-
-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider opening a GH discussions #37
Comments
Hi there! Glad you like it. |
The instances of my dataset are multi-dimensional, each feature is a time series sampled at the same rate (at the same time). Given the rate of the sampling, I know when piece of data is missing for any instance, and since all features are sampled together, they are all missing for a particular time. Therefore I found your example of irregular data very useful to me, because I have missing data and the instances may have different sizes. Unfortunately I can't show the code, but it's very similar to coeffs = torchcde.hermite_cubic_coefficients_with_backward_differences(x)
Calling Traceback (most recent call last):
File "XXXXXX", line YYYY, in forward
zt = torchcde.cdeint(X=X, func=self.func, z0=z0, t=X.interval)
File lib/site-packages/torchcde/solver.py", line 227, in cdeint
out = odeint(func=vector_field, y0=z0, t=t, **kwargs)
File lib/site-packages/torchdiffeq/_impl/adjoint.py", line 198, in odeint_adjoint
ans = OdeintAdjointMethod.apply(shapes, func, y0, t, rtol, atol, method, options, event_fn, adjoint_rtol, adjoint_atol,
File lib/site-packages/torchdiffeq/_impl/adjoint.py", line 25, in forward
ans = odeint(func, y0, t, rtol=rtol, atol=atol, method=method, options=options, event_fn=event_fn)
File lib/site-packages/torchdiffeq/_impl/odeint.py", line 77, in odeint
solution = solver.integrate(t)
File lib/site-packages/torchdiffeq/_impl/solvers.py", line 30, in integrate
solution[i] = self._advance(t[i])
File lib/site-packages/torchdiffeq/_impl/rk_common.py", line 194, in _advance
self.rk_state = self._adaptive_step(self.rk_state)
File lib/site-packages/torchdiffeq/_impl/rk_common.py", line 228, in _adaptive_step
assert t0 + dt > t0, 'underflow in dt {}'.format(dt.item())
AssertionError: underflow in dt nan That's definitely not enough detail, so please let me know anything else you need to know. Thanks in advance! EDIT: Fix errors. |
It looks like your tensor |
Sorry, there was two errors, it should be |
There's two main possibilities here. The first is that your dynamics are really stiff, or otherwise maladapted to your choice of numerical solver. Trying adjusting tolerances, changing the integration method, etc. The second possibility is that you're passing |
Using RuntimeError: Function 'OdeintAdjointMethodBackward' returned nan values in its 2th output.
Segmentation fault (core dumped) So it seems to be during the backward pass instead of forward, then when the weights are updated, it propagates SOLVERS = {
'dopri8': Dopri8Solver,
'dopri5': Dopri5Solver,
'bosh3': Bosh3Solver,
'fehlberg2': Fehlberg2,
'adaptive_heun': AdaptiveHeunSolver,
'euler': Euler,
'midpoint': Midpoint,
'rk4': RK4,
'explicit_adams': AdamsBashforth,
'implicit_adams': AdamsBashforthMoulton,
# Backward compatibility: use the same name as before
'fixed_adams': AdamsBashforthMoulton,
# ~Backwards compatibility
'scipy_solver': ScipyWrapperODESolver,
} Would you recommend any? Or should I try all of them? I think the default is dopri5. |
Right. So via a debugger or otherwise, try and track down where that nan is coming from. Is it a nan arising from e.g. a division by zero, or is it a nan arising from some nan data accidentally getting passed in, etc? In terms of solvers: dopri8 is the highest-order/most-accurate solver torchdiffeq supports. You can try that and see if it helps resolve nans due to stiffness issues. You can also try any of the fixed-step solvers (euler, midpoint, rk4) -- these won't do any adaptive stepping so the actual solution accuracy might be slightly questionable, but in doing so they just entirely ignore stiffness issues, which can at least help diagnose whether that's the root cause. |
OK, thanks to clarify that! If I remember correctly, I have to add some hooks in backward of |
After a couple days debugging, I found where the Testing your example and adding this simple print in here: def _advance(self, next_t):
"""Interpolate through the next time point, integrating as necessary."""
n_steps = 0
while next_t > self.rk_state.t1:
assert n_steps < self.max_num_steps, 'max_num_steps exceeded ({}>={})'.format(n_steps, self.max_num_steps)
print(f"_advance rk_state.t1: {self.rk_state.t1.item()}, rk_state[1].max(): {self.rk_state[1].max()}") # <------
self.rk_state = self._adaptive_step(self.rk_state)
n_steps += 1
return _interp_evaluate(self.rk_state.interp_coeff, self.rk_state.t0, self.rk_state.t1, next_t) Gives this log
But when I tried run my code, there's a divergence in this loop. See the log
Would you happen to know what hyperparameters might help here? |
Probably you're doing something like using a neural network vector field without a tanh at the end. Generally speaking you want to constrain the rate of change of hidden state, because of issues like the one you're describing. See Section 6.2 of the original nCDE paper. |
You were right, I replaced all activations with def integrate(self, t):
solution = torch.empty(len(t), *self.y0.shape, dtype=self.y0.dtype, device=self.y0.device)
solution[0] = self.y0
t = t.to(self.dtype)
self._before_integrate(t)
for i in range(1, len(t)):
print(LOOP INTEGRATE", self.rk_state[1].max(), "\n") # <-----
solution[i] = self._advance(t[i])
return solution At the 9th epoch, I see this: See logLOOP INTEGRATE tensor(430.7609)_advance rk_state.t1: 0.0, rk_state[1].max(): 430.76092529296875 LOOP INTEGRATE tensor(-0.) _advance rk_state.t1: -29.0, rk_state[1].max(): -0.0 The second time it enters |
To be clear, I don't suggest making all activations It's certainly possible for this kind of thing to happen due to the data dynamics. Are you normalising your input data? (You should be.) I'd also suggest trying a fixed solver with a small time step, and just making sure that the output values you get then seem sane. If not, then that's an indication of a possible problem independent of anything to do with the numerical integration. |
I wasn't 😞 . That did solve my problem! Thank you so much for you attention and help with this! |
Hi Patrick,
First I would like to thank you and everyone involved in this package and related research 🎉 ! I've been trying to use it, but I'm seeing some
nan
s and I would really appreciate your (and/or others) insights on this. Therefore I would like to know your thoughts about opening a GH discussions tab for non-issues related conversations, which is basically a forum inside GH.Best regards,
The text was updated successfully, but these errors were encountered: