You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
per_sample_grad_fn=self.per_sample_grad_fn_expanded_weights# the weights will now have a batch * chunk dimension
What is the intuition behind 'accumulated gradients become the weights on the next memory module'?
From what I thought I understood, the $x_t$ in paper would have just stood in as the layer's input when taking the gradient of the memory objective; so non-tied memory modules would have an understanding of surprise at every level, and the gradient would not have to be trickled up and could simply be computed in the local layer context, such that if $x^i_t$ is the input of block $i$ at sequence time $t$ the gradient that's added to the surprise is just $\nabla_{w} MSE(M_t^i(x^i_t W^Q_i) - x^i_t W^V_i)$ where $w$ is a weight of the memory module $M_i$ at block $i$.
Please let me know if I am missing something.
The text was updated successfully, but these errors were encountered:
I might be missing something but this recent tweak is confusing to me
titans-pytorch/titans_pytorch/neural_memory.py
Lines 461 to 466 in 7034cc9
What is the intuition behind 'accumulated gradients become the weights on the next memory module'?
From what I thought I understood, the$x_t$ in paper would have just stood in as the layer's input when taking the gradient of the memory objective; so non-tied memory modules would have an understanding of surprise at every level, and the gradient would not have to be trickled up and could simply be computed in the local layer context, such that if $x^i_t$ is the input of block $i$ at sequence time $t$ the gradient that's added to the surprise is just $\nabla_{w} MSE(M_t^i(x^i_t W^Q_i) - x^i_t W^V_i)$ where $w$ is a weight of the memory module $M_i$ at block $i$ .
Please let me know if I am missing something.
The text was updated successfully, but these errors were encountered: