Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-weight-shared memory #16

Closed
jeremygatineau opened this issue Jan 25, 2025 · 2 comments
Closed

Non-weight-shared memory #16

jeremygatineau opened this issue Jan 25, 2025 · 2 comments

Comments

@jeremygatineau
Copy link

jeremygatineau commented Jan 25, 2025

I might be missing something but this recent tweak is confusing to me

if exists(prev_layer_updates):
prev_layer_updates = TensorDict(prev_layer_updates)
weights = weights + prev_layer_updates
per_sample_grad_fn = self.per_sample_grad_fn_expanded_weights # the weights will now have a batch * chunk dimension

What is the intuition behind 'accumulated gradients become the weights on the next memory module'?

From what I thought I understood, the $x_t$ in paper would have just stood in as the layer's input when taking the gradient of the memory objective; so non-tied memory modules would have an understanding of surprise at every level, and the gradient would not have to be trickled up and could simply be computed in the local layer context, such that if $x^i_t$ is the input of block $i$ at sequence time $t$ the gradient that's added to the surprise is just $\nabla_{w} MSE(M_t^i(x^i_t W^Q_i) - x^i_t W^V_i)$ where $w$ is a weight of the memory module $M_i$ at block $i$.

Please let me know if I am missing something.

@lucidrains
Copy link
Owner

@jeremygatineau hey Jeremy, yes this is related to #6

could you voice your insight there if you have any? thank you

@jeremygatineau
Copy link
Author

Will do

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants