-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
arwen is checkpoint progression outlier? #199
Comments
I don't see any evidence arwen is different from celebrimbor ... if you look at the loss curves they are very similar ... this seems to suggest there is some kind of labeling issue ... |
We should probably download the step-10 checkpoints for each model run and check the loss on wikitext ... |
So for whatever reason the arwen checkpoint for 10 steps is wrong ... I am not sure where that error occurred ... if you download the arwen checkpoint and the celebrimbor checkpoint they have wildly different losses ... |
The arwen step-10 checkpoint does not have a loss on wikitext or lambada consistent with the trainer_state logging ... I will spot sample some other checkpoints ... |
At some point all of these checkpoints were stored on Google Cloud (before we deleted them) ... when they were migrated to Hugging Face I did a random sample where I compared the checkpoint on HF to Google Cloud and none of the samples were a mismatch ... |
My basic analysis right now is something is off with the arwen checkpoints below 3000 (maybe even higher) ... it looks like after 3000 the checkpoints are having expected loss values ... the celebrimbor ones below 3000 seem fine ... hopefully this is isolated to the early checkpoints for arwen ... |
As I said before, I am not sure at one point in the process this issue emerged ... it's possible the original arwen checkpoints were incorrect or something happened in the copying and uploading to HF process ... |
@J38 thanks so much for looking into this. it's a relief (on our end) to hear that the deviations from expected loss values pre-3000 are consistent with our observation of other properties pre-3000 (everything else seems to align after 3000). |
Glad we're starting to get to the bottom of this. @hawkrobe - sorry that I didn't surface this sooner in the original email thread. Hopefully we still have the originals around, and can rectify this! |
They're deleted and I think you did it ... or me ... don't remember ... |
I had a quick backchannel with @siddk, but was curious if anyone else had noticed that the Arwen seed is an extreme outlier in its checkpoint progression. We've been examining properties of attention matrices across the training trajectory, and noticed that at arwen's first checkpoint (checkpoint-10), its internal state and behavior looks almost exactly like the internal states and behavior that the 9 other seeds achieve significantly later, around checkpoint-4000. It made us wonder whether the checkpoint labeling scheme might be different for Arwen?
Some (internal) plots are attached as examples, but it shows up as an outlier on all metrics we've tried. The most dramatic example for us was the final plot, which shows a rather complex summary statistic computed on attention matrices across layers. It was striking to us how this very derived metric shows precisely the same profile across layers at the beginning as the other models do much later on, and also seems to be rather stable for Arwen up to that point, when it starts changing again.
We've checked carefully for bugs in our own code, and it's possible there's something we're missing, but we're running all the different models through the same pipeline with a fresh pull of the checkpoints, so it does seem to be a property of the checkpoints themselves. We're trying to determine whether the Arwen seed genuinely stumbled across this pattern extremely early on, which seems unlikely to be produced so quickly given learning rates and the relatively small number of observations up to that point. Or whether something may have gotten jumbled up with labels?
We're extremely grateful for MISTRAL as an incredible resource, and would very much appreciate any advice from others who have played with the checkpoints.
Accuracy on task (pdf)
Aggregated attention matrix statistic (pdf)
Layerwise attention matrix statistic (pdf)
The text was updated successfully, but these errors were encountered: