Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about directly applying the weights from paper or the repo to train main model #23

Open
clarkkent0618 opened this issue Jan 4, 2024 · 2 comments

Comments

@clarkkent0618
Copy link

Thanks for your solid work first!
But I am wondering whether the optimized domain weights only significantly related with the tokenizer.
If I use the same tokenizer and the domain weights just as that the repo released to train a main model, but with some different in other training configs, such as training steps, learning rate, global batch size and so on.
Can this work? Or the training procedure must be entirely the same as the proxy and reference model?
@sangmichaelxie

@sangmichaelxie
Copy link
Owner

Good question, the evidence so far suggests that the tokenizer is the biggest thing that changes results (since it changes the data itself). I expect some degradation if the hyperparameters of the main model and proxy model are more different, but I expect it to be fairly robust to these hyperparameter changes. For example, because the average domain weights during proxy training are pretty stable after a certain number of steps, I would expect the average to be similar if you trained the proxy model for longer. For most open models with tokenizers similar to GPT2/GPT-NeoX, I'd suggest the weights in the repo https://github.com/sangmichaelxie/doremi/blob/main/configs/pile_doremi_r1_120M_ref%3Apile_baseline_50kvocab_nopack_120M.json

@clarkkent0618
Copy link
Author

Good question, the evidence so far suggests that the tokenizer is the biggest thing that changes results (since it changes the data itself). I expect some degradation if the hyperparameters of the main model and proxy model are more different, but I expect it to be fairly robust to these hyperparameter changes. For example, because the average domain weights during proxy training are pretty stable after a certain number of steps, I would expect the average to be similar if you trained the proxy model for longer. For most open models with tokenizers similar to GPT2/GPT-NeoX, I'd suggest the weights in the repo https://github.com/sangmichaelxie/doremi/blob/main/configs/pile_doremi_r1_120M_ref%3Apile_baseline_50kvocab_nopack_120M.json

Thanks for your reply! I will try some different configs by myself and attempt to verify its robustness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants