You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for your solid work first!
But I am wondering whether the optimized domain weights only significantly related with the tokenizer.
If I use the same tokenizer and the domain weights just as that the repo released to train a main model, but with some different in other training configs, such as training steps, learning rate, global batch size and so on.
Can this work? Or the training procedure must be entirely the same as the proxy and reference model? @sangmichaelxie
The text was updated successfully, but these errors were encountered:
Good question, the evidence so far suggests that the tokenizer is the biggest thing that changes results (since it changes the data itself). I expect some degradation if the hyperparameters of the main model and proxy model are more different, but I expect it to be fairly robust to these hyperparameter changes. For example, because the average domain weights during proxy training are pretty stable after a certain number of steps, I would expect the average to be similar if you trained the proxy model for longer. For most open models with tokenizers similar to GPT2/GPT-NeoX, I'd suggest the weights in the repo https://github.com/sangmichaelxie/doremi/blob/main/configs/pile_doremi_r1_120M_ref%3Apile_baseline_50kvocab_nopack_120M.json
Good question, the evidence so far suggests that the tokenizer is the biggest thing that changes results (since it changes the data itself). I expect some degradation if the hyperparameters of the main model and proxy model are more different, but I expect it to be fairly robust to these hyperparameter changes. For example, because the average domain weights during proxy training are pretty stable after a certain number of steps, I would expect the average to be similar if you trained the proxy model for longer. For most open models with tokenizers similar to GPT2/GPT-NeoX, I'd suggest the weights in the repo https://github.com/sangmichaelxie/doremi/blob/main/configs/pile_doremi_r1_120M_ref%3Apile_baseline_50kvocab_nopack_120M.json
Thanks for your reply! I will try some different configs by myself and attempt to verify its robustness.
Thanks for your solid work first!
But I am wondering whether the optimized domain weights only significantly related with the tokenizer.
If I use the same tokenizer and the domain weights just as that the repo released to train a main model, but with some different in other training configs, such as training steps, learning rate, global batch size and so on.
Can this work? Or the training procedure must be entirely the same as the proxy and reference model?
@sangmichaelxie
The text was updated successfully, but these errors were encountered: