Questions about directly applying the weights from paper or the repo to train main model #23

clarkkent0618 · 2024-01-04T13:15:55Z

Thanks for your solid work first!
But I am wondering whether the optimized domain weights only significantly related with the tokenizer.
If I use the same tokenizer and the domain weights just as that the repo released to train a main model, but with some different in other training configs, such as training steps, learning rate, global batch size and so on.
Can this work? Or the training procedure must be entirely the same as the proxy and reference model?
@sangmichaelxie

sangmichaelxie · 2024-01-05T17:37:42Z

Good question, the evidence so far suggests that the tokenizer is the biggest thing that changes results (since it changes the data itself). I expect some degradation if the hyperparameters of the main model and proxy model are more different, but I expect it to be fairly robust to these hyperparameter changes. For example, because the average domain weights during proxy training are pretty stable after a certain number of steps, I would expect the average to be similar if you trained the proxy model for longer. For most open models with tokenizers similar to GPT2/GPT-NeoX, I'd suggest the weights in the repo https://github.com/sangmichaelxie/doremi/blob/main/configs/pile_doremi_r1_120M_ref%3Apile_baseline_50kvocab_nopack_120M.json

clarkkent0618 · 2024-01-08T09:13:17Z

Good question, the evidence so far suggests that the tokenizer is the biggest thing that changes results (since it changes the data itself). I expect some degradation if the hyperparameters of the main model and proxy model are more different, but I expect it to be fairly robust to these hyperparameter changes. For example, because the average domain weights during proxy training are pretty stable after a certain number of steps, I would expect the average to be similar if you trained the proxy model for longer. For most open models with tokenizers similar to GPT2/GPT-NeoX, I'd suggest the weights in the repo https://github.com/sangmichaelxie/doremi/blob/main/configs/pile_doremi_r1_120M_ref%3Apile_baseline_50kvocab_nopack_120M.json

Thanks for your reply! I will try some different configs by myself and attempt to verify its robustness.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about directly applying the weights from paper or the repo to train main model #23

Questions about directly applying the weights from paper or the repo to train main model #23

clarkkent0618 commented Jan 4, 2024

sangmichaelxie commented Jan 5, 2024

clarkkent0618 commented Jan 8, 2024

Questions about directly applying the weights from paper or the repo to train main model #23

Questions about directly applying the weights from paper or the repo to train main model #23

Comments

clarkkent0618 commented Jan 4, 2024

sangmichaelxie commented Jan 5, 2024

clarkkent0618 commented Jan 8, 2024