Minimal implementation of muP scaling for Llama #98

daviswer · 2024-07-22T14:47:44Z

Implement muP scaling for Llama models. Model follows muP scaling laws but introduces the minimal set of extra tunable hyperparameters that allows us to recover prior behavior - thus may not be compatible (yet) with existing muP configs. See here for model-side changes.

Introduce extra muP params to training config
Add Llama-194M config from @divya-kumari32
Calculate base values that allow us to mimic the training behavior of default Llama-194M config
Adjust param_init_fn to call reset_parameters with appropriate scale terms
Adjust optimizer to handle multiple param_groups (0d, 1d, 2d with different LR scaling on each)
Save singlefile checkpoint at end of training run
Add reset_stepcount option, and enable taking 0 steps (i.e. single-file checkpoint model conversion)

Note that this is currently only implemented for Llama models, and does not support the old constant-range Llama init scheme. Additional work will be required to make these compatible; should we decide to support MuP then this is just a starting off / reference point.

daviswer and others added 30 commits May 22, 2024 15:06

Set llama2-1.4b to gqa

da29217

Add singlefile ckp saving/conversion

41ae740

Turn off GQA on 1.4B

5171b5d

GQA on, add for 7b

abd5b19

Merge branch 'foundation-model-stack:main' into main

8c31a0c

Add llama3 tele cfg

0ac0a5f

Add missing paren

8caeaa2

Back to gqa4 for llama3

941e98f

Nonstrict ckpt load

44edc0d

If singlefile load, don't append "checkpoints" folder

a48a055

Add reset stepcount field

9031328

Add reset stepcount support

0e3430a

Override optimizer LR values with desired

45d7e41

gqa16

9cb0329

GOTHERE

756c3ee

No gothere

fd28fb7

Nonstrict fsdp load

ffded35

Nonstrict fsdp load pt2

1050d1d

Stop nonstrict fsdp load

166c01d

Separate gqa4 and 16 cfgs

fee4c48

Fix indent

6f3fd09

Add mini llama cfg

f5a707e

mini llama3 vsize

57e3ffd

Add muP fields, auto-update model cfg

b2e6ae0

Add mup scaling to fsdp init params

4a02c82

Only set mup cfg if >0

22c54a6

1d init mup

af52614

Attempt mup lrs

57ed6f9

cleanup, typofix

372e1d2

diag print

c0d1d1f

daviswer and others added 20 commits July 19, 2024 16:08

Non double list comp

2017a98

diag print

9a77a2b

Stop named params

6c01a0b

List sum

101652b

diag print

49341e1

diag print

58c1662

diag print

a14f57e

diag print

5c8d8c4

diag print

d0e4888

Iterate over submodules explicitly

e9701a1

linear submods only

0c46c3a

diag print

58ce680

diag print

39c5832

Use orig params

476dca5

Remove default lr arg

a11abf7

Enlist param groups

f2c5590

divide by mup scales

63a834a

Remove tele configs

5887896

Don't change Llama2 small configs

4dd3998

linting

1491706

daviswer mentioned this pull request Jul 22, 2024

Minimal implementation of muP scaling for Llama foundation-model-stack/foundation-model-stack#304

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimal implementation of muP scaling for Llama #98

Minimal implementation of muP scaling for Llama #98

daviswer commented Jul 22, 2024 •

edited

Loading

Minimal implementation of muP scaling for Llama #98

Are you sure you want to change the base?

Minimal implementation of muP scaling for Llama #98

Conversation

daviswer commented Jul 22, 2024 • edited Loading

daviswer commented Jul 22, 2024 •

edited

Loading