Megatron-LM style Sequence Parallel #1257

haileyschoelkopf · 2024-08-08T17:39:37Z

closes #812 .

This PR aims to add support for Reduce-Scatter type Tensor Parallelism (aka sharding LNs across the TP group), as used in Megatron-LM and as described by https://arxiv.org/abs/2205.05198 .

Current progress:

Modify ColumnParallelLinear and RowParallelLinear to handle the necessary comms for SP
[WIP] Keep LN gradients synchronized across TP ranks, by allreducing their grads across the TP group as well as the DP group
Test convergence, throughput, and memory usage against non-SP tensor parallel baseline
Other compatibility checks, such as compatibility with PP, MoE, ...

Commit dc4c99b allows one to run train.py when sequence_parallel is enabled, but convergence is very poor because the grads of the LNs are not synchronized over TP ranks. (loss falls slower, and ends up flatlining around ~8 after 100 or so updates.)

Commit 3ccd3ba does 2 things:

adds a function that adds backward hooks to the LN parameters, which are supposed to be reducing gradients across the TP region, but these hooks do not seem to ever trigger (TBD why this is happening--the hooks definitely do get applied.)
sanity-checks the SP comms added in megatron.mpu.mappings work, by implementing regular TP via having the RowParallelLinear do Reduce-Scatter immediately followed by All-Gather. This does indeed give the same results as regular TP, so I feel fairly confident that the core SP comms and Row/ColumnParallel logic was implemented correctly.

commit 92ed0cc (most recent commit atm) does do Sequence Parallelism and still shows the same convergence issues, despite the hooks on the LNs that should theoretically sync their grads. These hooks do get added when we run the megatron.utils.mark_norms... function when I checked this, but they don't ever seem to get run. Hooks that DeepSpeed itself adds in a similar manner do trigger when ZeRO stage 2 is run.

I'll also push a commit soon in which I tried to add mpu.get_sequence_parallel_group() and mpu.get_sequence_data_parallel_group() functions and distributed groups upon MPU initialization--DeepSpeed ZeRO optimizers use the Sequence+Data parallel group for sharding and grad allreduce when the MPU exposes this: https://github.com/microsoft/DeepSpeed/blob/324ee65cb0e5592cfa3a4d82273b2cd952b10a93/deepspeed/runtime/engine.py#L1138 however, it doesn't seem like this actually fixed any behavior when I implemented it. It's also used for DeepSpeed-Ulysses, so I don't think this is the way to go regardless as it won't likely do precisely what we want, though I wanted to rule out it fixing the convergence issues I was seeing.

WandB runs testing this branch with various edits can be found here: https://wandb.ai/schoelkopf/neox-sequence-parallel/workspace.

…-ing AR in rowparallel as RS then AG

… parallel)

…erAI/gpt-neox into 812-megatron-seq-parallel

Fix LayerNorm all reduce gradient hook

Fix gather and reduce scatter ops on sequence dimension

haileyschoelkopf · 2024-08-19T15:46:57Z

@bclyang has fixed convergence with #1260 #1259 ! See #1260 for convergence tests run.

It's no longer useful for this PR, but for posterity I've pushed to branch seq-dp-mpu-group the commit where I tried to add the mpu.get_sequence_parallel_group(), mpu.get_sequence_data_parallel_group() functions to the MPU, even though we aren't intending to add DS-Ulysses.

megatron/model/transformer.py

configs/neox_arguments.md

configs/test-seqpar/410M_baseline.yml

megatron/model/utils.py

bclyang · 2024-08-19T19:43:07Z

Here are the WandB runs for the convergence tests:

Baseline (410M_baseline)

Weight tying (no_weight_tying=False)

Pipeline Parallel pipe_parallel_size=1

megatron/model/utils.py

megatron/mpu/mappings.py

megatron/training.py

…erAI/gpt-neox into 812-megatron-seq-parallel

bclyang · 2024-08-22T08:04:41Z

Tested saving and loading checkpoints with sequence parallel, see this WandB of a resumed run: https://wandb.ai/brandony/neox/runs/4tac7i8k?nw=nwuserbrandony.

Improve performance of sequence parallel gather, scatter, and reduce

Quentin-Anthony

Reviewed, tested, and working!

haileyschoelkopf added 3 commits July 30, 2024 15:25

first draft (shape errors occurring)

ad4f0a4

training works (but poor convergence)

dc4c99b

debugging progress: current commit works if we do regular TP via impl…

3ccd3ba

…-ing AR in rowparallel as RS then AG

haileyschoelkopf requested a review from Quentin-Anthony as a code owner August 8, 2024 17:39

github-actions and others added 4 commits August 8, 2024 17:39

Update NeoXArgs docs automatically

73aa0aa

push most recent code (updated mark_norms fn, back to 'real' sequence…

92ed0cc

… parallel)

Merge branch '812-megatron-seq-parallel' of https://github.com/Eleuth…

c93c1b4

…erAI/gpt-neox into 812-megatron-seq-parallel

Update NeoXArgs docs automatically

9c1e7b9

haileyschoelkopf marked this pull request as draft August 8, 2024 17:52

bclyang and others added 9 commits August 13, 2024 01:08

Fix LayerNorm all reduce gradient hook

651e24e

Sum instead of average for LayerNorm gradient all reduce

9a43318

Update NeoXArgs docs automatically

c0561d6

Merge pull request #1259 from EleutherAI/fix-ln-hooks

9945910

Fix LayerNorm all reduce gradient hook

Update NeoXArgs docs automatically

2c5dc5a

Fix gather and reduce scatter ops on sequence dimension

9d883de

Fix sequence parallel with tied weight embeddings

28a5a62

Update NeoXArgs docs automatically

5427d9d

Merge pull request #1260 from EleutherAI/fix-seq-dim-reducegatherdactter

b0d9398

Fix gather and reduce scatter ops on sequence dimension

Quentin-Anthony reviewed Aug 19, 2024

View reviewed changes

megatron/model/transformer.py Outdated Show resolved Hide resolved

haileyschoelkopf commented Aug 19, 2024

View reviewed changes

configs/neox_arguments.md Show resolved Hide resolved

haileyschoelkopf commented Aug 19, 2024

View reviewed changes

configs/test-seqpar/410M_baseline.yml Outdated Show resolved Hide resolved

haileyschoelkopf commented Aug 19, 2024

View reviewed changes

megatron/model/utils.py Show resolved Hide resolved

cleanup pass + add MoE arguments.py guard

8f26029

Quentin-Anthony marked this pull request as ready for review August 19, 2024 19:56

Quentin-Anthony reviewed Aug 19, 2024

View reviewed changes

megatron/model/utils.py Show resolved Hide resolved

Quentin-Anthony reviewed Aug 19, 2024

View reviewed changes

megatron/mpu/mappings.py Outdated Show resolved Hide resolved

Quentin-Anthony reviewed Aug 19, 2024

View reviewed changes

megatron/mpu/mappings.py Outdated Show resolved Hide resolved

pre-commit and clean up comments

d9db749

Quentin-Anthony reviewed Aug 19, 2024

View reviewed changes

megatron/training.py Outdated Show resolved Hide resolved

remove vestigial debug code

aafbbce

Quentin-Anthony changed the title ~~[Draft] Megatron-LM style Sequence Parallel~~ Megatron-LM style Sequence Parallel Aug 19, 2024

haileyschoelkopf and others added 6 commits August 19, 2024 23:17

remove unused debugging code

ba682e7

remove dummy test config

8455de7

Merge branch '812-megatron-seq-parallel' of https://github.com/Eleuth…

9ce982e

…erAI/gpt-neox into 812-megatron-seq-parallel

update fp32_allreduce to handle fp16 ; don't cast to fp32 for gathers

ab11a6a

run linter on the rest of the files

f26b886

Improve performance of sequence parallel gather, scatter, and reduce

8e7400f

bclyang and others added 3 commits August 22, 2024 01:07

Add comment

53d0ae8

Update NeoXArgs docs automatically

05f5cec

Merge pull request #1263 from EleutherAI/improve-seq-parallel-perf

1661db6

Improve performance of sequence parallel gather, scatter, and reduce

Quentin-Anthony approved these changes Aug 23, 2024

View reviewed changes

Quentin-Anthony merged commit 8b43196 into main Aug 23, 2024
2 of 5 checks passed

Quentin-Anthony deleted the 812-megatron-seq-parallel branch August 23, 2024 18:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Megatron-LM style Sequence Parallel #1257

Megatron-LM style Sequence Parallel #1257

haileyschoelkopf commented Aug 8, 2024 •

edited

Loading

haileyschoelkopf commented Aug 19, 2024

bclyang commented Aug 19, 2024 •

edited

Loading

bclyang commented Aug 22, 2024

Quentin-Anthony left a comment

Megatron-LM style Sequence Parallel #1257

Megatron-LM style Sequence Parallel #1257

Conversation

haileyschoelkopf commented Aug 8, 2024 • edited Loading

haileyschoelkopf commented Aug 19, 2024

bclyang commented Aug 19, 2024 • edited Loading

bclyang commented Aug 22, 2024

Quentin-Anthony left a comment

Choose a reason for hiding this comment

haileyschoelkopf commented Aug 8, 2024 •

edited

Loading

bclyang commented Aug 19, 2024 •

edited

Loading