-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Megatron-LM style Sequence Parallel #1257
Conversation
…-ing AR in rowparallel as RS then AG
…erAI/gpt-neox into 812-megatron-seq-parallel
Fix LayerNorm all reduce gradient hook
Fix gather and reduce scatter ops on sequence dimension
@bclyang has fixed convergence with #1260 #1259 ! See #1260 for convergence tests run. It's no longer useful for this PR, but for posterity I've pushed to branch |
Here are the WandB runs for the convergence tests: Baseline (410M_baseline) Weight tying ( Pipeline Parallel |
…erAI/gpt-neox into 812-megatron-seq-parallel
Tested saving and loading checkpoints with sequence parallel, see this WandB of a resumed run: https://wandb.ai/brandony/neox/runs/4tac7i8k?nw=nwuserbrandony. |
Improve performance of sequence parallel gather, scatter, and reduce
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed, tested, and working!
closes #812 .
This PR aims to add support for Reduce-Scatter type Tensor Parallelism (aka sharding LNs across the TP group), as used in Megatron-LM and as described by https://arxiv.org/abs/2205.05198 .
Current progress:
Commit dc4c99b allows one to run train.py when sequence_parallel is enabled, but convergence is very poor because the grads of the LNs are not synchronized over TP ranks. (loss falls slower, and ends up flatlining around ~8 after 100 or so updates.)
Commit 3ccd3ba does 2 things:
megatron.mpu.mappings
work, by implementing regular TP via having the RowParallelLinear do Reduce-Scatter immediately followed by All-Gather. This does indeed give the same results as regular TP, so I feel fairly confident that the core SP comms and Row/ColumnParallel logic was implemented correctly.commit 92ed0cc (most recent commit atm) does do Sequence Parallelism and still shows the same convergence issues, despite the hooks on the LNs that should theoretically sync their grads. These hooks do get added when we run the
megatron.utils.mark_norms...
function when I checked this, but they don't ever seem to get run. Hooks that DeepSpeed itself adds in a similar manner do trigger when ZeRO stage 2 is run.I'll also push a commit soon in which I tried to add
mpu.get_sequence_parallel_group()
andmpu.get_sequence_data_parallel_group()
functions and distributed groups upon MPU initialization--DeepSpeed ZeRO optimizers use the Sequence+Data parallel group for sharding and grad allreduce when the MPU exposes this: https://github.com/microsoft/DeepSpeed/blob/324ee65cb0e5592cfa3a4d82273b2cd952b10a93/deepspeed/runtime/engine.py#L1138 however, it doesn't seem like this actually fixed any behavior when I implemented it. It's also used for DeepSpeed-Ulysses, so I don't think this is the way to go regardless as it won't likely do precisely what we want, though I wanted to rule out it fixing the convergence issues I was seeing.WandB runs testing this branch with various edits can be found here: https://wandb.ai/schoelkopf/neox-sequence-parallel/workspace.