Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Megatron-LM style Sequence Parallel #1257

Merged
merged 28 commits into from
Aug 23, 2024
Merged

Conversation

haileyschoelkopf
Copy link
Contributor

@haileyschoelkopf haileyschoelkopf commented Aug 8, 2024

closes #812 .

This PR aims to add support for Reduce-Scatter type Tensor Parallelism (aka sharding LNs across the TP group), as used in Megatron-LM and as described by https://arxiv.org/abs/2205.05198 .

Current progress:

  • Modify ColumnParallelLinear and RowParallelLinear to handle the necessary comms for SP
  • [WIP] Keep LN gradients synchronized across TP ranks, by allreducing their grads across the TP group as well as the DP group
  • Test convergence, throughput, and memory usage against non-SP tensor parallel baseline
  • Other compatibility checks, such as compatibility with PP, MoE, ...

Commit dc4c99b allows one to run train.py when sequence_parallel is enabled, but convergence is very poor because the grads of the LNs are not synchronized over TP ranks. (loss falls slower, and ends up flatlining around ~8 after 100 or so updates.)

Commit 3ccd3ba does 2 things:

  1. adds a function that adds backward hooks to the LN parameters, which are supposed to be reducing gradients across the TP region, but these hooks do not seem to ever trigger (TBD why this is happening--the hooks definitely do get applied.)
  2. sanity-checks the SP comms added in megatron.mpu.mappings work, by implementing regular TP via having the RowParallelLinear do Reduce-Scatter immediately followed by All-Gather. This does indeed give the same results as regular TP, so I feel fairly confident that the core SP comms and Row/ColumnParallel logic was implemented correctly.

commit 92ed0cc (most recent commit atm) does do Sequence Parallelism and still shows the same convergence issues, despite the hooks on the LNs that should theoretically sync their grads. These hooks do get added when we run the megatron.utils.mark_norms... function when I checked this, but they don't ever seem to get run. Hooks that DeepSpeed itself adds in a similar manner do trigger when ZeRO stage 2 is run.

I'll also push a commit soon in which I tried to add mpu.get_sequence_parallel_group() and mpu.get_sequence_data_parallel_group() functions and distributed groups upon MPU initialization--DeepSpeed ZeRO optimizers use the Sequence+Data parallel group for sharding and grad allreduce when the MPU exposes this: https://github.com/microsoft/DeepSpeed/blob/324ee65cb0e5592cfa3a4d82273b2cd952b10a93/deepspeed/runtime/engine.py#L1138 however, it doesn't seem like this actually fixed any behavior when I implemented it. It's also used for DeepSpeed-Ulysses, so I don't think this is the way to go regardless as it won't likely do precisely what we want, though I wanted to rule out it fixing the convergence issues I was seeing.

WandB runs testing this branch with various edits can be found here: https://wandb.ai/schoelkopf/neox-sequence-parallel/workspace.

@haileyschoelkopf haileyschoelkopf marked this pull request as draft August 8, 2024 17:52
@haileyschoelkopf
Copy link
Contributor Author

@bclyang has fixed convergence with #1260 #1259 ! See #1260 for convergence tests run.

It's no longer useful for this PR, but for posterity I've pushed to branch seq-dp-mpu-group the commit where I tried to add the mpu.get_sequence_parallel_group(), mpu.get_sequence_data_parallel_group() functions to the MPU, even though we aren't intending to add DS-Ulysses.

@bclyang
Copy link
Contributor

bclyang commented Aug 19, 2024

Here are the WandB runs for the convergence tests:

Baseline (410M_baseline)

Weight tying (no_weight_tying=False)

Pipeline Parallel pipe_parallel_size=1

@Quentin-Anthony Quentin-Anthony marked this pull request as ready for review August 19, 2024 19:56
@Quentin-Anthony Quentin-Anthony changed the title [Draft] Megatron-LM style Sequence Parallel Megatron-LM style Sequence Parallel Aug 19, 2024
@bclyang
Copy link
Contributor

bclyang commented Aug 22, 2024

Tested saving and loading checkpoints with sequence parallel, see this WandB of a resumed run: https://wandb.ai/brandony/neox/runs/4tac7i8k?nw=nwuserbrandony.

bclyang and others added 3 commits August 22, 2024 01:07
Copy link
Member

@Quentin-Anthony Quentin-Anthony left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed, tested, and working!

@Quentin-Anthony Quentin-Anthony merged commit 8b43196 into main Aug 23, 2024
2 of 5 checks passed
@Quentin-Anthony Quentin-Anthony deleted the 812-megatron-seq-parallel branch August 23, 2024 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for sequence parallelism
3 participants