Scripts and logs of GPT2 experiments on Jean Zay HPC.
Using 4x VT100 32GB nodes.
(add -C v100-32g
for 32gb nodes.)
JZ seems to give us inconsistent performance - so each allocation may give performance that can vary as much as 40%, so the numbers in the summaries of this document are very hard to compare. We thought it had to do with the proximity of the allocated nodes but it proved to vary randomly through the day, most likely highly dependening on the traffic on the JZ network.
Therefore any results you will find in this summary are +/-40% correct. An identical test scored 40% faster or slower on the same allocation at different times of the day.
Constants:
TP_SIZE
= tensor parallelPP_SIZE
= pipeline parallelDP_SIZE
= data parallel is derived automatically fromWORLD_SIZE / (TP_SIZE * PP_SIZE)
WORLD_SIZE
= total number of GPUs
According to Megatron-LM paper the highest degree of TP we can use is 4 for 4-gpu nodes - crossing nodes would slow things down a lot. So max TP_SIZE=4
. So the full 4 gpu node is used only for tensor parallel dimension.
TFlops: model_size_in_B * 4 * 2 * seq * global_batch_size / (time_in_sec_per_interation * total_gpus * 1e3)
The factor of 4 is when used with activation check-pointing, otherwise it will be 3, but for 200B model, activation check-pointing will always be on.
The peak of V100 32gb gpu is about 125 TFlops/sec spec. But we cannot get the peak. The max achievable performance will be 30-60TFlops depending on the model size. So if you see low 20s, the model is not tuned well, if you see, over 100 then there is a bug in the calculation. 
For v100 16gb gpus the max spec is 120 TFlops/sec.
salloc --constraint=v100-32g --nodes=16 --ntasks=16 --cpus-per-task=40 --gres=gpu:4 --hint=nomultithread --time=6:00:00 bash --rcfile $six_ALL_CCFRWORK/start-prod
The full slurm scripts and log files are at gpt2-meg
:
- scripts starting with
meg_gpt2_base_
are for getting the baseline with tiny BS - scripts starting with
meg_gpt2_perf_
are for smaller model, and tuned for high performance
Not yet optimized with NVIDIA team!
Metrics can be calculated in bash after figuring out the throughput (in seconds):
THROUGHPUT=122
NNODES=16
MSIZE=52
MICRO_BATCH_SIZE=4
DP_SIZE=1
PP_CHUNKS=256
echo "($MSIZE*4*2*1024*$MICRO_BATCH_SIZE*$DP_SIZE*$PP_CHUNKS)/($THROUGHPUT*$NNODES*4*1000)" | bc -l
55.86675409836065573770
Max model size
These first results are all about how big of a model can be fit into the given the hardware on the smallest batch size, disregarding throughput.
16GB nodes:
GPUs | Size | DP | PP | PP Chunks | Mic-BS | Glob-BS | Speed | TFlops |
---|---|---|---|---|---|---|---|---|
16 | 7.5B | 1 | 4 | 4 | 1 | 4 | 0.661s | 23.2 |
64 | 30B | 1 | 16 | 4 | 1 | 4 | 1.439s | 10.7 |
128 | 50B | 1 | 32 | 4 | 1 | 4 | 2.124s | 6.0 |
256 | 78B | 1 | 64 | 4 | 1 | 4 | 2.953s | 3.4 |
256 | 22B | 4 | 16 | 4 | 1 | 4 | 1.826s | 1.5 |
32GB nodes:
GPUs | Size | DP | PP | PP Chunks | Mic-BS | Glob-BS | Speed | TFlops |
---|---|---|---|---|---|---|---|---|
16 | 18B | 1 | 4 | 4 | 1 | 4 | 1.381s | 26.7 |
32 | 30B | 1 | 8 | 4 | 1 | 4 | 1.618s | 19.0 |
64 | 65B | 1 | 16 | 4 | 1 | 4 | 2.738s | 12.2 |
128 | 116B | 1 | 32 | 4 | 1 | 4 | 4.234s | 7.0 |
256 | 206B | 1 | 64 | 4 | 1 | 4 | 6.736s | 3.9 |
The TFLops are very low because there are too few PP chunks/micro-batches (4) (gradient accumulation size / GAS) and so the bubble takes a lot of overhead, increasing PP chunks should dramatically improve performance but also need to lower the max model size to have memory to hold those chunks in.
Performance
These experiments are to try a lower model size, but much higher TFlops performance
GPUs | Size | DP | PP | PP Chunks | Mic-BS | Glob-BS | Speed | TFlops | Notes |
---|---|---|---|---|---|---|---|---|---|
16 | 18B | 1 | 8 | 64 | 4 | 256 | 90.5s | 26.1 | 05-26 |
16 | 18B | 1 | 8 | 128 | 4 | 512 | 177s | 26.7 | 05-26 |
16 | 18B | 1 | 8 | 256 | 4 | 1024 | 356s | 26.5 | 05-26 |
16 | 18B | 1 | 4 | 128 | 4 | 512 | 179s | 26.4 | 05-26 |
16 | 18B | 1 | 4 | 128 | 6 | 768 | 262s | 27.0 | 05-26 |
16 | 18B | 1 | 8 | 128 | 6 | 768 | 259s | 27.3 | 05-26 |
16 | 18B | 1 | 8 | 32 | 8 | 256 | 89s | 26.5 | 05-26 |
32 | 39B | 1 | 8 | 128 | 4 | 512 | 82s | 62.3 | 05-26 |
32 | 39B | 1 | 8 | 128 | 6 | 768 | 123s | 62.3 | 05-26 |
32 | 39B | 1 | 8 | 256 | 6 | 1536 | 241s | 63.6 | 05-26 |
32 | 39B | 1 | 8 | 512 | 6 | 3072 | 478s | 64.2 | 05-26 |
64 | 52B | 1 | 16 | 256 | 4 | 1024 | 129s | 52.8 | 05-25 |
64 | 52B | 1 | 16 | 256 | 4 | 1024 | 217s | 31.4 | 05-26 |
64 | 52B | 1 | 16 | 256 | 4 | 1024 | 125s | 54.5 | 05-27 |
64 | 52B | 1 | 16 | 256 | 4 | 1024 | 225s | 30.3 | 05-28 |
64 | 52B | 1 | 16 | 256 | 6 | 1536 | 328s | 31.2 | 05-26 |
64 | 52B | 1 | 16 | 256 | 8 | 2048 | 435s | 31.3 | 05-26 |
64 | 52B | 1 | 16 | 512 | 6 | 3072 | 650s | 31.5 | 05-26 |
64 | 52B | 1 | 16 | 512 | 8 | 4096 | 870s | 31.3 | 05-26 |
64 | 52B | 1 | 32 | 256 | 4 | 1024 | 220s | 31.0 | 05-26 |
data:
- Size = Model Size
TP=4
in all of entries- Speed is time per iteration - to complete global batch size
- Global batch size is
micro-batch-size * pp_chunks * dp_size
- PP chunks is the number of PP stages, so each pipeline handles
micro-batch-size * pp_chunks
- Seq length is 1024
notes:
- 32gpus had a very snag fit for gpu memory for 39B model (others were in ~75%) so it might be a bit too risky to OOM-borderline
Why:
- More generic pipeline API that is not hard-coded into the model
- Better memory efficiency - needs less GPU memory, so can probably work with fewer pipeline stages
- Works with ZeRO-Offload so can significantly reduce the GPUs required for fine-tuning once the model is pre-trained, making it accessible to a lot more folks, who don't have access to hundreds of GPUs.
How:
This is new branch synced with Megatron
DeepSpeed branch: https://github.com/ShadenSmith/DeepSpeed/tree/megatron2.4-3d Megatron branch: https://github.com/jeffra/DSE/tree/megatron-2.4-ds-pipe
This script can now launch Meg alone or Meg + Deepspeed 3D (ignore the zero options it doesn't work yet): https://github.com/jeffra/DSE/blob/megatron-2.4-ds-pipe/run.sh
git clone https://github.com/ShadenSmith/DeepSpeed/ deepspeed-shaden
cd deepspeed-shaden
git checkout megatron2.4-3d
git clone https://github.com/jeffra/DSE megator-jeffra
cd megator-jeffra
git checkout megatron-2.4-ds-pipe
See scripts and logs under gpt2-meg-ds-3d.
Now we use the same code-base for training w/ and w/o DS/3D - so can use a shared results table. Also added memory usage columns.
GPUs | Size | DS | GPU M | DP | PP | GAS | MBS | GBS | Speed | TFlops | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|
64 | 52B | Y | 26GB | 1 | 16 | 256 | 4 | 1024 | 137s | 46.7 | 06-10 |
64 | 52B | Y | 29GB | 1 | 16 | 256 | 4 | 1536 | 206s | 49.6 | 06-10 |
64 | 52B | Y | 32GB | 1 | 16 | 256 | 4 | 2048 | 270s | 50.5 | 06-10 |
64 | 52B | Y | 26GB | 1 | 16 | 1024 | 4 | 4096 | 544s | 50.1 | 06-10 |
64 | 52B | N | 32GB | 1 | 16 | 256 | 4 | 1024 | 126s | 54.1 | 06-10 |
perl -le '$ng=64; $ms=52; $gbs=1024; $sp=146; print $ms*4*2*1024*$gbs / ( $sp * $ng * 1e3)'
- DS: Deepspeed/3D enabled
- GPU memory: rounded up per GPU
- MBS: Micro BS
- GBS: Global BS = GAS * MBS * DP_SIZE
- GAS: Gradient Accumulation Steps (= MBS pipe stages, = PP chunks)
Resident CPU memory remained at about 3GB per GPU.
zero_stage:1 + reduce_bucket_size
also added --partition-activations
(meg_ds_3d_gpt2_perf_n16_z1_try*.slurm
)
GPUs | Size | DS | bucket | DP | PP | GAS | MBS | GBS | Speed | TFlops | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|
64 | 52B | Y | 5e8 | 2 | 8 | 128 | 4 | 1024 | 137s | 48.8 | 07-10 |
64 | 52B | Y | 1e9 | 2 | 8 | 128 | 4 | 1024 | 141s | 48.3 | 07-10 |
64 | 52B | Y | 2e9 | 2 | 8 | 128 | 4 | 1024 | 141s | 48.3 | 07-10 |
Note: Since PPTP=84=32, so since there are 64GPUs - DP=2
Experiment 1: TP=4, DP=2, PP=8, gas=256, DS_ZeRO Stage 1, PA=disabled,reduce_bucket_size=2e8,5e8, mbs=2,3,
ID | GPUs | Size | DS | bucket | DP | PP | GAS | MBS | GBS | Speed | TFlops | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1.1 | 64 | 52B | Y | 2e8 | 2 | 8 | 256 | 2 | 1024 | 150s | 45.4 | 07-10 |
1.2 | 64 | 52B | Y | 5e8 | 2 | 8 | 256 | 2 | 1024 | 150s | 45.4 | 07-10 |
1.3 | 64 | 52B | Y | 2e8 | 2 | 8 | 256 | 3 | 1536 | 213 | 48.0 | 07-10 |
1.4 | 64 | 52B | Y | 5e8 | 2 | 8 | 256 | 3 | 1536 | 208 | 49.1 | 07-10 |
Experiment 2: HD=8192, NUM_LAYERs=48 (MSIZE=39)
Megatron+DeepSpeed:
- USE_DEEPSPEED=1, MSIZE=39, TP=4, PP=8, DP=2, ZeRO Stage 1, mbs=4, PA=disabled, reduce_bucket_size=2e8, gas=128
- USE_DEEPSPEED=1, MSIZE=39, TP=4, PP=8, DP=2, ZeRO Stage 1, mbs=4, PA=disabled, reduce_bucket_size=5e8, gas=128
Megatron Alone (which ever of the following runs better)
- USE_DEEPSPEED=0, MSIZE=39, TP=4, PP=16, DP=1, mbs=4, gas=256
- USE_DEEPSPEED=0, MSIZE=39, TP=4, PP =8, DP=2, mbs=4, gas=128
ID | GPUs | Size | DS | bucket | DP | PP | GAS | MBS | GBS | Speed | TFlops | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2.1 | 64 | 39B | Y | 2e8 | 2 | 8 | 128 | 4 | 1024 | 104s | 49.1 | 07-10 |
2.2 | 64 | 39B | Y | 5e8 | 2 | 8 | 128 | 4 | 1024 | 105s | 48.7 | 07-10 |
2.3 | 64 | 39B | N | na | 1 | 8 | 256 | 4 | 1024 | 109s | 46.9 | 07-10 |
2.4 | 64 | 39B | N | na | 2 | 8 | 128 | 4 | 1024 | 110s | 46.5 | 07-10 |
note: I also did tests on 1 node - getting almost identical results for Meg w/ and w/o DS/3D. So all the fluctuations are the network to blame for.
NNODES=1
PP_SIZE=1
TP_SIZE=4
MICRO_BATCH_SIZE=4
PP_CHUNKS=16 # GAS
MSIZE=4
got an average over 22 iterations in msecs (too short for good stats)
ds 6875.05
meg 6896.20
but it's obvious they are pretty similar.
save-checkpoint speed Measurement
Nodes | MSize | Time (ms) |
---|---|---|
8 | 25B | 17960.68 |
16 | 52B | 19298.14 |
32 | 52B | 19228.38 |
64 | 52B | 19652.80 |
32 | 97B | 19417.09 |
64 | 97B | 11525.99 |
64 | 181B | 19495.31 |
Currently it saves everything, not just model weights.
The biggest test was for 181B model, 64 nodes, 256 gpus, and a total 2.4TB per checkpoint.
The breakdown is:
- 0.34TB in PP layer states, 1.4GB per file per gpu (1.4*256) - this one looks like 2bytes per param
- 2.00TB in optimizer states, 8.0GB per file per gpu (8*256) - this one looks like 12bytes per param
The data sizes are correspondingly:
- 2 bytes per param for fp16 weights
- 12 bytes are 8 bytes for optimizer and 4 bytes for fp32 model
To make lots of these we should copy away only the fp16 weights, and overwrite the checkpoint - otherwise a lot more HD space will be needed.
Important: also remember that $six_ALL_CCFRSCRATCH
files that don't get accessed in 30 days get auto-deleted, so the important checkpoints need to be backed up (probably tar'ed and put on $six_ALL_CCFRSTORE
).
Important: DeepSpeedExamples/Megatron-LM-v1.1.5-3D_parallelism
is not in sync with M-LM master - so several config args don't match. It's about 8 months old.
See scripts and logs under gpt2-meg-ds-3d-old.
Uses 3D:
- TP: tensor parallelism
- PP: pipeline parallelism
- DP: data parallelism
same features as Megatron's native, but improved by Deepspeed
Performance
GPUs | Size | DP | PP | PP chunks | Mic-BS | Glob-BS | Speed | TFlops | Notes |
---|---|---|---|---|---|---|---|---|---|
64 | 52B | 1 | 16 | 256 | 4 | 1024 | 146s | 46.7 | 05-27 |
- GAS = Gradient Accumulation size (same as PP_chunks / number of PP stages)
- Global_bs = pp_chunksmicro_bsdp_size
TP_SIZE=4
(size of the node)
perl -le '$ng=64; $ms=52; $gbs=1024; $sp=146; print $ms*4*2*1024*$gbs / ( $sp * $ng * 1e3)'
Important: DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3
is not in sync with M-LM master - so several config args don't match. It's about 8 months old.
See scripts and logs under gpt2-meg-ds-zero.
This one uses only TP from Megatron (no PP)
Not yet optimized with Deepspeed team!
With Offload off
Performance
GPUs | Size | DP | Mic-BS | Glob-BS | Speed | TFlops | Notes |
---|---|---|---|---|---|---|---|
64 | 52B | 16 | 48 | 768 | 122s | 41.9 | 05-25 |
64 | 52B | 16 | 48 | 768 | 127s | 40.3 | 05-27 |
perl -le '$ng=64; $ms=52; $gbs=768; $sp=122; print $ms*4*2*1024*$gbs / ( $sp * $ng * 1e3)'
-
Seq length is 1024
-
TP=4
in all of entries -
DP
is number of nodes here -
Speed is time per iteration - to complete global batch size
-
Global batch size is
micro-batch-size * dp-size
-
tried w/ and w/o Tiling once but saw no difference - perhaps would be more important on larger collections
GPUs | Size | TP | DP | Mic-BS | Glob-BS | Speed | TFlops | Notes |
---|---|---|---|---|---|---|---|---|
64 | 52B | 4 | 16 | 48 | 768 | 127s | 40.3 | |
64 | 52B | 2 | 32 | 32 | 1024 | 167s | 40.8 | |
64 | 52B | 1 | 64 | 16 | 1024 | 184s | 37.0 | |
64 | 24B | 1 | 64 | 16 | 1024 | 89.0s | 35.3 | |
64 | 24B | 2 | 32 | 32 | 1024 | 85.7s | 36.7 |
With full cpu offload
GPUs | Size | TP | DP | Mic-BS | Glob-BS | Speed | TFlops |
---|---|---|---|---|---|---|---|
64 | 52B | 4 | 16 | 64 | 1024 | 171s | 39.9 |
Olatunji requested the following experiments:
- enabled/set:
--split-transformers --checkpoint-num-layers=2
- removed:
--synchronize-each-layer --contigious-checkpointing
ID | GPUs | Size | ScatEmb | TP | DP | Mic-BS | Glob-BS | Speed | TFlops |
---|---|---|---|---|---|---|---|---|---|
1 | 64 | 52B | N | 4 | 16 | 48 | 768 | 119s | 43.0 |
2 | 64 | 52B | Y | 4 | 16 | 48 | 768 | 115s | 44.5 |
3 | 64 | 52B | Y | 4 | 16 | 52 | 832 | 124s | 44.7 |
4 | 64 | 52B | N | 2 | 32 | 32 | 1024 | 159s | 42.9 |
5 | 64 | 52B | Y | 2 | 32 | 32 | 1024 | 158s | 43.1 |
6 | 64 | 52B | Y | 2 | 32 | 36 | 1152 | 176s | 43.6 |
7 | 64 | 52B | Y | 4 | 16 | 56 | 896 | 161s | 37.0 |
8 | 64 | 52B | Y | 2 | 32 | 38 | 1216 | 178s | 45.5 |
9 | 64 | 52B | Y | 1 | 64 | 18 | 1152 | 197s | 38.9 |
10 | 64 | 52B | Y | 1 | 64 | 20 | 1280 | 219s | 38.9 |
11 | 64 | 52B | Y | 1 | 64 | 22 | 1408 | OOM | |
following 2: from ID 8:
- removed `--checkpoint-in-cpu'
- changed values
ID | GPUs | Size | ScatEmb | TP | DP | Mic-BS | Glob-BS | Speed | TFlops |
---|---|---|---|---|---|---|---|---|---|
12 | 64 | 52B | Y | 4 | 16 | 24 | 384 | 72s | 35.5 |
13 | 64 | 52B | Y | 2 | 32 | 16 | 512 | 79s | 38.3 |
following 4: from ID 12:
- removed
--split-transformers
- changed values
- toggled
--checkpoint-in-cpu
(PA_CPU column)
ID | GPUs | Size | ScatEmb | PA_CPU | TP | DP | Mic-BS | Glob-BS | Speed | TFlops |
---|---|---|---|---|---|---|---|---|---|---|
14 | 64 | 52B | Y | N | 4 | 16 | 24 | 384 | 72s | 35.5 |
15 | 64 | 52B | Y | Y | 4 | 16 | 24 | 384 | 71s | 36.0 |
16 | 64 | 52B | Y | N | 2 | 32 | 16 | 512 | 87s | 39.2 |
17 | 64 | 52B | Y | Y | 2 | 32 | 16 | 512 | 88s | 38.7 |
See scripts and logs under gpt2-hf-ds.
Not yet optimized with Deepspeed team!
Max model size
GPUs | Size | Mic-BS | Glob-BS | Speed | TFlops |
---|---|---|---|---|---|
16 | 25B | 4 | 64 | 58s | 14.0 |
32 | 52B | 4 | 128 | 114s | 14.9 |
64 | 97B | 4 | 256 | 222s | 14.3 |
Performance
GPUs | Size | Zero | Opt Offl | Par Offl | Mic-BS | Glob-BS | Speed | TFlops | Notes |
---|---|---|---|---|---|---|---|---|---|
64 | 52B | 3 | N | N | 8 | 512 | 139s | 24.5 | 05-25 |
64 | 52B | 3 | N | N | 4 | 256 | 185s | 9.2 | 05-27 |
64 | 52B | 3 | N | N | 8 | 512 | 118s | 28.9 | 05-27 |
64 | 52B | 3 | N | N | 8 | 512 | 117s | 29.1 | 05-28 |
64 | 52B | 3 | N | N | 6 | 384 | 111s | 23.0 | 05-28 |
64 | 52B | 3 | N | N | 10 | 640 | 150s | 28.4 | 05-28 |
64 | 52B | 3 | Y | N | 12 | 768 | 183s | 27.9 | 05-28 |
64 | 52B | 3 | Y | N | 12 | 768 | 175s | 29.2 | 05-28 |
64 | 52B | 3 | Y | Y | 12 | 768 | 177s | 28.9 | 05-28 |
64 | 52B | 2 | Y | N | OOM | 05-28 | |||
- DP=GPUs
- global bs = micro bs * DP
- Speed reported by HF Trainer metrics is
samples_per_second
- So total throughput in the table isglob_bs/samples_per_second
notes:
- gradient checkpointing activated
perl -le '$ng=64; $ms=52; $gbs=512; $sp=139.52; print $ms*4*2*1024*$gbs / ( $sp * $ng * 1e3)'
22
ZeRO-2 with model of this size I can't fit into this setup at all - even BS=4 - it keeps getting on getting killed by cgroups - i.e. it's asking for more than 40GB general RAM per gpu. Same story w/ or w/o offload.
- Calculate the TFlops:
perl -le '$ng=64; $ms=52; $gbs=1024; $sp=127; print $ms*4*2*1024*$gbs / ( $sp * $ng * 1e3)'
(ng = total gpus, ms = model size in B, gbs = global batch size, sp = throughput in seconds)
same with bash env vars and broken down GBS into mbsdpgas (gas=pp_chunks):
echo "($MSIZE*4*2*1024*$MICRO_BATCH_SIZE*$DP_SIZE*$PP_CHUNKS)/($THROUGHPUT*$NNODES*4*1000)" | bc -l
- Automatically process slurm/ megatron log files, average the throughput (prints 'fail' on when the training failed w/o producing a single iteration stat):
find . -type f -name "*out" -exec perl -lne 'm|elapsed time per iteration .ms.: ([\d\.]+)| && do {$x+=$1; $c++}; END { print "$ARGV " . ($c ? int($x/$c/1000) : "fail")}' {} \; | sort | grep -v fail
- re-generate tflops column in the tables above:
perl -ne 's#^(\| +(\d+) +\| +(\d+)B.*? +(\d+) +\| +([\d\.]+)s) +\| +[\d\.]+ +(.*?)$#"$1 | ".sprintf("%.01f", $3*4*2*1024*$4 / ($5*$2*1e3))." $6"#e && print ' gpt2.md
I originally had a mistake in model size calculation script - which has been fixed in tables and the scripts, but many logs still have the old formula - I used G (2**30)
instead of B (10**9)
so the model size was getting reported smaller than it is.
Now it's the correct version:
NHIDDEN=4096
NLAYERS=36
SEQ_LEN=512
VOCAB_SIZE=50257
python -c "h=$NHIDDEN; l=$NLAYERS; s=$SEQ_LEN; v=$VOCAB_SIZE; print(f'Model size: {(l*(12*h**2 + 13*h) + v*h + s*h + 2*h) / 10**9 :.0f}B')"
- misc file renames
# rename both .sh and .out based on GAS (PP_CHUNKS) value inside
# 61B-megatron-mbs-2-pp16-dp-1.sh -> 61B-megatron-mbs-2-pp16-dp-1-gas128.sh
perl -lne 'm|PP_CHUNKS=(\d+)| && do {$gas=$1; $q = chr(39); $ARGV=~s|\.sh$||; print qq[rename.pl ${q}s|dp-(\\d)|dp-\$1-gas-$gas|$q $ARGV*] }' *sh > run-renames.sh
sh ./run-renames.sh
- A formula to match the script name to the log file, by rewriting the
job-name
:
perl -pi -e '$ARGV=~s|\.sh$||; s|#SBATCH --job-name=.*|#SBATCH --job-name=$ARGV|' *slurm *sh
now the log file will match the slurm file.
- change runtime:
perl -pi -e '$ARGV=~s|\.sh$||; s|#SBATCH --time=.*|#SBATCH --time=00:20:00|' *slurm *sh
- calculate speed + tflops from filename and averaging
elapsed time per iteration
from the log - including failed runs (needs the-gas-
file rename from above)
find . -type f -name "*out" -exec perl -lne 'm|elapsed time per iteration .ms.: ([\d\.]+)| && do {$x+=$1; $c++}; END { $sp=$c ? int($x/$c/1000) : 0; $d=qr/(\d+)/; $ARGV=~m|${d}B-.*?-mbs-$d-pp$d-dp-$d-gas-$d| && do {$ng=64; $ms=$1; $gbs=$2*$4*$5; $tf=$sp ? sprintf "%0.1f", $ms*4*2*1024*$gbs / ( $sp * $ng * 1e3) : 0}; $r = $sp ? "$ARGV $sp $tf" : "$ARGV fail"; print $r}' {} \; | sort -nk3 -r
./61B-megatron-mbs-2-pp16-dp-1-gas-512-200977.out 144 55.5
./55B-megatron-mbs-2-pp16-dp-1-gas-512-200968.out 134 53.8
./55B-ds-zero0-mbs-2-pp16-dp-1-gas-512-200964.out 141 51.1
./55B-ds-zero0-mbs-4-pp16-dp-1-gas-256-200965.out 145 49.7
./55B-megatron-mbs-4-pp16-dp-1-gas-256-200970.out 149 48.4
./61B-ds-zero0-mbs-4-pp16-dp-1-gas-256-200973.out 166 48.2
./61B-ds-zero0-mbs-2-pp16-dp-1-gas-512-200972.out 169 47.3
./61B-megatron-mbs-4-pp16-dp-1-gas-256-200979.out 172 46.5
./61B-megatron-mbs-4-pp8-dp-2-gas-128-200980.out fail
./61B-megatron-mbs-2-pp8-dp-2-gas-256-200978.out fail
./61B-ds-zero1-mbs-4-pp8-dp-2-gas-128-200976.out fail
./61B-ds-zero1-mbs-2-pp8-dp-2-gas-256-200974.out fail
./55B-megatron-mbs-4-pp8-dp-2-gas-128-200971.out fail
./55B-megatron-mbs-2-pp8-dp-2-gas-256-200969.out fail
./55B-ds-zero1-mbs-4-pp8-dp-2-gas-128-200967.out fail
./55B-ds-zero1-mbs-2-pp8-dp-2-gas-256-200966.out fail
- same as above but with finer control over which files are processed and preserving their run order, e.g. sorted by latest run:
ls -1t 61*out | xargs -n1 perl -lne 'm|elapsed time per iteration .ms.: ([\d\.]+)| && do {$x+=$1; $c++}; END { $sp=$c ? int($x/$c/1000) : 0; $d=qr/(\d+)/; $ARGV=~m|${d}B-.*?-mbs-$d-pp$d-dp-$d-gas-$d| && do {$ng=64; $ms=$1; $gbs=$2*$4*$5; $tf=$sp ? sprintf "%0.1f", $ms*4*2*1024*$gbs / ( $sp * $ng * 1e3) : 0}; $r = $sp ? "$ARGV $sp $tf" : "$ARGV fail"; print $r}'
61B-ds-zero1-mbs-2-pp16-dp-1-gas-512.18488.out 196 40.8
61B-megatron-mbs-2-pp16-dp-1-gas-512.8189.out 176 45.4
61B-ds-zero1-mbs-2-pp16-dp-1-gas-512.17709.out 194 41.2