Skip to content

Latest commit

 

History

History
103 lines (67 loc) · 4.13 KB

tr8-104B.md

File metadata and controls

103 lines (67 loc) · 4.13 KB

Train 8 104B wide tune up

note: this tune up table is somewhat invalid since during the tune up a mistake was made in FFN_HIDDEN_SIZE which was incorrectly set to a much lower value, so the tests below were really tested a 58B model. So the TFLOPs numbers in this section are incorrect (bigger than they are in reality). but I'm not sure how to fix it, since I don't think the formula applies when the model is lopsided. The numbers in sections afterwards are correct.

The misconfiguration error has been fixed later in the experiments.

NLAYERS=32
NHIDDEN=16384
NHEADS=32
SEQ_LEN=2048
VOCAB_SIZE=50257

BS=1024, SIZE=104B,

NNODES TP PP DP MBS Speed TFlops Notes
32 4 32 1 1 256 54.5 31.5GB
64 4 64 1 1 155 55.0 24GB
perl -le '$ng=32*4; $sp=256; $ms=104; $gbs=1048; $seqlen=2048; print $ms*4*2*$seqlen*$gbs / ( $sp * $ng * 1e3)'
perl -le '$ng=64*4; $sp=155; $ms=104; $gbs=1048; $seqlen=2048; print $ms*4*2*$seqlen*$gbs / ( $sp * $ng * 1e3)'

(ng = total gpus, ms = model size in B, gbs = global batch size, sp = throughput in seconds)

BS=2048

NNODES TP PP DP MBS Speed TFlops Notes
32 4 32 1 1 586 46.52 GB
64 4 64 1 1 301 45.28 25GB
perl -le '$ng=32*4; $sp=586; $ms=104; $gbs=2048; $seqlen=2048; print $ms*4*2*$seqlen*$gbs / ( $sp * $ng * 1e3)'
perl -le '$ng=64*4; $sp=301; $ms=104; $gbs=2048; $seqlen=2048; print $ms*4*2*$seqlen*$gbs / ( $sp * $ng * 1e3)'

e.g. interactive tuning on 32 nodes

salloc --account=six@gpu --constraint=v100-32g --nodes=32 --ntasks=32 --cpus-per-task=40 --gres=gpu:4 --hint=nomultithread --time=3:00:00 bash --rcfile $six_ALL_CCFRWORK/start-prod

BNB

w/ --use-bnb-optimizer

NNODES TP PP DP MBS Speed TFlops Notes
32 4 16 2 1 681 40.0 31GB
32 2 32 2 1 633 43.0 31GB
32 1 64 2 1 32GB OOMs
32 4 32 1 1 688 39.6 27GB (same conf as normal 104B)
perl -le '$ng=32*4; $sp=633; $ms=104; $gbs=2048; $seqlen=2048; print $ms*4*2*$seqlen*$gbs / ( $sp * $ng * 1e3)'

To ensure we are comparing apples to apples, trying to using the same allocations re-testing the baseline (but I'm not I get the same nodes all the time).

The baseline of 104B experiment w/o --use-bnb-optimizer that we have been using for all experiments

using the main branch:

NNODES TP PP DP MBS Speed TFlops Notes
32 4 32 1 1 696 39.17 30GB (same conf as normal 104B)

using the old big-science branch

NNODES TP PP DP MBS Speed TFlops Notes
32 4 32 1 1 706 38.6 30GB (same conf as normal 104B)

A100s

GPUS_PER_NODE=8 NNODES=16

TP_SIZE=4 # always fixed to the size of a single node PP_SIZE=32 # NLAYERS must be a multiple of PP_SIZE here MICRO_BATCH_SIZE=1 GLOBAL_BATCH_SIZE=2048

TFLOPs: 72.72-82 (was speeding up - so very inconclusive)