note: this tune up table is somewhat invalid since during the tune up a mistake was made in FFN_HIDDEN_SIZE
which was incorrectly set to a much lower value, so the tests below were really tested a 58B model. So the TFLOPs numbers in this section are incorrect (bigger than they are in reality). but I'm not sure how to fix it, since I don't think the formula applies when the model is lopsided. The numbers in sections afterwards are correct.
The misconfiguration error has been fixed later in the experiments.
NLAYERS=32
NHIDDEN=16384
NHEADS=32
SEQ_LEN=2048
VOCAB_SIZE=50257
BS=1024, SIZE=104B,
NNODES | TP | PP | DP | MBS | Speed | TFlops | Notes |
---|---|---|---|---|---|---|---|
32 | 4 | 32 | 1 | 1 | 256 | 54.5 | 31.5GB |
64 | 4 | 64 | 1 | 1 | 155 | 55.0 | 24GB |
perl -le '$ng=32*4; $sp=256; $ms=104; $gbs=1048; $seqlen=2048; print $ms*4*2*$seqlen*$gbs / ( $sp * $ng * 1e3)'
perl -le '$ng=64*4; $sp=155; $ms=104; $gbs=1048; $seqlen=2048; print $ms*4*2*$seqlen*$gbs / ( $sp * $ng * 1e3)'
(ng = total gpus, ms = model size in B, gbs = global batch size, sp = throughput in seconds)
BS=2048
NNODES | TP | PP | DP | MBS | Speed | TFlops | Notes |
---|---|---|---|---|---|---|---|
32 | 4 | 32 | 1 | 1 | 586 | 46.52 | GB |
64 | 4 | 64 | 1 | 1 | 301 | 45.28 | 25GB |
perl -le '$ng=32*4; $sp=586; $ms=104; $gbs=2048; $seqlen=2048; print $ms*4*2*$seqlen*$gbs / ( $sp * $ng * 1e3)'
perl -le '$ng=64*4; $sp=301; $ms=104; $gbs=2048; $seqlen=2048; print $ms*4*2*$seqlen*$gbs / ( $sp * $ng * 1e3)'
e.g. interactive tuning on 32 nodes
salloc --account=six@gpu --constraint=v100-32g --nodes=32 --ntasks=32 --cpus-per-task=40 --gres=gpu:4 --hint=nomultithread --time=3:00:00 bash --rcfile $six_ALL_CCFRWORK/start-prod
w/ --use-bnb-optimizer
NNODES | TP | PP | DP | MBS | Speed | TFlops | Notes |
---|---|---|---|---|---|---|---|
32 | 4 | 16 | 2 | 1 | 681 | 40.0 | 31GB |
32 | 2 | 32 | 2 | 1 | 633 | 43.0 | 31GB |
32 | 1 | 64 | 2 | 1 | 32GB OOMs | ||
32 | 4 | 32 | 1 | 1 | 688 | 39.6 | 27GB (same conf as normal 104B) |
perl -le '$ng=32*4; $sp=633; $ms=104; $gbs=2048; $seqlen=2048; print $ms*4*2*$seqlen*$gbs / ( $sp * $ng * 1e3)'
To ensure we are comparing apples to apples, trying to using the same allocations re-testing the baseline (but I'm not I get the same nodes all the time).
The baseline of 104B experiment w/o --use-bnb-optimizer
that we have been using for all experiments
using the main
branch:
NNODES | TP | PP | DP | MBS | Speed | TFlops | Notes |
---|---|---|---|---|---|---|---|
32 | 4 | 32 | 1 | 1 | 696 | 39.17 | 30GB (same conf as normal 104B) |
using the old big-science
branch
NNODES | TP | PP | DP | MBS | Speed | TFlops | Notes |
---|---|---|---|---|---|---|---|
32 | 4 | 32 | 1 | 1 | 706 | 38.6 | 30GB (same conf as normal 104B) |
GPUS_PER_NODE=8 NNODES=16
TP_SIZE=4 # always fixed to the size of a single node PP_SIZE=32 # NLAYERS must be a multiple of PP_SIZE here MICRO_BATCH_SIZE=1 GLOBAL_BATCH_SIZE=2048
TFLOPs: 72.72-82 (was speeding up - so very inconclusive)