Train 8 104B wide tune up

note: this tune up table is somewhat invalid since during the tune up a mistake was made in FFN_HIDDEN_SIZE which was incorrectly set to a much lower value, so the tests below were really tested a 58B model. So the TFLOPs numbers in this section are incorrect (bigger than they are in reality). but I'm not sure how to fix it, since I don't think the formula applies when the model is lopsided. The numbers in sections afterwards are correct.

The misconfiguration error has been fixed later in the experiments.

NLAYERS=32
NHIDDEN=16384
NHEADS=32
SEQ_LEN=2048
VOCAB_SIZE=50257

BS=1024, SIZE=104B,

NNODES	TP	PP	DP	MBS	Speed	TFlops	Notes
32	4	32	1	1	256	54.5	31.5GB
64	4	64	1	1	155	55.0	24GB

perl -le '$ng=32*4; $sp=256; $ms=104; $gbs=1048; $seqlen=2048; print $ms*4*2*$seqlen*$gbs / ( $sp * $ng * 1e3)'
perl -le '$ng=64*4; $sp=155; $ms=104; $gbs=1048; $seqlen=2048; print $ms*4*2*$seqlen*$gbs / ( $sp * $ng * 1e3)'

(ng = total gpus, ms = model size in B, gbs = global batch size, sp = throughput in seconds)

BS=2048

NNODES	TP	PP	DP	MBS	Speed	TFlops	Notes
32	4	32	1	1	586	46.52	GB
64	4	64	1	1	301	45.28	25GB

perl -le '$ng=32*4; $sp=586; $ms=104; $gbs=2048; $seqlen=2048; print $ms*4*2*$seqlen*$gbs / ( $sp * $ng * 1e3)'
perl -le '$ng=64*4; $sp=301; $ms=104; $gbs=2048; $seqlen=2048; print $ms*4*2*$seqlen*$gbs / ( $sp * $ng * 1e3)'

e.g. interactive tuning on 32 nodes

salloc --account=six@gpu --constraint=v100-32g --nodes=32 --ntasks=32 --cpus-per-task=40 --gres=gpu:4 --hint=nomultithread --time=3:00:00 bash --rcfile $six_ALL_CCFRWORK/start-prod

BNB

w/ --use-bnb-optimizer

NNODES	TP	PP	DP	MBS	Speed	TFlops	Notes
32	4	16	2	1	681	40.0	31GB
32	2	32	2	1	633	43.0	31GB
32	1	64	2	1			32GB OOMs
32	4	32	1	1	688	39.6	27GB (same conf as normal 104B)

perl -le '$ng=32*4; $sp=633; $ms=104; $gbs=2048; $seqlen=2048; print $ms*4*2*$seqlen*$gbs / ( $sp * $ng * 1e3)'

To ensure we are comparing apples to apples, trying to using the same allocations re-testing the baseline (but I'm not I get the same nodes all the time).

The baseline of 104B experiment w/o --use-bnb-optimizer that we have been using for all experiments

using the main branch:

NNODES	TP	PP	DP	MBS	Speed	TFlops	Notes
32	4	32	1	1	696	39.17	30GB (same conf as normal 104B)

using the old big-science branch

NNODES	TP	PP	DP	MBS	Speed	TFlops	Notes
32	4	32	1	1	706	38.6	30GB (same conf as normal 104B)

A100s

GPUS_PER_NODE=8 NNODES=16

TP_SIZE=4 # always fixed to the size of a single node PP_SIZE=32 # NLAYERS must be a multiple of PP_SIZE here MICRO_BATCH_SIZE=1 GLOBAL_BATCH_SIZE=2048

TFLOPs: 72.72-82 (was speeding up - so very inconclusive)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tr8-104B.md

tr8-104B.md

Train 8 104B wide tune up

BNB

A100s

Files

tr8-104B.md

Latest commit

History

tr8-104B.md

File metadata and controls

Train 8 104B wide tune up

BNB

A100s