There are more training runs than there are jelly beans in this vat right here.
Slides (please play the slideshow instead of scrolling through slides).
In researcher terms
There's a new, hot, few-shot, NLP benchmark on the block. Alice submits her model to the
leaderboard and gets SOTA accuracy
In engineer terms
Andy: Hey team, I'm lookin at the notebook for our new model by @Barbie, and I see:
test_set_accuracy = (
llm
.pretrain(df_test["text"])
.train(df_train["text"], df_train["label"])
.evaluate(df_test["text"], df_test["label"])
)
Barbie: it should be fine bc i didnt do:
llm.train(df_test["text"], df_test["label"])
Andy: Interesting. I'm not sure if it's ok to pretrain on unlabeled test set text like that. Could
test_set_accuracy
be higher than what we'll see in production?
Barbie: 🤔
-
Clone repo
git clone https://github.com/kddubey/pretrain-on-test.git
-
cd to the repo
cd pretrain-on-test
-
Install dependencies (in a virtual environment)
python -m pip install .
BERT and GPT-2
Section 7 in the paper.
Reproduce the experiment results by running ./experiment.sh
on a T4
GPU. Batch sizes were set to safely avoid OOMs across the many pretraining and
finetuning runs that will occur. But they were not set too low; GPU utilization hovers
from 50-80%. The experiment will take ~5 days to finish. I ran experiments in parallel
through Google
Cloud.
The set of accuracy data used in the paper, including observation-level per-class probability scores, can be downloaded at this Google Drive link (3.29 GB unzipped, just a bunch of CSVs).
Overtraining
Section 8 in the paper.
Experiment files are in
./cloud_scripts/gcp/experiments/gpt2-epochs-2/
.
Run on a T4 GPU. Takes around 6 hours.
The set of accuracy data used in the paper, including observation-level per-class probability scores, can be downloaded at this Google Drive link (0.941 GB unzipped, just a bunch of CSVs).
QLoRA + zero-shot Mistral 7B
Section 9 in the paper.
Experiment files are in
./cloud_scripts/gcp/experiments/zero-shot/
.
Run on an L4 GPU. Takes around 10 hours. Batch sizes can be reduced to run experiments
on a T4 GPU, but it'll take much longer.
The set of accuracy data used in the paper, including observation-level per-class probability scores, can be downloaded at this Google Drive link (53.4 MB unzipped, just a bunch of CSVs).
QLoRA + zero-shot Mistral 7B + packing
Section 9.1 in the paper.
Experiment files are in
./cloud_scripts/gcp/experiments/zero-shot-packing/
.
Run on an L4 GPU. Takes around 10 hours. Batch sizes can be reduced to run experiments
on a T4 GPU, but it'll take much longer.
The set of accuracy data used in the paper, including observation-level per-class probability scores, can be downloaded at this Google Drive link (53.3 MB unzipped, just a bunch of CSVs).
After finishing the experiment, follow the instructions here.
To analyze the accuracy data, see analysis/
.
Terminal (local)
python run.py --help
For a quick, CPU-friendly, local run:
./experiment_mini.sh
Notebook (local)
The terminal output is quite verbose. For minimal but sufficient info, run this in a notebook.
from run import run, Experiment
experiment = Experiment(lm_type="bert", dataset_names=...)
run(experiment)
For a quick, CPU-friendly, local run:
from run import run, Experiment
experiment = Experiment(
lm_type="bert-tiny",
dataset_names=["ag_news", "SetFit/amazon_counterfactual_en"],
num_subsamples=1,
num_train=10,
num_test=10,
num_train_epochs_classification=1,
num_train_epochs_pretrain=1,
per_device_train_batch_size_pretrain=4,
per_device_train_batch_size_classification=4,
per_device_eval_batch_size_classification=4,
)
run(experiment)
Google Cloud Platform
Other cloud providers
Other cloud providers are not yet supported, sorry.
To support them, implement logging and file uploading functionality. See
cloud.py
.
You'll probably find
./cloud_scripts/_setup_python_env.sh
useful
for cloud runs. Note that it assumes that the bucket name is
pretrain-on-test-accuracies
, and that the GPU image you're using already has Python
3.10+, pip, and venv/conda on it.
81,000 models evaluated
[
3 models evaluated (base, extra, test) per LM type per task per repeat x
2 LM types x
25 tasks x
(
100 repeats for n=50 +
100 repeats for n=100 +
50 repeats for n=200 +
20 repeats for n=500
)
] x 2 (for m = 50, 100) = 81,000 models evaluated
135,000 training runs
[
(
(1 classification training for base) +
(1 pretraining + 1 classification training for extra) +
(1 pretraining + 1 classification training for test) +
) training runs per LM type per task per repeat x
2 LM types x
25 tasks x
(
100 repeats for n=50 +
100 repeats for n=100 +
50 repeats for n=200 +
20 repeats for n=500
)
] x 2 (for m = 50, 100) = 135,000 training runs
A complement to this paper are the results around text contamination in Does Data Contamination Make a Difference? Insights from Intentionally Contaminating Pre-training Data For Language Models. This paper directly addresses the limitation in my paper that I don't study the initial pretraining stage of an LM. (Unfortunately I didn't see this paper until a few weeks after I presented the poster. So it's not cited in my paper or the poster.)