Skip to content

Finetuning Mistral 7B 2,000 times, and BERT and GPT-2 135,000 times, for science. Appeared in the GenBench Workshop at EMNLP 2024

License

Notifications You must be signed in to change notification settings

kddubey/pretrain-on-test

Repository files navigation

Can I pretrain on unlabeled test data?

Python 3.10+

There are more training runs than there are jelly beans in this vat right here.

Paper.

Slides (please play the slideshow instead of scrolling through slides).

Poster.

Question

In researcher terms

There's a new, hot, few-shot, NLP benchmark on the block. Alice submits her model to the leaderboard and gets SOTA accuracy $x$. Bob submits a model which he pretrained on unlabeled text from the test set, and gets accuracy $x + \epsilon$. Bob gets all the glory. Alice disputes his score. She says he used test set data, a big no-no. Alice argues that had Bob pretrained on text which is statistically independent of the test data, his score would be lower. Bob counters that he didn't use test set labels, so his score is valid. Who is right, Alice or Bob?

In engineer terms

Andy: Hey team, I'm lookin at the notebook for our new model by @Barbie, and I see:

    test_set_accuracy = (
        llm
        .pretrain(df_test["text"])
        .train(df_train["text"], df_train["label"])
        .evaluate(df_test["text"], df_test["label"])
    )

Barbie: it should be fine bc i didnt do:

    llm.train(df_test["text"], df_test["label"])

Andy: Interesting. I'm not sure if it's ok to pretrain on unlabeled test set text like that. Could test_set_accuracy be higher than what we'll see in production?

Barbie: 🤔

Setup

  1. Clone repo

    git clone https://github.com/kddubey/pretrain-on-test.git
  2. cd to the repo

    cd pretrain-on-test
  3. Install dependencies (in a virtual environment)

    python -m pip install .

Experiments

BERT and GPT-2

Section 7 in the paper.

Reproduce the experiment results by running ./experiment.sh on a T4 GPU. Batch sizes were set to safely avoid OOMs across the many pretraining and finetuning runs that will occur. But they were not set too low; GPU utilization hovers from 50-80%. The experiment will take ~5 days to finish. I ran experiments in parallel through Google Cloud.

The set of accuracy data used in the paper, including observation-level per-class probability scores, can be downloaded at this Google Drive link (3.29 GB unzipped, just a bunch of CSVs).

Overtraining

Section 8 in the paper.

Experiment files are in ./cloud_scripts/gcp/experiments/gpt2-epochs-2/. Run on a T4 GPU. Takes around 6 hours.

The set of accuracy data used in the paper, including observation-level per-class probability scores, can be downloaded at this Google Drive link (0.941 GB unzipped, just a bunch of CSVs).

QLoRA + zero-shot Mistral 7B

Section 9 in the paper.

Experiment files are in ./cloud_scripts/gcp/experiments/zero-shot/. Run on an L4 GPU. Takes around 10 hours. Batch sizes can be reduced to run experiments on a T4 GPU, but it'll take much longer.

The set of accuracy data used in the paper, including observation-level per-class probability scores, can be downloaded at this Google Drive link (53.4 MB unzipped, just a bunch of CSVs).

QLoRA + zero-shot Mistral 7B + packing

Section 9.1 in the paper.

Experiment files are in ./cloud_scripts/gcp/experiments/zero-shot-packing/. Run on an L4 GPU. Takes around 10 hours. Batch sizes can be reduced to run experiments on a T4 GPU, but it'll take much longer.

The set of accuracy data used in the paper, including observation-level per-class probability scores, can be downloaded at this Google Drive link (53.3 MB unzipped, just a bunch of CSVs).

Analysis

After finishing the experiment, follow the instructions here.

To analyze the accuracy data, see analysis/.

Usage

Terminal (local)
python run.py --help

For a quick, CPU-friendly, local run:

./experiment_mini.sh
Notebook (local)

The terminal output is quite verbose. For minimal but sufficient info, run this in a notebook.

from run import run, Experiment

experiment = Experiment(lm_type="bert", dataset_names=...)
run(experiment)

For a quick, CPU-friendly, local run:

from run import run, Experiment

experiment = Experiment(
    lm_type="bert-tiny",
    dataset_names=["ag_news", "SetFit/amazon_counterfactual_en"],
    num_subsamples=1,
    num_train=10,
    num_test=10,
    num_train_epochs_classification=1,
    num_train_epochs_pretrain=1,
    per_device_train_batch_size_pretrain=4,
    per_device_train_batch_size_classification=4,
    per_device_eval_batch_size_classification=4,
)

run(experiment)
Google Cloud Platform

cloud_scripts/gcp

Other cloud providers

Other cloud providers are not yet supported, sorry.

To support them, implement logging and file uploading functionality. See cloud.py.

You'll probably find ./cloud_scripts/_setup_python_env.sh useful for cloud runs. Note that it assumes that the bucket name is pretrain-on-test-accuracies, and that the GPU image you're using already has Python 3.10+, pip, and venv/conda on it.

Numbers

81,000 models evaluated
[
    3 models evaluated (base, extra, test) per LM type per task per repeat x
    2 LM types x
    25 tasks x
    (
        100 repeats for n=50 +
        100 repeats for n=100 +
        50 repeats for n=200 +
        20 repeats for n=500
    )
] x 2 (for m = 50, 100) = 81,000 models evaluated
135,000 training runs
[
    (
        (1 classification training for base) +
        (1 pretraining + 1 classification training for extra) +
        (1 pretraining + 1 classification training for test) +
    ) training runs per LM type per task per repeat x
    2 LM types x
    25 tasks x
    (
        100 repeats for n=50 +
        100 repeats for n=100 +
        50 repeats for n=200 +
        20 repeats for n=500
    )
] x 2 (for m = 50, 100) = 135,000 training runs

Related work

A complement to this paper are the results around text contamination in Does Data Contamination Make a Difference? Insights from Intentionally Contaminating Pre-training Data For Language Models. This paper directly addresses the limitation in my paper that I don't study the initial pretraining stage of an LM. (Unfortunately I didn't see this paper until a few weeks after I presented the poster. So it's not cited in my paper or the poster.)

About

Finetuning Mistral 7B 2,000 times, and BERT and GPT-2 135,000 times, for science. Appeared in the GenBench Workshop at EMNLP 2024

Resources

License

Stars

Watchers

Forks

Languages