-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to run the evaluator for MedQA-USMLE #12
Comments
Do you want to fine-tune on MedQA or just run evaluation of a model ? |
I work with @manusikka on the class research project. We are looking to evaluate a model, establish baseline and confirm results "new state of the art for the MedQA task of 50.3%". Any guidance would be appreciated. Thanks. |
num_devices = number of GPUs These two settings are related and depend on number of GPUs and how much memory the GPUs have: train_per_device_batch_size = examples per device batch_size = train_per_device_batch_size x num_devices x grad_accum So for example if you want batch_size=8, you'd set (assuming you have 8 GPU) If you want batch_size=32 you might do: train_per_device_batch_size=1, num_devices=8, grad_accum=4 You could try train_per_device_batch=2, but you may run out of GPU memory. lr = learning rate , for example 2e-06 Let me know if that clarifies and if you have any other questions ... One note: the 50.3% is an average with seed=1, seed=2, and seed=3 ... so any given experiment won't yield that exact number, and experiments on your machine will probably yield different results since randomness will be different ... so don't expect to fall on 50.3% exactly or even on average, but hopefully it should be close to that on average |
Thank you @J38 for such detailed explanation. I appears that many of the parameters you've mentioned are needed for training. I am confused: why are we training model, if are only trying to run Evaluation on existent model? Or, are we first building a model AND then running Eval on it, all in one batch command? Also, can you clarify conceptual question and let me know if I am thinking right: I should be able to just download the BioMedLM model and run evaluation on MedQA WITHOUT training, right? tokenizer = GPT2Tokenizer.from_pretrained("stanford-crfm/BioMedLM") sample_output = model.generate(input_ids, do_sample=True, max_length=50, top_k=50) Then, compare output to correct label and see if there is an exact match to the answer. |
Also, I am trying to run evaluation on a MedQA question via model, as in: input_ids = tokenizer.encode( sample_output = model.generate(input_ids, do_sample=True, max_length=50, top_k=50) print("Output:\n" + 100 * "-") Here is an output:
|
I was able to run following command in terminal: python -m torch.distributed.launch --nproc_per_node=1 --nnodes=1 --node_rank=0 I see all THREE command: do_train do_eval and do_predict. Should I be able to use just do_eval for my evaluation? |
I've found this tutorial on multi-choice inference: https://huggingface.co/docs/transformers/tasks/multiple_choice#inference Thank you. |
The results will be printed out after the training is complete. I think do_eval will just work for eval. That command is running fine-tuning for multiple choice, and at the end prints out the results and puts |
Thank you, @J38. Appreciate you response. I am running following command on a single GPU (on https://colab.research.google.com/ using Pro+ GPU) python -m torch.distributed.launch --nproc_per_node=1 --nnodes=1 --node_rank=0 I've been experimenting with but getting the same error. Do you have recommendations for parameters when running train/eval on single GPU? Thanks |
You're going to have to use cpu_offloading if you're trying to train this on a single GPU. |
Here is a thread where I got it working on 1 GPU for sequence classification: |
I think it may be sufficient to just update the deepspeed config to use cpu_offloading ... there is an example deepspeed config in that thread I shared in the previous comment. |
What this will do is drop information to machine RAM allowing you to work with much larger models at the cost of running much more slowly. But it is the only option for a model this large when you don't have a lot of GPU memory ... |
You will need to use DeepSpeed rather than the torch distributed launch ... so I can see if I can get an example for the MC choice code working. It should be similar to what I posted for the sequence classification example. |
@J38 Thank you for the guidance. We've just got deepspeed to work! Here is the code in Jupyter Notebook: Here is the command line that worked (but ran VERY slow) deepspeed --num_gpus 1 --num_nodes 1 run_multiple_choice.py --tokenizer_name stanford-crfm/pubmed_gpt_tokenizer --model_name_or_path stanford-crfm/BioMedLM --train_file $datadir/train.json --validation_file $datadir/dev.json --test_file $datadir/test.json --do_train --do_eval --do_predict --per_device_train_batch_size 1 --gradient_accumulation_steps 2 --learning_rate 2e-06 --warmup_ratio 0.5 --num_train_epochs 20 --max_seq_length 560 --logging_steps 100 --save_strategy no --evaluation_strategy no --output_dir medqa-finetune-demo --overwrite_output_dir --fp16 --seed 1 --run_name medqa-finetune-demo --deepspeed deepspeed_config.json The deepspeed_config.json was taken from this thread: #9 |
Just to summarize, there are several ways to run a fine-tuning process, including:
If you use deepspeed, the deepspeed config will determine optimizer settings. So for instance that config sets the learning rate, so make sure you review the deepspeed config and set the training parameters the way you want for the experiment. I think the
in your command. It's possible deepspeed will just notice this, but I would advise carefully reviewing the config to make sure all of the settings are what you want. |
Now it happens the deepspeed config I showed had learning rate |
It is expected to be really slow, sorry, but training a model this large on 1 GPU is going to take a bit of time vs. using multiple GPUs. I think 8 GPUs take 1.5h to fine tune on this set, so it will be substantially slower with 1 GPU and cpu_offloading. |
I will work to take notes from these issues and update the documentation to have some clear fine-tune on 1 GPU examples ... I think 1 GPU with cpu_offloading is going to be a common use case for a lot of users. |
The PubMedQA task should only take like 4 hours, but that is a lot smaller training set ... |
We were able to run preprocess_medqa.py based on the steps in https://github.com/stanford-crfm/BioMedLM/tree/main/finetune/mc
Next we wanted to run the evaluator as we already downloaded the question and answers
We went here https://github.com/stanford-crfm/BioMedLM/tree/main/finetune and ran
task=medqa_usmle_hf
datadir=data/$task
outdir=runs/$task/GPT2
mkdir -p $outdir
python -m torch.distributed.launch --nproc_per_node={num_devices} --nnodes=1 --node_rank=0
run_multiple_choice.py --tokenizer_name stanford-crfm/pubmed_gpt_tokenizer --model_name_or_path
{checkpoint} --train_file data/medqa_usmle_hf/train.json --validation_file data/medqa_usmle_hf/dev.json
--test_file data/medqa_usmle_hf/test.json --do_train --do_eval --do_predict --per_device_train_batch_size
{train_per_device_batch_size} --per_device_eval_batch_size 1 --gradient_accumulation_steps {grad_accum}
--learning_rate {lr} --warmup_ratio 0.5 --num_train_epochs {epochs} --max_seq_length 512
--{numerical_format} --seed {seed} --data_seed {seed} --logging_first_step --logging_steps 20
--save_strategy no --evaluation_strategy steps --eval_steps 500 --run_name {run_name}
--output_dir trash/
--overwrite_output_dir
It asks for various arguments that are missing e.g. {num_devices}, {checkpoint} {train_per_device_batch_size} etc
Can someone give us the command to execute "run_multiple_choice.py" exactly with arguments ?
The text was updated successfully, but these errors were encountered: