Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run the evaluator for MedQA-USMLE #12

Open
manusikka opened this issue Mar 10, 2023 · 20 comments
Open

How to run the evaluator for MedQA-USMLE #12

manusikka opened this issue Mar 10, 2023 · 20 comments

Comments

@manusikka
Copy link

We were able to run preprocess_medqa.py based on the steps in https://github.com/stanford-crfm/BioMedLM/tree/main/finetune/mc

Next we wanted to run the evaluator as we already downloaded the question and answers

We went here https://github.com/stanford-crfm/BioMedLM/tree/main/finetune and ran
task=medqa_usmle_hf
datadir=data/$task
outdir=runs/$task/GPT2
mkdir -p $outdir
python -m torch.distributed.launch --nproc_per_node={num_devices} --nnodes=1 --node_rank=0
run_multiple_choice.py --tokenizer_name stanford-crfm/pubmed_gpt_tokenizer --model_name_or_path
{checkpoint} --train_file data/medqa_usmle_hf/train.json --validation_file data/medqa_usmle_hf/dev.json
--test_file data/medqa_usmle_hf/test.json --do_train --do_eval --do_predict --per_device_train_batch_size
{train_per_device_batch_size} --per_device_eval_batch_size 1 --gradient_accumulation_steps {grad_accum}
--learning_rate {lr} --warmup_ratio 0.5 --num_train_epochs {epochs} --max_seq_length 512
--{numerical_format} --seed {seed} --data_seed {seed} --logging_first_step --logging_steps 20
--save_strategy no --evaluation_strategy steps --eval_steps 500 --run_name {run_name}
--output_dir trash/
--overwrite_output_dir

It asks for various arguments that are missing e.g. {num_devices}, {checkpoint} {train_per_device_batch_size} etc

Can someone give us the command to execute "run_multiple_choice.py" exactly with arguments ?

@J38
Copy link
Contributor

J38 commented Mar 11, 2023

Do you want to fine-tune on MedQA or just run evaluation of a model ?

@githubusera
Copy link

I work with @manusikka on the class research project. We are looking to evaluate a model, establish baseline and confirm results "new state of the art for the MedQA task of 50.3%". Any guidance would be appreciated. Thanks.

@J38
Copy link
Contributor

J38 commented Mar 12, 2023

num_devices = number of GPUs
checkpoint = file path of hugging face model checkpoint dir

These two settings are related and depend on number of GPUs and how much memory the GPUs have:

train_per_device_batch_size = examples per device
grad_accum = number of steps to accumulate gradient

batch_size = train_per_device_batch_size x num_devices x grad_accum

So for example if you want batch_size=8, you'd set
train_per_device_batch_size=1, num_devices=8, grad_accum=1

(assuming you have 8 GPU)

If you want batch_size=32 you might do:

train_per_device_batch_size=1, num_devices=8, grad_accum=4

You could try train_per_device_batch=2, but you may run out of GPU memory.

lr = learning rate , for example 2e-06
num_train_epochs = number of epochs, for example 10
numerical_format = bf16
seed = random seed, set this differently for each experiment to something like 1,2, or 3
you can remove data_seed option
run_name = name for your experiment

Let me know if that clarifies and if you have any other questions ...

One note: the 50.3% is an average with seed=1, seed=2, and seed=3 ... so any given experiment won't yield that exact number, and experiments on your machine will probably yield different results since randomness will be different ... so don't expect to fall on 50.3% exactly or even on average, but hopefully it should be close to that on average

@githubusera
Copy link

githubusera commented Mar 12, 2023

Thank you @J38 for such detailed explanation. I appears that many of the parameters you've mentioned are needed for training. I am confused: why are we training model, if are only trying to run Evaluation on existent model? Or, are we first building a model AND then running Eval on it, all in one batch command?

Also, can you clarify conceptual question and let me know if I am thinking right:
BioMedLM is a model that has been already trained on data and is saved to HuggingFace: https://huggingface.co/stanford-crfm/BioMedLM.

I should be able to just download the BioMedLM model and run evaluation on MedQA WITHOUT training, right?
For example, I would do something like this:


tokenizer = GPT2Tokenizer.from_pretrained("stanford-crfm/BioMedLM")
model = GPT2LMHeadModel.from_pretrained("stanford-crfm/BioMedLM").to(device)
input_ids = tokenizer.encode(
"A 20-year-old woman presents with menorrhagia for the past several years..... Which of the following is the most likely cause of this patient’s symptoms? A: Factor V Leiden ...", return_tensors="pt"
).to(device)

sample_output = model.generate(input_ids, do_sample=True, max_length=50, top_k=50)
#TODO: get output from tokenizer.decode(sample_output[0], skip_special_tokens=True)

Then, compare output to correct label and see if there is an exact match to the answer.
Is this evaluation method appropriate?

@githubusera
Copy link

Also, I am trying to run evaluation on a MedQA question via model, as in:
`
question = ("A 20-year-old woman presents with menorrhagia for the past several years."
"She says that her menses “have always been heavy”, and she has experienced easy bruising for as long as she can remember."
"Family history is significant for her mother, who had similar problems with bruising easily. "
"The patient's vital signs include: heart rate 98/min, respiratory rate 14/min, temperature 36.1°C (96.9°F),"
" and blood pressure 110/87 mm Hg. Physical examination is unremarkable. "
" Laboratory tests show the following: platelet count 200,000/mm3, PT 12 seconds,"
" and PTT 43 seconds. Which of the following is the most likely cause of this patient’s symptoms?"
"A: Factor V Leiden B: Hemophilia A C: Lupus anticoagulant D: Protein C deficiency E Von Willebrand disease"
)

input_ids = tokenizer.encode(
question, return_tensors="pt"
).to(device)

sample_output = model.generate(input_ids, do_sample=True, max_length=50, top_k=50)

print("Output:\n" + 100 * "-")
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))
`

Here is an output:
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:28895 for open-end generation.
Input length of input_ids is 209, but max_length is set to 50. This can lead to unexpected behavior. You should consider increasing max_new_tokens.
Output:

A 20-year-old woman presents with menorrhagia for the past several years.She says that her menses “have always been heavy”, and she has experienced easy bruising for as long as she can remember.Family history is significant for her mother, who had similar problems with bruising easily. The patient's vital signs include: heart rate 98/min, respiratory rate 14/min, temperature 36.1°C (96.9°F), and blood pressure 110/87 mm Hg. Physical examination is unremarkable. Laboratory tests show the following: platelet count 200,000/mm3, PT 12 seconds, and PTT 43 seconds. Which of the following is the most likely cause of this patient’s symptoms?A: Factor V Leiden B: Hemophilia A C: Lupus anticoagulant D: Protein C deficiency E Von Willebrand disease An

Notice that the answer seem to be truncated (very last "An").
Is there a way to use above code snippet to display answer to the multiple choice MedQA question? Thanks!

@githubusera
Copy link

I was able to run following command in terminal:

python -m torch.distributed.launch --nproc_per_node=1 --nnodes=1 --node_rank=0
run_multiple_choice.py --tokenizer_name stanford-crfm/pubmed_gpt_tokenizer --model_name_or_path
/root/.cache/huggingface/hub/models--stanford-crfm--BioMedLM --stanford-crfm--BioMedLM --train_file data/medqa_usmle_hf/train.json --validation_file data/medqa_usmle_hf/dev.json
--test_file data/medqa_usmle_hf/test.json --do_train --do_eval --do_predict --per_device_train_batch_size
1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 1
--learning_rate 2e-06 --warmup_ratio 0.5 --num_train_epochs 10 --max_seq_length 512
--bf16 --seed 1 --data_seed 1 --logging_first_step --logging_steps 20
--save_strategy no --evaluation_strategy steps --eval_steps 500 --run_name alex
--output_dir trash/
--overwrite_output_dir

I see all THREE command: do_train do_eval and do_predict. Should I be able to use just do_eval for my evaluation?
Where should I be able to see the results from eval? Thank you.

@githubusera
Copy link

I've found this tutorial on multi-choice inference: https://huggingface.co/docs/transformers/tasks/multiple_choice#inference
Are we supposed to train our BioMedLM on Multi-Choice task, before running inference, as in this example: https://huggingface.co/docs/transformers/tasks/multiple_choice#train ?

Thank you.

@J38
Copy link
Contributor

J38 commented Mar 12, 2023

The results will be printed out after the training is complete. I think do_eval will just work for eval. That command is running fine-tuning for multiple choice, and at the end prints out the results and puts .json files in the directory for the fine-tuned model.

@githubusera
Copy link

Thank you, @J38. Appreciate you response.

I am running following command on a single GPU (on https://colab.research.google.com/ using Pro+ GPU)
task=medqa_usmle_hf
datadir=data/$task
outdir=runs/$task/GPT2
mkdir -p $outdir

python -m torch.distributed.launch --nproc_per_node=1 --nnodes=1 --node_rank=0
run_multiple_choice.py --tokenizer_name stanford-crfm/pubmed_gpt_tokenizer --model_name_or_path
stanford-crfm/BioMedLM --train_file data/medqa_usmle_hf/train.json --validation_file data/medqa_usmle_hf/dev.json
--test_file data/medqa_usmle_hf/test.json --do_train --do_eval --do_predict --per_device_train_batch_size
1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 1
--learning_rate 2e-06 --warmup_ratio 0.5 --num_train_epochs 10 --max_seq_length 512
--fp16 --seed 1 --data_seed 1 --logging_first_step --logging_steps 20
--save_strategy no --evaluation_strategy steps --eval_steps 500 --run_name alex
--output_dir trash/
--overwrite_output_dir

I am getting GPU error:
image

I've been experimenting with
export 'PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128'

but getting the same error.

Do you have recommendations for parameters when running train/eval on single GPU?

Thanks

@J38
Copy link
Contributor

J38 commented Mar 13, 2023

You're going to have to use cpu_offloading if you're trying to train this on a single GPU.

@J38
Copy link
Contributor

J38 commented Mar 13, 2023

Here is a thread where I got it working on 1 GPU for sequence classification:

#9

@J38
Copy link
Contributor

J38 commented Mar 13, 2023

I think it may be sufficient to just update the deepspeed config to use cpu_offloading ... there is an example deepspeed config in that thread I shared in the previous comment.

@J38
Copy link
Contributor

J38 commented Mar 13, 2023

What this will do is drop information to machine RAM allowing you to work with much larger models at the cost of running much more slowly. But it is the only option for a model this large when you don't have a lot of GPU memory ...

@J38
Copy link
Contributor

J38 commented Mar 13, 2023

You will need to use DeepSpeed rather than the torch distributed launch ... so I can see if I can get an example for the MC choice code working. It should be similar to what I posted for the sequence classification example.

@githubusera
Copy link

githubusera commented Mar 14, 2023

@J38 Thank you for the guidance. We've just got deepspeed to work!

Here is the code in Jupyter Notebook:
!pip install fairscale
!pip install accelerate
!pip install deepspeed

Here is the command line that worked (but ran VERY slow)
`task=medqa_usmle_hf ; datadir=data/$task ; export WANDB_PROJECT=biomedical-nlp-eval

deepspeed --num_gpus 1 --num_nodes 1 run_multiple_choice.py --tokenizer_name stanford-crfm/pubmed_gpt_tokenizer --model_name_or_path stanford-crfm/BioMedLM --train_file $datadir/train.json --validation_file $datadir/dev.json --test_file $datadir/test.json --do_train --do_eval --do_predict --per_device_train_batch_size 1 --gradient_accumulation_steps 2 --learning_rate 2e-06 --warmup_ratio 0.5 --num_train_epochs 20 --max_seq_length 560 --logging_steps 100 --save_strategy no --evaluation_strategy no --output_dir medqa-finetune-demo --overwrite_output_dir --fp16 --seed 1 --run_name medqa-finetune-demo --deepspeed deepspeed_config.json
`

The deepspeed_config.json was taken from this thread: #9

@J38
Copy link
Contributor

J38 commented Mar 14, 2023

Just to summarize, there are several ways to run a fine-tuning process, including:

  • plain on 1 GPU
  • torch.distributed on multiple GPUs
  • deepspeed on multiple GPUs

If you use deepspeed, the deepspeed config will determine optimizer settings. So for instance that config sets the learning rate, so make sure you review the deepspeed config and set the training parameters the way you want for the experiment.

I think the

--learning_rate 2e-06 

in your command. It's possible deepspeed will just notice this, but I would advise carefully reviewing the config to make sure all of the settings are what you want.

@J38
Copy link
Contributor

J38 commented Mar 14, 2023

Now it happens the deepspeed config I showed had learning rate 2e-06 ... but just wanted to let you know that that config will influence the optimizer settings, because deepspeed executes the optimization.

@J38
Copy link
Contributor

J38 commented Mar 14, 2023

It is expected to be really slow, sorry, but training a model this large on 1 GPU is going to take a bit of time vs. using multiple GPUs. I think 8 GPUs take 1.5h to fine tune on this set, so it will be substantially slower with 1 GPU and cpu_offloading.

@J38
Copy link
Contributor

J38 commented Mar 14, 2023

I will work to take notes from these issues and update the documentation to have some clear fine-tune on 1 GPU examples ... I think 1 GPU with cpu_offloading is going to be a common use case for a lot of users.

@J38
Copy link
Contributor

J38 commented Mar 14, 2023

The PubMedQA task should only take like 4 hours, but that is a lot smaller training set ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants