Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi gpu inference with run_rm.py #95

Closed
SeungoneKim opened this issue Apr 1, 2024 · 3 comments · Fixed by #144 · May be fixed by #125
Closed

multi gpu inference with run_rm.py #95

SeungoneKim opened this issue Apr 1, 2024 · 3 comments · Fixed by #144 · May be fixed by #125
Labels
enhancement New feature or request

Comments

@SeungoneKim
Copy link

SeungoneKim commented Apr 1, 2024

Hello Nathan,

Thank you for this valuable resource! I strongly think that we needed more standardized benchmarks to evaluate reward/evaluator models.

I think submit_eval_jobs.py (using AI2's beaker) supports multi gpu inference but run_rm.py doesn't at the moment.
I was wondering if this intended (correct me if I'm wrong)!

Best,
Seungone

@natolambert
Copy link
Collaborator

Hey @SeungoneKim -- we just haven't needed it yet (the biggest classifiers are 34B). Happy to add it.

run_dpo.py works nicely with 2,4,6,8 GPUs. That's what it's included.
Lmk if you want to open a PR :)

@SeungoneKim
Copy link
Author

SeungoneKim commented Apr 3, 2024

Thanks for your response @natolambert!

I was trying to test generative reward modeling (with GPT-4, Prometheus, Auto-J) and it seems like run_dpo.py has a slightly different functionality than what I need.

Considering that generative RMs require generating a CoT-ish feedback before their scoring decision, I think it would be best to integrate vllm and add an additional run_generative_rm.py code. Users could add on additional generative rms by implementing the code for parsing the output(reward).

If this makes sense to you, I'll leave a pull request of this and try to maintain the style of the code as similar to run_rm.py!

@natolambert
Copy link
Collaborator

@SeungoneKim generative RM's (via API) are being added in #86, but adding the full generation thing is another can of worms. I agree with your path, I just worry a bit about complexity. It's prolly worth having though.

The API implementation should be closer to what you want to build off of.

Here are preliminary results

Claude results:
Haiku {‘Chat’: 0.9273743016759777, ‘Chat Hard’: 0.5197368421052632, ‘Safety’: 0.8210275184275184, ‘Reasoning’: 0.7060194658154636}
Sonnet {‘Chat’: 0.9343575418994413, ‘Chat Hard’: 0.5657894736842105, ‘Safety’: 0.8367826605826606, ‘Reasoning’: 0.6907005374583948}
Opus {‘Chat’: 0.946927374301676, ‘Chat Hard’: 0.6030701754385965, ‘Safety’: 0.8905447525447526, ‘Reasoning’: 0.7868223795492989}
(reminder) OpenAI results:
GPT 3.5 {‘Chat’: 0.9217877094972067, ‘Chat Hard’: 0.4451754385964912, ‘Safety’: 0.6229577395577396, ‘Reasoning’: 0.5912315163420091}
GPT 4.Turbo {‘Chat’: 0.952513966480447, ‘Chat Hard’: 0.743421052631579, ‘Safety’: 0.8719219375219376, ‘Reasoning’: 0.8692366453865881}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants