-
Notifications
You must be signed in to change notification settings - Fork 786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run llama2 model on inferentia 2 #1615
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool post! I'm quite excited about Inferentia
inferentia-llama2.md
Outdated
|
||
# Make your llama generation time fly with AWS Inferentia2 | ||
|
||
In a [previous post on the Hugging Face blog](https://huggingface.co/blog/accelerate-transformers-with-inferentia2), we introduced [AWS Inferentia 2](https://aws.amazon.com/ec2/instance-types/inf2/), the second-generation AWS Inferentia accelerator, and explained how you could use [optimum-neuron](https://huggingface.co/docs/optimum-neuron/index) to quickly deploy Hugging Face models for standard text and vision tasks on AWS Inferencia 2 instances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe mention the biggest advantage here (is it $/inference?) for Inferentia
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The primary advantage should be the price, yes. But, since the best numbers here are obtained using the high-end ec2
instances (as recommended by AWS), I am not so sure. This is a sensitive topic actually ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interesting!
Didn't we want to have it as a documentation guide? |
I think a blog post is better for communication. The guide for generation actually already exists here: https://huggingface.co/docs/optimum-neuron/guides/models#generative-nlp-models, but I plan to improve it. |
I am off next week, and @philschmid will also be absent most of the week. I think the post could be released after the 6th, unless we need to communicate earlier. |
fc7ff5b
to
c2b62dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, feel free to merge when you are ready.
cf67dd7
to
d26720c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments. Looking forward to this being published.
inferentia-llama2.md
Outdated
|
||
Our recommendation is to use the [Hugging Face Neuron Deep Learning AMI](https://aws.amazon.com/marketplace/pp/prodview-gr3e6yiscria2) (DLAMI). The DLAMI comes with all required libraries pre-packaged for you, including the Optimum Neuron, Neuron Drivers, Transformers, Datasets, and Accelerate. | ||
|
||
These components can also be installed manually on a fresh Inferentia2 instance following the `optimum-neuron` [installation instructions](https://huggingface.co/docs/optimum-neuron/installation). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a section about sagemaker? And there will be soon a dedicated example/guide
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I had not pushed the change yet. Now it is really done.
inferentia-llama2.md
Outdated
| new tokens | Llama2 7B latency | Llama2 7B throughput | Llama2 13B latency | Llama2 13B throughput | Llama2 7B budget | | ||
|---------------|----------------------|-------------------------|-----------------------|--------------------------|---------------------| | ||
| 256 | 2,3 | 2,7 | 3,5 | 4,1 | 15,9 | | ||
| 512 | 4,4 | 5,3 | 6,9 | 7,8 | 31,7 | | ||
| 768 | 6,2 | 7,7 | 10,2 | 11,1 | 47,3 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what instances where used? is seconds or miliseconds?
inferentia-llama2.md
Outdated
| new tokens | Llama2 7B latency | Llama2 7B throughput | Llama2 13B latency | Llama2 13B throughput | Llama2 7B budget | | ||
|---------------|----------------------|-------------------------|-----------------------|--------------------------|---------------------| | ||
| 256 | 227 | 750 | 145 | 504 | 32 | | ||
| 512 | 177 | 579 | 111 | 394 | 24 | | ||
| 768 | 164 | 529 | 101 | 370 | 22 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what instances where used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a sentence before the table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay, wow i didn't understand that i though "Latency" meant Latency and is not a model description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I used codenames so that it is less confuzing (I hope).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻
Co-authored-by: Pedro Cuenca <[email protected]> Co-authored-by: Julien Chaumond <[email protected]>
Co-authored-by: Pedro Cuenca <[email protected]>
Co-authored-by: Philipp Schmid <[email protected]>
Co-authored-by: Philipp Schmid <[email protected]>
Co-authored-by: Philipp Schmid <[email protected]>
9dcc8b3
to
78ced07
Compare
This blog post explains how to deploy and run llama2 models on the AWS Inferentia2 instances using
optimum-neuron
.For now I have included the images directly in the asset directory, as I need another pull-request to be merged in the documentation-images repo: https://huggingface.co/datasets/huggingface/documentation-images/discussions/212.