Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run llama2 model on inferentia 2 #1615

Merged
merged 17 commits into from
Nov 7, 2023
Merged

Conversation

dacorvo
Copy link
Contributor

@dacorvo dacorvo commented Oct 27, 2023

This blog post explains how to deploy and run llama2 models on the AWS Inferentia2 instances using optimum-neuron.

For now I have included the images directly in the asset directory, as I need another pull-request to be merged in the documentation-images repo: https://huggingface.co/datasets/huggingface/documentation-images/discussions/212.

Copy link
Member

@julien-c julien-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool post! I'm quite excited about Inferentia

inferentia-llama2.md Show resolved Hide resolved
inferentia-llama2.md Outdated Show resolved Hide resolved
inferentia-llama2.md Show resolved Hide resolved

# Make your llama generation time fly with AWS Inferentia2

In a [previous post on the Hugging Face blog](https://huggingface.co/blog/accelerate-transformers-with-inferentia2), we introduced [AWS Inferentia 2](https://aws.amazon.com/ec2/instance-types/inf2/), the second-generation AWS Inferentia accelerator, and explained how you could use [optimum-neuron](https://huggingface.co/docs/optimum-neuron/index) to quickly deploy Hugging Face models for standard text and vision tasks on AWS Inferencia 2 instances.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe mention the biggest advantage here (is it $/inference?) for Inferentia

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary advantage should be the price, yes. But, since the best numbers here are obtained using the high-end ec2 instances (as recommended by AWS), I am not so sure. This is a sensitive topic actually ...

@dacorvo dacorvo requested a review from philschmid October 27, 2023 15:27
Copy link
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting!

assets/inferentia-llama2/encoding-time.png Outdated Show resolved Hide resolved
assets/inferentia-llama2/thumbnail.png Outdated Show resolved Hide resolved
inferentia-llama2.md Outdated Show resolved Hide resolved
inferentia-llama2.md Outdated Show resolved Hide resolved
inferentia-llama2.md Outdated Show resolved Hide resolved
inferentia-llama2.md Outdated Show resolved Hide resolved
inferentia-llama2.md Outdated Show resolved Hide resolved
inferentia-llama2.md Outdated Show resolved Hide resolved
inferentia-llama2.md Outdated Show resolved Hide resolved
inferentia-llama2.md Outdated Show resolved Hide resolved
@philschmid
Copy link
Member

Didn't we want to have it as a documentation guide?

inferentia-llama2.md Outdated Show resolved Hide resolved
@dacorvo
Copy link
Contributor Author

dacorvo commented Oct 27, 2023

Didn't we want to have it as a documentation guide?

I think a blog post is better for communication. The guide for generation actually already exists here: https://huggingface.co/docs/optimum-neuron/guides/models#generative-nlp-models, but I plan to improve it.

@dacorvo dacorvo marked this pull request as ready for review October 27, 2023 16:13
@dacorvo
Copy link
Contributor Author

dacorvo commented Oct 27, 2023

I am off next week, and @philschmid will also be absent most of the week. I think the post could be released after the 6th, unless we need to communicate earlier.
Feel free to review in the meantime and i will address comments when I come back.

inferentia-llama2.md Outdated Show resolved Hide resolved
inferentia-llama2.md Outdated Show resolved Hide resolved
inferentia-llama2.md Outdated Show resolved Hide resolved
@dacorvo dacorvo force-pushed the DavidCorvoysier/llama2-inferentia branch from fc7ff5b to c2b62dd Compare November 6, 2023 12:27
Copy link
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, feel free to merge when you are ready.

_blog.yml Outdated Show resolved Hide resolved
@dacorvo dacorvo force-pushed the DavidCorvoysier/llama2-inferentia branch from cf67dd7 to d26720c Compare November 7, 2023 13:23
Copy link
Member

@philschmid philschmid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments. Looking forward to this being published.


Our recommendation is to use the [Hugging Face Neuron Deep Learning AMI](https://aws.amazon.com/marketplace/pp/prodview-gr3e6yiscria2) (DLAMI). The DLAMI comes with all required libraries pre-packaged for you, including the Optimum Neuron, Neuron Drivers, Transformers, Datasets, and Accelerate.

These components can also be installed manually on a fresh Inferentia2 instance following the `optimum-neuron` [installation instructions](https://huggingface.co/docs/optimum-neuron/installation).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a section about sagemaker? And there will be soon a dedicated example/guide

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I had not pushed the change yet. Now it is really done.

inferentia-llama2.md Show resolved Hide resolved
inferentia-llama2.md Outdated Show resolved Hide resolved
inferentia-llama2.md Outdated Show resolved Hide resolved
inferentia-llama2.md Outdated Show resolved Hide resolved
inferentia-llama2.md Outdated Show resolved Hide resolved
inferentia-llama2.md Outdated Show resolved Hide resolved
inferentia-llama2.md Outdated Show resolved Hide resolved
Comment on lines 169 to 173
| new tokens | Llama2 7B latency | Llama2 7B throughput | Llama2 13B latency | Llama2 13B throughput | Llama2 7B budget |
|---------------|----------------------|-------------------------|-----------------------|--------------------------|---------------------|
| 256 | 2,3 | 2,7 | 3,5 | 4,1 | 15,9 |
| 512 | 4,4 | 5,3 | 6,9 | 7,8 | 31,7 |
| 768 | 6,2 | 7,7 | 10,2 | 11,1 | 47,3 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what instances where used? is seconds or miliseconds?

Comment on lines 187 to 191
| new tokens | Llama2 7B latency | Llama2 7B throughput | Llama2 13B latency | Llama2 13B throughput | Llama2 7B budget |
|---------------|----------------------|-------------------------|-----------------------|--------------------------|---------------------|
| 256 | 227 | 750 | 145 | 504 | 32 |
| 512 | 177 | 579 | 111 | 394 | 24 |
| 768 | 164 | 529 | 101 | 370 | 22 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what instances where used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a sentence before the table.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, wow i didn't understand that i though "Latency" meant Latency and is not a model description.

Copy link
Contributor Author

@dacorvo dacorvo Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I used codenames so that it is less confuzing (I hope).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

@dacorvo dacorvo requested a review from philschmid November 7, 2023 15:17
@dacorvo dacorvo force-pushed the DavidCorvoysier/llama2-inferentia branch from 9dcc8b3 to 78ced07 Compare November 7, 2023 16:05
@dacorvo dacorvo merged commit 7bff7d2 into main Nov 7, 2023
1 check passed
@dacorvo dacorvo deleted the DavidCorvoysier/llama2-inferentia branch November 7, 2023 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants