Run llama2 model on inferentia 2 #1615

dacorvo · 2023-10-27T12:14:43Z

This blog post explains how to deploy and run llama2 models on the AWS Inferentia2 instances using optimum-neuron.

For now I have included the images directly in the asset directory, as I need another pull-request to be merged in the documentation-images repo: https://huggingface.co/datasets/huggingface/documentation-images/discussions/212.

julien-c

cool post! I'm quite excited about Inferentia

inferentia-llama2.md

julien-c · 2023-10-27T12:58:14Z

inferentia-llama2.md

+
+# Make your llama generation time fly with AWS Inferentia2
+
+In a [previous post on the Hugging Face blog](https://huggingface.co/blog/accelerate-transformers-with-inferentia2), we introduced [AWS Inferentia 2](https://aws.amazon.com/ec2/instance-types/inf2/), the second-generation AWS Inferentia accelerator, and explained how you could use [optimum-neuron](https://huggingface.co/docs/optimum-neuron/index) to quickly deploy Hugging Face models for standard text and vision tasks on AWS Inferencia 2 instances.


maybe mention the biggest advantage here (is it $/inference?) for Inferentia

The primary advantage should be the price, yes. But, since the best numbers here are obtained using the high-end ec2 instances (as recommended by AWS), I am not so sure. This is a sensitive topic actually ...

pcuenca

Very interesting!

assets/inferentia-llama2/encoding-time.png

assets/inferentia-llama2/thumbnail.png

inferentia-llama2.md

philschmid · 2023-10-27T15:29:09Z

Didn't we want to have it as a documentation guide?

inferentia-llama2.md

dacorvo · 2023-10-27T16:12:45Z

Didn't we want to have it as a documentation guide?

I think a blog post is better for communication. The guide for generation actually already exists here: https://huggingface.co/docs/optimum-neuron/guides/models#generative-nlp-models, but I plan to improve it.

dacorvo · 2023-10-27T16:19:09Z

I am off next week, and @philschmid will also be absent most of the week. I think the post could be released after the 6th, unless we need to communicate earlier.
Feel free to review in the meantime and i will address comments when I come back.

inferentia-llama2.md

pcuenca

LGTM, feel free to merge when you are ready.

_blog.yml

philschmid

Left some comments. Looking forward to this being published.

philschmid · 2023-11-07T13:49:58Z

inferentia-llama2.md

+
+Our recommendation is to use the [Hugging Face Neuron Deep Learning AMI](https://aws.amazon.com/marketplace/pp/prodview-gr3e6yiscria2) (DLAMI). The DLAMI comes with all required libraries pre-packaged for you, including the Optimum Neuron, Neuron Drivers, Transformers, Datasets, and Accelerate.
+
+These components can also be installed manually on a fresh Inferentia2 instance following the `optimum-neuron` [installation instructions](https://huggingface.co/docs/optimum-neuron/installation).


Can we add a section about sagemaker? And there will be soon a dedicated example/guide

OK, I had not pushed the change yet. Now it is really done.

inferentia-llama2.md

philschmid · 2023-11-07T14:00:40Z

inferentia-llama2.md

+|   new tokens  |   Llama2 7B latency  |   Llama2 7B throughput  |   Llama2 13B latency  |   Llama2 13B throughput  |   Llama2 7B budget  |
+|---------------|----------------------|-------------------------|-----------------------|--------------------------|---------------------|
+|   256         |   2,3                |   2,7                   |   3,5                 |   4,1                    |   15,9              |
+|   512         |   4,4                |   5,3                   |   6,9                 |   7,8                    |   31,7              |
+|   768         |   6,2                |   7,7                   |   10,2                |   11,1                   |   47,3              |


what instances where used? is seconds or miliseconds?

philschmid · 2023-11-07T14:00:54Z

inferentia-llama2.md

+|   new tokens  |   Llama2 7B latency  |   Llama2 7B throughput  |   Llama2 13B latency  |   Llama2 13B throughput  |   Llama2 7B budget  |
+|---------------|----------------------|-------------------------|-----------------------|--------------------------|---------------------|
+|   256         |   227                |   750                   |   145                 |   504                    |   32                |
+|   512         |   177                |   579                   |   111                 |   394                    |   24                |
+|   768         |   164                |   529                   |   101                 |   370                    |   22                |


what instances where used?

I added a sentence before the table.

Ah okay, wow i didn't understand that i though "Latency" meant Latency and is not a model description.

Okay, I used codenames so that it is less confuzing (I hope).

Co-authored-by: Pedro Cuenca <[email protected]> Co-authored-by: Julien Chaumond <[email protected]>

Co-authored-by: Pedro Cuenca <[email protected]>

Co-authored-by: Philipp Schmid <[email protected]>

julien-c reviewed Oct 27, 2023

View reviewed changes

dacorvo requested a review from philschmid October 27, 2023 15:27

pcuenca reviewed Oct 27, 2023

View reviewed changes

inferentia-llama2.md Outdated Show resolved Hide resolved

dacorvo marked this pull request as ready for review October 27, 2023 16:13

pcuenca approved these changes Oct 27, 2023

View reviewed changes

inferentia-llama2.md Outdated Show resolved Hide resolved

inferentia-llama2.md Outdated Show resolved Hide resolved

inferentia-llama2.md Outdated Show resolved Hide resolved

dacorvo force-pushed the DavidCorvoysier/llama2-inferentia branch from fc7ff5b to c2b62dd Compare November 6, 2023 12:27

pcuenca approved these changes Nov 7, 2023

View reviewed changes

_blog.yml Outdated Show resolved Hide resolved

dacorvo force-pushed the DavidCorvoysier/llama2-inferentia branch from cf67dd7 to d26720c Compare November 7, 2023 13:23

philschmid reviewed Nov 7, 2023

View reviewed changes

dacorvo requested a review from philschmid November 7, 2023 15:17

dacorvo and others added 16 commits November 7, 2023 17:04

add LLama2 inference on AWS Inferentia 2 post

3814679

added post to blog.yml

8aa2879

Apply suggestions from code review

d81e63f

Co-authored-by: Pedro Cuenca <[email protected]> Co-authored-by: Julien Chaumond <[email protected]>

address review comments

11cb064

use images from documentation-images repo

d680dd0

Apply suggestions from code review

b8ca746

Co-authored-by: Pedro Cuenca <[email protected]>

budget model can run on lower instance

c8a3492

models have moved to aws-neuron org

c9a460b

Update inferentia-llama2.md

aee796e

Co-authored-by: Philipp Schmid <[email protected]>

Update inferentia-llama2.md

63f4cb0

Co-authored-by: Philipp Schmid <[email protected]>

Update inferentia-llama2.md

c253870

Co-authored-by: Philipp Schmid <[email protected]>

added mention to SageMaker in installation

a294a6b

cleanup code examples

7507648

add links to the exported models

6d3d5d7

mention instances used in benchmarks

b9b7827

make benchmark units more explicit

a25dbb4

try to reduce ambiguity between measurements and model configurations

78ced07

dacorvo force-pushed the DavidCorvoysier/llama2-inferentia branch from 9dcc8b3 to 78ced07 Compare November 7, 2023 16:05

dacorvo merged commit 7bff7d2 into main Nov 7, 2023
1 check passed

dacorvo deleted the DavidCorvoysier/llama2-inferentia branch November 7, 2023 16:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run llama2 model on inferentia 2 #1615

Run llama2 model on inferentia 2 #1615

dacorvo commented Oct 27, 2023 •

edited

Loading

julien-c left a comment

julien-c Oct 27, 2023

dacorvo Oct 27, 2023

pcuenca left a comment

philschmid commented Oct 27, 2023

dacorvo commented Oct 27, 2023

dacorvo commented Oct 27, 2023

pcuenca left a comment

philschmid left a comment

philschmid Nov 7, 2023

dacorvo Nov 7, 2023

philschmid Nov 7, 2023

dacorvo Nov 7, 2023

philschmid Nov 7, 2023

philschmid Nov 7, 2023

dacorvo Nov 7, 2023

philschmid Nov 7, 2023

dacorvo Nov 7, 2023 •

edited

Loading

philschmid Nov 7, 2023


		# Make your llama generation time fly with AWS Inferentia2

		In a [previous post on the Hugging Face blog](https://huggingface.co/blog/accelerate-transformers-with-inferentia2), we introduced [AWS Inferentia 2](https://aws.amazon.com/ec2/instance-types/inf2/), the second-generation AWS Inferentia accelerator, and explained how you could use [optimum-neuron](https://huggingface.co/docs/optimum-neuron/index) to quickly deploy Hugging Face models for standard text and vision tasks on AWS Inferencia 2 instances.


		Our recommendation is to use the [Hugging Face Neuron Deep Learning AMI](https://aws.amazon.com/marketplace/pp/prodview-gr3e6yiscria2) (DLAMI). The DLAMI comes with all required libraries pre-packaged for you, including the Optimum Neuron, Neuron Drivers, Transformers, Datasets, and Accelerate.

		These components can also be installed manually on a fresh Inferentia2 instance following the `optimum-neuron` [installation instructions](https://huggingface.co/docs/optimum-neuron/installation).

Run llama2 model on inferentia 2 #1615

Run llama2 model on inferentia 2 #1615

Conversation

dacorvo commented Oct 27, 2023 • edited Loading

julien-c left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcuenca left a comment

Choose a reason for hiding this comment

philschmid commented Oct 27, 2023

dacorvo commented Oct 27, 2023

dacorvo commented Oct 27, 2023

pcuenca left a comment

Choose a reason for hiding this comment

philschmid left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dacorvo Nov 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dacorvo commented Oct 27, 2023 •

edited

Loading

dacorvo Nov 7, 2023 •

edited

Loading