Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile and Ctranslate2 support #161

Merged
merged 19 commits into from
Dec 18, 2024
Merged

Compile and Ctranslate2 support #161

merged 19 commits into from
Dec 18, 2024

Conversation

vince62s
Copy link
Contributor

@vince62s vince62s commented Dec 12, 2024

Tested with torch 2.5.1

Pytorch made a lot of progress with torch.compile especially with dynamic shapes (our case, seqlen varies accross batches)

torch.nn.RMSNorm is not fully compatible with torch.compile
Hence I kept our python code but added a torch.compile decorator
This makes RMSNorm as fast as the awq_ext (so I removed it)

Further improvement:
Training:
Added a model = torch.compile(model, dynamic=True) in train_single
Gain is more or less +10% in tok/sec

Inference:
We never use the model.forward() directly so using torch.compile(model) has no effect.
I tried to use some regional compilation for MLP, part of MHA, etc ... no improvements
The reason is that the use flash_attn_with_kvcache is the best optimization we can get at inference.

EDIT:
flash-attn is supposed to support torch.compile starting versions >=2.6.X
Issue flash_attn_with_kvcache (only function we use) does not support torch.compile as of 2.7.2.post1
Dao-AILab/flash-attention#1386

Also due to some changes between 2.5.9.post1 and 2.6.x (probably because of torch.compile support) 2.6 is slower
We recommend flash-attn 2.5.9.post1 for now.

FLASH 2.6.1
[2024-12-16 10:13:34,736 INFO] Loading checkpoint from /mnt/InternalCrucial4/LLM_work/EuroLLM-9B-Instruct/estim/step_4000
[2024-12-16 10:13:34,822 INFO] Building model...
[2024-12-16 10:13:35,130 INFO] Loading data into the model
[2024-12-16 10:13:39,335 INFO] Transforms applied: ['onmt_tokenize']
[2024-12-16 10:14:34,741 INFO] PRED SCORE: -0.2343, PRED PPL: 1.26 NB SENTENCES: 100
[2024-12-16 10:14:34,741 INFO] ESTIM SCORE: 0.8440, ESTIM PPL: 0.43 NB SENTENCES: 100
[2024-12-16 10:14:34,741 INFO] Total prediction time (s): 55.4
[2024-12-16 10:14:34,741 INFO] Average prediction time (ms): 554.1
[2024-12-16 10:14:34,741 INFO] Tokens per second: 165.3
[2024-12-16 10:14:34,741 INFO] pred_words_total: 9158.0
Time w/o python interpreter load/terminate:  60.03075098991394

FLASH 2.5.9.post1
[2024-12-16 10:15:46,236 INFO] Loading checkpoint from /mnt/InternalCrucial4/LLM_work/EuroLLM-9B-Instruct/estim/step_4000
[2024-12-16 10:15:46,321 INFO] Building model...
[2024-12-16 10:15:46,626 INFO] Loading data into the model
[2024-12-16 10:15:50,151 INFO] Transforms applied: ['onmt_tokenize']
[2024-12-16 10:16:43,185 INFO] PRED SCORE: -0.2343, PRED PPL: 1.26 NB SENTENCES: 100
[2024-12-16 10:16:43,185 INFO] ESTIM SCORE: 0.8440, ESTIM PPL: 0.43 NB SENTENCES: 100
[2024-12-16 10:16:43,185 INFO] Total prediction time (s): 53.0
[2024-12-16 10:16:43,185 INFO] Average prediction time (ms): 530.3
[2024-12-16 10:16:43,185 INFO] Tokens per second: 172.7
[2024-12-16 10:16:43,185 INFO] pred_words_total: 9158.0
Time w/o python interpreter load/terminate:  56.975401639938354

CT2 Support within Eole:

I opened a PR to convert Eole models to CT2 here: OpenNMT/CTranslate2#1832
For NMT (Encoder/Decoder models) there seems to be nice speeds up but it may vary a lot depending on the inference config file (flash, batch_size, beam_size, ...)
For LLM (Decoder only models) CT2 remains slower because it does not support left padding yet.
Even with batch size = 1, my tests show that Eole is faster at the moment (with EuroLLM-9B-Instruc)

@vince62s vince62s changed the title Compile Compile and Ctranslate2 support Dec 18, 2024
@vince62s vince62s merged commit 31e02f3 into eole-nlp:main Dec 18, 2024
2 checks passed
@vince62s vince62s deleted the compile branch January 13, 2025 08:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant