Compile and Ctranslate2 support #161

vince62s · 2024-12-12T16:08:13Z

Tested with torch 2.5.1

Pytorch made a lot of progress with torch.compile especially with dynamic shapes (our case, seqlen varies accross batches)

torch.nn.RMSNorm is not fully compatible with torch.compile
Hence I kept our python code but added a torch.compile decorator
This makes RMSNorm as fast as the awq_ext (so I removed it)

Further improvement:
Training:
Added a model = torch.compile(model, dynamic=True) in train_single
Gain is more or less +10% in tok/sec

Inference:
We never use the model.forward() directly so using torch.compile(model) has no effect.
I tried to use some regional compilation for MLP, part of MHA, etc ... no improvements
The reason is that the use flash_attn_with_kvcache is the best optimization we can get at inference.

EDIT:
flash-attn is supposed to support torch.compile starting versions >=2.6.X
Issue flash_attn_with_kvcache (only function we use) does not support torch.compile as of 2.7.2.post1
Dao-AILab/flash-attention#1386

Also due to some changes between 2.5.9.post1 and 2.6.x (probably because of torch.compile support) 2.6 is slower
We recommend flash-attn 2.5.9.post1 for now.

FLASH 2.6.1
[2024-12-16 10:13:34,736 INFO] Loading checkpoint from /mnt/InternalCrucial4/LLM_work/EuroLLM-9B-Instruct/estim/step_4000
[2024-12-16 10:13:34,822 INFO] Building model...
[2024-12-16 10:13:35,130 INFO] Loading data into the model
[2024-12-16 10:13:39,335 INFO] Transforms applied: ['onmt_tokenize']
[2024-12-16 10:14:34,741 INFO] PRED SCORE: -0.2343, PRED PPL: 1.26 NB SENTENCES: 100
[2024-12-16 10:14:34,741 INFO] ESTIM SCORE: 0.8440, ESTIM PPL: 0.43 NB SENTENCES: 100
[2024-12-16 10:14:34,741 INFO] Total prediction time (s): 55.4
[2024-12-16 10:14:34,741 INFO] Average prediction time (ms): 554.1
[2024-12-16 10:14:34,741 INFO] Tokens per second: 165.3
[2024-12-16 10:14:34,741 INFO] pred_words_total: 9158.0
Time w/o python interpreter load/terminate:  60.03075098991394

FLASH 2.5.9.post1
[2024-12-16 10:15:46,236 INFO] Loading checkpoint from /mnt/InternalCrucial4/LLM_work/EuroLLM-9B-Instruct/estim/step_4000
[2024-12-16 10:15:46,321 INFO] Building model...
[2024-12-16 10:15:46,626 INFO] Loading data into the model
[2024-12-16 10:15:50,151 INFO] Transforms applied: ['onmt_tokenize']
[2024-12-16 10:16:43,185 INFO] PRED SCORE: -0.2343, PRED PPL: 1.26 NB SENTENCES: 100
[2024-12-16 10:16:43,185 INFO] ESTIM SCORE: 0.8440, ESTIM PPL: 0.43 NB SENTENCES: 100
[2024-12-16 10:16:43,185 INFO] Total prediction time (s): 53.0
[2024-12-16 10:16:43,185 INFO] Average prediction time (ms): 530.3
[2024-12-16 10:16:43,185 INFO] Tokens per second: 172.7
[2024-12-16 10:16:43,185 INFO] pred_words_total: 9158.0
Time w/o python interpreter load/terminate:  56.975401639938354

CT2 Support within Eole:

I opened a PR to convert Eole models to CT2 here: OpenNMT/CTranslate2#1832
For NMT (Encoder/Decoder models) there seems to be nice speeds up but it may vary a lot depending on the inference config file (flash, batch_size, beam_size, ...)
For LLM (Decoder only models) CT2 remains slower because it does not support left padding yet.
Even with batch size = 1, my tests show that Eole is faster at the moment (with EuroLLM-9B-Instruc)

…(not applicable)

vince62s added 18 commits December 12, 2024 16:43

add compile + torch 2.5.1 changes

5abed3b

fix config

cbc1c98

torch 2.5.1

219368c

compile default False

41ef121

small chnages to rope + rmsnorm / remove compile global at inference …

889b680

…(not applicable)

Interleave false by default

9654e86

remove complex ops

8aa0f04

fix ct2 temporary inference

43746f9

black

d98c877

fix tests

4f4f4fb

typo

bed1343

fix test again

d6624de

damned

10e3e26

fix name mismatch

b710259

tests

72e2505

missing files

4204234

move output file writing to align with ct2 logic

e357b72

fix nbest > 1

298e289

vince62s changed the title ~~Compile~~ Compile and Ctranslate2 support Dec 18, 2024

readme torch version

3464d09

vince62s merged commit 31e02f3 into eole-nlp:main Dec 18, 2024
2 checks passed

vince62s deleted the compile branch January 13, 2025 08:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compile and Ctranslate2 support #161

Compile and Ctranslate2 support #161

vince62s commented Dec 12, 2024 •

edited

Loading

Compile and Ctranslate2 support #161

Compile and Ctranslate2 support #161

Conversation

vince62s commented Dec 12, 2024 • edited Loading

vince62s commented Dec 12, 2024 •

edited

Loading