Int4 inference #45

dmenig · 2024-10-30T15:58:34Z

Is there any plan to release an int4 version ? Specifically, I'm interested in the video understanding part.

xffxff · 2024-11-12T02:52:42Z

Yes! You can check out https://github.com/mobiusml/hqq/blob/master/examples/hf/aria_multimodal.py, a HQQ 4-bit version. This implementation is about 4-6x faster and takes 3x less VRAM!

AIWarper · 2024-12-05T20:18:50Z

Yes! You can check out https://github.com/mobiusml/hqq/blob/master/examples/hf/aria_multimodal.py, a HQQ 4-bit version. This implementation is about 4-6x faster and takes 3x less VRAM!

I was looking into this as I need a local Video Vision solution - is it correct to assume the model was 12 shards at ~4gb a shard?

Also, you mentioned 3x less VRAM - what are the upper limits of the video vision you can achieve on 24gb or less?

EDIT: I tested locally with a 4090. Single image inference was fine but any video OOMed me no matter the length or resolution

h123fire · 2025-01-08T07:22:27Z

Error: accessing tensor output of CUDAGraphs that has been overwritten by a subsequent run. Stack trace,/transformers/models/llama/modeling_llama.py", line 1190, in forward

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Int4 inference #45

Int4 inference #45

dmenig commented Oct 30, 2024

xffxff commented Nov 12, 2024

AIWarper commented Dec 5, 2024 •

edited

Loading

h123fire commented Jan 8, 2025

Int4 inference #45

Int4 inference #45

Comments

dmenig commented Oct 30, 2024

xffxff commented Nov 12, 2024

AIWarper commented Dec 5, 2024 • edited Loading

h123fire commented Jan 8, 2025

AIWarper commented Dec 5, 2024 •

edited

Loading