-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Int4 inference #45
Comments
Yes! You can check out https://github.com/mobiusml/hqq/blob/master/examples/hf/aria_multimodal.py, a HQQ 4-bit version. This implementation is about 4-6x faster and takes 3x less VRAM! |
I was looking into this as I need a local Video Vision solution - is it correct to assume the model was 12 shards at ~4gb a shard? Also, you mentioned 3x less VRAM - what are the upper limits of the video vision you can achieve on 24gb or less? EDIT: I tested locally with a 4090. Single image inference was fine but any video OOMed me no matter the length or resolution |
Error: accessing tensor output of CUDAGraphs that has been overwritten by a subsequent run. Stack trace,/transformers/models/llama/modeling_llama.py", line 1190, in forward |
Is there any plan to release an int4 version ? Specifically, I'm interested in the video understanding part.
The text was updated successfully, but these errors were encountered: