-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Issues when Attempting to Load GGUF Tensors in transformers #34417
Comments
Thanks for opening the issue. Would you be up to also try using |
@ydshieh Many tahnks for you response. Sure! I have already tested it with llama.cpp and there were no memory problems. It seems that the problem is the de-quantization and memory usage in this process. The size of the model at inference time is “small”, as supposed of a Q2 model and therefore it fits without problems in the memory of my GPU and CPU. |
Hey ! You are probably missing the |
@ArthurZucker Many thanks for your response. I ran with the following code and with 28% of de-quantizing GGUF tensors the process had already used 104 GB of RAM.
I also prove torch.float8_e5m2 with same result:
|
Hi, I'm having the same issue, with my server suffering carastrophic crashes as a result (I'm forced to reboot, probably because many system apps were forced to crash?). Maybe this is because of this line that seems to copy tensors:
|
cc @SunMarc if you have some bandwidth to check this. Happy to help otherwise. |
With transformers, we are dequantizing the gguf model meaning that the gguf model will take 280GB if loaded in fp32 or 140GB in bfloat
Could you share your specs and with which model you are getting this issue ? |
Doesn't that defeat the purpose of the quantized gguf in the first place? I'm surprised that I can load a given model using ollama just fine on my setup but my server severely struggles when trying to load the same model using transformers. I'm suspecting it's something that could be very easily solved by using some args but I struggle to find which one so far.
My specs: 32G of RAM for my cpu, a GTX 1060 (6Go of VRAM) and a GTX 1080 (8Go of VRAM). Ollama is getting by just fine with models even above models like Nemo 12B. But I'm having a lot of trouble getting a 7B to even load on my code. It usually crashes half way IIRC. An example of model is this one: model_name = "MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF"
fname = "Mistral-7B-Instruct-v0.3.Q4_K_M.gguf" My code can be found in that fork: https://github.com/thiswillbeyourgithub/repeng/blob/1fe6316ba92b0d455ca44ad7d37e705f59d5a555/research/research.py#L56 As you can see, I tried pretty much anything and everything in terms of arguments to get the model to load. What I'm after ideally is to load the gguf in a similar way as ollama but I'm not sure what this implies in terms of dtype. Also ideally the model would be loaded to the GPU, or both GPU if needed, or a mix of GPU + cpu RAM (but maybe only ollama supports that? I don't know). Thanks a lot! |
ollama is powered by llama.cpp, so it is running the quantized model as you would expect. The goal of this integration with transformers was to enable users to finetune their gguf model as described here. However, we are open on adding more support to the gguf integration and enable inference with the quantized model (without dequantization). Could you share a bit your goal with using transformers with gguf files ? Do you want to use transformers as the inference engine instead of ollama ? |
Thanks for the helpful answer. My current goal is to investigate a few questions i had about control vectors / representation engineering. @vgel made the repeng repository that makes it intuitive and easy to create control vectors. But with my somewhat limited hardware, I have to either:
The prefered solution of course, given how fast my setup can do inference on 7B and even 12B models, would be to find a way to work with quantized gguf without dequantization. Edit: Well actually I just found out that not using gguf but using bitsandbytes is good enough for my use case: bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
print("Initializing tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(
model_name,
device_map="cuda",
low_cpu_mem_usage=True,
quantization_config=bnb_config,
)
if tokenizer.pad_token is None:
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
print("Initializing model...")
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
low_cpu_mem_usage=True,
quantization_config=bnb_config,
) |
Nice, glad that it works with bnb ! |
I thought about it some more. Isn't it a bug that the dequantization step from gguf to hg weights does not apply bnb arguments? What I mean is that if I load a non gguf model with the arg Similarly, why is the np.copy needed in that line:
And why is it sent going through numpy by default, instead of being sent to the device specified by the user? Because if the user wants to load it to cuda in the end, it does not seem necessary to do all that copying right? Overall, I don't understand why there's a dequantization step. Why is that needed? Thanks a lot! |
Right now, we just don't support loading the gguf in the quantized format. As I said, this is something we are considering. Now, for the dequantization part, this can be useful for people who wants to fine-tune the gguf model. |
Thank you very much for your time. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
System Info
Environment:
OS: Ubuntu 24.04
Python version: 3.11.8
Transformers version: transformers==4.45.2
Torch version: torch==2.3.0
Model: Meta-Llama-3.1-70B-Q2_K-GGUF - https://huggingface.co/Ffftdtd5dtft/Meta-Llama-3.1-70B-Q2_K-GGUF
Who can help?
text models: @ArthurZucker
generate: @zucchini-nlp
Information
Reproduction
Description:
I am attempting to load a quantized GGUF model (Meta-Llama-3.1-70B-Q2_K-GGUF) using the AutoTokenizer and AutoModel classes from Hugging Face transformers, but I am encountering a severe memory RAM usage during the de-quantization process (more than 90GB).
Steps to Reproduce:
Observed Behavior:
Memory Issues: When attempting to load the GGUF model (Meta-Llama-3.1-70B-Q2_K), the system quickly exhausts available memory during the de-quantization. It use more than 90GB. I also try other different models such as https://huggingface.co/mradermacher/Meta-Llama-3.1-70B-GGUF
Expected behavior
Efficient Loading: I expect the GGUF model (like Meta-Llama-3.1-70B-Q2_K-GGUF) to be loaded correctly using the transformers library, with a clear and efficient process for de-quantizing and loading the model without causing memory exhaustion, even in systems with less than 100GB of RAM.
Support for Lower RAM Systems: Given that GGUF is a quantized format designed for efficiency, it would be ideal if transformers could either support a more memory-optimized loading process or allow partial model loading, enabling users with lower RAM systems (e.g., < 100GB) to load and run these models effectively.
Alternative Loading Method: If loading such models directly with AutoModel and AutoTokenizer is not possible due to GGUF-specific constraints, it would be helpful to have documentation or tools to convert these models into a compatible format (such as PyTorch) that can be handled within transformers.
The text was updated successfully, but these errors were encountered: