Memory Issues when Attempting to Load GGUF Tensors in transformers #34417

mpperez3 · 2024-10-25T16:09:22Z

System Info

Environment:
OS: Ubuntu 24.04
Python version: 3.11.8
Transformers version: transformers==4.45.2
Torch version: torch==2.3.0
Model: Meta-Llama-3.1-70B-Q2_K-GGUF - https://huggingface.co/Ffftdtd5dtft/Meta-Llama-3.1-70B-Q2_K-GGUF

Who can help?

text models: @ArthurZucker
generate: @zucchini-nlp

Information

The official example scripts

Reproduction

Description:
I am attempting to load a quantized GGUF model (Meta-Llama-3.1-70B-Q2_K-GGUF) using the AutoTokenizer and AutoModel classes from Hugging Face transformers, but I am encountering a severe memory RAM usage during the de-quantization process (more than 90GB).

Steps to Reproduce:

tokenizer = AutoTokenizer.from_pretrained("Ffftdtd5dtft/Meta-Llama-3.1-70B-Q2_K-GGUF", trust_remote_code=True, gguf_file="Meta-Llama-3.1-70B-Q2_K-GGUF")
model = AutoModel.from_pretrained("Ffftdtd5dtft/Meta-Llama-3.1-70B-Q2_K-GGUF",  device_map='auto', trust_remote_code=True, gguf_file="Meta-Llama-3.1-70B-Q2_K-GGUF")

Observed Behavior:
Memory Issues: When attempting to load the GGUF model (Meta-Llama-3.1-70B-Q2_K), the system quickly exhausts available memory during the de-quantization. It use more than 90GB. I also try other different models such as https://huggingface.co/mradermacher/Meta-Llama-3.1-70B-GGUF

Expected behavior

Efficient Loading: I expect the GGUF model (like Meta-Llama-3.1-70B-Q2_K-GGUF) to be loaded correctly using the transformers library, with a clear and efficient process for de-quantizing and loading the model without causing memory exhaustion, even in systems with less than 100GB of RAM.
Support for Lower RAM Systems: Given that GGUF is a quantized format designed for efficiency, it would be ideal if transformers could either support a more memory-optimized loading process or allow partial model loading, enabling users with lower RAM systems (e.g., < 100GB) to load and run these models effectively.
Alternative Loading Method: If loading such models directly with AutoModel and AutoTokenizer is not possible due to GGUF-specific constraints, it would be helpful to have documentation or tools to convert these models into a compatible format (such as PyTorch) that can be handled within transformers.

The text was updated successfully, but these errors were encountered:

ydshieh · 2024-10-28T14:53:14Z

@mpperez3

Thanks for opening the issue.

Would you be up to also try using llama.cpp and see and report here how is the memory usage with it?

mpperez3 · 2024-10-28T15:25:15Z

@ydshieh Many tahnks for you response. Sure! I have already tested it with llama.cpp and there were no memory problems. It seems that the problem is the de-quantization and memory usage in this process. The size of the model at inference time is “small”, as supposed of a Q2 model and therefore it fits without problems in the memory of my GPU and CPU.

ArthurZucker · 2024-10-29T15:07:52Z

Hey ! You are probably missing the torch_dtype flag when initializing the model! By default it will be torch's default so float32

mpperez3 · 2024-10-29T16:40:12Z

@ArthurZucker Many thanks for your response. I ran with the following code and with 28% of de-quantizing GGUF tensors the process had already used 104 GB of RAM.

tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=model_dir, trust_remote_code=True,
                                              gguf_file=gguf_file, torch_dtype=torch.bfloat16)
model = AutoModel.from_pretrained(model_name, cache_dir=model_dir, device_map='auto', trust_remote_code=True,
                                      gguf_file=gguf_file, torch_dtype=torch.bfloat16)

I also prove torch.float8_e5m2 with same result:

  File "lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3963, in from_pretrained
    state_dict = load_gguf_checkpoint(gguf_path, return_tensors=True)["tensors"]
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/transformers/modeling_gguf_pytorch_utils.py", line 167, in load_gguf_checkpoint
    weights = dequantize(tensor.data, tensor.tensor_type)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/gguf/quants.py", line 73, in dequantize
    return q.dequantize(data)
           ^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/gguf/quants.py", line 201, in dequantize
    return cls.__dequantize_array(tensor)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/gguf/quants.py", line 173, in __dequantize_array
    return _apply_over_grouped_rows(cls.dequantize_rows, arr=array, otype=np.float32, oshape=cls.__shape_from_bytes(array.shape))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/python3.11/site-packages/gguf/quants.py", line 34, in _apply_over_grouped_rows
    out = np.empty(shape=osize, dtype=otype)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 896. MiB for an array with shape (234881024,) and data type float32

thiswillbeyourgithub · 2024-11-16T07:45:42Z

Hi, I'm having the same issue, with my server suffering carastrophic crashes as a result (I'm forced to reboot, probably because many system apps were forced to crash?).

Maybe this is because of this line that seems to copy tensors:

transformers/src/transformers/modeling_gguf_pytorch_utils.py

Line 253 in 1349321

parsed_parameters["tensors"][name] = torch.from_numpy(np.copy(weights))

ydshieh · 2024-11-18T12:00:58Z

cc @SunMarc if you have some bandwidth to check this. Happy to help otherwise.

SunMarc · 2024-11-18T14:26:59Z

@ArthurZucker Many thanks for your response. I ran with the following code and with 28% of de-quantizing GGUF tensors the process had already used 104 GB of RAM.

With transformers, we are dequantizing the gguf model meaning that the gguf model will take 280GB if loaded in fp32 or 140GB in bfloat

Hi, I'm having the same issue, with my server suffering carastrophic crashes as a result (I'm forced to reboot, probably because many system apps were forced to crash?).

Maybe this is because of this line that seems to copy tensors:

Could you share your specs and with which model you are getting this issue ?

thiswillbeyourgithub · 2024-11-18T16:28:32Z

With transformers, we are dequantizing the gguf model meaning that the gguf model will take 280GB if loaded in fp32 or 140GB in bfloat

Doesn't that defeat the purpose of the quantized gguf in the first place? I'm surprised that I can load a given model using ollama just fine on my setup but my server severely struggles when trying to load the same model using transformers. I'm suspecting it's something that could be very easily solved by using some args but I struggle to find which one so far.

Could you share your specs and with which model you are getting this issue ?

My specs: 32G of RAM for my cpu, a GTX 1060 (6Go of VRAM) and a GTX 1080 (8Go of VRAM). Ollama is getting by just fine with models even above models like Nemo 12B. But I'm having a lot of trouble getting a 7B to even load on my code. It usually crashes half way IIRC.

An example of model is this one:

model_name = "MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF"
fname = "Mistral-7B-Instruct-v0.3.Q4_K_M.gguf"

My code can be found in that fork: https://github.com/thiswillbeyourgithub/repeng/blob/1fe6316ba92b0d455ca44ad7d37e705f59d5a555/research/research.py#L56

As you can see, I tried pretty much anything and everything in terms of arguments to get the model to load. What I'm after ideally is to load the gguf in a similar way as ollama but I'm not sure what this implies in terms of dtype. Also ideally the model would be loaded to the GPU, or both GPU if needed, or a mix of GPU + cpu RAM (but maybe only ollama supports that? I don't know).

Thanks a lot!

SunMarc · 2024-11-19T15:11:31Z

Doesn't that defeat the purpose of the quantized gguf in the first place? I'm surprised that I can load a given model using ollama just fine on my setup but my server severely struggles when trying to load the same model using transformers. I'm suspecting it's something that could be very easily solved by using some args but I struggle to find which one so far.

ollama is powered by llama.cpp, so it is running the quantized model as you would expect. The goal of this integration with transformers was to enable users to finetune their gguf model as described here. However, we are open on adding more support to the gguf integration and enable inference with the quantized model (without dequantization). Could you share a bit your goal with using transformers with gguf files ? Do you want to use transformers as the inference engine instead of ollama ?

thiswillbeyourgithub · 2024-11-19T16:22:35Z

Thanks for the helpful answer.

My current goal is to investigate a few questions i had about control vectors / representation engineering. @vgel made the repeng repository that makes it intuitive and easy to create control vectors.

But with my somewhat limited hardware, I have to either:

try to make my tests on very small LLMs (3B for example), which makes the conclusions less universal
or pay for compute (I'm thinking modal.com might be the most flexible one, allowing to run custom code?)

The prefered solution of course, given how fast my setup can do inference on 7B and even 12B models, would be to find a way to work with quantized gguf without dequantization.

Edit: Well actually I just found out that not using gguf but using bitsandbytes is good enough for my use case:

bnb_config = BitsAndBytesConfig(
  load_in_4bit=True,
  bnb_4bit_quant_type="nf4",
  bnb_4bit_compute_dtype=torch.bfloat16,
  bnb_4bit_use_double_quant=True,
)

print("Initializing tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    device_map="cuda",
    low_cpu_mem_usage=True,
    quantization_config=bnb_config,
)
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

print("Initializing model...")
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    low_cpu_mem_usage=True,
    quantization_config=bnb_config,
)

SunMarc · 2024-11-20T17:02:28Z

Nice, glad that it works with bnb !

thiswillbeyourgithub · 2024-11-22T15:52:20Z

Nice, glad that it works with bnb !

I thought about it some more. Isn't it a bug that the dequantization step from gguf to hg weights does not apply bnb arguments? What I mean is that if I load a non gguf model with the arg load_in_4bit=True it will be applied, but if I load a gguf the load_in_4bit=True seems to be either ignored or possibly be applied only after the model is fully dequantized. The latter would defeat the purpose because the reason I want to use a GGUF model IS because I don't have the memory ressources to load it in full.

Similarly, why is the np.copy needed in that line:

transformers/src/transformers/modeling_gguf_pytorch_utils.py

Line 253 in 1349321

parsed_parameters["tensors"][name] = torch.from_numpy(np.copy(weights))

And why is it sent going through numpy by default, instead of being sent to the device specified by the user? Because if the user wants to load it to cuda in the end, it does not seem necessary to do all that copying right?

Overall, I don't understand why there's a dequantization step. Why is that needed?

Thanks a lot!

SunMarc · 2024-11-22T16:09:11Z

Right now, we just don't support loading the gguf in the quantized format. As I said, this is something we are considering. Now, for the dequantization part, this can be useful for people who wants to fine-tune the gguf model.

thiswillbeyourgithub · 2024-11-22T16:30:13Z

Thank you very much for your time.

github-actions · 2024-12-17T08:05:07Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

mpperez3 added the bug label Oct 25, 2024

mpperez3 changed the title ~~Incompatibility and Memory Issues when Attempting to Load GGUF Tensors in transformers~~ Memory Issues when Attempting to Load GGUF Tensors in transformers Oct 25, 2024

ydshieh self-assigned this Oct 28, 2024

github-actions bot closed this as completed Dec 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Issues when Attempting to Load GGUF Tensors in transformers #34417

Memory Issues when Attempting to Load GGUF Tensors in transformers #34417

mpperez3 commented Oct 25, 2024 •

edited

Loading

ydshieh commented Oct 28, 2024

mpperez3 commented Oct 28, 2024

ArthurZucker commented Oct 29, 2024

mpperez3 commented Oct 29, 2024

thiswillbeyourgithub commented Nov 16, 2024

ydshieh commented Nov 18, 2024

SunMarc commented Nov 18, 2024

thiswillbeyourgithub commented Nov 18, 2024

SunMarc commented Nov 19, 2024 •

edited

Loading

thiswillbeyourgithub commented Nov 19, 2024

SunMarc commented Nov 20, 2024

thiswillbeyourgithub commented Nov 22, 2024

SunMarc commented Nov 22, 2024

thiswillbeyourgithub commented Nov 22, 2024

github-actions bot commented Dec 17, 2024

Memory Issues when Attempting to Load GGUF Tensors in transformers #34417

Memory Issues when Attempting to Load GGUF Tensors in transformers #34417

Comments

mpperez3 commented Oct 25, 2024 • edited Loading

System Info

Who can help?

Information

Reproduction

Expected behavior

ydshieh commented Oct 28, 2024

mpperez3 commented Oct 28, 2024

ArthurZucker commented Oct 29, 2024

mpperez3 commented Oct 29, 2024

thiswillbeyourgithub commented Nov 16, 2024

ydshieh commented Nov 18, 2024

SunMarc commented Nov 18, 2024

thiswillbeyourgithub commented Nov 18, 2024

SunMarc commented Nov 19, 2024 • edited Loading

thiswillbeyourgithub commented Nov 19, 2024

SunMarc commented Nov 20, 2024

thiswillbeyourgithub commented Nov 22, 2024

SunMarc commented Nov 22, 2024

thiswillbeyourgithub commented Nov 22, 2024

github-actions bot commented Dec 17, 2024

mpperez3 commented Oct 25, 2024 •

edited

Loading

SunMarc commented Nov 19, 2024 •

edited

Loading