Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Issues when Attempting to Load GGUF Tensors in transformers #34417

Closed
1 task done
mpperez3 opened this issue Oct 25, 2024 · 15 comments
Closed
1 task done

Memory Issues when Attempting to Load GGUF Tensors in transformers #34417

mpperez3 opened this issue Oct 25, 2024 · 15 comments
Assignees
Labels

Comments

@mpperez3
Copy link

mpperez3 commented Oct 25, 2024

System Info

Environment:
OS: Ubuntu 24.04
Python version: 3.11.8
Transformers version: transformers==4.45.2
Torch version: torch==2.3.0
Model: Meta-Llama-3.1-70B-Q2_K-GGUF - https://huggingface.co/Ffftdtd5dtft/Meta-Llama-3.1-70B-Q2_K-GGUF

Who can help?

text models: @ArthurZucker
generate: @zucchini-nlp

Information

  • The official example scripts

Reproduction

Description:
I am attempting to load a quantized GGUF model (Meta-Llama-3.1-70B-Q2_K-GGUF) using the AutoTokenizer and AutoModel classes from Hugging Face transformers, but I am encountering a severe memory RAM usage during the de-quantization process (more than 90GB).

Steps to Reproduce:

tokenizer = AutoTokenizer.from_pretrained("Ffftdtd5dtft/Meta-Llama-3.1-70B-Q2_K-GGUF", trust_remote_code=True, gguf_file="Meta-Llama-3.1-70B-Q2_K-GGUF")
model = AutoModel.from_pretrained("Ffftdtd5dtft/Meta-Llama-3.1-70B-Q2_K-GGUF",  device_map='auto', trust_remote_code=True, gguf_file="Meta-Llama-3.1-70B-Q2_K-GGUF")

Observed Behavior:
Memory Issues: When attempting to load the GGUF model (Meta-Llama-3.1-70B-Q2_K), the system quickly exhausts available memory during the de-quantization. It use more than 90GB. I also try other different models such as https://huggingface.co/mradermacher/Meta-Llama-3.1-70B-GGUF

Expected behavior

  • Efficient Loading: I expect the GGUF model (like Meta-Llama-3.1-70B-Q2_K-GGUF) to be loaded correctly using the transformers library, with a clear and efficient process for de-quantizing and loading the model without causing memory exhaustion, even in systems with less than 100GB of RAM.

  • Support for Lower RAM Systems: Given that GGUF is a quantized format designed for efficiency, it would be ideal if transformers could either support a more memory-optimized loading process or allow partial model loading, enabling users with lower RAM systems (e.g., < 100GB) to load and run these models effectively.

  • Alternative Loading Method: If loading such models directly with AutoModel and AutoTokenizer is not possible due to GGUF-specific constraints, it would be helpful to have documentation or tools to convert these models into a compatible format (such as PyTorch) that can be handled within transformers.

@mpperez3 mpperez3 added the bug label Oct 25, 2024
@mpperez3 mpperez3 changed the title Incompatibility and Memory Issues when Attempting to Load GGUF Tensors in transformers Memory Issues when Attempting to Load GGUF Tensors in transformers Oct 25, 2024
@ydshieh
Copy link
Collaborator

ydshieh commented Oct 28, 2024

@mpperez3

Thanks for opening the issue.

Would you be up to also try using llama.cpp and see and report here how is the memory usage with it?

@mpperez3
Copy link
Author

@ydshieh Many tahnks for you response. Sure! I have already tested it with llama.cpp and there were no memory problems. It seems that the problem is the de-quantization and memory usage in this process. The size of the model at inference time is “small”, as supposed of a Q2 model and therefore it fits without problems in the memory of my GPU and CPU.

@ydshieh ydshieh self-assigned this Oct 28, 2024
@ArthurZucker
Copy link
Collaborator

Hey ! You are probably missing the torch_dtype flag when initializing the model! By default it will be torch's default so float32

@mpperez3
Copy link
Author

@ArthurZucker Many thanks for your response. I ran with the following code and with 28% of de-quantizing GGUF tensors the process had already used 104 GB of RAM.

tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=model_dir, trust_remote_code=True,
                                              gguf_file=gguf_file, torch_dtype=torch.bfloat16)
model = AutoModel.from_pretrained(model_name, cache_dir=model_dir, device_map='auto', trust_remote_code=True,
                                      gguf_file=gguf_file, torch_dtype=torch.bfloat16)

I also prove torch.float8_e5m2 with same result:

  File "lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3963, in from_pretrained
    state_dict = load_gguf_checkpoint(gguf_path, return_tensors=True)["tensors"]
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/transformers/modeling_gguf_pytorch_utils.py", line 167, in load_gguf_checkpoint
    weights = dequantize(tensor.data, tensor.tensor_type)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/gguf/quants.py", line 73, in dequantize
    return q.dequantize(data)
           ^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/gguf/quants.py", line 201, in dequantize
    return cls.__dequantize_array(tensor)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/gguf/quants.py", line 173, in __dequantize_array
    return _apply_over_grouped_rows(cls.dequantize_rows, arr=array, otype=np.float32, oshape=cls.__shape_from_bytes(array.shape))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/python3.11/site-packages/gguf/quants.py", line 34, in _apply_over_grouped_rows
    out = np.empty(shape=osize, dtype=otype)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 896. MiB for an array with shape (234881024,) and data type float32

@thiswillbeyourgithub
Copy link

Hi, I'm having the same issue, with my server suffering carastrophic crashes as a result (I'm forced to reboot, probably because many system apps were forced to crash?).

Maybe this is because of this line that seems to copy tensors:

parsed_parameters["tensors"][name] = torch.from_numpy(np.copy(weights))

@ydshieh
Copy link
Collaborator

ydshieh commented Nov 18, 2024

cc @SunMarc if you have some bandwidth to check this. Happy to help otherwise.

@SunMarc
Copy link
Member

SunMarc commented Nov 18, 2024

@ArthurZucker Many thanks for your response. I ran with the following code and with 28% of de-quantizing GGUF tensors the process had already used 104 GB of RAM.

With transformers, we are dequantizing the gguf model meaning that the gguf model will take 280GB if loaded in fp32 or 140GB in bfloat

Hi, I'm having the same issue, with my server suffering carastrophic crashes as a result (I'm forced to reboot, probably because many system apps were forced to crash?).

Maybe this is because of this line that seems to copy tensors:

Could you share your specs and with which model you are getting this issue ?

@thiswillbeyourgithub
Copy link

With transformers, we are dequantizing the gguf model meaning that the gguf model will take 280GB if loaded in fp32 or 140GB in bfloat

Doesn't that defeat the purpose of the quantized gguf in the first place? I'm surprised that I can load a given model using ollama just fine on my setup but my server severely struggles when trying to load the same model using transformers. I'm suspecting it's something that could be very easily solved by using some args but I struggle to find which one so far.

Could you share your specs and with which model you are getting this issue ?

My specs: 32G of RAM for my cpu, a GTX 1060 (6Go of VRAM) and a GTX 1080 (8Go of VRAM). Ollama is getting by just fine with models even above models like Nemo 12B. But I'm having a lot of trouble getting a 7B to even load on my code. It usually crashes half way IIRC.

An example of model is this one:

model_name = "MaziyarPanahi/Mistral-7B-Instruct-v0.3-GGUF"
fname = "Mistral-7B-Instruct-v0.3.Q4_K_M.gguf"

My code can be found in that fork: https://github.com/thiswillbeyourgithub/repeng/blob/1fe6316ba92b0d455ca44ad7d37e705f59d5a555/research/research.py#L56

As you can see, I tried pretty much anything and everything in terms of arguments to get the model to load. What I'm after ideally is to load the gguf in a similar way as ollama but I'm not sure what this implies in terms of dtype. Also ideally the model would be loaded to the GPU, or both GPU if needed, or a mix of GPU + cpu RAM (but maybe only ollama supports that? I don't know).

Thanks a lot!

@SunMarc
Copy link
Member

SunMarc commented Nov 19, 2024

Doesn't that defeat the purpose of the quantized gguf in the first place? I'm surprised that I can load a given model using ollama just fine on my setup but my server severely struggles when trying to load the same model using transformers. I'm suspecting it's something that could be very easily solved by using some args but I struggle to find which one so far.

ollama is powered by llama.cpp, so it is running the quantized model as you would expect. The goal of this integration with transformers was to enable users to finetune their gguf model as described here. However, we are open on adding more support to the gguf integration and enable inference with the quantized model (without dequantization). Could you share a bit your goal with using transformers with gguf files ? Do you want to use transformers as the inference engine instead of ollama ?

@thiswillbeyourgithub
Copy link

Thanks for the helpful answer.

My current goal is to investigate a few questions i had about control vectors / representation engineering. @vgel made the repeng repository that makes it intuitive and easy to create control vectors.

But with my somewhat limited hardware, I have to either:

  • try to make my tests on very small LLMs (3B for example), which makes the conclusions less universal
  • or pay for compute (I'm thinking modal.com might be the most flexible one, allowing to run custom code?)

The prefered solution of course, given how fast my setup can do inference on 7B and even 12B models, would be to find a way to work with quantized gguf without dequantization.

Edit: Well actually I just found out that not using gguf but using bitsandbytes is good enough for my use case:

bnb_config = BitsAndBytesConfig(
  load_in_4bit=True,
  bnb_4bit_quant_type="nf4",
  bnb_4bit_compute_dtype=torch.bfloat16,
  bnb_4bit_use_double_quant=True,
)

print("Initializing tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    device_map="cuda",
    low_cpu_mem_usage=True,
    quantization_config=bnb_config,
)
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

print("Initializing model...")
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    low_cpu_mem_usage=True,
    quantization_config=bnb_config,
)

@SunMarc
Copy link
Member

SunMarc commented Nov 20, 2024

Nice, glad that it works with bnb !

@thiswillbeyourgithub
Copy link

Nice, glad that it works with bnb !

I thought about it some more. Isn't it a bug that the dequantization step from gguf to hg weights does not apply bnb arguments? What I mean is that if I load a non gguf model with the arg load_in_4bit=True it will be applied, but if I load a gguf the load_in_4bit=True seems to be either ignored or possibly be applied only after the model is fully dequantized. The latter would defeat the purpose because the reason I want to use a GGUF model IS because I don't have the memory ressources to load it in full.

Similarly, why is the np.copy needed in that line:

parsed_parameters["tensors"][name] = torch.from_numpy(np.copy(weights))

And why is it sent going through numpy by default, instead of being sent to the device specified by the user? Because if the user wants to load it to cuda in the end, it does not seem necessary to do all that copying right?

Overall, I don't understand why there's a dequantization step. Why is that needed?

Thanks a lot!

@SunMarc
Copy link
Member

SunMarc commented Nov 22, 2024

Right now, we just don't support loading the gguf in the quantized format. As I said, this is something we are considering. Now, for the dequantization part, this can be useful for people who wants to fine-tune the gguf model.

@thiswillbeyourgithub
Copy link

Thank you very much for your time.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants