Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to quantize the model? #32

Open
iamthemulti opened this issue Oct 15, 2024 · 9 comments
Open

How to quantize the model? #32

iamthemulti opened this issue Oct 15, 2024 · 9 comments

Comments

@iamthemulti
Copy link

Currently having issues attempting to quantize, save, then load the model using HF Transformers.

Is there any known working method for quantizing Aria (preferably to 4bit)?

@aria-hacker
Copy link
Collaborator

@iamthemulti Quantizing the Aria model presents challenges due to its use of grouped-gemm for efficient inference and training with bfloat16, rather than standard nn.Linear layers. The grouped-gemm implementation can be found in the Aria repository:

Aria/aria/model/moe_lm.py

Lines 444 to 482 in 719ff4e

class GroupedGEMM(nn.Module):
"""
Grouped GEMM (General Matrix Multiplication) module for efficient expert computation.
This module utilizes the grouped_gemm library (https://github.com/fanshiqing/grouped_gemm)
for optimized performance. If the grouped_gemm library is not installed, it gracefully
falls back to a sequential GEMM implementation, which may be slower but ensures
functionality.
Args:
in_features (int): Number of input features.
out_features (int): Number of output features.
groups (int): Number of expert groups.
"""
def __init__(self, in_features, out_features, groups):
super().__init__()
self.in_features = in_features
self.out_features = out_features
self.groups = groups
self.weight = nn.Parameter(torch.empty(groups, in_features, out_features))
def forward(self, input, tokens_per_expert):
"""
Perform grouped matrix multiplication.
Args:
input (torch.Tensor): Input tensor of shape (num_tokens, in_features).
tokens_per_expert (torch.Tensor): Number of tokens assigned to each expert.
Returns:
torch.Tensor: Output tensor of shape (num_tokens, out_features).
"""
tokens_per_expert = tokens_per_expert.cpu()
# Ensure the CUDA device matches the input tensor's device.
# This mismatch can occur when using `transformers.AutoModel.from_pretrained`
# with `device_map="auto"` on a multi-GPU setup.
torch.cuda.set_device(input.device)
return experts_gemm(input, self.weight, tokens_per_expert)

I'm currently working on a custom solution to address this quantization issue.

@aria-hacker
Copy link
Collaborator

@iamthemulti I've uploaded a fork of aria model that replaces the grouped gemm with a sequential mlp, in which each expert is implemented as a torch.nn.Linear layer executed in sequence. This adjustment simplifies quantization with current open source libraries that are optimized for nn.Linear layers.

If you want to quantize an aria model, please use rhymes-ai/Aria-sequential_mlp

I am also trying to use some open-source tools to quantize the Aria model, but I'm encountering some issues on the H100. Currently, I don't have access to other GPUs for quantization.

@DenisSergeevitch
Copy link

Any updates on quants would be highly valuable @aria-hacker! Please keep us posted about your progress

@leon-seidel
Copy link

leon-seidel commented Oct 23, 2024

I got a BitsAndBytes NF4 quant working based on Aria-sequential_mlp here, requires less than 16 GB of VRAM and runs on an RTX 3090

@aria-hacker
Copy link
Collaborator

I've uploaded an int8 weight-only model that has been quantized using torchao. It's also compatible with grouped-gemm. Feel free to try it out if you're interested!

@ntoxeg
Copy link

ntoxeg commented Nov 4, 2024

Anyone else getting [ERROR|vllm_server.py:212:3614300] 2024-11-04 11:21:12,223 >> KeyError: 'language_model.layers.27.mlp.experts.experts.61.down_proj.weight’ while loading the MLP model via VLLM?

@mobicham
Copy link

mobicham commented Nov 8, 2024

We have an HQQ 4-bit version working well with just 15GB of VRAM: https://github.com/mobiusml/hqq/blob/master/examples/hf/aria_multimodal.py

@h123fire
Copy link

h123fire commented Jan 9, 2025

我们有一个 HQQ 4 位版本,仅需 15GB VRAM 即可运行良好:https ://github.com/mobiusml/hqq/blob/master/examples/hf/aria_multimodal.py

infer video error:
RuntimeError: Error: accessing tensor output of CUDAGraphs that has been overwritten by a subsequent run
self.key_cache[layer_idx] = torch.cat([self.key_cache[layer_idx], key_states], dim=-2). To prevent overwriting, clone the tensor outside of torch.compile() or call torch.compiler.cudagraph_mark_step_begin() before each model invocation

@mobicham
Copy link

mobicham commented Jan 9, 2025

我们有一个 HQQ 4 位版本,仅需 15GB VRAM 即可运行良好:https ://github.com/mobiusml/hqq/blob/master/examples/hf/aria_multimodal.py

infer video error: RuntimeError: Error: accessing tensor output of CUDAGraphs that has been overwritten by a subsequent run self.key_cache[layer_idx] = torch.cat([self.key_cache[layer_idx], key_states], dim=-2). To prevent overwriting, clone the tensor outside of torch.compile() or call torch.compiler.cudagraph_mark_step_begin() before each model invocation

Can you try an older version of transformers, like ~2 months older

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants