Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Suggestion] Guide to Access the Best Quantized LLM Models on Hugging Face #193

Open
Greatz08 opened this issue Dec 31, 2024 · 3 comments
Open

Comments

@Greatz08
Copy link

I think we can provide users with clear guidance on where to find the best-optimized and quantized versions of open weights large language models (LLMs) on Hugging Face. Many developers are releasing quantized versions of popular LLMs to improve performance and efficiency, and it's crucial to help users find these resources easily.

Not Many have powerful GPU's to run even fp16 8b models so they will be truly disappointed if they dont know about quantized models and what and from where to download exactly so you can guide them to either some video plus suggest some popular profiles to use for downloading quantized models from hugginface like i use majorly bartowski quantized models - https://huggingface.co/bartowski

@lingeshsathyanarayanacm
Copy link

I'm fresher it is my first time to contribute a project in github and i like to take this issue

@lingeshsathyanarayanacm
Copy link

what are the topics you expect to be there in the guidelines

@Greatz08
Copy link
Author

Greatz08 commented Jan 8, 2025

@lingeshsathyanarayanacm You could guide users to first understand their own system specs and what level of quantized and non quantized models they can run on their system (for this you need multiple people with different hardware specs who would contribute by sharing their experience with running models like in my case i have RTX 4060 8GB VRAM variant so i can run several types of quantized model like 7B,8B,13B, even 22B but level of quantization matters here like for running 22B model on my system i need to run with with lowest possible quantized version possible like i tried Codestral 22B with IQ2_XS quant and that could barely fit in my VRAM and as far as i could remember it used very little amount of my 780M igpu VRAM too, was slow little bit but could run and give back correct answer for lot of coding question surprisingly :-) but anyways for others lower weights model like 7B,13B i could run them upto Q5_K_M, Q5_K_S easily which is good quality tbh so similarly you have to gain other people experience like upto what level of quant model they could fit and use and should be usable and not be like 1 t/s or 2 t/s speed of generation. This experience you can get from reddit as many share their specs and what level of model they could use easily like you can create category for fast,usable,not usable and then can create guide for like upto what level of VRAM is required to run what level of actual weight models and quantized models so people can understand their limitations easily and can download models accordingly.
Main purpose of this guide would be to share as much experience possible of different people in better way which can be easily understood by everyone (techincal and non technical) and they can save their time on testing different level of models like we did :-) plus they will know from where to download those models. You can think or add much more but this much will be great to for many in my opinion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants