-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The Project is not implemented for 70B llama? #62
Comments
Hi, the modeling file currently does not support GQA, but should require minimal changes to support it. What you described should work perfectly :) |
It seems that we need a hierarchical pruning scheme for gqa, group pruning and head pruning inside group? Since we need to keep the number of heads in each group the same. |
In order to make the pruned model be able to run tp, it would be better to keep the group num unchanged. |
Pruning queries might cause the number of queries to be different in different groups. So maybe a group-based pruning is more reasonable? @zhangzhenyu13 |
Could we share the mask of query-heads among different groups?
|
Yes, your settings are right. We need to share z across groups. |
Hi @zhangzhenyu13 I have some confusion. The author's composer llama file does not implement any GQA functionality. Did you implement GQA forward yourself? Which llama warehouse implementation version is better to refer to? |
No GQA implementation is found, so the model is not capable to scale to 70B for composerLLAMA.
Maybe we need design GQA and introduce head_z for wq and head_z_kv for wk and wv?
The text was updated successfully, but these errors were encountered: