-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Will this work with Qlora and 4bit inference? #1
Comments
yes currently it relies on standard finetuning to transform an existing model to longer context. conceptually, it should be perfectly reasonable to try LoRA-finetuning instead for more memory efficiency, or even Q-LoRA. we're playing with LoRA at the moment. one change you'll have to make is to also unfreeze the initial embedding layer, to allow it to learn the new landmark tokens. otherwise, should work out of the box. let us know if you manage to get it running by combining the two codebases. we can keep the issue open if other people share their experience as well |
Do you think it should be doable with 4 bit finetuning as implemented on this repo? https://github.com/johnsmith0031/alpaca_lora_4bit How big of a dataset would be necessary for the finetuning to work past the 2048 token mark, assuming a LoRA approach? |
on my server i have a system set up and this is one of our active problems were working on, i plan on figuring out how to alter this to pass custom datasets through the tokenization system on this and run it with axolotl which has built in qlora functionality its actually a very cool development stack that involves samantha which is a new model developed by ehartford. the server has some of the more notable open sourced researchers there and if you think you can figure out how to combine the requisite elements were paying for compute to run experiments and make attempts, including h100s or clusters of a100's if needed honestly id happily pay for you to train a 65b llama fine tune if you can figure how to do it with custom datasets on this repo |
Thank you very much. I don't feel qualified enough to take on such a challenge, maybe a more knowledgeable person can do this. I joined the discord you sent and will be paying attention to any developments around increased context length on LLaMA models. |
I am currently training a 7B TheBloke-WizardLM-7B-HF model. I worked around all of the issues to get it up and running. It's presently generating the tokens for the original dataset. If all goes well and I'm confident in the results, I'll send you a message about seeing if I can work with you to get a 65B model. |
keep me posted! |
Just wanted to give a status update. I've been able to create a QLoRA but haven't seen any improvement as of yet. There were a lot of settings I had to tweak to get it running and I'm not sure if the issue is settings I've tweaked or just from not training long enough. I will release a fork later today with what I've done and an explainaition of the issues I'm running into if anyone else wants to take a crack at it as well. |
ill happily take a look, maybe we can crack it if we work at it as a unit |
I could train 65b on 4 node 32GPU if someone wants to help me with the training script |
I worked with @Alignment-Lab-AI and we were able to reproduce the results as described in the paper. Here is a guide if you just want to be able to test out landmark attention (not qlora yet, that's still in the works) This guide is what for spinning up the model on lambda labs, so you might be able to skip some of the steps. Leaving all of the information in here just in case. This was done at 3 AM yesterday so there may be an issue you run into when setting up that I forgot about. If you have any issues let me know. Python version 3.11 Getting the models mkdir models Installing the required libraries (If you don't have conda) Creating the tuned llama file. This is how the merged model is created cd /landmark-attention/llama Running Inference (run_test.py) cd /landmark-attention/llama We were able to get context up to 25k working and getting the correct answer, but it is SLOW. I think there are some optimizations that still need to be made to improve the performance. More comprehensive testing for evaluation needs to be completed as well. |
Yes, I already did the 7B reproduction with 8 A10080GBs but I want to also try using larger models up to 65B running on 4 nodes. I guess I have to use deepspeed like shown in Llama-X. Is anybody is up for some collaboration? |
You can join our discord and we can work out the details. We are trying to get this in the hands of people as soon as possible. |
There's a lot of optimization left to do, you can reach us at toasts discord or at mine |
Just giving a status update, we've been able to train a 3B model using 20GB of VRAM, and 7B model using 29GB of VRAM, but have not trained long enough to get results and are still working out how it will get merged with the original weights. I've made a fork that will keep up with my progress on this. |
Hey everyone, I got it working! We are running a longer training overnight, but from training 500 steps on the 7B model, we were getting up to 7k tokens (I tried 32k but got OOM). The accuracy wasn't great (60% accurate at 7k tokens) from lack of training so we haven't released the model yet. We should have something released tomorrow though. Exciting stuff! |
Hi. Thank you for following up on this. Looking forward to hear more about the final accuracy. Regarding OOM, are you using offloading? |
Very exciting indeed! Is this a full finetuning or QLora ? |
QLoRA. We weren't using offloading so that's probably why we were getting OOM. We haven't had much luck getting better performance than the initial 60% accuracy. We trained a 13B, 7B, 3B and all of them had the same issue. Still in the process of tweaking all of the LoRA settings to see if we can improve the results. The goal was to make it so larger models, ie 30B or 65B could be trained as well at a low cost. For doing all 3 of those trainings today, we've only spent about $50 on a H100. |
Thanks for the update. I took a quick pass at your code and it seems the embedding layer is frozen during training. This can be a problem since we are adding a new token and at least this token's embedding needs to be trained. |
Would you be able to provide some example code or another repo on how that is done? I'll take a look myself but this is all still very new to me. I'm just a guy with some free time and used to staring at things until I can get them working. |
I don't have an example unfortunately (but if someone else does, please share). I also have not done this before myself. But, as a guess, I think you should be able to achieve this by updating the |
Okay thanks for the pointer, we'll see how it goes. Took me 3 days to get to this point so hopefully I'll have it figured out this weekend. |
hey man it's ehartford not ehartman :-) hit me up if you need help. |
whoops! sorry about that, i was having a really late night that night |
It's all done now! You can check out my repo here: https://github.com/eugenepentland/landmark-attention-qlora We trained a 7B and 13B model, and the 13B model appears to have equal or better performance than the fully fine tuned 7B base model. We tested each step 20 times. The majority of the work now is just properly evaluating the model beyond the test provided in the paper. All of the models can handle larger context than shown, we just ran out of memory on our GPU (Still haven't tried the CPU offloading). @mkrima I would love to get in contact with you guys to talk about your work and see if there is anything we can do to help. I already have a few people that are evaluating the models now and will be providing some feedback. I also have lots of questions about possible improvements for the future! |
Hi! I am currently running MMLU (soon BBH and HumanEval) on the WizardLM-Landmark model. Will report back once I get numbers! I'm currently using https://github.com/declare-lab/instruct-eval, wired up to code based on @eugenepentland's qlora evaluation code. I have found that it runs much slower, which I guess is to be expected. |
have you switched yet to the new triton kernel which we posted? |
our most recent progress had a few roadblocks in regards to triton eugene and the other person primarily working on it have additionally been super busy for the last few days though interest is ongoing |
I haven't had any luck just trying to get your base landmark training on my local machine yet. The first issue is the block size is defined as 63, and the max_model_size is 512, but I get an error where the max model size has to be divisible by the block size, so I dropped the max_model_size down to 504. The issue I currently have not been able to resolve is when running the fused_landmark_attention function, I get the following error: error: Number of elements must be power-of-two, but "tt.return"(%0) : (tensor<64x100xf32>) -> () doesn't follow the rule (6400) elements So my query tensor is size 100 which is making it fail. The only thing I've changed is I'm running it on a smaller dataset for the sake of testing faster. (also changed to open_llama_3b so the RTX Quadro 8000 could run without issues. Any help would be great. python3 train.py |
Regarding the max_model_size I understand this is a bit confusing since max_model_size corresponds to the context length when use_flash is False whereas when use_flash is True, it corresponds to the number of non-landmark tokens (so for example in your settings the context size is actually 1008 + 1008/63 = 1024). I'll work on a patch so max_model_size will always be the context length which should resolve the first issue but in the meantime you are using the correct value. Regarding the second issue, when using Triton, head dimension needs to be a power of two. Can you test with using 128 head dimension? |
Since you are using a pretrained model, you probably can not increase the head dimension directly. One solution is padding your key, query, and value vectors with zeros before passing it to the fused attention (which should not affect the output) and then dropping the additional dimension after the attention is done. (By the way it's better to move this discussion to a new issue thread at some point since it's no longer about QLORA) |
I was able to fix that issue by padding my q, k, v tensors but I'm just getting an OOM error from triton now. I'll take a look into it later but I'm not sure this is something I will be able to fix in any kind of simple fashion. This is after I had already reduced the num_stages to 1. triton.runtime.autotuner.OutOfResources: out of resource: shared memory, Required: 81920, Hardware limit: 65536. Reducing block sizes or (And if I run into further issues I will open a new issue) |
This can be solved by lowering the block size to 31. Alternatively if possible using bf16 instead of tf32 should also fix it (possibly tf16 might also work). |
I got training running once I set the block size to 15 (I'm on an older GPU that doesn't support bf16/ft32. Also, the head dimension was only 100 for openllama 3B, when I reran it on wizardLM7B already had the head dimension at 128. Doing the training with a 7B model was using 25GB of VRAM using my QLora repo, but still required 42GB of VRAM at the beginning of training because of Triton. I'll need to take a look into it further, but I'll push the updates to my repo later tonight. |
@ethanhs Hi! Just wanted to follow up on this. Have you got any results for qlora on BBH? |
Yeah, I'm pretty sure the results were significantly worse than the base model. I don't have the numbers anymore. |
Super excited about this project! I'm in the process of reading the paper now! But just curious, are there any plans to make this work for 4bit or 8bit finetuning, so it can be applied to the larger opensource models?
The text was updated successfully, but these errors were encountered: