-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding simple batch example #1038
Conversation
fwiw, here's the gguf that was giving me fits. gemma-2-9b-it their model card says their prompt template is
looking at the metadata that's not what's there causing
which clearly isn't right. Looks like I'm not the only one - https://huggingface.co/bartowski/gemma-2-9b-it-GGUF/discussions/12 No idea why the weird template polluted the other conversations though. There theoretically could be an actual bug in there that this template triggers the race condition quickly thanks to it being a bit of a chaos monkey. |
actually, looking at my video it is curious that the Paris answer mentioned using arithmetic to get the capital of France... |
Those replies with the bad template look a lot like there's some kind of bug leaking context from one conversation to another! |
Yeah, that's why I had to get about as simple as possible when walking through the code. One thing I noticed was I'd sometimes see the same id for the chain being used in the sampler even though they are all created independently. But I'll admit total ignorance on what I'm looking at. Not gonna stop me from poking around more tonight after the kids go to bed though |
I'm happy to merge this as-is. There's just one thing I thought I'd mention, and you can add it if you like - I'm mostly mentioning since you're learning things and it'll help to know!
|
Batching can share tokens (e.g. if you prompt two sequences with |
a28c870
to
c9455d9
Compare
c9455d9
to
fcb8b89
Compare
I've cleaned up my code and added some better checking on your suggestion. Ready to merge if you don't see anything else. Once it is in there and if you get an itch to look at the bug, it does seem to be quite reproducible with this Gemma model and the first three prompts from this sample. I'll create a new issue, although I with it being theoretically low-level it could be something that has already been caught in the llama.cpp updates since the last sync |
I'll try to make some time to look into the issue this weekend. If it is a bug I think it's most likely a bug on our end inside the |
I've been investigating this issue today, it reprodues locally by running this example so that's been very helpful! I can work around the issue by disabling one of the feature of For each token the batch stores:
The C# LLamaBatch automatically finds tokens and shares them, so if you add:
It will automatically add 2 sequences for the same entry, instead of 2 completely independent entries. If I disable that feature, so it doesn't share any tokens between sequences, it works around the issue. As I understand it though that shouldn't be necessary, as long as the very last tokens (i.e. the one that produces logits) is not shared. I haven't worked out what the issue is any further than that. It may be a bug in llama.cpp, so I'm hoping to test this out again once the next binary update is done to see if we can reproduce it again. |
I tested this with another model LLama-3.2-3B and it did not have any issues. It really does seem to be something wrong with that model template, which is bizarre! |
interesting. I can get Llama 3.2 3B (Q8) to start getting confused, but it takes a bit more work. With these questions var messages = new[]
{
"What's 2+2?",
"Where is the coldest part of Texas?",
"What's the capital of France?",
"What's a one word name for a food item with ground beef patties on a bun?",
"What are two toppings for a pizza?",
"What american football play are you calling on a 3rd and 8 from our own 25?",
"What liquor should I add to egg nog?",
"I have two sons, Bert and Ernie. What should I name my daughter?",
"What day comes after Friday?",
"What color shoes should I wear with dark blue pants?",
"What's 2 * 3?",
"Should I order the mozzarella sticks or the potato wedges?",
"What liquor should I add to sprite?",
"What's the recipe for an old fashioned?",
"Does miller lite taste great or less filling?"
};
not going off the rails quite as bad without really throwing a lot at it at once, but it does go it stays in crazy land ordering pepperoni flavored vodka and insisting upon telling me about the weather of texas when I ask about mixed drinks. here are the results from Qwen 2.5.1 Coder 7B Instruct. Far less crazy but it becomes insistent on answering what the coldest part of something here and there
As you said, this might be low level enough that just getting llama.cpp updated perhaps sorts things out. If not, I might go down the route of trying to also reproduce this in raw llama.cpp, if my brain still remembers any C++ from 25 years ago. oh, and thanks again for taking the time to look. |
I was really struggling to wrap my head around batched execution, got caught up in the weeds on an issue that turned out to be a bad prompt template in a gguf I was using. Anyways, I created a straight-forward "run these prompts as a batch" sample that was quite helpful to me while debugging, figured I'd toss it out there for inclusion with the samples if you see fit. If not, no worries, it served its value to me already.
Recording.2025-01-08.140929.mp4