-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Route requests to take advantage of prefix caching #266
Comments
we could send a UID, for example in the
|
My thinking is we could treat e.g. the |
vLLM issue to watch: vllm-project/vllm#8523 |
I see 3 main options:
Option 1 is not currently possible at the moment from my understanding (see issue linked in previous comment). Option 2 is probably the simplest to implement in KubeAI, but harder for clients to take full advantage of. The simplest approach would be to use a "user session" as the routing key. This might work well in chat-scenarios but might not take advantage of scenarios where large prefixes transcend user-sessions. Option 3 could be implemented fairly simply as a short-term solution: consider an implementation where we define a static prefix-length that KubeAI uses to calculate prefix hashes. We could evolve that technique to be more sophisticated over time. It is possible that Option 3 might even prove to be more advantageous than Option 1 in the longer term due to the amount/frequency of communication that might be required to support Option 1. |
I think it's important that the user has control over the behavior so I see a future we do both option 2 and 3. Option 3 would be nice due to getting benefits out of the box for everyone. I suggest starting with option 2, so only requests that specify a session header will enable this behavior. Option 3 could be done by hashing the first X (e.g. 100) characters of each prompt. Afterwards use a hashing table to send requests to different back ends, while still respecting target concurrent request counts. The tricky part may be coming up with a value of X that makes sense as a default value. |
One more thought that came to mind for option 3. We can take the first 100 characters, 500 characters and 1000 characters and do hashing based on those. |
See: https://docs.vllm.ai/en/stable/automatic_prefix_caching/apc.html
The text was updated successfully, but these errors were encountered: