Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Route requests to take advantage of prefix caching #266

Open
nstogner opened this issue Oct 8, 2024 · 6 comments
Open

Route requests to take advantage of prefix caching #266

nstogner opened this issue Oct 8, 2024 · 6 comments

Comments

@nstogner
Copy link
Contributor

nstogner commented Oct 8, 2024

See: https://docs.vllm.ai/en/stable/automatic_prefix_caching/apc.html

@sam-huang1223
Copy link

we could send a UID, for example in the extra_body of some given request

response = client.chat.completions.create(
  model="model",
  messages=[
    {
      "role": "user",
      "content": "heyyy"
    }
  ],
  temperature=0.2,
  extra_body={"decode_id": "SOME_HASH"},
)

response = client.chat.completions.create(
  model="model",
  messages=[
    {
      "role": "user",
      "content": "heyyy there"
    }
  ],
  temperature=0.2,
  extra_body={"decode_id": "SOME_HASH"},
)

@samos123
Copy link
Contributor

My thinking is we could treat e.g. the X-Session-ID HTTP header as a way to tell us that a request belongs to the same session. You can set custom HTTP headers in the python openai client as well.

@nstogner
Copy link
Contributor Author

vLLM issue to watch: vllm-project/vllm#8523

@nstogner
Copy link
Contributor Author

I see 3 main options:

  1. Integrate with vLLM to ask/be-told what the state of the cache is.
  2. Sticky sessions based on request attributes (HTTP headers, etc).
  3. Calculate an approximation of the cache state of the backend engines within KubeAI.

Option 1 is not currently possible at the moment from my understanding (see issue linked in previous comment).

Option 2 is probably the simplest to implement in KubeAI, but harder for clients to take full advantage of. The simplest approach would be to use a "user session" as the routing key. This might work well in chat-scenarios but might not take advantage of scenarios where large prefixes transcend user-sessions.

Option 3 could be implemented fairly simply as a short-term solution: consider an implementation where we define a static prefix-length that KubeAI uses to calculate prefix hashes. We could evolve that technique to be more sophisticated over time. It is possible that Option 3 might even prove to be more advantageous than Option 1 in the longer term due to the amount/frequency of communication that might be required to support Option 1.

@samos123
Copy link
Contributor

I think it's important that the user has control over the behavior so I see a future we do both option 2 and 3. Option 3 would be nice due to getting benefits out of the box for everyone.

I suggest starting with option 2, so only requests that specify a session header will enable this behavior.

Option 3 could be done by hashing the first X (e.g. 100) characters of each prompt. Afterwards use a hashing table to send requests to different back ends, while still respecting target concurrent request counts. The tricky part may be coming up with a value of X that makes sense as a default value.

@samos123
Copy link
Contributor

One more thought that came to mind for option 3. We can take the first 100 characters, 500 characters and 1000 characters and do hashing based on those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants