Route requests to take advantage of prefix caching #266

nstogner · 2024-10-08T16:16:16Z

See: https://docs.vllm.ai/en/stable/automatic_prefix_caching/apc.html

sam-huang1223 · 2024-10-23T02:34:30Z

we could send a UID, for example in the extra_body of some given request

response = client.chat.completions.create(
  model="model",
  messages=[
    {
      "role": "user",
      "content": "heyyy"
    }
  ],
  temperature=0.2,
  extra_body={"decode_id": "SOME_HASH"},
)

response = client.chat.completions.create(
  model="model",
  messages=[
    {
      "role": "user",
      "content": "heyyy there"
    }
  ],
  temperature=0.2,
  extra_body={"decode_id": "SOME_HASH"},
)

samos123 · 2024-10-23T05:52:03Z

My thinking is we could treat e.g. the X-Session-ID HTTP header as a way to tell us that a request belongs to the same session. You can set custom HTTP headers in the python openai client as well.

nstogner · 2024-10-25T20:45:20Z

vLLM issue to watch: vllm-project/vllm#8523

nstogner · 2024-10-25T21:10:46Z

I see 3 main options:

Integrate with vLLM to ask/be-told what the state of the cache is.
Sticky sessions based on request attributes (HTTP headers, etc).
Calculate an approximation of the cache state of the backend engines within KubeAI.

Option 1 is not currently possible at the moment from my understanding (see issue linked in previous comment).

Option 2 is probably the simplest to implement in KubeAI, but harder for clients to take full advantage of. The simplest approach would be to use a "user session" as the routing key. This might work well in chat-scenarios but might not take advantage of scenarios where large prefixes transcend user-sessions.

Option 3 could be implemented fairly simply as a short-term solution: consider an implementation where we define a static prefix-length that KubeAI uses to calculate prefix hashes. We could evolve that technique to be more sophisticated over time. It is possible that Option 3 might even prove to be more advantageous than Option 1 in the longer term due to the amount/frequency of communication that might be required to support Option 1.

samos123 · 2024-10-26T21:45:54Z

I think it's important that the user has control over the behavior so I see a future we do both option 2 and 3. Option 3 would be nice due to getting benefits out of the box for everyone.

I suggest starting with option 2, so only requests that specify a session header will enable this behavior.

Option 3 could be done by hashing the first X (e.g. 100) characters of each prompt. Afterwards use a hashing table to send requests to different back ends, while still respecting target concurrent request counts. The tricky part may be coming up with a value of X that makes sense as a default value.

samos123 · 2024-10-27T06:04:34Z

One more thought that came to mind for option 3. We can take the first 100 characters, 500 characters and 1000 characters and do hashing based on those.

nstogner mentioned this issue Nov 5, 2024

add ollama caching #297

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Route requests to take advantage of prefix caching #266

Route requests to take advantage of prefix caching #266

nstogner commented Oct 8, 2024

sam-huang1223 commented Oct 23, 2024

samos123 commented Oct 23, 2024

nstogner commented Oct 25, 2024

nstogner commented Oct 25, 2024

samos123 commented Oct 26, 2024

samos123 commented Oct 27, 2024

Route requests to take advantage of prefix caching #266

Route requests to take advantage of prefix caching #266

Comments

nstogner commented Oct 8, 2024

sam-huang1223 commented Oct 23, 2024

samos123 commented Oct 23, 2024

nstogner commented Oct 25, 2024

nstogner commented Oct 25, 2024

samos123 commented Oct 26, 2024

samos123 commented Oct 27, 2024