Add vLLM raw completions API #823

aidando73 · 2025-01-18T08:20:11Z

What does this PR do?

Adds raw completions API to vLLM

Test Plan

Setup

# Run vllm server
conda create -n vllm python=3.12 -y
conda activate vllm
pip install vllm

# Run llamastack
conda create --name llamastack-vllm python=3.10
conda activate llamastack-vllm

export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct && \
pip install -e . && \
pip install --no-cache --index-url https://pypi.org/simple/ --extra-index-url https://test.pypi.org/simple/ llama-stack==0.1.0rc7 && \
llama stack build --template remote-vllm --image-type conda && \
llama stack run ./distributions/remote-vllm/run.yaml \
  --port 5000 \
  --env INFERENCE_MODEL=$INFERENCE_MODEL \
  --env VLLM_URL=http://localhost:8000/v1 | tee -a llama-stack.log

Integration

# Run
conda activate llamastack-vllm
export VLLM_URL=http://localhost:8000/v1
pip install pytest pytest_html pytest_asyncio aiosqlite
pytest llama_stack/providers/tests/inference/test_text_inference.py -v -k vllm

# Results
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[-vllm_remote] PASSED            [ 11%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[-vllm_remote] PASSED            [ 22%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_logprobs[-vllm_remote] SKIPPED  [ 33%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[-vllm_remote] SKIPPED [ 44%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[-vllm_remote] PASSED [ 55%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[-vllm_remote] PASSED     [ 66%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[-vllm_remote] PASSED [ 77%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[-vllm_remote] PASSED [ 88%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[-vllm_remote] PASSED [100%]

====================================== 7 passed, 2 skipped, 99 deselected, 1 warning in 9.80s ======================================

Manual

# Install
pip install --no-cache --index-url https://pypi.org/simple/ --extra-index-url https://test.pypi.org/simple/ llama-stack==0.1.0rc7

Apply this diff

diff --git a/llama_stack/distribution/server/server.py b/llama_stack/distribution/server/server.py
index 8dbb193..95173e2 100644
--- a/llama_stack/distribution/server/server.py
+++ b/llama_stack/distribution/server/server.py
@@ -250,7 +250,7 @@ class ClientVersionMiddleware:
                     server_version_parts = tuple(
                         map(int, self.server_version.split(".")[:2])
                     )
-                    if client_version_parts != server_version_parts:
+                    if False and client_version_parts != server_version_parts:
 
                         async def send_version_error(send):
                             await send(
diff --git a/llama_stack/templates/remote-vllm/run.yaml b/llama_stack/templates/remote-vllm/run.yaml
index 4eac4da..32eb50e 100644
--- a/llama_stack/templates/remote-vllm/run.yaml
+++ b/llama_stack/templates/remote-vllm/run.yaml
@@ -94,7 +94,8 @@ metadata_store:
   type: sqlite
   db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/remote-vllm}/registry.db
 models:
-- metadata: {}
+- metadata: 
+    llama_model: meta-llama/Llama-3.2-3B-Instruct
   model_id: ${env.INFERENCE_MODEL}
   provider_id: vllm-inference
   model_type: llm

Test 1:

from llama_stack_client import LlamaStackClient

client = LlamaStackClient(
    base_url="http://localhost:5000",
)

response = client.inference.completion(
    model_id="meta-llama/Llama-3.2-3B-Instruct",
    content="Hello, world client!",
)

print(response)

Test 2

from llama_stack_client import LlamaStackClient

client = LlamaStackClient(
    base_url="http://localhost:5000",
)

response = client.inference.completion(
    model_id="meta-llama/Llama-3.2-3B-Instruct",
    content="Hello, world client!",
    stream=True,
)

for chunk in response:
    print(chunk.delta, end="", flush=True)

 I'm excited to introduce you to our latest project, a comprehensive guide to the best coffee shops in [City]. As a coffee connoisseur, you're in luck because we've scoured the city to bring you the top picks for the perfect cup of joe.

In this guide, we'll take you on a journey through the city's most iconic coffee shops, highlighting their unique features, must-try drinks, and insider tips from the baristas themselves. From cozy cafes to trendy cafes, we've got you covered.

**Top 5 Coffee Shops in [City]**

1. **The Daily Grind**: This beloved institution has been serving up expertly crafted pour-overs and lattes for over 10 years. Their expert baristas are always happy to guide you through their menu, which features a rotating selection of single-origin beans from around the world...

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Ran pre-commit to handle lint / formatting issues.
Read the contributor guideline,
Pull Request section?
Updated relevant documentation.
Wrote necessary unit or integration tests.

aidando73 · 2025-01-18T09:39:58Z

cc @terrytangyuan

ashwinb · 2025-01-22T05:10:01Z

llama_stack/providers/tests/inference/test_text_inference.py

@@ -118,6 +118,7 @@ async def test_completion(self, inference_model, inference_stack):
            "remote::fireworks",
            "remote::nvidia",
            "remote::cerebras",
+            "remote::vllm",


can you just remove this block now entirely? whichever provider fails the test better just show up as a Failure.

ashwinb

thank you!

terrytangyuan

Thank you!

aidando73 requested review from ashwinb, yanxi0830, hardikjshah, dltn, raghotham, dineshyv, vladimirivic and sixianyi0721 as code owners January 18, 2025 08:20

aidando73 marked this pull request as draft January 18, 2025 08:20

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 18, 2025

Add vllm completions

09fc380

aidando73 force-pushed the aidand-vllm-completions branch from 6823a60 to 09fc380 Compare January 18, 2025 08:35

aidando73 changed the title ~~Add vllm raw completions API~~ Add vLLM raw completions API Jan 18, 2025

add test

7e667b6

aidando73 marked this pull request as ready for review January 18, 2025 09:39

ashwinb reviewed Jan 22, 2025

View reviewed changes

ashwinb approved these changes Jan 22, 2025

View reviewed changes

terrytangyuan reviewed Jan 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vLLM raw completions API #823

Add vLLM raw completions API #823

aidando73 commented Jan 18, 2025 •

edited

Loading

aidando73 commented Jan 18, 2025

ashwinb Jan 22, 2025

ashwinb left a comment

terrytangyuan left a comment

Add vLLM raw completions API #823

Are you sure you want to change the base?

Add vLLM raw completions API #823

Conversation

aidando73 commented Jan 18, 2025 • edited Loading

What does this PR do?

Test Plan

Before submitting

aidando73 commented Jan 18, 2025

ashwinb Jan 22, 2025

Choose a reason for hiding this comment

ashwinb left a comment

Choose a reason for hiding this comment

terrytangyuan left a comment

Choose a reason for hiding this comment

aidando73 commented Jan 18, 2025 •

edited

Loading