Cerebras Inference Integration #265

henrytwo · 2024-10-17T23:24:03Z

Adding Cerebras Inference as an API provider.

Testing

Conda

$ llama stack build --template cerebras --image-type conda
$ llama stack run ~/.llama/distributions/llamastack-cerebras/cerebras-run.yaml
...
Listening on ['::', '0.0.0.0']:5000
INFO:     Started server process [12443]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://['::', '0.0.0.0']:5000 (Press CTRL+C to quit)

Chat Completion

$ curl --location 'http://localhost:5000/alpha/inference/chat-completion' --header 'Content-Type: application/json' --data '{
    "model_id": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
        {
            "role": "user",
            "content": "What is the temperature in Seattle right now?"
        }
    ],
    "stream": false,
    "sampling_params": {
        "strategy": "top_p",
        "temperature": 0.5,
        "max_tokens": 100
    },                   
    "tool_choice": "auto",
    "tool_prompt_format": "json",
    "tools": [                   
        {
            "tool_name": "getTemperature",
            "description": "Gets the current temperature of a location.",
            "parameters": {                                              
                "location": {
                    "param_type": "string",
                    "description": "The name of the place to get the temperature from in degress celsius.",
                    "required": true                                                                       
                }                   
            }    
        }    
    ]    
}'

Non-Streaming Response

{
  "completion_message": {
    "role": "assistant",
    "content": "",
    "stop_reason": "end_of_message",
    "tool_calls": [
      {
        "call_id": "6f42fdcc-6cbb-46ad-a17b-5d20ac64b678",
        "tool_name": "getTemperature",
        "arguments": {
          "location": "Seattle"
        }
      }
    ]
  },
  "logprobs": null
}

Streaming Response

data: {"event":{"event_type":"start","delta":"","logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"","parse_status":"started"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"{\"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"type","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"\":","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":" \"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"function","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"\",","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":" \"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"name","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"\":","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":" \"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"get","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"Temperature","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"\",","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":" \"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"parameters","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"\":","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":" {\"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"location","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"\":","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":" \"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"Seattle","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"\"}}","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":{"call_id":"e742df1f-0ae9-40ad-a49e-18e5c905484f","tool_name":"getTemperature","arguments":{"location":"Seattle"}},"parse_status":"success"},"logprobs":null,"stop_reason":"end_of_message"}}
data: {"event":{"event_type":"complete","delta":"","logprobs":null,"stop_reason":"end_of_message"}}

Completion

$ curl --location 'http://localhost:5000/alpha/inference/completion' --header 'Content-Type: application/json' --data '{
    "model_id": "meta-llama/Llama-3.1-8B-Instruct",
    "content": "1,2,3,",
    "stream": true,
    "sampling_params": {
        "strategy": "top_p",
        "temperature": 0.5,
        "max_tokens": 10
    },                   
    "tool_choice": "auto",
    "tool_prompt_format": "json",
    "tools": [                   
        {
            "tool_name": "getTemperature",
            "description": "Gets the current temperature of a location.",
            "parameters": {                                              
                "location": {
                    "param_type": "string",
                    "description": "The name of the place to get the temperature from in degress celsius.",
                    "required": true                                                                       
                }                   
            }    
        }    
    ]    
}'

Non-Streaming Response

{
  "content": "4,5,6,7,8,",
  "stop_reason": "out_of_tokens",
  "logprobs": null
}

Streaming Response

data: {"delta":"4","stop_reason":null,"logprobs":null}
data: {"delta":",","stop_reason":null,"logprobs":null}
data: {"delta":"5","stop_reason":null,"logprobs":null}
data: {"delta":",","stop_reason":null,"logprobs":null}
data: {"delta":"6","stop_reason":null,"logprobs":null}
data: {"delta":",","stop_reason":null,"logprobs":null}
data: {"delta":"7","stop_reason":null,"logprobs":null}
data: {"delta":",","stop_reason":null,"logprobs":null}
data: {"delta":"8","stop_reason":null,"logprobs":null}
data: {"delta":",","stop_reason":null,"logprobs":null}
data: {"delta":"","stop_reason":null,"logprobs":null}
data: {"delta":"","stop_reason":"out_of_tokens","logprobs":null}

Pre-Commit Checks

trim trailing whitespace.................................................Passed
check python ast.........................................................Passed
check for merge conflicts................................................Passed
check for added large files..............................................Passed
fix end of files.........................................................Passed
Insert license in comments...............................................Passed
flake8...................................................................Passed
Format files with µfmt...................................................Passed

Testing with `test_inference.py`

$ export CEREBRAS_API_KEY=<insert API key here>
$ pytest -v -s llama_stack/providers/tests/inference/test_text_inference.py -m "cerebras and llama_8b" 
/net/henryt-dev/srv/nfs/henryt-data/ws/llama-stack/.venv/lib/python3.12/site-packages/pytest_asyncio/plugin.py:208: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

  warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
=================================================== test session starts ===================================================
platform linux -- Python 3.12.3, pytest-8.3.3, pluggy-1.5.0 -- /net/henryt-dev/srv/nfs/henryt-data/ws/llama-stack/.venv/bin/python3.12
cachedir: .pytest_cache
rootdir: /net/henryt-dev/srv/nfs/henryt-data/ws/llama-stack
configfile: pyproject.toml
plugins: anyio-4.6.2.post1, asyncio-0.24.0
asyncio: mode=Mode.STRICT, default_loop_scope=None
collected 128 items / 120 deselected / 8 selected                                                                         

llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[llama_8b-cerebras] Resolved 4 providers
 inner-inference => cerebras
 models => __routing_table__
 inference => __autorouted__
 inspect => __builtin__

Models: meta-llama/Llama-3.1-8B-Instruct served by cerebras

PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[llama_8b-cerebras] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completions_structured_output[llama_8b-cerebras] SKIPPED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[llama_8b-cerebras] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[llama_8b-cerebras] SKIPPED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[llama_8b-cerebras] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[llama_8b-cerebras] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[llama_8b-cerebras] PASSED

================================ 6 passed, 2 skipped, 120 deselected, 6 warnings in 3.95s =================================

I ran python llama_stack/scripts/distro_codegen.py to run codegen.

henrytwo · 2024-10-18T01:40:33Z

One thing I am unsure about is whether "Agent" support is available out of the box just from implementing "Inference". I did notice other API vendors advertising on README.md that they have Agent support, but I could not find the corresponding implementation in code.

henrytwo · 2024-10-21T13:48:44Z

@ashwinb friendly bump on this PR :) Please allow the CI to run for this PR

ashwinb · 2024-10-21T21:06:48Z

I did notice other API vendors advertising on README.md that they have Agent support

which other vendors? if you mean Fireworks and Together, that is because they both have Llama Stack distribution endpoints so they make the entirety of the Llama Stack APIs available on their ends. That includes Agents, Memory, etc.

henrytwo · 2024-10-21T21:23:51Z

I did notice other API vendors advertising on README.md that they have Agent support

which other vendors? if you mean Fireworks and Together, that is because they both have Llama Stack distribution endpoints so they make the entirety of the Llama Stack APIs available on their ends. That includes Agents, Memory, etc.

Ah I see, I didn't know about the distribution endpoints. I'll take out the ✅ on the Agents column

llama_stack/providers/remote/inference/cerebras/__init__.py

ashwinb · 2024-11-13T19:39:29Z

@henrytwo

Thanks for sharing instruction for reproducing. Well, here's what the server is returning:

<|python_tag|>{"type": "function", "name": "get_weather", "parameters": {"location": "San Francisco, CA"}}<|eom_id|><|start_header_id|>assistant<|end_header_id|>

<|python_tag|>{"type": "function", "name": "get_weather", "parameters": {"location": "San Francisco, CA"}}<|eom_id|><|start_header_id|>assistant<|end_header_id|>

<|python_tag|><|python_tag|>{"type": "function", "name": "get_weather", "parameters": {"location": "San Francisco, CA"}}<|eom_id|><|start_header_id|>assistant<|end_header_id|>

This is clearly a malformed message. Why is it doing that? Because you aren't stopping generation on <|eom_id|> which should be a stop token.

distributions/cerebras/build.yaml

ashwinb

This looks good, happy to get this in. Could you just fix the conflicts with the documentation files now?

henrytwo · 2024-11-25T16:21:46Z

@ashwinb please have another look. I just rebased to latest main.

Also this might be a bug, but it seems like that test_text_inference.py doesn't run any tests when I use latest main, even when using the example command found in the docs:

llama-stack (main) $ pytest -v -s llama_stack/providers/tests/inference/test_text_inference.py -m "(fireworks or ollama) and llama_3b"
/net/henryt-dev/srv/nfs/henryt-data/ws/llama-stack/.venv/lib/python3.12/site-packages/pytest_asyncio/plugin.py:208: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

  warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
=================================================================================== test session starts ====================================================================================
platform linux -- Python 3.12.3, pytest-8.3.3, pluggy-1.5.0 -- /net/henryt-dev/srv/nfs/henryt-data/ws/llama-stack/.venv/bin/python3.12
cachedir: .pytest_cache
rootdir: /net/henryt-dev/srv/nfs/henryt-data/ws/llama-stack
configfile: pyproject.toml
plugins: anyio-4.6.2.post1, asyncio-0.24.0
asyncio: mode=Mode.STRICT, default_loop_scope=None
collected 128 items / 128 deselected / 0 selected                                                                                                                                          

=========================================================================== 128 deselected, 5 warnings in 0.15s ============================================================================

As a result I haven't been able to reverify that the E2E tests for this integration are still passing.

ashwinb

Let's get this in. It has been forever. Thank you for the all the iteration!

ashwinb · 2024-12-04T05:16:09Z

@henrytwo We will check why tests are suddenly not getting picked up.

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 17, 2024

henrytwo force-pushed the henrytu/cerebras-integration branch from b77aac4 to 5b02f89 Compare October 17, 2024 23:25

henrytwo marked this pull request as ready for review October 17, 2024 23:27

henrytwo requested review from ashwinb, yanxi0830, hardikjshah, dltn and raghotham as code owners October 17, 2024 23:27

henrytwo force-pushed the henrytu/cerebras-integration branch 2 times, most recently from 89bc01a to f07c6f2 Compare November 13, 2024 15:12

ashwinb reviewed Nov 13, 2024

View reviewed changes

llama_stack/providers/remote/inference/cerebras/__init__.py Outdated Show resolved Hide resolved

ashwinb mentioned this pull request Nov 13, 2024

feat: azure ai inference support #364

Open

7 tasks

ashwinb reviewed Nov 13, 2024

View reviewed changes

distributions/cerebras/build.yaml Outdated Show resolved Hide resolved

henrytwo force-pushed the henrytu/cerebras-integration branch 5 times, most recently from f536c84 to 9e53ebc Compare November 20, 2024 18:20

henrytwo requested a review from ashwinb November 20, 2024 21:03

ashwinb approved these changes Nov 23, 2024

View reviewed changes

henrytwo added 4 commits November 25, 2024 07:56

Cerebras Integration

3838bd1

Regenerate distro codegen

5de4c8b

Regenerate distro codegen

db9c28b

Update documentation

659764b

henrytwo force-pushed the henrytu/cerebras-integration branch from 9e53ebc to 659764b Compare November 25, 2024 16:12

Merge branch 'main' into henrytu/cerebras-integration

c29e327

ashwinb approved these changes Dec 4, 2024

View reviewed changes

ashwinb merged commit 64c6df8 into meta-llama:main Dec 4, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cerebras Inference Integration #265

Cerebras Inference Integration #265

henrytwo commented Oct 17, 2024 •

edited

Loading

henrytwo commented Oct 18, 2024

henrytwo commented Oct 21, 2024 •

edited

Loading

ashwinb commented Oct 21, 2024

henrytwo commented Oct 21, 2024

ashwinb commented Nov 13, 2024

ashwinb left a comment

henrytwo commented Nov 25, 2024

ashwinb left a comment

ashwinb commented Dec 4, 2024

Cerebras Inference Integration #265

Cerebras Inference Integration #265

Conversation

henrytwo commented Oct 17, 2024 • edited Loading

Testing

Conda

Chat Completion

Non-Streaming Response

Streaming Response

Completion

Non-Streaming Response

Streaming Response

Pre-Commit Checks

Testing with test_inference.py

henrytwo commented Oct 18, 2024

henrytwo commented Oct 21, 2024 • edited Loading

ashwinb commented Oct 21, 2024

henrytwo commented Oct 21, 2024

ashwinb commented Nov 13, 2024

ashwinb left a comment

Choose a reason for hiding this comment

henrytwo commented Nov 25, 2024

ashwinb left a comment

Choose a reason for hiding this comment

ashwinb commented Dec 4, 2024

henrytwo commented Oct 17, 2024 •

edited

Loading

Testing with `test_inference.py`

henrytwo commented Oct 21, 2024 •

edited

Loading