LoRA Adapters for vLLM & support for s3, gs, oss for pulling adapters and models (to cache) from buckets #304

nstogner · 2024-11-04T14:40:15Z

Add .spec.adapters
Support requesting adapters using the pattern: {"model": "<model>_<adapter>", ... }
Load LoRA adapters into running vLLM containers
Support updating LoRA adapters without needing to restart vLLM
Rewrite .model to use adapter in chat request body when proxying to vLLM
Add adapters to model list
Add support for s3://, gs://, oss:// urls (for adapters and cache loading)
Add new cloud credentials to support new urls
Update docs
Update Model validation

NOTE:

Was unable to test oss:// urls... Had issues opening acct.

FOLLOWUP:

Need to add adapter e2e tests (have not found a small enough model with adapters for use in kind cluster)
Need to update chart values.yaml to include GH-actions-built image for model loader after merge!!!

alpe · 2024-11-05T09:21:25Z

Nice drawing. This is very helpful! 🌻
I am not super familiar with LoRa adapters but they can have significant size from what I saw. Caching seems a good idea. For the non-cache scenario, I would suggest to have no-cache or container-managed profile so that it does not look like the default to skip this.
With on-demand LoRa, the disk size may become a problem at some point. This is off scope but purge job or retention time can be things that need to be configured at some point in the profile.

samos123 · 2024-11-05T15:42:53Z

Can you show an example that has the url field? I'm assuming the url field must be used to specify the base model?

nstogner · 2024-11-05T15:46:32Z

@samos123 I currently have all examples in the diagrams

samos123 · 2024-11-05T16:07:00Z

That's where I looked but none of them have the base model URL set?

nstogner · 2024-11-08T02:08:53Z

Model .spec.url would be the same as normal.

nstogner · 2024-11-08T02:09:20Z

Note, it looks like vLLM supports loading adapters from huggingface: vllm-project/vllm#6234

nstogner · 2024-11-08T15:52:04Z

Note, vLLM has an endpoint to support dynamic loading/unloading of adapters: vllm-project/vllm#6566

charts/kubeai/templates/configmap.yaml

samos123 · 2024-11-11T19:02:58Z

hack/dev-models/gke-vllm-gpu-adapters.yaml

+  #url: hf://meta-llama/Llama-2-7b
+  adapters:
+  - id: test
+    url: hf://jashing/tinyllama-colorist-lora


does vLLM support directly loading this adapter from HF or is it a hard requirement to download the lora adapter first?

vLLM can load it from HF but not S3

See: https://github.com/vllm-project/vllm/blob/d1c6799b8870e513bf4f2305cbf6cda9fc3d773b/vllm/lora/utils.py#L178

samos123 · 2024-11-24T06:52:11Z

Not sure whether this should work or not. So sharing what I did and the error message I get:

git checkout $THIS_PR
helm install kubeai ./charts/kubeai

k get pods
NAME                         READY   STATUS             RESTARTS     AGE
kubeai-6bf98d5b77-txmpb      0/1     CrashLoopBackOff   1 (2s ago)   5s
openwebui-69ffb7dbb4-xcvlj   0/1     Running            0            5s

k logs -f kubeai-6bf98d5b77-txmpb 
2024-11-24T06:37:29Z    INFO    manager run finished
2024-11-24T06:37:29Z    ERROR   manager failed to run command   {"error": "invalid config: Key: 'System.ModelLoaders' Error:Field validation for 'ModelLoaders' failed on the 'required' tag"}
main.main
        /workspace/cmd/main.go:50
runtime.main
        /usr/local/go/src/runtime/proc.go:271

I was trying to play around with the branch and see how the helm validation worked.

samos123

I didn't finish all. Leaving partial feedback, will continue more tomorrow.

api/v1/model_types.go

internal/apiutils/requests.go

internal/modelcontroller/engine_vllm.go

internal/modelcontroller/adapters.go

internal/modelcontroller/pod_utils.go

internal/modelcontroller/engine_vllm.go

nstogner · 2024-11-24T12:26:54Z

Not sure whether this should work or not. So sharing what I did and the error message I get:

I was trying to play around with the branch and see how the helm validation worked.

Here is what I used to make sure that the most recent image was running:

gcloud container clusters create-auto cluster-1 \
    --location=us-central1

skaffold run -f ./skaffold.yaml --profile kubeai-only-gke --default-repo us-central1-docker.pkg.dev/substratus-dev

… request to vLLM

alpe · 2024-11-26T09:48:06Z

nice work!

nstogner requested a review from samos123 November 4, 2024 14:40

samos123 reviewed Nov 11, 2024

View reviewed changes

charts/kubeai/templates/configmap.yaml Show resolved Hide resolved

samos123 reviewed Nov 11, 2024

View reviewed changes

nstogner changed the title ~~WIP: LoRA Adapters~~ WIP: LoRA Adapters for vLLM Nov 13, 2024

nstogner changed the title ~~WIP: LoRA Adapters for vLLM~~ LoRA Adapters for vLLM Nov 19, 2024

nstogner changed the title ~~LoRA Adapters for vLLM~~ LoRA Adapters for vLLM + more URLs Nov 22, 2024

nstogner mentioned this pull request Nov 22, 2024

how to access model files by pvc? #303

Closed

nstogner changed the title ~~LoRA Adapters for vLLM + more URLs~~ LoRA Adapters for vLLM & support for pulling models and adapters from buckets Nov 22, 2024

nstogner mentioned this pull request Nov 23, 2024

Dynamically calculate PVC request sizes or remove, or document if it can be ignored #298

Open

nstogner changed the title ~~LoRA Adapters for vLLM & support for pulling models and adapters from buckets~~ LoRA Adapters for vLLM & support for s3, gs, oss for pulling adapters and models (to cache) from buckets Nov 24, 2024

samos123 requested changes Nov 24, 2024

View reviewed changes

nstogner added 10 commits November 24, 2024 12:19

LoRA Adapters

258eb31

Update diagram

7145925

Update diagram with more details on caching

0ae3cf8

Add direct loading implementation

d4ccf49

Update diagram

a72522a

Checkin

ab84632

Checkpoint

85440f1

Checkpoint: vllm adapter loader api working

b56f3f7

Update direct loading diagram

6700ebd

Switch eph container to sidecar

07722ae

nstogner added 23 commits November 24, 2024 12:19

Update endpoint lookup and GH action for building image

f3e8346

Parse adapters from request

46df68b

Add tests

a9b2b7f

Fix non-deterministic test

4d79a10

Add integration test case for listing model adapters

62b2bca

Add validation for adapters

94e6bc5

Add failing integration test - still need to rewrite model in adapter…

56ed23a

… request to vLLM

Rewrite model in request body when using adapter

71705bc

Fix test

cdc317a

Switch adapter separator, add basic concept doc

f14b5b2

Checkpoint

47a3b8a

Update to support loading adapters from s3

47c93c5

Add s3 support in CRD

255b394

Checkpoint: aws credentials

be0ec4f

Fix comment

82390c5

Add support for more URL types

c38481d

Google Storage and S3 working

4f44685

Manually tested adapters

1f338f6

Update api doc

3bf5af8

Add timeout to curl in failing e2e test

1bd68a1

Attempt to fix url parsing

3c22fa1

Add comment to adapter reconciler

2bfd3ec

Address comment

ad72797

nstogner force-pushed the lora-adapters branch from 07e4c2a to ad72797 Compare November 24, 2024 17:20

Fix deprecated .Stream() call

9c339e4

samos123 approved these changes Nov 24, 2024

View reviewed changes

nstogner merged commit b12f811 into main Nov 24, 2024
11 checks passed

nstogner deleted the lora-adapters branch November 24, 2024 18:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoRA Adapters for vLLM & support for s3, gs, oss for pulling adapters and models (to cache) from buckets #304

LoRA Adapters for vLLM & support for s3, gs, oss for pulling adapters and models (to cache) from buckets #304

nstogner commented Nov 4, 2024 •

edited

Loading

alpe commented Nov 5, 2024

samos123 commented Nov 5, 2024

nstogner commented Nov 5, 2024

samos123 commented Nov 5, 2024

nstogner commented Nov 8, 2024

nstogner commented Nov 8, 2024

nstogner commented Nov 8, 2024

samos123 Nov 11, 2024

nstogner Nov 12, 2024

nstogner Nov 12, 2024

samos123 commented Nov 24, 2024

samos123 left a comment •

edited

Loading

nstogner commented Nov 24, 2024 •

edited

Loading

alpe commented Nov 26, 2024

LoRA Adapters for vLLM & support for s3, gs, oss for pulling adapters and models (to cache) from buckets #304

LoRA Adapters for vLLM & support for s3, gs, oss for pulling adapters and models (to cache) from buckets #304

Conversation

nstogner commented Nov 4, 2024 • edited Loading

alpe commented Nov 5, 2024

samos123 commented Nov 5, 2024

nstogner commented Nov 5, 2024

samos123 commented Nov 5, 2024

nstogner commented Nov 8, 2024

nstogner commented Nov 8, 2024

nstogner commented Nov 8, 2024

samos123 Nov 11, 2024

Choose a reason for hiding this comment

nstogner Nov 12, 2024

Choose a reason for hiding this comment

nstogner Nov 12, 2024

Choose a reason for hiding this comment

samos123 commented Nov 24, 2024

samos123 left a comment • edited Loading

Choose a reason for hiding this comment

nstogner commented Nov 24, 2024 • edited Loading

alpe commented Nov 26, 2024

nstogner commented Nov 4, 2024 •

edited

Loading

samos123 left a comment •

edited

Loading

nstogner commented Nov 24, 2024 •

edited

Loading