Issue in Lora Adapter Test #347

shawnho1018 · 2024-12-11T09:30:26Z

I followed this tutorial to deploy a tinyLLama base model with colorist adapter. However, after the deployment, I found the following deployment error from kubectl logs -f . All others seems working.

INFO:     192.168.16.2:44452 - "POST /v1/load_lora_adapter HTTP/1.1" 400 Bad Request

Both containers could run successfully. However, when testing from UI, if I choose TinyLlamaModel instead of colorist, I would get the following error.

ERROR 12-11 01:45:12 serving_chat.py:170] Error in preprocessing prompt inputs
ERROR 12-11 01:45:12 serving_chat.py:170] Traceback (most recent call last):
ERROR 12-11 01:45:12 serving_chat.py:170]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 155, in create_chat_completion
ERROR 12-11 01:45:12 serving_chat.py:170]     ) = await self._preprocess_chat(
ERROR 12-11 01:45:12 serving_chat.py:170]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-11 01:45:12 serving_chat.py:170]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_engine.py", line 464, in _preprocess_chat
ERROR 12-11 01:45:12 serving_chat.py:170]     request_prompt = apply_hf_chat_template(
/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py:171: RuntimeWarning: coroutine 'AsyncMultiModalItemTracker.all_mm_data' was never awaited
  return self.create_error_response(str(e))
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
ERROR 12-11 01:45:12 serving_chat.py:170]                      ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 12-11 01:45:12 serving_chat.py:170]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/chat_utils.py", line 736, in apply_hf_chat_template
ERROR 12-11 01:45:12 serving_chat.py:170]     raise ValueError(
ERROR 12-11 01:45:12 serving_chat.py:170] ValueError: As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.
INFO:     192.168.16.2:41630 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request

If choose the colorist, the prompt freezes but there is no log.

My Yaml is shown below

apiVersion: kubeai.org/v1
kind: Model
metadata:
  name: tinyllama-chat
spec:
  features: [TextGeneration]
  owner: meta-llama
  url: hf://TinyLlama/TinyLlama-1.1B-Chat-v0.3
  adapters:
  - name: colorist
    url: hf://mychen76/tinyllama-colorist-v2
  engine: VLLM
  resourceProfile: nvidia-gpu-l4:1
  minReplicas: 1

The text was updated successfully, but these errors were encountered:

nstogner · 2024-12-11T15:58:57Z

Will look to reproduce this

samos123 · 2024-12-11T16:16:10Z

It seems the newer version of tinyllama does have a chat template in the tokenizer_config.json: https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/blob/main/tokenizer_config.json#L29

however the v0.3 version is missing the chat template: https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.3/blob/main/tokenizer_config.json

The easiest approach for now would be to switch to a base model that has a chat_template specificed.

I suspect this would just work

apiVersion: kubeai.org/v1
kind: Model
metadata:
  name: tinyllama-chat
spec:
  features: [TextGeneration]
  owner: meta-llama
  url: hf://TinyLlama/TinyLlama-1.1B-Chat-v1.0
  adapters:
  - name: colorist
    url: hf://BOT365/my-tinyllama-colorist-v1
  engine: VLLM
  resourceProfile: nvidia-gpu-l4:1
  minReplicas: 1

Let me see if we already have an open issue for ability to provide a chat template by passing a configmap or simply inline with the model definition.

edit here is the issue: #243

shawnho1018 · 2024-12-11T17:14:03Z

hf://BOT365/my-tinyllama-colorist-v1

I tested the new yaml. I think TinyLlama layer worked but colorist adapter seems problematic still.
There is one 400 responses

INFO:     192.168.16.2:53676 - "POST /v1/load_lora_adapter HTTP/1.1" 200 OK
INFO:     192.168.16.2:39432 - "POST /v1/load_lora_adapter HTTP/1.1" 400 Bad Request

If I specify the base model in openweb ui, chatbot worked. If I specified -colorist, it responded nothing and stuck.

INFO 12-11 09:06:39 metrics.py:449] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

samos123 · 2024-12-12T05:23:55Z

Summary: It seems KubeAI doesn't always correctly update the adapters for an endpoint. Restarting KubeAI is able to workaround the issue. Next step, figure out why endpoints aren't being updated correctly. The reconcile loop should be triggered after the pod labels are updated?

I was able to reproduce this. It seems the request is never passed from KubeAI to the vLLM backend. I see this in the kubeAI logs:

2024-12-12T05:15:40Z    INFO    Reconciling Model       {"controller": "model", "controllerGroup": "kubeai.org", "controllerKind": "Model", "Model": {"name":"tinyllama-chat","namespace":"default"}, "namespace": "default", "name": "tinyllama-chat", "reconcileID": "6ef32200-1e66-4ea5-ba0a-84826cfa1887"}
2024-12-12T05:15:40Z    INFO    Reconciling Model       {"controller": "model", "controllerGroup": "kubeai.org", "controllerKind": "Model", "Model": {"name":"tinyllama-chat","namespace":"default"}, "namespace": "default", "name": "tinyllama-chat", "reconcileID": "769aea64-9b9e-40f5-b6a3-a5aa5e4d371c"}
2024/12/12 05:15:46 url: /v1/chat/completions
2024/12/12 05:15:46 model: tinyllama-chat adapter: colorist
2024/12/12 05:15:46 Waiting for host: 84c50901-f060-4e61-84dc-672ec0ad9317

2024/12/12 05:31:37 url: /v1/chat/completions
2024/12/12 05:31:37 model: tinyllama-chat adapter: colorist
2024/12/12 05:31:37 Waiting for host: 9b7df8d2-8dfb-4f96-9530-06fa6b0d4510
2024/12/12 05:31:47 Is leader, autoscaling                                                                               
2024/12/12 05:31:47 Aggregating metrics from KubeAI addresses [10.92.128.7:8080]         
2024/12/12 05:31:47 No metrics found for model "tinyllama-chat", skipping
2024/12/12 05:31:55 url: /v1/chat/completions                                                                            
2024/12/12 05:31:55 model: tinyllama-chat adapter:                                                                       2024/12/12 05:31:55 Waiting for host: a2b452a9-1c27-490a-a4c6-a3ba82a3ed15                                               2024/12/12 05:31:55 Proxying request to ip 10.92.128.133:8000: a2b452a9-1c27-490a-a4c6-a3ba82a3ed15

Even though the host is already up and running, it seems KubeAI is waiting for the host?

Some more relevant KubeAI logs

2024-12-12T05:15:37Z    INFO    Reconciling Model       {"controller": "model", "controllerGroup": "kubeai.org", "controllerKind": "Model", "Model": {"name":"tinyllama-chat","namespace":"default"}, "namespace": "default", "name": "tinyllama-chat", "reconcileID": "d14bacb1-4355-4f30-8755-5bcde0f4977f"}
+ src=hf://BOT365/my-tinyllama-colorist-v1
+ dest=/adapters/colorist
+ dest_type=
+ [[ /adapters/colorist == *\:\/\/* ]]
+ dir=/adapters/colorist
+ dest_type=dir
+ mkdir -p /adapters/colorist
+ case $src in
+ repo=BOT365/my-tinyllama-colorist-v1
+ huggingface-cli download --local-dir /adapters/colorist BOT365/my-tinyllama-colorist-v1
2024/12/12 05:15:37 Is leader, autoscaling
2024/12/12 05:15:37 Aggregating metrics from KubeAI addresses [10.92.128.7:8080]
2024/12/12 05:15:37 No metrics found for model "tinyllama-chat", skipping
Fetching 11 files:   0%|          | 0/11 [00:00<?, ?it/s]Downloading 'adapter_config.json' to '/adapters/colorist/.cache/huggingface/download/adapter_config.json.007830e6fe44cbdd5557345bc359ae80ef42e3ed.incomplete'
Downloading 'special_tokens_map.json' to '/adapters/colorist/.cache/huggingface/download/special_tokens_map.json.72ecfeeb7e14d244c936169d2ed139eeae235ef1.incomplete'
Download complete. Moving file to /adapters/colorist/adapter_config.json
Downloading 'README.md' to '/adapters/colorist/.cache/huggingface/download/README.md.d4d126660d6d189acfa5639407beed226a8a29e5.incomplete'
Download complete. Moving file to /adapters/colorist/special_tokens_map.json
Downloading 'tokenizer.json' to '/adapters/colorist/.cache/huggingface/download/tokenizer.json.6696342520883a3b5c01adf8c5ca8c5fc7ab4f17.incomplete'
Downloading 'tokenizer_config.json' to '/adapters/colorist/.cache/huggingface/download/tokenizer_config.json.fa96b85858f4053b0142a18e2b09dbe94e3fae46.incomplete'
Downloading '.gitattributes' to '/adapters/colorist/.cache/huggingface/download/.gitattributes.a6344aac8c09253b3b630fb776ae94478aa0275b.incomplete'
Download complete. Moving file to /adapters/colorist/README.md
Download complete. Moving file to /adapters/colorist/tokenizer_config.json
Download complete. Moving file to /adapters/colorist/.gitattributes
Fetching 11 files:   9%|▉         | 1/11 [00:00<00:04,  2.36it/s]Download complete. Moving file to /adapters/colorist/tokenizer.json
Fetching 11 files: 100%|██████████| 11/11 [00:00<00:00, 23.56it/s]
/adapters/colorist
+ rm -rf /adapters/colorist/.cache
+ [[ dir == \u\r\l ]]
2024-12-12T05:15:38Z    ERROR   Reconciler error        {"controller": "model", "controllerGroup": "kubeai.org", "controllerKind": "Model", "Model": {"name":"tinyllama-chat","namespace":"default"}, "namespace": "default", "name": "tinyllama-chat", "reconcileID": "d14bacb1-4355-4f30-8755-5bcde0f4977f", "error": "reconciling adapters: update pod labels for pod \"default/model-tinyllama-chat-5bc59449b6-8sjv5\": update pod labels: Operation cannot be fulfilled on pods \"model-tinyllama-chat-5bc59449b6-8sjv5\": the object has been modified; please apply your changes to the latest version and try again"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:324
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:261
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:222
2

Eventually the adapter seems to load correctly though.

Checking the pod directly, I do see the adapter label on the pod

  labels:
    adapter.kubeai.org/colorist: 578b9cf9
    app: model
    app.kubernetes.io/instance: vllm-tinyllama-chat
    app.kubernetes.io/managed-by: kubeai
    app.kubernetes.io/name: vllm
    model: tinyllama-chat
    pod-hash: 5bc59449b6

nstogner · 2024-12-12T13:25:04Z

Did you see if the label was present on the Pod before the restart?

samos123 · 2024-12-12T15:02:17Z

Yes it was there already before restart

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue in Lora Adapter Test #347

Issue in Lora Adapter Test #347

shawnho1018 commented Dec 11, 2024 •

edited

Loading

nstogner commented Dec 11, 2024

samos123 commented Dec 11, 2024 •

edited

Loading

shawnho1018 commented Dec 11, 2024

samos123 commented Dec 12, 2024 •

edited

Loading

nstogner commented Dec 12, 2024

samos123 commented Dec 12, 2024

Issue in Lora Adapter Test #347

Issue in Lora Adapter Test #347

Comments

shawnho1018 commented Dec 11, 2024 • edited Loading

nstogner commented Dec 11, 2024

samos123 commented Dec 11, 2024 • edited Loading

shawnho1018 commented Dec 11, 2024

samos123 commented Dec 12, 2024 • edited Loading

nstogner commented Dec 12, 2024

samos123 commented Dec 12, 2024

shawnho1018 commented Dec 11, 2024 •

edited

Loading

samos123 commented Dec 11, 2024 •

edited

Loading

samos123 commented Dec 12, 2024 •

edited

Loading