KubeAI stuck when configuring model caching with missing storageclass #286

samos123 · 2024-10-24T14:04:18Z

I tried creating a model with cachingProfile and storageClass but the storageclass doesn't exist.

Steps to reproduce:

Create model with invalid caching profile
Delete model with invalid caching profile

Expected result: The model gets cleaned up automatically.

Current result: The model is stuck with finalizers and the evict cache and load cache pods stay pending forever.

Pods:

evict-cache-llama-3.1-8b-instruct-fp8-l4-ghlmq   0/1     Pending   0          70s
kubeai-794576b9f-jt5p5                           1/1     Running   0          106s
load-cache-llama-3.1-8b-instruct-fp8-l4-nzpdm    0/1     Pending   0          87s
openwebui-69ffb7dbb4-hb2lk                       1/1     Running   0          106s

Model spec:

apiVersion: kubeai.org/v1
kind: Model
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"kubeai.org/v1","kind":"Model","metadata":{"annotations":{},"name":"llama-3.1-8b-instruct-fp8-l4","namespace":"default"},"spec":{"args":["--max-model-len=16384","--max-num-batched-token=16384","--gpu-memory-utilization=0.9","--disable-log-requests"],"cacheProfile":"efs-dynamic","engine":"VLLM","features":["TextGeneration"],"minReplicas":1,"owner":"neuralmagic","resourceProfile":"nvidia-gpu-l4:1","url":"hf://neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8"}}
  creationTimestamp: "2024-10-24T14:00:13Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2024-10-24T14:00:30Z"
  finalizers:
  - kubeai.org/cache-eviction
  generation: 3
  labels:
    features.kubeai.org/TextGeneration: "true"
  name: llama-3.1-8b-instruct-fp8-l4
  namespace: default
  resourceVersion: "7101"
  uid: 44ecf5d1-7437-43d4-ad66-6646963eab4a
spec:
  args:
  - --max-model-len=16384
  - --max-num-batched-token=16384
  - --gpu-memory-utilization=0.9
  - --disable-log-requests
  cacheProfile: efs-dynamic
  engine: VLLM
  features:
  - TextGeneration
  minReplicas: 1
  owner: neuralmagic
  replicas: 1
  resourceProfile: nvidia-gpu-l4:1
  scaleDownDelaySeconds: 30
  targetRequests: 100
  url: hf://neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8
status:
  cache:
    loaded: false
  replicas:
    all: 0
    ready: 0

Logs in KubeAI:

2024-10-24T13:59:57Z    INFO    manager loaded config   {"config": "allowPodAddressOverride: false\ncacheProfiles:\n  efs-dynamic:\n    sharedFilesystem:\n      storageClassName: efs-sc\n  efs-static:\n    sharedFilesystem:\n      persistentVolumeName: efs-pv\nhealthAddress: :8081\nleaderElection:\n  leaseDuration: 15s\n  renewDeadline: 10s\n  retryPeriod: 2s\nmessaging:\n  errorMaxBackoff: 30s\n  streams: []\nmetricsAddr: :8080\nmodelAutoscaling:\n  interval: 10s\n  stateConfigMapName: kubeai-autoscaler-state\n  timeWindow: 10m0s\nmodelLoaders:\n  huggingface:\n    image: substratusai/huggingface-model-loader:v0.9.0\nmodelRollouts:\n  surge: 1\nmodelServerPods:\n  securityContext:\n    allowPrivilegeEscalation: false\n    capabilities:\n      drop:\n      - ALL\n    readOnlyRootFilesystem: false\n    runAsUser: 0\n  serviceAccountName: kubeai-models\nmodelServers:\n  FasterWhisper:\n    images:\n      default: fedirz/faster-whisper-server:latest-cpu\n      nvidia-gpu: fedirz/faster-whisper-server:latest-cuda\n  Infinity:\n    images:\n      default: michaelf34/infinity:latest\n  OLlama:\n    images:\n      default: ollama/ollama:latest\n  VLLM:\n    images:\n      cpu: substratusai/vllm:v0.6.3.post1-cpu\n      default: vllm/vllm-openai:v0.6.3.post1\n      google-tpu: substratusai/vllm:v0.6.3.post1-tpu\nresourceProfiles:\n  cpu:\n    imageName: cpu\n    requests:\n      cpu: \"1\"\n      memory: 2Gi\n  nvidia-gpu-a100-40gb:\n    imageName: nvidia-gpu\n    limits:\n      nvidia.com/gpu: \"1\"\n    nodeSelector:\n      node.kubernetes.io/instance-type: p4de.24xlarge\n    tolerations:\n    - effect: NoSchedule\n      key: nvidia.com/gpu\n      operator: Equal\n      value: present\n  nvidia-gpu-a100-80gb:\n    imageName: nvidia-gpu\n    limits:\n      nvidia.com/gpu: \"1\"\n    nodeSelector:\n      node.kubernetes.io/instance-type: p4d.24xlarge\n    tolerations:\n    - effect: NoSchedule\n      key: nvidia.com/gpu\n      operator: Equal\n      value: present\n  nvidia-gpu-h100:\n    imageName: nvidia-gpu\n    limits:\n      nvidia.com/gpu: \"1\"\n    nodeSelector:\n      node.kubernetes.io/instance-type: p5.48xlarge\n    tolerations:\n    - effect: NoSchedule\n      key: nvidia.com/gpu\n      operator: Equal\n      value: present\n  nvidia-gpu-l4:\n    imageName: nvidia-gpu\n    limits:\n      nvidia.com/gpu: \"1\"\n    nodeSelector:\n      karpenter.k8s.aws/instance-gpu-name: l4\n    requests:\n      cpu: \"6\"\n      memory: 24Gi\n      nvidia.com/gpu: \"1\"\n    tolerations:\n    - effect: NoSchedule\n      key: nvidia.com/gpu\n      operator: Equal\n      value: present\n  nvidia-gpu-l40s:\n    imageName: \"\"\n    nodeSelector:\n      karpenter.k8s.aws/instance-gpu-name: l40s\nsecretNames:\n  huggingface: kubeai-huggingface\n"}
2024/10/24 13:59:57 Autoscaler state ConfigMap "models" has no key "default/kubeai-autoscaler-state", state not loaded
2024/10/24 13:59:57 Loaded last state of models: 0 total, last calculated on 0001-01-01 00:00:00 +0000 UTC
2024-10-24T13:59:57Z    INFO    manager starting controller-manager
2024-10-24T13:59:57Z    INFO    manager run launched all goroutines
2024-10-24T13:59:57Z    INFO    starting server {"name": "health probe", "addr": "[::]:8081"}
2024-10-24T13:59:57Z    INFO    manager starting api server     {"addr": ":8000"}
2024-10-24T13:59:57Z    INFO    manager starting metrics server {"addr": ":8080"}
2024-10-24T13:59:57Z    INFO    manager starting leader election
I1024 13:59:57.951075       1 leaderelection.go:250] attempting to acquire leader lease default/kubeai.org...
2024-10-24T13:59:57Z    INFO    Starting EventSource    {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "source": "kind source: *v1.Pod"}
2024-10-24T13:59:57Z    INFO    Starting Controller     {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod"}
I1024 13:59:57.951759       1 leaderelection.go:250] attempting to acquire leader lease default/cc6bca10.substratus.ai...
I1024 13:59:57.964735       1 leaderelection.go:260] successfully acquired lease default/kubeai.org
2024/10/24 13:59:57 "kubeai-794576b9f-jt5p5" started leading
I1024 13:59:57.965962       1 leaderelection.go:260] successfully acquired lease default/cc6bca10.substratus.ai
2024-10-24T13:59:57Z    DEBUG   events  kubeai-794576b9f-jt5p5_8fb2aee3-97c9-4306-a7c4-314f0889f83e became leader       {"type": "Normal", "object": {"kind":"Lease","namespace":"default","name":"cc6bca10.substratus.ai","uid":"44c2a5a3-e00b-47bb-aecc-18dff348a894","apiVersion":"coordination.k8s.io/v1","resourceVersion":"6934"}, "reason": "LeaderElection"}
2024-10-24T13:59:57Z    INFO    Starting EventSource    {"controller": "model", "controllerGroup": "kubeai.org", "controllerKind": "Model", "source": "kind source: *v1.Model"}
2024-10-24T13:59:57Z    INFO    Starting EventSource    {"controller": "model", "controllerGroup": "kubeai.org", "controllerKind": "Model", "source": "kind source: *v1.Pod"}
2024-10-24T13:59:57Z    INFO    Starting EventSource    {"controller": "model", "controllerGroup": "kubeai.org", "controllerKind": "Model", "source": "kind source: *v1.PersistentVolumeClaim"}
2024-10-24T13:59:57Z    INFO    Starting EventSource    {"controller": "model", "controllerGroup": "kubeai.org", "controllerKind": "Model", "source": "kind source: *v1.Job"}
2024-10-24T13:59:57Z    INFO    Starting Controller     {"controller": "model", "controllerGroup": "kubeai.org", "controllerKind": "Model"}
2024-10-24T13:59:58Z    INFO    Starting workers        {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "worker count": 1}
2024-10-24T13:59:58Z    INFO    Starting workers        {"controller": "model", "controllerGroup": "kubeai.org", "controllerKind": "Model", "worker count": 1}
2024/10/24 14:00:07 Is leader, autoscaling
2024/10/24 14:00:07 Aggregating metrics from KubeAI addresses [192.168.67.206:8080]
2024-10-24T14:00:13Z    INFO    Reconciling Model       {"controller": "model", "controllerGroup": "kubeai.org", "controllerKind": "Model", "Model": {"name":"llama-3.1-8b-instruct-fp8-l4","namespace":"default"}, "namespace": "default", "name": "llama-3.1-8b-instruct-fp8-l4", "reconcileID": "c383c9be-7ee8-40fe-949a-b19aa94704e1"}
2024-10-24T14:00:13Z    INFO    KubeAPIWarningLogger    metadata.name: this is used in Pod names and hostnames, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]
2024-10-24T14:00:13Z    INFO    Reconciling Model       {"controller": "model", "controllerGroup": "kubeai.org", "controllerKind": "Model", "Model": {"name":"llama-3.1-8b-instruct-fp8-l4","namespace":"default"}, "namespace": "default", "name": "llama-3.1-8b-instruct-fp8-l4", "reconcileID": "0d012363-27c3-41ab-befc-53a559a6ee21"}
2024-10-24T14:00:13Z    INFO    Reconciling Model       {"controller": "model", "controllerGroup": "kubeai.org", "controllerKind": "Model", "Model": {"name":"llama-3.1-8b-instruct-fp8-l4","namespace":"default"}, "namespace": "default", "name": "llama-3.1-8b-instruct-fp8-l4", "reconcileID": "e580de7b-d4d9-4fcc-a0eb-ddb631d5ae80"}
2024-10-24T14:00:13Z    INFO    Reconciling Model       {"controller": "model", "controllerGroup": "kubeai.org", "controllerKind": "Model", "Model": {"name":"llama-3.1-8b-instruct-fp8-l4","namespace":"default"}, "namespace": "default", "name": "llama-3.1-8b-instruct-fp8-l4", "reconcileID": "61c706dc-6e2e-4d97-83ff-600e80f45efa"}
2024/10/24 14:00:17 Is leader, autoscaling
2024/10/24 14:00:17 Aggregating metrics from KubeAI addresses [192.168.67.206:8080]
2024/10/24 14:00:17 No metrics found for model "llama-3.1-8b-instruct-fp8-l4", skipping
2024/10/24 14:00:27 Is leader, autoscaling
2024/10/24 14:00:27 Aggregating metrics from KubeAI addresses [192.168.67.206:8080]
2024/10/24 14:00:27 No metrics found for model "llama-3.1-8b-instruct-fp8-l4", skipping
2024-10-24T14:00:30Z    INFO    Reconciling Model       {"controller": "model", "controllerGroup": "kubeai.org", "controllerKind": "Model", "Model": {"name":"llama-3.1-8b-instruct-fp8-l4","namespace":"default"}, "namespace": "default", "name": "llama-3.1-8b-instruct-fp8-l4", "reconcileID": "cce6ae02-b08c-4dd5-b13f-72be8916d22e"}
2024-10-24T14:00:30Z    INFO    Reconciling Model       {"controller": "model", "controllerGroup": "kubeai.org", "controllerKind": "Model", "Model": {"name":"llama-3.1-8b-instruct-fp8-l4","namespace":"default"}, "namespace": "default", "name": "llama-3.1-8b-instruct-fp8-l4", "reconcileID": "cfcf5f39-8516-4522-917a-84d884869f98"}
2024-10-24T14:00:30Z    INFO    Reconciling Model       {"controller": "model", "controllerGroup": "kubeai.org", "controllerKind": "Model", "Model": {"name":"llama-3.1-8b-instruct-fp8-l4","namespace":"default"}, "namespace": "default", "name": "llama-3.1-8b-instruct-fp8-l4", "reconcileID": "9a7498a8-09a8-4a02-8703-6941edcf64fc"}

The text was updated successfully, but these errors were encountered:

samos123 · 2024-10-24T15:53:35Z

Workaround: remove the finalizer on the model object.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KubeAI stuck when configuring model caching with missing storageclass #286

KubeAI stuck when configuring model caching with missing storageclass #286

samos123 commented Oct 24, 2024 •

edited

Loading

samos123 commented Oct 24, 2024

KubeAI stuck when configuring model caching with missing storageclass #286

KubeAI stuck when configuring model caching with missing storageclass #286

Comments

samos123 commented Oct 24, 2024 • edited Loading

samos123 commented Oct 24, 2024

samos123 commented Oct 24, 2024 •

edited

Loading