Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KubeAI stuck when configuring model caching with missing storageclass #286

Open
samos123 opened this issue Oct 24, 2024 · 1 comment
Open

Comments

@samos123
Copy link
Contributor

samos123 commented Oct 24, 2024

I tried creating a model with cachingProfile and storageClass but the storageclass doesn't exist.

Steps to reproduce:

  1. Create model with invalid caching profile
  2. Delete model with invalid caching profile

Expected result: The model gets cleaned up automatically.

Current result: The model is stuck with finalizers and the evict cache and load cache pods stay pending forever.

Pods:

evict-cache-llama-3.1-8b-instruct-fp8-l4-ghlmq   0/1     Pending   0          70s
kubeai-794576b9f-jt5p5                           1/1     Running   0          106s
load-cache-llama-3.1-8b-instruct-fp8-l4-nzpdm    0/1     Pending   0          87s
openwebui-69ffb7dbb4-hb2lk                       1/1     Running   0          106s

Model spec:

apiVersion: kubeai.org/v1
kind: Model
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"kubeai.org/v1","kind":"Model","metadata":{"annotations":{},"name":"llama-3.1-8b-instruct-fp8-l4","namespace":"default"},"spec":{"args":["--max-model-len=16384","--max-num-batched-token=16384","--gpu-memory-utilization=0.9","--disable-log-requests"],"cacheProfile":"efs-dynamic","engine":"VLLM","features":["TextGeneration"],"minReplicas":1,"owner":"neuralmagic","resourceProfile":"nvidia-gpu-l4:1","url":"hf://neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8"}}
  creationTimestamp: "2024-10-24T14:00:13Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2024-10-24T14:00:30Z"
  finalizers:
  - kubeai.org/cache-eviction
  generation: 3
  labels:
    features.kubeai.org/TextGeneration: "true"
  name: llama-3.1-8b-instruct-fp8-l4
  namespace: default
  resourceVersion: "7101"
  uid: 44ecf5d1-7437-43d4-ad66-6646963eab4a
spec:
  args:
  - --max-model-len=16384
  - --max-num-batched-token=16384
  - --gpu-memory-utilization=0.9
  - --disable-log-requests
  cacheProfile: efs-dynamic
  engine: VLLM
  features:
  - TextGeneration
  minReplicas: 1
  owner: neuralmagic
  replicas: 1
  resourceProfile: nvidia-gpu-l4:1
  scaleDownDelaySeconds: 30
  targetRequests: 100
  url: hf://neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8
status:
  cache:
    loaded: false
  replicas:
    all: 0
    ready: 0

Logs in KubeAI:

2024-10-24T13:59:57Z    INFO    manager loaded config   {"config": "allowPodAddressOverride: false\ncacheProfiles:\n  efs-dynamic:\n    sharedFilesystem:\n      storageClassName: efs-sc\n  efs-static:\n    sharedFilesystem:\n      persistentVolumeName: efs-pv\nhealthAddress: :8081\nleaderElection:\n  leaseDuration: 15s\n  renewDeadline: 10s\n  retryPeriod: 2s\nmessaging:\n  errorMaxBackoff: 30s\n  streams: []\nmetricsAddr: :8080\nmodelAutoscaling:\n  interval: 10s\n  stateConfigMapName: kubeai-autoscaler-state\n  timeWindow: 10m0s\nmodelLoaders:\n  huggingface:\n    image: substratusai/huggingface-model-loader:v0.9.0\nmodelRollouts:\n  surge: 1\nmodelServerPods:\n  securityContext:\n    allowPrivilegeEscalation: false\n    capabilities:\n      drop:\n      - ALL\n    readOnlyRootFilesystem: false\n    runAsUser: 0\n  serviceAccountName: kubeai-models\nmodelServers:\n  FasterWhisper:\n    images:\n      default: fedirz/faster-whisper-server:latest-cpu\n      nvidia-gpu: fedirz/faster-whisper-server:latest-cuda\n  Infinity:\n    images:\n      default: michaelf34/infinity:latest\n  OLlama:\n    images:\n      default: ollama/ollama:latest\n  VLLM:\n    images:\n      cpu: substratusai/vllm:v0.6.3.post1-cpu\n      default: vllm/vllm-openai:v0.6.3.post1\n      google-tpu: substratusai/vllm:v0.6.3.post1-tpu\nresourceProfiles:\n  cpu:\n    imageName: cpu\n    requests:\n      cpu: \"1\"\n      memory: 2Gi\n  nvidia-gpu-a100-40gb:\n    imageName: nvidia-gpu\n    limits:\n      nvidia.com/gpu: \"1\"\n    nodeSelector:\n      node.kubernetes.io/instance-type: p4de.24xlarge\n    tolerations:\n    - effect: NoSchedule\n      key: nvidia.com/gpu\n      operator: Equal\n      value: present\n  nvidia-gpu-a100-80gb:\n    imageName: nvidia-gpu\n    limits:\n      nvidia.com/gpu: \"1\"\n    nodeSelector:\n      node.kubernetes.io/instance-type: p4d.24xlarge\n    tolerations:\n    - effect: NoSchedule\n      key: nvidia.com/gpu\n      operator: Equal\n      value: present\n  nvidia-gpu-h100:\n    imageName: nvidia-gpu\n    limits:\n      nvidia.com/gpu: \"1\"\n    nodeSelector:\n      node.kubernetes.io/instance-type: p5.48xlarge\n    tolerations:\n    - effect: NoSchedule\n      key: nvidia.com/gpu\n      operator: Equal\n      value: present\n  nvidia-gpu-l4:\n    imageName: nvidia-gpu\n    limits:\n      nvidia.com/gpu: \"1\"\n    nodeSelector:\n      karpenter.k8s.aws/instance-gpu-name: l4\n    requests:\n      cpu: \"6\"\n      memory: 24Gi\n      nvidia.com/gpu: \"1\"\n    tolerations:\n    - effect: NoSchedule\n      key: nvidia.com/gpu\n      operator: Equal\n      value: present\n  nvidia-gpu-l40s:\n    imageName: \"\"\n    nodeSelector:\n      karpenter.k8s.aws/instance-gpu-name: l40s\nsecretNames:\n  huggingface: kubeai-huggingface\n"}
2024/10/24 13:59:57 Autoscaler state ConfigMap "models" has no key "default/kubeai-autoscaler-state", state not loaded
2024/10/24 13:59:57 Loaded last state of models: 0 total, last calculated on 0001-01-01 00:00:00 +0000 UTC
2024-10-24T13:59:57Z    INFO    manager starting controller-manager
2024-10-24T13:59:57Z    INFO    manager run launched all goroutines
2024-10-24T13:59:57Z    INFO    starting server {"name": "health probe", "addr": "[::]:8081"}
2024-10-24T13:59:57Z    INFO    manager starting api server     {"addr": ":8000"}
2024-10-24T13:59:57Z    INFO    manager starting metrics server {"addr": ":8080"}
2024-10-24T13:59:57Z    INFO    manager starting leader election
I1024 13:59:57.951075       1 leaderelection.go:250] attempting to acquire leader lease default/kubeai.org...
2024-10-24T13:59:57Z    INFO    Starting EventSource    {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "source": "kind source: *v1.Pod"}
2024-10-24T13:59:57Z    INFO    Starting Controller     {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod"}
I1024 13:59:57.951759       1 leaderelection.go:250] attempting to acquire leader lease default/cc6bca10.substratus.ai...
I1024 13:59:57.964735       1 leaderelection.go:260] successfully acquired lease default/kubeai.org
2024/10/24 13:59:57 "kubeai-794576b9f-jt5p5" started leading
I1024 13:59:57.965962       1 leaderelection.go:260] successfully acquired lease default/cc6bca10.substratus.ai
2024-10-24T13:59:57Z    DEBUG   events  kubeai-794576b9f-jt5p5_8fb2aee3-97c9-4306-a7c4-314f0889f83e became leader       {"type": "Normal", "object": {"kind":"Lease","namespace":"default","name":"cc6bca10.substratus.ai","uid":"44c2a5a3-e00b-47bb-aecc-18dff348a894","apiVersion":"coordination.k8s.io/v1","resourceVersion":"6934"}, "reason": "LeaderElection"}
2024-10-24T13:59:57Z    INFO    Starting EventSource    {"controller": "model", "controllerGroup": "kubeai.org", "controllerKind": "Model", "source": "kind source: *v1.Model"}
2024-10-24T13:59:57Z    INFO    Starting EventSource    {"controller": "model", "controllerGroup": "kubeai.org", "controllerKind": "Model", "source": "kind source: *v1.Pod"}
2024-10-24T13:59:57Z    INFO    Starting EventSource    {"controller": "model", "controllerGroup": "kubeai.org", "controllerKind": "Model", "source": "kind source: *v1.PersistentVolumeClaim"}
2024-10-24T13:59:57Z    INFO    Starting EventSource    {"controller": "model", "controllerGroup": "kubeai.org", "controllerKind": "Model", "source": "kind source: *v1.Job"}
2024-10-24T13:59:57Z    INFO    Starting Controller     {"controller": "model", "controllerGroup": "kubeai.org", "controllerKind": "Model"}
2024-10-24T13:59:58Z    INFO    Starting workers        {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "worker count": 1}
2024-10-24T13:59:58Z    INFO    Starting workers        {"controller": "model", "controllerGroup": "kubeai.org", "controllerKind": "Model", "worker count": 1}
2024/10/24 14:00:07 Is leader, autoscaling
2024/10/24 14:00:07 Aggregating metrics from KubeAI addresses [192.168.67.206:8080]
2024-10-24T14:00:13Z    INFO    Reconciling Model       {"controller": "model", "controllerGroup": "kubeai.org", "controllerKind": "Model", "Model": {"name":"llama-3.1-8b-instruct-fp8-l4","namespace":"default"}, "namespace": "default", "name": "llama-3.1-8b-instruct-fp8-l4", "reconcileID": "c383c9be-7ee8-40fe-949a-b19aa94704e1"}
2024-10-24T14:00:13Z    INFO    KubeAPIWarningLogger    metadata.name: this is used in Pod names and hostnames, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]
2024-10-24T14:00:13Z    INFO    Reconciling Model       {"controller": "model", "controllerGroup": "kubeai.org", "controllerKind": "Model", "Model": {"name":"llama-3.1-8b-instruct-fp8-l4","namespace":"default"}, "namespace": "default", "name": "llama-3.1-8b-instruct-fp8-l4", "reconcileID": "0d012363-27c3-41ab-befc-53a559a6ee21"}
2024-10-24T14:00:13Z    INFO    Reconciling Model       {"controller": "model", "controllerGroup": "kubeai.org", "controllerKind": "Model", "Model": {"name":"llama-3.1-8b-instruct-fp8-l4","namespace":"default"}, "namespace": "default", "name": "llama-3.1-8b-instruct-fp8-l4", "reconcileID": "e580de7b-d4d9-4fcc-a0eb-ddb631d5ae80"}
2024-10-24T14:00:13Z    INFO    Reconciling Model       {"controller": "model", "controllerGroup": "kubeai.org", "controllerKind": "Model", "Model": {"name":"llama-3.1-8b-instruct-fp8-l4","namespace":"default"}, "namespace": "default", "name": "llama-3.1-8b-instruct-fp8-l4", "reconcileID": "61c706dc-6e2e-4d97-83ff-600e80f45efa"}
2024/10/24 14:00:17 Is leader, autoscaling
2024/10/24 14:00:17 Aggregating metrics from KubeAI addresses [192.168.67.206:8080]
2024/10/24 14:00:17 No metrics found for model "llama-3.1-8b-instruct-fp8-l4", skipping
2024/10/24 14:00:27 Is leader, autoscaling
2024/10/24 14:00:27 Aggregating metrics from KubeAI addresses [192.168.67.206:8080]
2024/10/24 14:00:27 No metrics found for model "llama-3.1-8b-instruct-fp8-l4", skipping
2024-10-24T14:00:30Z    INFO    Reconciling Model       {"controller": "model", "controllerGroup": "kubeai.org", "controllerKind": "Model", "Model": {"name":"llama-3.1-8b-instruct-fp8-l4","namespace":"default"}, "namespace": "default", "name": "llama-3.1-8b-instruct-fp8-l4", "reconcileID": "cce6ae02-b08c-4dd5-b13f-72be8916d22e"}
2024-10-24T14:00:30Z    INFO    Reconciling Model       {"controller": "model", "controllerGroup": "kubeai.org", "controllerKind": "Model", "Model": {"name":"llama-3.1-8b-instruct-fp8-l4","namespace":"default"}, "namespace": "default", "name": "llama-3.1-8b-instruct-fp8-l4", "reconcileID": "cfcf5f39-8516-4522-917a-84d884869f98"}
2024-10-24T14:00:30Z    INFO    Reconciling Model       {"controller": "model", "controllerGroup": "kubeai.org", "controllerKind": "Model", "Model": {"name":"llama-3.1-8b-instruct-fp8-l4","namespace":"default"}, "namespace": "default", "name": "llama-3.1-8b-instruct-fp8-l4", "reconcileID": "9a7498a8-09a8-4a02-8703-6941edcf64fc"}

@samos123
Copy link
Contributor Author

Workaround: remove the finalizer on the model object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant