Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AI model demo Helm chart and Rancher prime installation script #29

Open
wants to merge 17 commits into
base: develop
Choose a base branch
from
Open
16 changes: 16 additions & 0 deletions assets/fleet/clustergroup.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
apiVersion: fleet.cattle.io/v1alpha1
kind: ClusterGroup
metadata:
name: build-a-dino
annotations:
{}
# key: string
labels:
{}
# key: string
namespace: fleet-default
spec:
selector:
matchLabels:
gpu-enabled: 'true'
app: build-a-dino
24 changes: 24 additions & 0 deletions assets/fleet/gitrepo.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
apiVersion: fleet.cattle.io/v1alpha1
kind: GitRepo
metadata:
name: build-a-dino
annotations:
{}
# key: string
labels:
{}
# key: string
namespace: fleet-default
spec:
branch: main
correctDrift:
enabled: true
# force: boolean
# keepFailHistory: boolean
insecureSkipTLSVerify: false
paths:
- /fleet/build-a-dino
# - string
repo: https://github.com/wiredquill/prime-rodeo
targets:
- clusterGroup: build-a-dino
48 changes: 48 additions & 0 deletions assets/monitors/certificate-expiration.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
nodes:
- _type: Monitor
arguments:
criticalThreshold: 1w
deviatingThreshold: 30d
query: type = "secret" AND label = "secret-type:certificate"
resourceName: Certificate
timestampProperty: certificateExpiration
description: Verify certificates that are close to it's expiration date
function: {{ get "urn:stackpack:common:monitor-function:topology-timestamp-threshold-monitor" }}
id: -12
identifier: urn:custom:monitor:certificate-expiration-v2
intervalSeconds: 30
name: Certificate Expiration V2
remediationHint: |

Certificate expiration date `\{{certificateExpiration\}}`.

### Obtain new TLS certificates

If you're using a Certificate Authority (CA) or a third-party provider, follow their procedures to obtain a new TLS certificate.
Once validated, download the new TLS certificate and the corresponding private key from the third-party provider's dashboard or via their API.
When you have downloaded these two files, you can update the Secret with the new certificate and key data.

```
kubectl create secret tls \{{name\}} --cert=path/to/new/certificate.crt --key=path/to/new/private.key
```

2. **Generate new self-signed certificates**:

If you're using self-signed certificates, you can generate new ones locally and update the Secret with the new certificate and key data.
Use tools like OpenSSL to generate new self-signed certificates.

```
openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout path/to/new/private.key -out path/to/new/certificate.crt
```

Update the Secret with the new certificate and key data.

```
kubectl create secret tls \{{name\}} --cert=path/to/new/certificate.crt --key=path/to/new/private.key
```

Alternatively you can edit the existing secret with **`kubectl edit secret \{{name\}}`** and replace the certificate and key data with the new ones obtained from the third-party provider or generated locally.
status: ENABLED
tags:
- certificate
- secret
81 changes: 81 additions & 0 deletions assets/monitors/http-error-ratio-for-service.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
_version: 1.0.85
nodes:
- _type: Monitor
arguments:
deviatingThreshold: 0.05
loggingLevel: WARN
timeWindow: 2 minutes
description: |-
HTTP responses with a status code in the 5xx range indicate server-side errors such as a misconfiguration, overload or internal server errors.
To ensure a good user experience, the percentage of 5xx responses should be less than the configured percentage (5% is the default) of the total HTTP responses for a Kubernetes (K8s) service.
To understand the full monitor definition check the details.
Because the exact threshold and severity might be application dependent, the thresholds can be overriden via a Kubernetes annotation on the service. For example to override the pre-configured deviating threshold and instead only have a critical threshold at 6% put this annotation on your service:
```
monitor.kubernetes-v2.stackstate.io/http-error-ratio-for-service: |
{
"criticalThreshold": 0.06,
"deviatingThreshold": null
}
```
Omitting the deviating threshold from this json snippet would have kept it at the configured 5%, with the critical threshold at 6% that means that the monitor would only result in a deviating state for an error ratio between 5% and 6%.
function: {{ get "urn:stackpack:prime-kubernetes:shared:monitor-function:http-error-ratio-for-service" }}
id: -8
identifier: urn:stackpack:custom:shared:monitor:http-error-ratio-for-service-v2
intervalSeconds: 10
name: HTTP - 5xx error ratio
remediationHint: |-
We have detected that more than 5% of the total responses from your Kubernetes service have a 5xx status code,
this signals that a significant number of users are experiencing downtime and service interruptions.
Take the following steps to diagnose the problem:

## Possible causes
- Slow dependency or dependency serving errors
- Recent update of the application
- Load on the application has increased
- Code has memory leaks
- Environment issues (e.g. certain nodes, database or services that the service depends on)

### Slow dependency or dependency serving errors
Check, in the related health violations of this monitor (which can be found in the expanded version if you read this in the pinned minimised version) if there are any health violations on one of the services or pods that this service depends on (focus on the lowest dependency). If you find a violation (deviating or critical health), click on that component to see the related health violations in table next to it. You can than click on those health violations to follow the instructions to resolve the issue.

### New behavior of the service
If there are no dependencies that have health violations, it could be that the pod backing this service is returning errors. If this behavior is new, it could be caused by a recent deployment.

This can be checked by looking at the Events shown on the [service highlights page](/#/components/\{{ componentUrnForUrl \}}/highlights) and checking whether a `Deployment` event happened recently after which the HTTP Error ratio behaviour changed.

To troubleshoot further, you can have a look at the pod(s) backing this service.
- Click on the "Pods of this service" in the "Related resource" section of the [service highlight page](/#/components/\{{ componentUrnForUrl \}})
- Click on the pod name(s) to go to their highlights pages
- Check the logs of the pod(s) to see if they're returning any errors.

### Recent update of the service
Check if the service was recently updated:
- See the Age in the "About" section to identify on the [service highlight page](/#/components/\{{ componentUrnForUrl \}})
is this is recently deployed
- Check if any of the pods are recently updated by clicking on "Pods of this service" in "Related resource" section of
the [service highlight page](/#/components/\{{ componentUrnForUrl \}}) and look if their Age is recent.
- If application has just started, it might be that the service has not warmed up yet. Compare the response time metrics
for the current deployment with the previous deployment by checking the response time metric chart with a time interval including both.
- Check if application is using more resources than before, consider scaling it up or giving it more resources.
- If increased latency is crucial, consider rolling back the service to the previous version:
- if that helps, then the issue is likely with new deployment
- if that does not help, then the issue may be in the environment (e.g. network issues or issues with the underlying infrastructure, e. g. database)
### Load on the service has increased
- Check if the amount of requests to the service has increased by looking at the "Throughput (HTTP responses/s)" chart for the "HTTP response metrics for all clients (incoming requests)" on the [service highlight page](/#/components/\{{ componentUrnForUrl \}}).
If so, consider scaling up the service or giving it more resources.
### Code has memory leaks
- Check if memory or CPU usage have been increasing over time. If so, there might be a memory leak.
You can find the pods supporting this service by clicking on "Pods of this service" in "Related resource"
section of the [service highlight page](/#/components/\{{ componentUrnForUrl \}}).
Check which pods are using the most disk space by clicking on the left side of the [service highlight page](/#/components/\{{ componentUrnForUrl \}}) on "Pods of this service"
- Check all the pods supporting this service by clicking on the pod name
- Check the resource usage on the "Resource usage" section
- Restart the pod(s) of this service that is having the issue or add more memory/cpu
### Environment issues
- Check latency of particular pods of the service. If only certain pods are having issues, might be an issue with the node the pod is running on:
- Try to move the pod to another node
- Check if other pods of other services are also having latency increased on that node. Drain the node if that is the case.
status: ENABLED
tags:
- services
timestamp: 2025-01-16T13:16:53.208687Z[Etc/UTC]
81 changes: 81 additions & 0 deletions assets/monitors/out-of-memory-containers.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
nodes:
- _type: Monitor
arguments:
comparator: GTE
failureState: DEVIATING
metric:
aliasTemplate: OOM Killed count
query: max(increase(kubernetes_containers_last_state_terminated{reason="OOMKilled"}[10m]))
by (cluster_name, namespace, pod_name, container)
unit: short
threshold: 1.0
urnTemplate: urn:kubernetes:/${cluster_name}:${namespace}:pod/${pod_name}
description: |-
It is important to ensure that the containers running in your Kubernetes cluster have enough memory to function properly. Out of memory (OOM) conditions can cause containers to crash or become unresponsive, leading to restarts and potential data loss.
To monitor for these conditions, we set up a check that detects and reports OOM events in the containers running in the cluster. This check will help you identify any containers that are running out of memory and allow you to take action to prevent issues before they occur.
To understand the full monitor definition check the details.
function: {{ get "urn:stackpack:common:monitor-function:threshold" }}
id: -13
identifier: urn:custom:monitor:out-of-memory-containers-v2
intervalSeconds: 30
name: Out of memory for containers V2
remediationHint: |-
An Out of Memory (OOM) event in Kubernetes occurs when a container's memory usage exceeds the limit set for it.
The Linux kernel's OOM killer process is triggered, which attempts to free up memory by killing one or more processes.
This can cause the container to terminate, leading to issues such as lost data, service interruption, and increased
resource usage.

Check the container [Logs](/#/components/\{{ componentUrnForUrl \}}#logs) for any hints on how the application is behaving.

### Recognize a memory leak

A memory leak can be recognized by looking at the "Memory Usage" metric on the [pod metrics page](/#/components/\{{ componentUrnForUrl \}}/metrics).

If the metric resembles a `saw-tooth` pattern that is a clear indication of a slow memory leak being present in your application.
The memory usage increases over time, but the memory is not released until the container is restarted.

If the metric resembles a `dash` pattern that is an indication of a memory leak via a spike.
The memory usage suddenly increases that causes the limit to be violated and the container killed.

You will notice that the container continually restarts.

Common issues that can cause this problem include:
1. New deployments that introduce a memory leak.
2. Elevated traffic that causes a temporary increase of memory usage.
3. Incorrectly configured memory limits.

### 1. New deployments that introduce a memory leak

If the memory leak behaviour is new, it is likely that a new deployment introduced a memory leak.

This can be checked by looking at the Events shown on the [pod highlights page](/#/components/\{{ componentUrnForUrl \}}/highlights) and checking whether a `Deployment` event happened recently after which the memory usage behaviour changed.

If the memory leak is caused by a deployment, you can investigate which change led to the memory leak by checking the [Show last change](/#/components/\{{ componentUrnForUrl \}}#lastChange), which will highlight the latest changeset for the deployment. You can then revert the change or fix the memory leak.

### 2. Elevated traffic that causes a temporary increase of memory usage
This can be checked by looking at the "Network Throughput for pods (received)" metric on the [pod metrics page](/#/components/\{{ componentUrnForUrl \}}/metrics) and comparing the usage to the "Memory Usage" metric. If the memory usage increases at the same time as the network throughput, it is likely that the memory usage is caused by the increased traffic.

As a temporary fix you can elevate the memory limit for the container. However, this is not a long-term solution as the memory usage will likely increase again in the future. You can also consider using Kubernetes autoscaling feature to scale up and down the number of replicas based on resource usage.

### 3. Incorrectly configured memory Limits
This can be checked by looking at the "Memory Usage" metric on the [pod metrics page](/#/components/\{{ componentUrnForUrl \}}/metrics) and comparing the usage to the requests and limits set for the pod. If the memory usage is higher than the limit set for the pod, the container will be terminated by the OOM killer.

To fix this issue, you can increate the memory limit for the pod, by changing the Kubernetes resource YAML and increasing the memory limit values e.g.
```
metadata:
spec:
containers:
resources:
limits:
cpu: "2"
memory: "3Gi"
requests:
cpu: "2"
memory: "3Gi"
```
status: ENABLED
tags:
- containers
- pods
85 changes: 85 additions & 0 deletions assets/monitors/pod-cpu-throttling.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
nodes:
- _type: Monitor
arguments:
comparator: GT
failureState: DEVIATING
metric:
aliasTemplate: CPU Throttling for ${container} of ${pod_name}
query: 100 * sum by (cluster_name, namespace, pod_name, container) (container_cpu_throttled_periods{})
/ sum by (cluster_name, namespace, pod_name, container) (container_cpu_elapsed_periods{})
unit: percent
threshold: 95.0
urnTemplate: urn:kubernetes:/${cluster_name}:${namespace}:pod/${pod_name}
description: |-
In Kubernetes, CPU throttling refers to the process where limits are applied to the amount of CPU resources a container can use.
This typically occurs when a container approaches the maximum CPU resources allocated to it, causing the system to throttle or restrict
its CPU usage to prevent a crash.

While CPU throttling can help maintain system stability by avoiding crashes due to CPU exhaustion, it can also significantly slow down workload
performance. Ideally, CPU throttling should be avoided by ensuring that containers have access to sufficient CPU resources.
This proactive approach helps maintain optimal performance and prevents the slowdown associated with throttling.
function: {{ get "urn:stackpack:common:monitor-function:threshold" }}
id: -13
identifier: urn:custom:monitor:pod-cpu-throttling-v2
intervalSeconds: 60
name: CPU Throttling V2
remediationHint: |-

### Application behaviour

Check the container [Logs](/#/components/\{{ componentUrnForUrl \}}#logs) for any hints on how the application is behaving under CPU Throttling

### Understanding CPU Usage and CPU Throttling

On the [pod metrics page](/#/components/\{{ componentUrnForUrl \}}/metrics) you will find the CPU Usage and CPU Throttling charts.

#### CPU Trottling

The percentage of CPU throttling over time. CPU throttling occurs when a container reaches its CPU limit, restricting its CPU usage to
prevent it from exceeding the specified limit. The higher the percentage, the more throttling is occurring, which means the container's
performance is being constrained.

#### CPU Usage

This chart shows three key CPU metrics over time:

1. Request: The amount of CPU the container requests as its minimum requirement. This sets the baseline CPU resources the container is guaranteed to receive.
2. Limit: The maximum amount of CPU the container can use. If the container's usage reaches this limit, throttling will occur.
3. Current: The actual CPU usage of the container in real-time.

The `Request` and `Limit` settings in the container can be seen in `Resource` section in [configuration](/#/components/\{{ componentUrnForUrl\}}#configuration)

#### Correlation

The two charts are correlated in the following way:

- As the `Current` CPU usage approaches the CPU `Limit`, the CPU throttling percentage increases. This is because the container tries to use more CPU than it is allowed, and the system restricts it, causing throttling.
- The aim is to keep the `Current` usage below the `Limit` to minimize throttling. If you see frequent high percentages in the CPU throttling chart, it suggests that you may need to adjust the CPU limits or optimize the container's workload to reduce CPU demand.


### Adjust CPU Requests and Limits

On the [pod highlights page](/#/components/\{{ componentUrnForUrl \}}/highlights) and checking whether a `Deployment` event happened recently after which the cpu usage behaviour changed.

You can investigate which change led to the cpu throttling by checking the [Show last change](/#/components/\{{ componentUrnForUrl \}}#lastChange),
which will highlight the latest changeset for the deployment. You can then revert the change or fix the cpu request and limit.


Review the pod's resource requests and limits to ensure they are set appropriately.
Show component [configuration](/#/components/\{{ componentUrnForUrl \}}#configuration)

If the CPU usage consistently hits the limit, consider increasing the CPU limit of the pod. <br/>
Edit the pod or deployment configuration file to modify the `resources.limits.cpu` and `resources.requests.cpu` as needed.
```
resources:
requests:
cpu: "500m" # Adjust this value based on analysis
limits:
cpu: "1" # Adjust this value based on analysis
```
If CPU throttling persists, consider horizontal pod autoscaling to distribute the workload across more pods, or adjust the cluster's node resources to meet the demands. Continuously monitor and fine-tune resource settings to optimize performance and prevent further throttling issues.
status: ENABLED
tags:
- cpu
- performance
- pod
Loading