Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scheduled delete #3

Merged
merged 38 commits into from
Mar 19, 2024
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
c672ed0
Add scheduled delete job
JohnGarbutt Mar 12, 2024
116e073
Make the fuctional test fire immediately
JohnGarbutt Mar 12, 2024
157d41b
Make status optional
JohnGarbutt Mar 12, 2024
0ec97b8
Default to an empty status, its not nullable
JohnGarbutt Mar 12, 2024
c67caca
Update schedule crd to report progress
JohnGarbutt Mar 12, 2024
454eb5b
Remove not_before from CRD
JohnGarbutt Mar 12, 2024
9deb2c5
Move functional test to configmap delete
JohnGarbutt Mar 12, 2024
c2bc0dc
Fix up tests
JohnGarbutt Mar 12, 2024
cc894d9
Fix typo in delete and get code
JohnGarbutt Mar 12, 2024
41d60c9
Fix permissions for delete of configmaps
JohnGarbutt Mar 12, 2024
83a120c
Fix timezone aware subtraction issues
JohnGarbutt Mar 12, 2024
a61ca93
Fix format of the status updates
JohnGarbutt Mar 12, 2024
c46d3dd
Add wait for delete in the functional test
JohnGarbutt Mar 12, 2024
f5c0246
Fix whitespace in functional test
JohnGarbutt Mar 12, 2024
2d67379
Remove accidential nullable fields
JohnGarbutt Mar 14, 2024
645a9c2
Move to timer approach
JohnGarbutt Mar 14, 2024
bb88297
Merge remote-tracking branch 'origin/main' into add-schedule-delete
JohnGarbutt Mar 14, 2024
a978ef0
Make the check interval tunable
JohnGarbutt Mar 14, 2024
6b2a32c
Fix a units issue with the check interval
JohnGarbutt Mar 14, 2024
c88bf8a
Attempt to fix up metrics
JohnGarbutt Mar 14, 2024
46f526a
Remove unused config
JohnGarbutt Mar 14, 2024
c396b3b
Add placeholder grafana dashboard
JohnGarbutt Mar 14, 2024
5be6bb4
Update metrics with delete_triggered
JohnGarbutt Mar 15, 2024
0248041
Update prometheusrule.yaml
JohnGarbutt Mar 15, 2024
01502a4
fix typo on deleteTriggered
JohnGarbutt Mar 15, 2024
8818e73
Update prometheusrule.yaml
JohnGarbutt Mar 15, 2024
ce6dedb
Update prometheusrule.yaml
JohnGarbutt Mar 15, 2024
90f36a6
Merge remote-tracking branch 'origin/main' into add-schedule-delete
JohnGarbutt Mar 18, 2024
e99f874
Simplify check_for_delete
JohnGarbutt Mar 18, 2024
0d31e8c
Rename ref_found
JohnGarbutt Mar 18, 2024
6ed346c
Fix metrics status paths
JohnGarbutt Mar 18, 2024
a6cd4ec
Wait for updatedAt to be added
JohnGarbutt Mar 18, 2024
c131dbb
Fix up updatedAt
JohnGarbutt Mar 18, 2024
05f11e9
Add test for get_reference
JohnGarbutt Mar 18, 2024
4c70419
Improve test coverage
JohnGarbutt Mar 18, 2024
f2a2bbc
Make name vs namespace ordering consistent
JohnGarbutt Mar 18, 2024
e8f722e
Fix check for delete
JohnGarbutt Mar 18, 2024
f5e3bc4
Don't call delete mulitple times
JohnGarbutt Mar 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 4 additions & 6 deletions azimuth_schedule_operator/models/v1alpha1/schedule.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,18 +10,18 @@ class ScheduleStatus(schema.BaseModel):
ref_found: bool = False
JohnGarbutt marked this conversation as resolved.
Show resolved Hide resolved
# updated when delete has been triggered
delete_triggered: bool = False
JohnGarbutt marked this conversation as resolved.
Show resolved Hide resolved
updated_at: schema.Optional[datetime.datetime] = None
JohnGarbutt marked this conversation as resolved.
Show resolved Hide resolved


class ScheduleRef(schema.BaseModel):
apiVersion: str
api_version: str
kind: str
name: str


class ScheduleSpec(schema.BaseModel):
ref: ScheduleRef
notBefore: datetime.datetime
notAfter: datetime.datetime
not_after: datetime.datetime


class Schedule(
Expand All @@ -44,8 +44,6 @@ def get_fake_dict():
metadata=dict(name="test1", uid="fakeuid1", namespace="ns1"),
spec=dict(
ref=dict(apiVersion="v1", kind="Pod", name="test1"),
notBefore=datetime.datetime.now(),
notAfter=datetime.datetime.now() + datetime.timedelta(days=1),
notAfter=datetime.datetime.now(datetime.timezone.utc),
),
status=dict(ref_found=True, delete_triggered=False),
)
73 changes: 67 additions & 6 deletions azimuth_schedule_operator/operator.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import asyncio
import datetime
import logging
import os
import sys
Expand All @@ -12,6 +13,8 @@
LOG = logging.getLogger(__name__)
K8S_CLIENT = None

CHECK_INTERVAL_SECONDS = int(os.environ.get("AZIMUTH_SCHEDULE_CHECK_INTERVAL", "120"))
JohnGarbutt marked this conversation as resolved.
Show resolved Hide resolved


@kopf.on.startup()
async def startup(settings, **kwargs):
Expand Down Expand Up @@ -46,15 +49,73 @@ async def startup(settings, **kwargs):


@kopf.on.cleanup()
async def cleanup(**kwargs):
async def cleanup(**_):
if K8S_CLIENT:
await K8S_CLIENT.aclose()
LOG.info("Cleanup complete.")


@kopf.on.create(registry.API_GROUP, "schedule")
@kopf.on.update(registry.API_GROUP, "schedule")
@kopf.on.resume(registry.API_GROUP, "schedule")
async def schedule_changed(body, name, namespace, labels, **kwargs):
async def get_reference(namespace: str, ref: schedule_crd.ScheduleRef):
resource = await K8S_CLIENT.api(ref.api_version).resource(ref.kind)
object = await resource.fetch(ref.name, namespace=namespace)
return object


async def delete_reference(namespace: str, ref: schedule_crd.ScheduleRef):
resource = await K8S_CLIENT.api(ref.api_version).resource(ref.kind)
await resource.delete(ref.name, namespace=namespace)


async def check_for_delete(namespace, schedule: schedule_crd.Schedule):
# TODO(johngarbutt): add some config in helm to set this?
max_delete_duration_int = int(
os.environ.get("AZIMUTH_SCHEDULE_MAX_DELETE_DURATION_MINUTES", "15")
) + int(CHECK_INTERVAL_SECONDS / 60)
max_delete_duration = datetime.timedelta(minutes=max_delete_duration_int)
scheduled_at = schedule.spec.not_after - max_delete_duration
now = datetime.datetime.now(datetime.timezone.utc)

if now >= scheduled_at:
LOG.info(f"Attempting delete for {namespace} and {schedule.metadata.name}.")
await delete_reference(namespace, schedule.spec.ref)
await update_schedule(schedule.metadata.name, namespace, delete_triggered=True)
else:
LOG.info(f"No delete for {namespace} and {schedule.metadata.name}.")
JohnGarbutt marked this conversation as resolved.
Show resolved Hide resolved


async def update_schedule(
name: str,
namespace: str,
ref_found: bool = None,
delete_triggered: bool = None,
):
now = datetime.datetime.now(datetime.timezone.utc)
now_string = now.strftime("%Y-%m-%dT%H:%M:%SZ")
status_updates = dict(updated_at=now_string)
JohnGarbutt marked this conversation as resolved.
Show resolved Hide resolved

if ref_found is not None:
status_updates["refFound"] = ref_found
if delete_triggered is not None:
status_updates["deleteTriggered"] = delete_triggered

status_resource = await K8S_CLIENT.api(registry.API_VERSION).resource(
"schedules/status"
)
LOG.info(f"Updating status for {name} in {namespace} with: {status_updates}")
await status_resource.patch(
name,
dict(status=status_updates),
namespace=namespace,
)


# check every two minutes
@kopf.timer(registry.API_GROUP, "schedule", interval=CHECK_INTERVAL_SECONDS)
async def schedule_check(body, namespace, **_):
schedule = schedule_crd.Schedule(**body)
LOG.error(f"seen schedule changed {schedule.spec}")

if not schedule.status.ref_found:
await get_reference(namespace, schedule.spec.ref)
await update_schedule(schedule.metadata.name, namespace, ref_found=True)

await check_for_delete(namespace, schedule)
10 changes: 5 additions & 5 deletions azimuth_schedule_operator/tests/models/test_crds.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,18 +62,13 @@ def test_cluster_type_crd_json(self):
],
"type": "object"
},
"notBefore": {
"format": "date-time",
"type": "string"
},
"notAfter": {
"format": "date-time",
"type": "string"
}
},
"required": [
"ref",
"notBefore",
"notAfter"
],
"type": "object"
Expand All @@ -85,6 +80,11 @@ def test_cluster_type_crd_json(self):
},
"deleteTriggered": {
"type": "boolean"
},
"updatedAt": {
"format": "date-time",
"nullable": true,
"type": "string"
}
},
"type": "object"
Expand Down
33 changes: 31 additions & 2 deletions azimuth_schedule_operator/tests/test_operator.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,5 +48,34 @@ async def test_cleanup_calls_aclose(self, mock_client):
await operator.cleanup()
mock_client.aclose.assert_awaited_once_with()

async def test_cluster_type_create_success(self):
await operator.schedule_changed(schedule_crd.get_fake_dict(), "type1", "ns", {})
@mock.patch.object(operator, "update_schedule")
@mock.patch.object(operator, "check_for_delete")
@mock.patch.object(operator, "get_reference")
async def test_schedule_check(
self, mock_get_reference, mock_check_for_delete, mock_update_schedule
):
body = schedule_crd.get_fake_dict()
fake = schedule_crd.Schedule(**body)
namespace = "ns1"

await operator.schedule_check(body, namespace)

# Assert the expected behavior
mock_get_reference.assert_awaited_once_with(namespace, fake.spec.ref)
mock_check_for_delete.assert_awaited_once_with(namespace, fake)
mock_update_schedule.assert_awaited_once_with(
fake.metadata.name, namespace, ref_found=True
)

@mock.patch.object(operator, "update_schedule")
@mock.patch.object(operator, "delete_reference")
async def test_check_for_delete(self, mock_delete_reference, mock_update_schedule):
namespace = "ns1"
schedule = schedule_crd.get_fake()

await operator.check_for_delete(namespace, schedule)

mock_delete_reference.assert_awaited_once_with(namespace, schedule.spec.ref)
mock_update_schedule.assert_awaited_once_with(
schedule.metadata.name, namespace, delete_triggered=True
)
Empty file.
6 changes: 5 additions & 1 deletion charts/operator/templates/clusterrole-operator.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,12 @@ rules:
- apiGroups: ["scheduling.azimuth.stackhpc.com"]
resources: ["*"]
verbs: ["*"]
# allow these things to be deleted by the operator
- apiGroups: ["caas.azimuth.stackhpc.com"]
resources: ["*"]
resources: ["clusters"]
verbs: ["*"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["*"]
- apiGroups: ["rbac.authorization.k8s.io"]
resources: ["clusterrolebindings"]
Expand Down
10 changes: 5 additions & 5 deletions charts/operator/templates/prometheusrule.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,14 @@ spec:
rules:
- alert: AzimuthCaasClusterNotReady
JohnGarbutt marked this conversation as resolved.
Show resolved Hide resolved
expr: >-
sum(azimuth_schedule_clusters_phase{phase!="Ready"}) by(cluster_namespace, cluster_name) > 0
for: 1h
sum(azimuth_schedule_delete_triggered{refFound!="True"}) by(schedule_namespace, schedule_name) > 0
for: 15m
annotations:
description: >-
Azimuth schedule
{{ "{{" }} $labels.cluster_namespace {{ "}}" }}/{{ "{{" }} $labels.cluster_name {{ "}}" }}
has been in a non-ready state for longer than one hour.
summary: Azimuth schedule has been in a non-ready state for more than one hour.
{{ "{{" }} $labels.schedule_namespace {{ "}}" }}/{{ "{{" }} $labels.schedule_name {{ "}}" }}
has been in a not found state for longer than 15 mins.
summary: Azimuth schedule has not found its ref for longer than 15 mins.
labels:
severity: warning
{{- end }}
23 changes: 10 additions & 13 deletions charts/operator/values.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,3 @@
# Config for the operator
config:
placeholder: "asdf"

# The operator image to use
image:
repository: ghcr.io/stackhpc/azimuth-schedule-operator
Expand Down Expand Up @@ -70,7 +66,7 @@ metrics:
create: true
extraRules:
- apiGroups:
- schedule.azimuth.stackhpc.com
- scheduling.azimuth.stackhpc.com
resources:
- schedules
verbs:
Expand All @@ -87,20 +83,21 @@ metrics:
spec:
resources:
- groupVersionKind:
group: schedule.azimuth.stackhpc.com
group: scheduling.azimuth.stackhpc.com
version: v1alpha1
kind: Schedule
metricNamePrefix: azimuth_schedule_clusters
metricNamePrefix: azimuth_schedule
labelsFromPath:
cluster_namespace: [metadata, namespace]
cluster_name: [metadata, name]
cluster_type_name: [spec, clusterTypeName]
cluster_type_version: [spec, clusterTypeVersion]
schedule_namespace: [metadata, namespace]
schedule_name: [metadata, name]
schedule_ref_kind: [spec, ref, kind]
schedule_ref_name: [spec, ref, name]
schedule_ref_found: [status, refFound]
metrics:
- name: phase
- name: delete_triggered
help: "Schedule phase"
each:
type: Info
info:
labelsFromPath:
phase: [status, phase]
phase: [status, delete_triggered]
9 changes: 6 additions & 3 deletions tools/functional_test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,12 @@ until [ `kubectl get crds | grep schedules.scheduling.azimuth.stackhpc.com | wc
kubectl get crds


export BEFORE=$(date --date="-1 hour" +"%Y-%m-%dT%H:%M:%SZ")
export AFTER=$(date --date="+2 hour" +"%Y-%m-%dT%H:%M:%SZ")
export AFTER=$(date --date="-1 hour" +"%Y-%m-%dT%H:%M:%SZ")
envsubst < $SCRIPT_DIR/test_schedule.yaml | kubectl apply -f -

# until kubectl wait --for=jsonpath='{.status.phase}'=Available clustertype quick-test; do echo "wait for status to appear"; sleep 5; done
until kubectl wait --for=jsonpath='{.status.refFound}'=true schedule caas-mycluster; do echo "wait for refFound"; sleep 5; done
until kubectl wait --for=jsonpath='{.status.deleteTriggered}'=true schedule caas-mycluster; do echo "wait for deleteTriggered"; sleep 5; done
kubectl get schedule caas-mycluster -o yaml

# for debugging get the logs from the operator
kubectl logs -n azimuth-schedule-operator deployment/azimuth-schedule-operator
15 changes: 11 additions & 4 deletions tools/test_schedule.yaml
Original file line number Diff line number Diff line change
@@ -1,13 +1,20 @@
---
apiVersion: v1
kind: ConfigMap
metadata:
name: caas-mycluster
data:
key1: value1
key2: value2
---
apiVersion: scheduling.azimuth.stackhpc.com/v1alpha1
kind: Schedule
metadata:
name: caas-mycluster
ownerReferences: []
spec:
ref:
apiVersion: caas.azimuth.stackhpc.com/v1alpha1
kind: Cluster
name: mycluster
notBefore: $BEFORE
apiVersion: v1
kind: ConfigMap
name: caas-mycluster
notAfter: $AFTER
2 changes: 1 addition & 1 deletion tox.ini
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ commands =
# E123, E125 skipped as they are invalid PEP-8.
show-source = True
# TODO add headers and remove H102
ignore = E123,E125,H102
ignore = E123,E125,H102,W503
builtins = _
exclude=.venv,.git,.tox,dist,doc,*lib/python*,*egg,build
# match black
Expand Down
Loading