Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backup failed due to throttling/OOM? #401

Open
toutas opened this issue Dec 18, 2024 · 13 comments
Open

Backup failed due to throttling/OOM? #401

toutas opened this issue Dec 18, 2024 · 13 comments
Labels
exempt This issue is never marked as stale question Further information is requested

Comments

@toutas
Copy link
Contributor

toutas commented Dec 18, 2024

Environment

  • OS: Ubuntu 24.04 LTS
  • k3s version: v1.29.5+k3s1 (4e53a323)
  • AWX Operator: 2.19.0

Description

I have had a backup script running forever, that follows the backup guide. It has been working flawlessly, but it has started failing every time since 2024-11-27. There have been made no modifications from the time since the backup job was functioning perfectly, until now where it has been failing every time I attempt to make a backup.

Step to Reproduce

Unsure, have not attempted to reproduce on a clean setup as daily backups have worked fine.
The command that fails is kubectl apply -f "{{ awx_k3s_repo_dir }}/awx-on-k3s/backup/awxbackup.yaml"

Logs

When running kubectl apply -f "/backup/awxbackup.yaml" it finishes immediately instead of doing the backup, and logs displayed by kubectl -n awx logs -f deployments/awx-operator-controller-manager do not tell me anything I can make sense of.

kubectl -n awx logs -f deployments/awx-operator-controller-manager
{"level":"info","ts":"2024-12-18T10:31:48Z","logger":"cmd","msg":"Version","Go Version":"go1.20.12","GOOS":"linux","GOARCH":"amd64","ansible-operator":"v1.34.0","commit":"d26c43bf94960d292152862a6685696be33190fb"}
{"level":"info","ts":"2024-12-18T10:31:48Z","logger":"cmd","msg":"Watching namespaces","namespaces":["awx"]}
{"level":"info","ts":"2024-12-18T10:31:48Z","logger":"watches","msg":"Environment variable not set; using default value","envVar":"ANSIBLE_VERBOSITY_AWX_AWX_ANSIBLE_COM","default":2}
{"level":"info","ts":"2024-12-18T10:31:48Z","logger":"watches","msg":"Environment variable not set; using default value","envVar":"ANSIBLE_VERBOSITY_AWXBACKUP_AWX_ANSIBLE_COM","default":2}
{"level":"info","ts":"2024-12-18T10:31:48Z","logger":"watches","msg":"Environment variable not set; using default value","envVar":"ANSIBLE_VERBOSITY_AWXRESTORE_AWX_ANSIBLE_COM","default":2}
{"level":"info","ts":"2024-12-18T10:31:48Z","logger":"watches","msg":"Environment variable not set; using default value","envVar":"ANSIBLE_VERBOSITY_AWXMESHINGRESS_AWX_ANSIBLE_COM","default":2}
{"level":"info","ts":"2024-12-18T10:31:48Z","logger":"ansible-controller","msg":"Watching resource","Options.Group":"awx.ansible.com","Options.Version":"v1beta1","Options.Kind":"AWX"}
{"level":"info","ts":"2024-12-18T10:31:48Z","logger":"ansible-controller","msg":"Watching resource","Options.Group":"awx.ansible.com","Options.Version":"v1beta1","Options.Kind":"AWXBackup"}
{"level":"info","ts":"2024-12-18T10:31:48Z","logger":"ansible-controller","msg":"Watching resource","Options.Group":"awx.ansible.com","Options.Version":"v1beta1","Options.Kind":"AWXRestore"}
{"level":"info","ts":"2024-12-18T10:31:48Z","logger":"ansible-controller","msg":"Watching resource","Options.Group":"awx.ansible.com","Options.Version":"v1alpha1","Options.Kind":"AWXMeshIngress"}
{"level":"info","ts":"2024-12-18T10:31:48Z","logger":"proxy","msg":"Starting to serve","Address":"127.0.0.1:8888"}
{"level":"info","ts":"2024-12-18T10:31:48Z","logger":"apiserver","msg":"Starting to serve metrics listener","Address":"localhost:5050"}
{"level":"info","ts":"2024-12-18T10:31:48Z","msg":"starting server","kind":"health probe","addr":"[::]:6789"}
{"level":"info","ts":"2024-12-18T10:31:48Z","logger":"controller-runtime.metrics","msg":"Starting metrics server"}
{"level":"info","ts":"2024-12-18T10:31:48Z","logger":"controller-runtime.metrics","msg":"Serving metrics server","bindAddress":"127.0.0.1:8080","secure":false}
I1218 10:31:48.821221       7 leaderelection.go:250] attempting to acquire leader lease awx/awx-operator...
I1218 10:32:05.084628       7 leaderelection.go:260] successfully acquired lease awx/awx-operator
{"level":"info","ts":"2024-12-18T10:32:05Z","msg":"Starting EventSource","controller":"awx-controller","source":"kind source: *unstructured.Unstructured"}
{"level":"info","ts":"2024-12-18T10:32:05Z","msg":"Starting EventSource","controller":"awxrestore-controller","source":"kind source: *unstructured.Unstructured"}
{"level":"info","ts":"2024-12-18T10:32:05Z","msg":"Starting EventSource","controller":"awxmeshingress-controller","source":"kind source: *unstructured.Unstructured"}
{"level":"info","ts":"2024-12-18T10:32:05Z","msg":"Starting Controller","controller":"awx-controller"}
{"level":"info","ts":"2024-12-18T10:32:05Z","msg":"Starting Controller","controller":"awxmeshingress-controller"}
{"level":"info","ts":"2024-12-18T10:32:05Z","msg":"Starting Controller","controller":"awxrestore-controller"}
{"level":"info","ts":"2024-12-18T10:32:05Z","msg":"Starting EventSource","controller":"awxbackup-controller","source":"kind source: *unstructured.Unstructured"}
{"level":"info","ts":"2024-12-18T10:32:05Z","msg":"Starting Controller","controller":"awxbackup-controller"}
{"level":"info","ts":"2024-12-18T10:32:05Z","msg":"Starting workers","controller":"awx-controller","worker count":32}
{"level":"info","ts":"2024-12-18T10:32:05Z","msg":"Starting workers","controller":"awxmeshingress-controller","worker count":32}
{"level":"info","ts":"2024-12-18T10:32:05Z","msg":"Starting workers","controller":"awxrestore-controller","worker count":32}
{"level":"info","ts":"2024-12-18T10:32:05Z","msg":"Starting workers","controller":"awxbackup-controller","worker count":32}
{"level":"info","ts":"2024-12-18T10:32:05Z","logger":"KubeAPIWarningLogger","msg":"unknown field \"status.conditions[0].message\""}
{"level":"info","ts":"2024-12-18T10:32:05Z","logger":"KubeAPIWarningLogger","msg":"unknown field \"status.conditions[1].message\""}
{"level":"info","ts":"2024-12-18T10:32:05Z","logger":"KubeAPIWarningLogger","msg":"unknown field \"status.conditions[2].message\""}
I1218 10:32:06.277917       7 request.go:697] Waited for 1.028919415s due to client-side throttling, not priority and fairness, request: PUT:https://10.43.0.1:443/apis/awx.ansible.com/v1beta1/namespaces/awx/awxbackups/awx-backup-20240716080901/status

When I check the awxbackup objects I see a long list of old backup objects, and it does create a new one every time I apply the backup:

kubectl -n awx get awxbackup
NAME                        AGE
<100s of other awx-backup-202x objects>
awx-backup-20241218100946   22m
awx-backup-20241218102752   4m40s

Files

my awxbackup.yml is defined as

---
apiVersion: awx.ansible.com/v1beta1
kind: AWXBackup
metadata:
  name: {{ filename_startswith }}-{{ current_date_utc }}
  namespace: awx
spec:
  deployment_name: awx
  backup_pvc: awx-backup-claim
@toutas toutas added the question Further information is requested label Dec 18, 2024
@toutas
Copy link
Contributor Author

toutas commented Dec 18, 2024

I came upon https://stackoverflow.com/questions/71596906/client-side-throttling-response-from-kubernetes-kubectl-command when looking around for similar issues, and I have attempted to clear the kube cache: rm -rf .kube/cache but that did nothing.

@toutas
Copy link
Contributor Author

toutas commented Dec 18, 2024

I also attempted to restart the awx-operator-controller-manager:

➜  ~ kubectl -n awx rollout restart deployment awx-operator-controller-manager
deployment.apps/awx-operator-controller-manager restarted

➜  ~ kubectl -n awx get pods -w

NAME                                               READY   STATUS             RESTARTS         AGE
awx-migration-24.6.0-pfgpq                         0/1     Completed          0                164d
awx-postgres-15-0                                  1/1     Running            2 (23d ago)      164d
awx-web-66fb5bc94d-jd229                           3/3     Running            6 (23d ago)      164d
awx-task-69b9f79cc9-gzqgn                          4/4     Running            8 (23d ago)      164d
automation-job-252677-lt4lr                        1/1     Running            0                18m
awx-operator-controller-manager-7875f768df-2hgcn   1/2     CrashLoopBackOff   6502 (31s ago)   164d
awx-operator-controller-manager-774894966b-qnjwv   1/2     Running            0                6s
awx-operator-controller-manager-774894966b-qnjwv   2/2     Running            0                11s
awx-operator-controller-manager-7875f768df-2hgcn   1/2     Terminating        6502 (36s ago)   164d
awx-operator-controller-manager-7875f768df-2hgcn   1/2     Terminating        6502 (23d ago)   164d
awx-operator-controller-manager-7875f768df-2hgcn   0/2     Terminating        6502 (23d ago)   164d
awx-operator-controller-manager-7875f768df-2hgcn   0/2     Terminating        6502 (23d ago)   164d
awx-operator-controller-manager-7875f768df-2hgcn   0/2     Terminating        6502 (23d ago)   164d
awx-operator-controller-manager-7875f768df-2hgcn   0/2     Terminating        6502 (23d ago)   164d
awx-operator-controller-manager-774894966b-qnjwv   1/2     OOMKilled          0                30s
awx-operator-controller-manager-774894966b-qnjwv   1/2     Running            1 (1s ago)       31s
awx-operator-controller-manager-774894966b-qnjwv   2/2     Running            1 (11s ago)      41s
awx-operator-controller-manager-774894966b-qnjwv   1/2     OOMKilled          1 (28s ago)      58s
awx-operator-controller-manager-774894966b-qnjwv   1/2     CrashLoopBackOff   1 (3s ago)       61s
awx-operator-controller-manager-774894966b-qnjwv   1/2     Running            2 (15s ago)      73s
awx-operator-controller-manager-774894966b-qnjwv   2/2     Running            2 (23s ago)      81s
awx-operator-controller-manager-774894966b-qnjwv   1/2     OOMKilled          2 (43s ago)      101s
awx-operator-controller-manager-774894966b-qnjwv   1/2     CrashLoopBackOff   2 (10s ago)      111s
awx-operator-controller-manager-774894966b-qnjwv   1/2     Running            3 (33s ago)      2m14s

It ends up with CrashLoopBackOff and OOMKilled which is what I assume was the initial reason for the controller manager failure way back when it initially started failing.

@toutas toutas changed the title Backup failed due to throttling? Backup failed due to throttling/OOM? Dec 18, 2024
@kurokobo
Copy link
Owner

Based on the results of kubectl -n awx get pods -w, it seems that the AWX Operator pods are not starting normally in the K3s cluster. I believe there is insufficient memory, and kubelet is unable to manage the pod startup status correctly.

When there are a large number of awxbackup resources, the AWX Operator will try to manage all of their states, which can consume a significant amount of computing resources. It is not recommended to keep too many awxbackup resources.

First, stop the AWX Operator with the following commands:

kubectl -n awx scale deployment/awx-operator-controller-manager --replicas=0
kubectl -n awx get pods -w

Next, delete unnecessary awxbackup resources:

kubectl -n awx delete awxbackup awx-backup-........

If possible, I recommend restarting the K3s host before starting the AWX Operator again:

kubectl -n awx scale deployment/awx-operator-controller-manager --replicas=1

If memory is still insufficient, there is a possibility that the DB could be corrupted upon restarting of the K3s host. For a safer operation, please confirm that the pods for web, task, and postgres are also scaled down to 0 replicas and stopped before the restart.

@toutas
Copy link
Contributor Author

toutas commented Dec 19, 2024

@kurokobo thank you for your quick response!

I forgot to add that I attempted to run kubectl -n awx delete awxbackup --all but it hangs after having output many lines:

awxbackup.awx.ansible.com "awx-backup-20241005183042" deleted
awxbackup.awx.ansible.com "awx-backup-20241022100054" deleted
awxbackup.awx.ansible.com "awx-backup-20240916183049" deleted
awxbackup.awx.ansible.com "awx-backup-20240725183054" deleted
awxbackup.awx.ansible.com "awx-backup-20240801183049" deleted
awxbackup.awx.ansible.com "awx-backup-20241119193056" deleted
awxbackup.awx.ansible.com "awx-backup-20240819183054" deleted
awxbackup.awx.ansible.com "awx-backup-20240829183042" deleted
awxbackup.awx.ansible.com "awx-backup-20241006183046" deleted
awxbackup.awx.ansible.com "awx-backup-20241012183042" deleted
awxbackup.awx.ansible.com "awx-backup-20240718183051" deleted
awxbackup.awx.ansible.com "awx-backup-20240927183046" deleted
awxbackup.awx.ansible.com "awx-backup-20240919183049" deleted
awxbackup.awx.ansible.com "awx-backup-20241120193103" deleted
awxbackup.awx.ansible.com "awx-backup-20241019183043" deleted
awxbackup.awx.ansible.com "awx-backup-20240727183049" deleted
awxbackup.awx.ansible.com "awx-backup-20240923183101" deleted
awxbackup.awx.ansible.com "awx-backup-20240912183043" deleted
awxbackup.awx.ansible.com "awx-backup-20240904183041" deleted
awxbackup.awx.ansible.com "awx-backup-20240810183049" deleted
awxbackup.awx.ansible.com "awx-backup-20240822183048" deleted
awxbackup.awx.ansible.com "awx-backup-20241103193039" deleted
awxbackup.awx.ansible.com "awx-backup-20240902183054" deleted
awxbackup.awx.ansible.com "awx-backup-20240716183121" deleted
awxbackup.awx.ansible.com "awx-backup-20241112193041" deleted
awxbackup.awx.ansible.com "awx-backup-20241025183057" deleted
awxbackup.awx.ansible.com "awx-backup-20240723183052" deleted
awxbackup.awx.ansible.com "awx-backup-20240719183047" deleted
awxbackup.awx.ansible.com "awx-backup-20241108193040" deleted
awxbackup.awx.ansible.com "awx-backup-20240814183052" deleted
awxbackup.awx.ansible.com "awx-backup-20241122193049" deleted
awxbackup.awx.ansible.com "awx-backup-20240908183044" deleted
awxbackup.awx.ansible.com "awx-backup-20240716080901" deleted
awxbackup.awx.ansible.com "awx-backup-20240809183047" deleted
awxbackup.awx.ansible.com "awx-backup-20241124193046" deleted
awxbackup.awx.ansible.com "awx-backup-20240713183116" deleted
awxbackup.awx.ansible.com "awx-backup-20241015183039" deleted
awxbackup.awx.ansible.com "awx-backup-20240731183048" deleted
awxbackup.awx.ansible.com "awx-backup-20240903183042" deleted
awxbackup.awx.ansible.com "awx-backup-20241022183042" deleted
awxbackup.awx.ansible.com "awx-backup-20240714183057" deleted
awxbackup.awx.ansible.com "awx-backup-20240811183048" deleted
awxbackup.awx.ansible.com "awx-backup-20240924183049" deleted
awxbackup.awx.ansible.com "awx-backup-20240827183048" deleted
awxbackup.awx.ansible.com "awx-backup-20241010183039" deleted
awxbackup.awx.ansible.com "awx-backup-20240825183041" deleted
awxbackup.awx.ansible.com "awx-backup-20241113193039" deleted
awxbackup.awx.ansible.com "awx-backup-20240824183042" deleted
awxbackup.awx.ansible.com "awx-backup-20241008183042" deleted
awxbackup.awx.ansible.com "awx-backup-20240803183049" deleted
awxbackup.awx.ansible.com "awx-backup-20240925183043" deleted
awxbackup.awx.ansible.com "awx-backup-20241031193043" deleted
awxbackup.awx.ansible.com "awx-backup-20240812183135" deleted
awxbackup.awx.ansible.com "awx-backup-20240920183053" deleted
awxbackup.awx.ansible.com "awx-backup-20241123193048" deleted
awxbackup.awx.ansible.com "awx-backup-20240726183049" deleted
awxbackup.awx.ansible.com "awx-backup-20241115193052" deleted
awxbackup.awx.ansible.com "awx-backup-20241105193043" deleted
awxbackup.awx.ansible.com "awx-backup-20241026183038" deleted
awxbackup.awx.ansible.com "awx-backup-20240712183049" deleted
awxbackup.awx.ansible.com "awx-backup-20241118193057" deleted
awxbackup.awx.ansible.com "awx-backup-20240730183049" deleted
awxbackup.awx.ansible.com "awx-backup-20240823183131" deleted
awxbackup.awx.ansible.com "awx-backup-20241107193042" deleted
awxbackup.awx.ansible.com "awx-backup-20240802183050" deleted
awxbackup.awx.ansible.com "awx-backup-20240901183043" deleted
awxbackup.awx.ansible.com "awx-backup-20240826183041" deleted
awxbackup.awx.ansible.com "awx-backup-20241009183042" deleted
awxbackup.awx.ansible.com "awx-backup-20241014183041" deleted
awxbackup.awx.ansible.com "awx-backup-20241003183206" deleted
awxbackup.awx.ansible.com "awx-backup-20240808183049" deleted
awxbackup.awx.ansible.com "awx-backup-20241017183041" deleted
awxbackup.awx.ansible.com "awx-backup-20240721183049" deleted
awxbackup.awx.ansible.com "awx-backup-20241030193043" deleted
awxbackup.awx.ansible.com "awx-backup-20241028193041" deleted
awxbackup.awx.ansible.com "awx-backup-20241110193043" deleted
awxbackup.awx.ansible.com "awx-backup-20240905183042" deleted
awxbackup.awx.ansible.com "awx-backup-20241114193041" deleted
awxbackup.awx.ansible.com "awx-backup-20240710183049" deleted
awxbackup.awx.ansible.com "awx-backup-20240806183047" deleted
awxbackup.awx.ansible.com "awx-backup-20240817183051" deleted
awxbackup.awx.ansible.com "awx-backup-20240706194812" deleted
awxbackup.awx.ansible.com "awx-backup-20241101193041" deleted
awxbackup.awx.ansible.com "awx-backup-20240914183047" deleted
awxbackup.awx.ansible.com "awx-backup-20240804183048" deleted
awxbackup.awx.ansible.com "awx-backup-20241104193042" deleted
awxbackup.awx.ansible.com "awx-backup-20241007183043" deleted
awxbackup.awx.ansible.com "awx-backup-20241022105528" deleted
awxbackup.awx.ansible.com "awx-backup-20241027193047" deleted
awxbackup.awx.ansible.com "awx-backup-20240816183051" deleted
awxbackup.awx.ansible.com "awx-backup-20241116193042" deleted
awxbackup.awx.ansible.com "awx-backup-20241111193041" deleted
awxbackup.awx.ansible.com "awx-backup-20240724183049" deleted
awxbackup.awx.ansible.com "awx-backup-20241029193042" deleted
awxbackup.awx.ansible.com "awx-backup-20241022084940" deleted
awxbackup.awx.ansible.com "awx-backup-20240717183107" deleted
awxbackup.awx.ansible.com "awx-backup-20240921183048" deleted
awxbackup.awx.ansible.com "awx-backup-20240828183041" deleted
awxbackup.awx.ansible.com "awx-backup-20240707183103" deleted
awxbackup.awx.ansible.com "awx-backup-20241016183044" deleted
awxbackup.awx.ansible.com "awx-backup-20240831183040" deleted
awxbackup.awx.ansible.com "awx-backup-20240807183104" deleted
awxbackup.awx.ansible.com "awx-backup-20241121193057" deleted
awxbackup.awx.ansible.com "awx-backup-20241004183054" deleted
awxbackup.awx.ansible.com "awx-backup-20240910183044" deleted
awxbackup.awx.ansible.com "awx-backup-20240906183041" deleted
awxbackup.awx.ansible.com "awx-backup-20240911183039" deleted
awxbackup.awx.ansible.com "awx-backup-20240729183049" deleted
awxbackup.awx.ansible.com "awx-backup-20240820183051" deleted
awxbackup.awx.ansible.com "awx-backup-20241023183039" deleted
awxbackup.awx.ansible.com "awx-backup-20240913183045" deleted
awxbackup.awx.ansible.com "awx-backup-20241002183100" deleted
awxbackup.awx.ansible.com "awx-backup-20240918183110" deleted
awxbackup.awx.ansible.com "awx-backup-20240928183053" deleted
awxbackup.awx.ansible.com "awx-backup-20241117193047" deleted
awxbackup.awx.ansible.com "awx-backup-20241106193040" deleted
awxbackup.awx.ansible.com "awx-backup-20240715183133" deleted
awxbackup.awx.ansible.com "awx-backup-20241022074300" deleted
awxbackup.awx.ansible.com "awx-backup-20241109193042" deleted
awxbackup.awx.ansible.com "awx-backup-20240706194147" deleted
awxbackup.awx.ansible.com "awx-backup-20240708183122" deleted
awxbackup.awx.ansible.com "awx-backup-20240830183042" deleted
awxbackup.awx.ansible.com "awx-backup-20240711183047" deleted
awxbackup.awx.ansible.com "awx-backup-20240805183049" deleted
awxbackup.awx.ansible.com "awx-backup-20240926183118" deleted
awxbackup.awx.ansible.com "awx-backup-20240815183050" deleted
awxbackup.awx.ansible.com "awx-backup-20240909183042" deleted
awxbackup.awx.ansible.com "awx-backup-20240813183049" deleted
awxbackup.awx.ansible.com "awx-backup-20240907183041" deleted
awxbackup.awx.ansible.com "awx-backup-20240728183049" deleted
awxbackup.awx.ansible.com "awx-backup-20240930183047" deleted
awxbackup.awx.ansible.com "awx-backup-20241011183041" deleted
awxbackup.awx.ansible.com "awx-backup-20240720183050" deleted
awxbackup.awx.ansible.com "awx-backup-20240818183053" deleted
awxbackup.awx.ansible.com "awx-backup-20241013183039" deleted
awxbackup.awx.ansible.com "awx-backup-20240922183133" deleted
awxbackup.awx.ansible.com "awx-backup-20240709183049" deleted
awxbackup.awx.ansible.com "awx-backup-20241024183046" deleted
awxbackup.awx.ansible.com "awx-backup-20241020183039" deleted
awxbackup.awx.ansible.com "awx-backup-20241102193041" deleted
awxbackup.awx.ansible.com "awx-backup-20240917183053" deleted
awxbackup.awx.ansible.com "awx-backup-20240915183128" deleted
awxbackup.awx.ansible.com "awx-backup-20240722183104" deleted
awxbackup.awx.ansible.com "awx-backup-20240821183124" deleted
awxbackup.awx.ansible.com "awx-backup-20240929183053" deleted
awxbackup.awx.ansible.com "awx-backup-20241001183046" deleted
awxbackup.awx.ansible.com "awx-backup-20241125193023" deleted
awxbackup.awx.ansible.com "awx-backup-20241126193022" deleted
awxbackup.awx.ansible.com "awx-backup-20241127193023" deleted
awxbackup.awx.ansible.com "awx-backup-20241128193023" deleted
awxbackup.awx.ansible.com "awx-backup-20241129193023" deleted
awxbackup.awx.ansible.com "awx-backup-20241130193023" deleted
awxbackup.awx.ansible.com "awx-backup-20241201193033" deleted
awxbackup.awx.ansible.com "awx-backup-20241202193022" deleted
awxbackup.awx.ansible.com "awx-backup-20241203193022" deleted
awxbackup.awx.ansible.com "awx-backup-20241204193023" deleted
awxbackup.awx.ansible.com "awx-backup-20241205193022" deleted
awxbackup.awx.ansible.com "awx-backup-20241206193026" deleted
awxbackup.awx.ansible.com "awx-backup-20241207193026" deleted
awxbackup.awx.ansible.com "awx-backup-20241208193023" deleted
awxbackup.awx.ansible.com "awx-backup-20241209193029" deleted
awxbackup.awx.ansible.com "awx-backup-20241210193022" deleted
awxbackup.awx.ansible.com "awx-backup-20241211193024" deleted
awxbackup.awx.ansible.com "awx-backup-20241212193023" deleted
awxbackup.awx.ansible.com "awx-backup-20241213193026" deleted
awxbackup.awx.ansible.com "awx-backup-20241214193023" deleted
awxbackup.awx.ansible.com "awx-backup-20241215193027" deleted
awxbackup.awx.ansible.com "awx-backup-20241216193024" deleted
awxbackup.awx.ansible.com "awx-backup-20241217081251" deleted
awxbackup.awx.ansible.com "awx-backup-20241217105816" deleted
awxbackup.awx.ansible.com "awx-backup-20241217193023" deleted
awxbackup.awx.ansible.com "awx-backup-20241218100946" deleted
awxbackup.awx.ansible.com "awx-backup-20241218102752" deleted
awxbackup.awx.ansible.com "awx-backup-20241218111043" deleted
awxbackup.awx.ansible.com "awx-backup-20241218193022" deleted
...

And since you pointed out that having many of these backup objects could be the issue, I will try to figure out how I can get around this hang. It seems like it will not delete any of the objects when it hangs using --all.

Will get back with an update once I have played around with it a bit.

On another note, do you recommend running

kubectl -n awx delete awxbackup <newly-created-backup-object>

after having uploaded my backup? I may have wrongfully assumed it would not be an issue to have the objects lying around if there is no data present / the backup dir is cleaned.

@toutas
Copy link
Contributor Author

toutas commented Dec 19, 2024

Okay I fixed the issue. It seems kind of weird to me that it ended up in this state by itself, but maybe you have an idea of things I can do to prevent this from happening again?

Running the following commands helped fix the deletion hanging issue:

kubectl -n awx scale deployment/awx-operator-controller-manager --replicas=0

kubectl get awxbackup -n awx -o json | jq '.items[] | .metadata.name' | xargs -I{} kubectl patch awxbackup {} -n awx --type=json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'

kubectl get awxbackup -n awx
then shows No resources found in awx namespace. instead of the long list of awx-backup- resources

kubectl -n awx scale deployment/awx-operator-controller-manager --replicas=1

my backup script now works as expected.

I assume I need to ensure those resources are cleaned up after every backup, or is there anything I am missing that is not documented?

@kurokobo
Copy link
Owner

Every time AWXBackup resources are deleted, the AWXOperator attempts to execute a playbook called a finalizer for each deleted AWXBackup resources. This is the reason your process got hung up. My bad, I didn't make any commenting on this point earlier.

If you are running backups regularly by some automated way, it is generally considered best practice to automatically delete older backups. Whether to delete the actual backup data in the PVC when the AWXBackup resource is deleted can be controlled by the clean_backup_on_delete parameter: https://github.com/ansible/awx-operator/tree/2.19.1/roles/backup

@toutas
Copy link
Contributor Author

toutas commented Dec 21, 2024

@kurokobo looking at that role, from what I understand it is just clearing the backup directory from inside the container and not exactly removing the backup objects in the awx namespace or am I misunderstanding it?

In the case it is just deleting the backup folder that was created, is deleting the backup from my hosts' /data/backup directory not enough? Do I have to delete it from inside the management pod as well?

@kurokobo
Copy link
Owner

The AWX Operator does not delete AWXBackup objects. The role of the finalizer is to detect when an AWXBackup object has been deleted by a user and then remove the actual data (i.e., the data within the PVC, which corresponds to the data in /data/backup). This finalizer functionality is handled by the AWX Operator.

The AWX Operator monitors all AWXBackup objects. When the state of an object changes, the Operator runs a playbook once to maintain and manage that object.

If there are a large number of AWXBackup objects, the number of playbook executions increases, consuming more resources. This is especially noticeable right after the Operator starts. For instance, if there are 100 AWXBackup objects, the playbook will be executed 100 times, and the system will remain unstable until all executions are completed.

The fewer AWXBackup objects there are, the lower the resource consumption of the Operator, and the more stable the cluster will be.

@toutas
Copy link
Contributor Author

toutas commented Dec 27, 2024

The fewer AWXBackup objects there are, the lower the resource consumption of the Operator, and the more stable the cluster will be.

Great to know! Would it make sense if I create a small documentation PR for the backup readme that goes over this?

@kurokobo
Copy link
Owner

Would it make sense if I create a small documentation PR for the backup readme that goes over this?

Yes, thanks :)

@kurokobo
Copy link
Owner

#402

Copy link

This issue is stale because it has been open 10 days with no activity. Remove stale label or comment or this will be closed in 4 days.

@github-actions github-actions bot added the stale This issue has no activity label Jan 10, 2025
@toutas
Copy link
Contributor Author

toutas commented Jan 10, 2025

This issue is stale because it has been open 10 days with no activity. Remove stale label or comment or this will be closed in 4 days.

not stale

@kurokobo kurokobo added exempt This issue is never marked as stale and removed stale This issue has no activity labels Jan 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
exempt This issue is never marked as stale question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants