-
-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backup failed due to throttling/OOM? #401
Comments
I came upon https://stackoverflow.com/questions/71596906/client-side-throttling-response-from-kubernetes-kubectl-command when looking around for similar issues, and I have attempted to clear the kube cache: |
I also attempted to restart the awx-operator-controller-manager:
It ends up with CrashLoopBackOff and OOMKilled which is what I assume was the initial reason for the controller manager failure way back when it initially started failing. |
Based on the results of When there are a large number of First, stop the AWX Operator with the following commands: kubectl -n awx scale deployment/awx-operator-controller-manager --replicas=0
kubectl -n awx get pods -w Next, delete unnecessary kubectl -n awx delete awxbackup awx-backup-........ If possible, I recommend restarting the K3s host before starting the AWX Operator again: kubectl -n awx scale deployment/awx-operator-controller-manager --replicas=1 If memory is still insufficient, there is a possibility that the DB could be corrupted upon restarting of the K3s host. For a safer operation, please confirm that the pods for web, task, and postgres are also scaled down to 0 replicas and stopped before the restart. |
@kurokobo thank you for your quick response! I forgot to add that I attempted to run
And since you pointed out that having many of these backup objects could be the issue, I will try to figure out how I can get around this hang. It seems like it will not delete any of the objects when it hangs using --all. Will get back with an update once I have played around with it a bit. On another note, do you recommend running
after having uploaded my backup? I may have wrongfully assumed it would not be an issue to have the objects lying around if there is no data present / the backup dir is cleaned. |
Okay I fixed the issue. It seems kind of weird to me that it ended up in this state by itself, but maybe you have an idea of things I can do to prevent this from happening again? Running the following commands helped fix the deletion hanging issue:
my backup script now works as expected. I assume I need to ensure those resources are cleaned up after every backup, or is there anything I am missing that is not documented? |
Every time AWXBackup resources are deleted, the AWXOperator attempts to execute a playbook called a finalizer for each deleted AWXBackup resources. This is the reason your process got hung up. My bad, I didn't make any commenting on this point earlier. If you are running backups regularly by some automated way, it is generally considered best practice to automatically delete older backups. Whether to delete the actual backup data in the PVC when the AWXBackup resource is deleted can be controlled by the |
@kurokobo looking at that role, from what I understand it is just clearing the backup directory from inside the container and not exactly removing the backup objects in the awx namespace or am I misunderstanding it? In the case it is just deleting the backup folder that was created, is deleting the backup from my hosts' /data/backup directory not enough? Do I have to delete it from inside the management pod as well? |
The AWX Operator does not delete AWXBackup objects. The role of the finalizer is to detect when an AWXBackup object has been deleted by a user and then remove the actual data (i.e., the data within the PVC, which corresponds to the data in /data/backup). This finalizer functionality is handled by the AWX Operator. The AWX Operator monitors all AWXBackup objects. When the state of an object changes, the Operator runs a playbook once to maintain and manage that object. If there are a large number of AWXBackup objects, the number of playbook executions increases, consuming more resources. This is especially noticeable right after the Operator starts. For instance, if there are 100 AWXBackup objects, the playbook will be executed 100 times, and the system will remain unstable until all executions are completed. The fewer AWXBackup objects there are, the lower the resource consumption of the Operator, and the more stable the cluster will be. |
Great to know! Would it make sense if I create a small documentation PR for the backup readme that goes over this? |
Yes, thanks :) |
This issue is stale because it has been open 10 days with no activity. Remove stale label or comment or this will be closed in 4 days. |
not stale |
Environment
Description
I have had a backup script running forever, that follows the backup guide. It has been working flawlessly, but it has started failing every time since
2024-11-27
. There have been made no modifications from the time since the backup job was functioning perfectly, until now where it has been failing every time I attempt to make a backup.Step to Reproduce
Unsure, have not attempted to reproduce on a clean setup as daily backups have worked fine.
The command that fails is
kubectl apply -f "{{ awx_k3s_repo_dir }}/awx-on-k3s/backup/awxbackup.yaml"
Logs
When running
kubectl apply -f "/backup/awxbackup.yaml"
it finishes immediately instead of doing the backup, and logs displayed bykubectl -n awx logs -f deployments/awx-operator-controller-manager
do not tell me anything I can make sense of.When I check the awxbackup objects I see a long list of old backup objects, and it does create a new one every time I apply the backup:
Files
my awxbackup.yml is defined as
The text was updated successfully, but these errors were encountered: