-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes random job failures in kubernetes #19001
Conversation
This is a great find, and ultimately also a simple fix! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks for figuring this out!
Nice work! It would have taken some doing to figure this out! |
Co-authored-by: Marius van den Beek <[email protected]>
Co-authored-by: Nuwan Goonasekera <[email protected]>
This PR was merged without a "kind/" label, please correct. |
Thank you everyone for your support. I tested now |
Hi @mvdbeek when do you think this fix will be pushed into the latest galaxy and helm chart builds for production? |
As you can see the milestone is tagged to 24.2, so when that release happens. |
Thank you @mvdbeek, sorry what I should have asked, when do expect 24.2 to be released? |
[24.1] Backport #19001 kubernetes api client fix
When there are no new or critical bugs, we don't set this in advance. I've backported the fix to 24.1 in #19338 so if you update to the head of the branch you will get the fix. |
This fix addresses the random crashes of k8s jobs in Galaxy: galaxyproject/galaxy-helm#490 and mentioned therein. The issue is that k8s job status may not be ready, while providing already some information:
https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.30/#jobstatus-v1-batch
In the previous code
if len(job.obj["status"]) == 0:
checked whether some information is there; if yes, treat it as "final" state of the job and continue processing. In the case that the job status has the fielduncountedTerminatedPods
, k8s is not done with analyzing whether the job failed or succeeded. The code then used this information (for me 0 succeeded, 0 active and 0 failed) and went through the decision tree to decide what to do.My suggestion is to instead wait for k8s to determine the status of the terminated pods and only then decide what to do. This reduced the failure rate from 2-5% to 0% :)
How to test the changes?
License