Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes random job failures in kubernetes #19001

Merged
merged 4 commits into from
Oct 17, 2024

Conversation

mapk-amazon
Copy link
Contributor

This fix addresses the random crashes of k8s jobs in Galaxy: galaxyproject/galaxy-helm#490 and mentioned therein. The issue is that k8s job status may not be ready, while providing already some information:
https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.30/#jobstatus-v1-batch

In the previous code if len(job.obj["status"]) == 0: checked whether some information is there; if yes, treat it as "final" state of the job and continue processing. In the case that the job status has the field uncountedTerminatedPods, k8s is not done with analyzing whether the job failed or succeeded. The code then used this information (for me 0 succeeded, 0 active and 0 failed) and went through the decision tree to decide what to do.

My suggestion is to instead wait for k8s to determine the status of the terminated pods and only then decide what to do. This reduced the failure rate from 2-5% to 0% :)

How to test the changes?

  • Instructions for manual testing are as follows:
    1. The change is difficult to test. It covers an edge where a/ the job is completed, b/ k8s has yet to determine the status of the job. For me 2-5% of jobs fail without the fix.

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@afgane
Copy link
Contributor

afgane commented Oct 15, 2024

This is a great find, and ultimately also a simple fix!

Copy link
Member

@mvdbeek mvdbeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for figuring this out!

@nuwang
Copy link
Member

nuwang commented Oct 16, 2024

Nice work! It would have taken some doing to figure this out!

@mvdbeek mvdbeek merged commit 409b790 into galaxyproject:dev Oct 17, 2024
48 of 53 checks passed
Copy link

This PR was merged without a "kind/" label, please correct.

@mapk-amazon
Copy link
Contributor Author

Thank you everyone for your support. I tested now quay.io/galaxyproject/galaxy-min:dev and it works as expected :)

@charesredhat
Copy link
Contributor

Hi @mvdbeek when do you think this fix will be pushed into the latest galaxy and helm chart builds for production?
We are waiting on this to push into our own galaxy build using Kubernetes.

@mvdbeek
Copy link
Member

mvdbeek commented Dec 17, 2024

As you can see the milestone is tagged to 24.2, so when that release happens.

@charesredhat
Copy link
Contributor

Thank you @mvdbeek, sorry what I should have asked, when do expect 24.2 to be released?

mvdbeek added a commit that referenced this pull request Dec 17, 2024
@mvdbeek
Copy link
Member

mvdbeek commented Dec 17, 2024

When there are no new or critical bugs, we don't set this in advance. I've backported the fix to 24.1 in #19338 so if you update to the head of the branch you will get the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants