Fixes random job failures in kubernetes #19001

mapk-amazon · 2024-10-15T15:47:50Z

This fix addresses the random crashes of k8s jobs in Galaxy: galaxyproject/galaxy-helm#490 and mentioned therein. The issue is that k8s job status may not be ready, while providing already some information:
https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.30/#jobstatus-v1-batch

In the previous code if len(job.obj["status"]) == 0: checked whether some information is there; if yes, treat it as "final" state of the job and continue processing. In the case that the job status has the field uncountedTerminatedPods, k8s is not done with analyzing whether the job failed or succeeded. The code then used this information (for me 0 succeeded, 0 active and 0 failed) and went through the decision tree to decide what to do.

My suggestion is to instead wait for k8s to determine the status of the terminated pods and only then decide what to do. This reduced the failure rate from 2-5% to 0% :)

How to test the changes?

Instructions for manual testing are as follows:
1. The change is difficult to test. It covers an edge where a/ the job is completed, b/ k8s has yet to determine the status of the job. For me 2-5% of jobs fail without the fix.

License

I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

afgane · 2024-10-15T16:12:41Z

This is a great find, and ultimately also a simple fix!

mvdbeek

Makes sense, thanks for figuring this out!

nuwang · 2024-10-16T13:53:09Z

Nice work! It would have taken some doing to figure this out!

lib/galaxy/jobs/runners/kubernetes.py

Co-authored-by: Marius van den Beek <[email protected]>

lib/galaxy/jobs/runners/kubernetes.py

Co-authored-by: Nuwan Goonasekera <[email protected]>

github-actions · 2024-10-17T12:58:22Z

This PR was merged without a "kind/" label, please correct.

mapk-amazon · 2024-10-18T11:32:21Z

Thank you everyone for your support. I tested now quay.io/galaxyproject/galaxy-min:dev and it works as expected :)

charesredhat · 2024-12-05T14:46:13Z

Hi @mvdbeek when do you think this fix will be pushed into the latest galaxy and helm chart builds for production?
We are waiting on this to push into our own galaxy build using Kubernetes.

mvdbeek · 2024-12-17T13:08:32Z

As you can see the milestone is tagged to 24.2, so when that release happens.

charesredhat · 2024-12-17T15:36:19Z

Thank you @mvdbeek, sorry what I should have asked, when do expect 24.2 to be released?

[24.1] Backport #19001 kubernetes api client fix

mvdbeek · 2024-12-17T15:50:16Z

When there are no new or critical bugs, we don't set this in advance. I've backported the fix to 24.1 in #19338 so if you update to the head of the branch you will get the fix.

Account that k8s pods may be uncounted yet.

f8c8001

github-actions bot added the area/jobs label Oct 15, 2024

github-actions bot added this to the 24.2 milestone Oct 15, 2024

mapk-amazon mentioned this pull request Oct 15, 2024

Multiple/many parallel jobs lead to "random" failures galaxyproject/galaxy-helm#490

Open

Fix linting

4b8b23a

mvdbeek approved these changes Oct 16, 2024

View reviewed changes

mvdbeek reviewed Oct 16, 2024

View reviewed changes

lib/galaxy/jobs/runners/kubernetes.py Outdated Show resolved Hide resolved

Update lib/galaxy/jobs/runners/kubernetes.py

2944d57

Co-authored-by: Marius van den Beek <[email protected]>

nuwang reviewed Oct 16, 2024

View reviewed changes

lib/galaxy/jobs/runners/kubernetes.py Outdated Show resolved Hide resolved

Update lib/galaxy/jobs/runners/kubernetes.py

3576afd

Co-authored-by: Nuwan Goonasekera <[email protected]>

mvdbeek merged commit 409b790 into galaxyproject:dev Oct 17, 2024
48 of 53 checks passed

nuwang added the kind/bug label Oct 17, 2024

mvdbeek mentioned this pull request Dec 17, 2024

[24.1] Backport #19001 kubernetes api client fix #19338

Merged

4 tasks

mvdbeek added a commit that referenced this pull request Dec 17, 2024

Merge pull request #19338 from mvdbeek/backport_job_failure_issues

2ca8678

[24.1] Backport #19001 kubernetes api client fix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes random job failures in kubernetes #19001

Fixes random job failures in kubernetes #19001

mapk-amazon commented Oct 15, 2024

afgane commented Oct 15, 2024

mvdbeek left a comment

nuwang commented Oct 16, 2024

github-actions bot commented Oct 17, 2024

mapk-amazon commented Oct 18, 2024

charesredhat commented Dec 5, 2024

mvdbeek commented Dec 17, 2024

charesredhat commented Dec 17, 2024

mvdbeek commented Dec 17, 2024

Fixes random job failures in kubernetes #19001

Fixes random job failures in kubernetes #19001

Conversation

mapk-amazon commented Oct 15, 2024

How to test the changes?

License

afgane commented Oct 15, 2024

mvdbeek left a comment

Choose a reason for hiding this comment

nuwang commented Oct 16, 2024

github-actions bot commented Oct 17, 2024

mapk-amazon commented Oct 18, 2024

charesredhat commented Dec 5, 2024

mvdbeek commented Dec 17, 2024

charesredhat commented Dec 17, 2024

mvdbeek commented Dec 17, 2024