-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log job failure even when there are retries configured #6169
base: 8.4.x
Are you sure you want to change the base?
Conversation
73714c8
to
8f20ab0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I don't think it's this simple.
- I think this diff means that polled log messages for task failure will go back to being duplicated.
- This only covers failure, but submission failure also has retries so may require similar treatment.
- I think the failures before retries are exhausted will now get logged at CRITICAL level rather than INFO.
I think that you have a particular closed issue in mind, but I can't find it... Can you point it out to me?
I think that submission failure is already handled correctly - it certainly is in the simplistic case where you feed it
These are logged at critical - and I think they should be?
This would be consistent with submit failure... |
2c7e480
to
3cedf2f
Compare
No, I'm not thinking of the other log message duplication issue. The change made here bypassed logic that was used for suppressing duplicate log messages (the 8f20ab0#diff-d6de42ef75ecc801c26a6be3a9dc4885b64db89d32bce6f07e319595257b9b2eL930 However, in your more recent "fix" commit, you have put this back the way it was before: 3cedf2f#diff-d6de42ef75ecc801c26a6be3a9dc4885b64db89d32bce6f07e319595257b9b2eR930 |
This does not apply to submit failure, because submit failure will always log a critical warning through the
|
3cedf2f
to
1341355
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Test failures)
1341355
to
ce0498e
Compare
These test failures were caused by a slight change in the nature of the message caused by moving it: By the time the code reaches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small typo found.
@wxtim this seems to have diverged a bit from what I thought was agreed above, which was:
Now, if there are retries we only get the retry warning. (Which I think is back to the problem we were trying to fix here, although the logging location has changed to the methods that would support the fix). |
Co-authored-by: Hilary James Oliver <[email protected]> response to review
fc12804
to
5453cac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The execution retry log message is now:
WARNING - [1/foo:waiting] failed/ERR - retrying in ...
Whereas the submission retry log message is still just:
WARNING - [1/foo:waiting] retrying in ...
I think it would make sense to update this too.
one.task_job_mgr._set_retry_timers( | ||
fail_once, { | ||
'execution retry delays': [1], | ||
'submission retry delays': [1] | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Less fragile if this was just set in the workflow config for the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it really make any difference? - the aim was to avoid fiddling with one
. Can do if you insist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely creating a workflow config with these retry delays set is not any more involved than fiddling the internal retry timers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not going to argue. Will change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went away and had a look at it - essentially this is more unit-testy than integration test-like: If you set these timers in the config you still need to run this function here to set the retry timers. So I think that I'll leave it.
Co-authored-by: Ronnie Dutta <[email protected]>
Here's a screenshot comparing two execution failures (one retry) with two submission failures (one retry): Contrary to @MetRonnie 's suggestion (I think) you just need to get rid of the |
This comment was marked as resolved.
This comment was marked as resolved.
Not really necessary because both entail a full "retry" (i.e., submit the job, then execute it). |
Still waiting on my suggestion above, I think. In case it got a bit lost in the log lines: # submission retry (good)
ERROR - [1/foo/01:preparing] submission failed
INFO - [1/foo/01:preparing] => waiting
WARNING - [1/foo:waiting] retrying in PT5S (after 2025-01-09T11:08:20+13:00) # execution retry (needs tweak)
ERROR - [1/foo/01:running] failed/ERR
INFO - [1/foo/01:running] => waiting
WARNING - [1/foo:waiting] failed/ERR - retrying in PT5S (after 2025-01-09T11:05:49+13:00) (Pretty minor, but there's no need to double-log the |
Done, but not pushed, because I was was also looking at Ronnie's Comments about the tests. |
Closes #6151
Check List
CONTRIBUTING.md
and added my name as a Code Contributor.setup.cfg
(andconda-environment.yml
if present).CHANGES.md
entry included if this is a change that can affect users?.?.x
branch.