Log job failure even when there are retries configured #6169

wxtim · 2024-06-25T13:55:09Z

Closes #6151

Check List

I have read CONTRIBUTING.md and added my name as a Code Contributor.
Contains logically grouped changes (else tidy your branch by rebase).
Does not contain off-topic changes (use other PRs for other changes).
Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
Tests are included (or explain why tests are not needed).
CHANGES.md entry included if this is a change that can affect users
Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

oliver-sanders

Unfortunately, I don't think it's this simple.

I think this diff means that polled log messages for task failure will go back to being duplicated.
This only covers failure, but submission failure also has retries so may require similar treatment.
I think the failures before retries are exhausted will now get logged at CRITICAL level rather than INFO.

wxtim · 2024-07-15T09:19:47Z

I think this diff means that polled log messages for task failure will go back to being duplicated.

I think that you have a particular closed issue in mind, but I can't find it... Can you point it out to me?

This only covers failure, but submission failure also has retries so may require similar treatment.

I think that submission failure is already handled correctly - it certainly is in the simplistic case where you feed it platform=garbage you get a submission failed message at level critical:

CRITICAL - [1/bar/01:preparing] submission failed
CRITICAL - [1/bar/02:preparing] submission failed
CRITICAL - [1/bar/03:preparing] submission failed

These are logged at critical - and I think they should be?

I think the failures before retries are exhausted will now get logged at CRITICAL level rather than INFO.

This would be consistent with submit failure...

oliver-sanders · 2024-07-15T12:29:00Z

I think this diff means that polled log messages for task failure will go back to being duplicated.

I think that you have a particular closed issue in mind, but I can't find it... Can you point it out to me?

No, I'm not thinking of the other log message duplication issue.

The change made here bypassed logic that was used for suppressing duplicate log messages (the f'{FAIL_MESSAGE_PREFIX}ERR' bit):

8f20ab0#diff-d6de42ef75ecc801c26a6be3a9dc4885b64db89d32bce6f07e319595257b9b2eL930

However, in your more recent "fix" commit, you have put this back the way it was before:

3cedf2f#diff-d6de42ef75ecc801c26a6be3a9dc4885b64db89d32bce6f07e319595257b9b2eR930

because I'm blind

cylc/flow/task_events_mgr.py

wxtim · 2024-07-22T13:53:14Z

This does not apply to submit failure, because submit failure will always log a critical warning through the jobs-submit command.

WARNING - platform: mulberry - Could not connect to mymachine.
    * mymachine has been added to the list of unreachable hosts
    * remote-init will retry if another host is available.
ERROR - [jobs-submit cmd] (remote init)
    [jobs-submit ret_code] 1
CRITICAL - [1/bar/02:preparing] submission failed
INFO - [1/bar/02:preparing] => waiting
WARNING - [1/bar:waiting] retrying in PT1S (after 2024-07-22T14:51:37+01:00)

MetRonnie

(Test failures)

changes.d/fix.6169.md

cylc/flow/task_events_mgr.py

wxtim · 2024-11-06T14:35:01Z

These test failures were caused by a slight change in the nature of the message caused by moving it: By the time the code reaches _process_message_failed some of the info required for the original error messages (the state) is not available, or has changed.

hjoliver

One small typo found.

cylc/flow/task_events_mgr.py

hjoliver · 2024-12-05T19:06:38Z

@wxtim this seems to have diverged a bit from what I thought was agreed above, which was:

log the job (or submit) failure as ERROR whether there are retries or not
then log the retry message at WARNING level, if there are retries

Now, if there are retries we only get the retry warning. (Which I think is back to the problem we were trying to fix here, although the logging location has changed to the methods that would support the fix).

Co-authored-by: Hilary James Oliver <[email protected]> response to review

tests/functional/reload/25-xtriggers.t

MetRonnie

The execution retry log message is now:

WARNING - [1/foo:waiting] failed/ERR - retrying in ...

Whereas the submission retry log message is still just:

WARNING - [1/foo:waiting] retrying in ...

I think it would make sense to update this too.

MetRonnie · 2025-01-07T11:43:03Z

tests/integration/test_task_events_mgr.py

+        one.task_job_mgr._set_retry_timers(
+            fail_once, {
+                'execution retry delays': [1],
+                'submission retry delays': [1]
+            })


Less fragile if this was just set in the workflow config for the test?

Does it really make any difference? - the aim was to avoid fiddling with one. Can do if you insist.

Surely creating a workflow config with these retry delays set is not any more involved than fiddling the internal retry timers?

Not going to argue. Will change.

Went away and had a look at it - essentially this is more unit-testy than integration test-like: If you set these timers in the config you still need to run this function here to set the retry timers. So I think that I'll leave it.

tests/integration/test_task_events_mgr.py

Co-authored-by: Ronnie Dutta <[email protected]>

hjoliver · 2025-01-08T02:41:38Z

Here's a screenshot comparing two execution failures (one retry) with two submission failures (one retry):

Contrary to @MetRonnie 's suggestion (I think) you just need to get rid of the failed/ERR bit in the execution retry message. That information is already logged as the error. The warning is just about the resulting retry.

hjoliver · 2025-01-08T21:55:13Z

Or have "execution retrying in..." and "submission retrying in..."?

Not really necessary because both entail a full "retry" (i.e., submit the job, then execute it).

hjoliver · 2025-01-08T22:17:48Z

Still waiting on my suggestion above, I think. In case it got a bit lost in the log lines:

# submission retry (good)
ERROR - [1/foo/01:preparing] submission failed
INFO - [1/foo/01:preparing] => waiting
WARNING - [1/foo:waiting] retrying in PT5S (after 2025-01-09T11:08:20+13:00)

# execution retry (needs tweak)
ERROR - [1/foo/01:running] failed/ERR
INFO - [1/foo/01:running] => waiting
WARNING - [1/foo:waiting] failed/ERR - retrying in PT5S (after 2025-01-09T11:05:49+13:00)

(Pretty minor, but there's no need to double-log the failed/ERR bit).

wxtim · 2025-01-09T09:01:48Z

Still waiting on my suggestion #6169 (comment), I think. In case it got a bit lost in the log lines:

Done, but not pushed, because I was was also looking at Ronnie's Comments about the tests.

wxtim force-pushed the fix.task_fail_not_logged_if_retries branch from 73714c8 to 8f20ab0 Compare June 25, 2024 13:57

wxtim requested review from MetRonnie and markgrahamdawson June 25, 2024 13:57

wxtim self-assigned this Jun 25, 2024

wxtim added bug Something is wrong :( small labels Jun 25, 2024

wxtim added this to the 8.3.1 milestone Jun 25, 2024

wxtim linked an issue Jun 25, 2024 that may be closed by this pull request

scheduler: task failure not logged when retries are configured #6151

Open

oliver-sanders reviewed Jun 26, 2024

View reviewed changes

wxtim marked this pull request as draft June 26, 2024 14:45

oliver-sanders modified the milestones: 8.3.1, 8.3.x Jun 27, 2024

wxtim requested a review from oliver-sanders July 15, 2024 10:17

wxtim force-pushed the fix.task_fail_not_logged_if_retries branch from 2c7e480 to 3cedf2f Compare July 15, 2024 10:19

wxtim marked this pull request as ready for review July 15, 2024 10:20

This comment was marked as off-topic.

Sign in to view

oliver-sanders reviewed Jul 15, 2024

View reviewed changes

cylc/flow/task_events_mgr.py Outdated Show resolved Hide resolved

wxtim marked this pull request as draft July 16, 2024 15:36

wxtim force-pushed the fix.task_fail_not_logged_if_retries branch from 3cedf2f to 1341355 Compare July 22, 2024 13:59

wxtim marked this pull request as ready for review July 22, 2024 13:59

MetRonnie requested changes Jul 25, 2024

View reviewed changes

wxtim force-pushed the fix.task_fail_not_logged_if_retries branch from 1341355 to ce0498e Compare July 26, 2024 09:50

wxtim requested a review from MetRonnie July 26, 2024 10:14

MetRonnie reviewed Jul 31, 2024

View reviewed changes

changes.d/fix.6169.md Outdated Show resolved Hide resolved

MetRonnie reviewed Jul 31, 2024

View reviewed changes

cylc/flow/task_events_mgr.py Show resolved Hide resolved

cylc/flow/task_events_mgr.py Outdated Show resolved Hide resolved

MetRonnie removed the request for review from markgrahamdawson July 31, 2024 11:39

merge

c718aee

wxtim marked this pull request as ready for review November 6, 2024 14:35

hjoliver approved these changes Nov 7, 2024

View reviewed changes

cylc/flow/task_events_mgr.py Outdated Show resolved Hide resolved

oliver-sanders requested review from hjoliver and removed request for oliver-sanders November 8, 2024 16:17

Update cylc/flow/task_events_mgr.py

5453cac

Co-authored-by: Hilary James Oliver <[email protected]> response to review

wxtim force-pushed the fix.task_fail_not_logged_if_retries branch from fc12804 to 5453cac Compare December 6, 2024 09:34

wxtim requested a review from oliver-sanders December 6, 2024 10:01

MetRonnie reviewed Jan 7, 2025

View reviewed changes

tests/functional/reload/25-xtriggers.t Outdated Show resolved Hide resolved

MetRonnie reviewed Jan 7, 2025

View reviewed changes

tests/functional/reload/25-xtriggers.t Outdated Show resolved Hide resolved

MetRonnie reviewed Jan 7, 2025

View reviewed changes

wxtim and others added 2 commits January 7, 2025 13:48

Apply suggestions from code review

0def449

Co-authored-by: Ronnie Dutta <[email protected]>

fixed test wrt to review

a063531

wxtim requested a review from MetRonnie January 7, 2025 14:03

wxtim added 2 commits January 8, 2025 09:41

removed failure message from retry logging

42fdd8b

modify retry message to remove repeating the error

463395f

This comment was marked as resolved.

Sign in to view

oliver-sanders removed their request for review January 8, 2025 13:54

hjoliver modified the milestones: 8.3.7, 8.4.1 Jan 8, 2025

fix another test

2197367

wxtim requested a review from oliver-sanders January 9, 2025 13:47

MetRonnie changed the base branch from 8.3.x to 8.4.x January 9, 2025 18:33

oliver-sanders removed their request for review January 13, 2025 10:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log job failure even when there are retries configured #6169

Log job failure even when there are retries configured #6169

wxtim commented Jun 25, 2024 •

edited

Loading

oliver-sanders left a comment

wxtim commented Jul 15, 2024 •

edited

Loading

oliver-sanders commented Jul 15, 2024

This comment was marked as off-topic.

wxtim commented Jul 22, 2024 •

edited

Loading

MetRonnie left a comment

wxtim commented Nov 6, 2024

hjoliver left a comment

hjoliver commented Dec 5, 2024 •

edited

Loading

MetRonnie left a comment

MetRonnie Jan 7, 2025

wxtim Jan 7, 2025

MetRonnie Jan 7, 2025

wxtim Jan 8, 2025

wxtim Jan 9, 2025

hjoliver commented Jan 8, 2025

This comment was marked as resolved.

hjoliver commented Jan 8, 2025

hjoliver commented Jan 8, 2025

wxtim commented Jan 9, 2025

Log job failure even when there are retries configured #6169

Are you sure you want to change the base?

Log job failure even when there are retries configured #6169

Conversation

wxtim commented Jun 25, 2024 • edited Loading

oliver-sanders left a comment

Choose a reason for hiding this comment

wxtim commented Jul 15, 2024 • edited Loading

oliver-sanders commented Jul 15, 2024

This comment was marked as off-topic.

wxtim commented Jul 22, 2024 • edited Loading

MetRonnie left a comment

Choose a reason for hiding this comment

wxtim commented Nov 6, 2024

hjoliver left a comment

Choose a reason for hiding this comment

hjoliver commented Dec 5, 2024 • edited Loading

MetRonnie left a comment

Choose a reason for hiding this comment

MetRonnie Jan 7, 2025

Choose a reason for hiding this comment

wxtim Jan 7, 2025

Choose a reason for hiding this comment

MetRonnie Jan 7, 2025

Choose a reason for hiding this comment

wxtim Jan 8, 2025

Choose a reason for hiding this comment

wxtim Jan 9, 2025

Choose a reason for hiding this comment

hjoliver commented Jan 8, 2025

This comment was marked as resolved.

hjoliver commented Jan 8, 2025

hjoliver commented Jan 8, 2025

wxtim commented Jan 9, 2025

wxtim commented Jun 25, 2024 •

edited

Loading

wxtim commented Jul 15, 2024 •

edited

Loading

wxtim commented Jul 22, 2024 •

edited

Loading

hjoliver commented Dec 5, 2024 •

edited

Loading