Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

catching error codes under PBS & LSF / aprun & mpiexec #1071

Open
JanStreffing opened this issue Oct 17, 2023 · 5 comments
Open

catching error codes under PBS & LSF / aprun & mpiexec #1071

JanStreffing opened this issue Oct 17, 2023 · 5 comments
Assignees
Labels
bug Something isn't working documentation Improvements or additions to documentation error handling better error output required help wanted Extra attention is needed

Comments

@JanStreffing
Copy link
Contributor

JanStreffing commented Oct 17, 2023

This issue is closely related to issues #142, #148, #262 & #365.

All of those issues deal with model crashes followed by esm_tools attempting to run the next leg of the experiment. This results in messed up date files, output files in outdata that have only half the timesteps and need to be manually deleted etc.

With release 6, this was fixed for slurm, because esm_tools is checking the slurm exit code, and if it does not indicate a successful run, esm_tools does not do post and resubmit.

We need that same check for PBS and LSF / aprun & mpiexec.

I had this error on aleph last week:

[OIFS]1280:  18:07:38 STEP     14845 H=1855:37 +CPU= 18.456     
[OIFS]1280:           STEP14845 :## EC_MEMINFO    1 nid00087    3363    3925       0      1381   14402      418    1790    167938    2852      2.2   0.0 s/p
[NID 00305] 2023-10-09 18:08:34 Apid 2397532 killed. Received node event ec_node_failed for nid 312 

real    126m34.158s
user    0m0.584s
sys 0m0.820s

==============================================================================
::: Executing the step:  _read_date_file    (step [1/20] of the job:  prepare)
==============================================================================

===================================================================================
::: Executing the step:  _update_run_in_chunk    (step [2/20] of the job:  prepare)

and esm_tools did not catch on to the fact that this run had crashed. I also saw this on the LSF maschines while in Korea, where at times the old "first leg crashed, but esm_tools was resubmitting itself endlessly" behavior was happening.

@JanStreffing JanStreffing added bug Something isn't working help wanted Extra attention is needed error handling better error output required labels Oct 17, 2023
@JanStreffing JanStreffing changed the title Catching node failure under PBS (& LSF) catching error codes under PBS & LSF / aprun & mpiexec Oct 17, 2023
@mandresm
Copy link
Contributor

This are the check_error commands I was talking about.

check_error:
"exit signals: Killed":
frequency: 30
method: "kill"
message: "PBS ERROR: pbs ended with an error, exiting."
file: "${output_path}${expid}_${general.setup_name}_execution_${general.current_date!syear!smonth!sday}-${general.end_date!syear!smonth!sday}_@[email protected]"
"Exiting due to errors. Application aborted":
frequency: 30
method: "kill"
message: "PBS ERROR: pbs ended with an error, exiting."
file: "${output_path}${expid}_${general.setup_name}_execution_${general.current_date!syear!smonth!sday}-${general.end_date!syear!smonth!sday}_@[email protected]"

Your problably want to add an error to be triggered with the message Apid ${general.launcher_pid} killed.

@mandresm
Copy link
Contributor

It's confusing because the message property is actually the string that triggers the error, and the 2nd level nesting strings (e.g. "exit singnals: Killed") is what is printed as the error message from ESM-Tools

@mandresm
Copy link
Contributor

The documentation is incomplete: https://esm-tools.readthedocs.io/en/latest/yaml.html#error-handling-and-warning-syntax

Tagging this as Documentation

@mandresm mandresm added the documentation Improvements or additions to documentation label Oct 17, 2023
@mandresm mandresm assigned mandresm and unassigned nwieters Nov 20, 2023
Copy link

This issue has been inactive for the last 365 days. It will now be marked as stale and closed after 30 days of further inactivity. Please add a comment to reset this automatic closing of this issue or close it if solved.

@github-actions github-actions bot added the Stale label Nov 19, 2024
@mandresm
Copy link
Contributor

mandresm commented Dec 2, 2024

Do not close!

@github-actions github-actions bot removed the Stale label Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation error handling better error output required help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants