-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
catching error codes under PBS & LSF / aprun & mpiexec #1071
Comments
This are the esm_tools/configs/other_software/batch_system/pbs.yaml Lines 62 to 72 in bbfb07d
Your problably want to add an error to be triggered with the message |
It's confusing because the |
The documentation is incomplete: https://esm-tools.readthedocs.io/en/latest/yaml.html#error-handling-and-warning-syntax Tagging this as Documentation |
This issue has been inactive for the last 365 days. It will now be marked as stale and closed after 30 days of further inactivity. Please add a comment to reset this automatic closing of this issue or close it if solved. |
Do not close! |
This issue is closely related to issues #142, #148, #262 & #365.
All of those issues deal with model crashes followed by esm_tools attempting to run the next leg of the experiment. This results in messed up date files, output files in outdata that have only half the timesteps and need to be manually deleted etc.
With release 6, this was fixed for slurm, because esm_tools is checking the slurm exit code, and if it does not indicate a successful run, esm_tools does not do post and resubmit.
We need that same check for PBS and LSF / aprun & mpiexec.
I had this error on aleph last week:
and esm_tools did not catch on to the fact that this run had crashed. I also saw this on the LSF maschines while in Korea, where at times the old "first leg crashed, but esm_tools was resubmitting itself endlessly" behavior was happening.
The text was updated successfully, but these errors were encountered: