-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCHP 14.3.1 Disable writing final checkpoint file? #461
Comments
HI @cbutenhoff, I think the only way to turn this off is to edit MAPL source code and recompile. However, why would you want to do this? If gcchem_internal_checkpoint exists then it means either GCHP is actively using it or a run did not complete properly, in which case the run directory will not rename it to the expected input restart format for later use. If you have multiple GCHP runs writing to the same Restarts folder then you might write to the same restart file. If gcchem_internal_checkpoint is only still around because a previous run failed and you allow clobbering it, how would you know that a previous run failed? As you say, using a single run directory for multiple runs at the same time is bad practice, the reason being there are several ways things could go wrong, this issue of the checkpoint being just one. |
Thanks @lizziel. If I have Midrun_Checkpoint=OFF, then GCHP should only write gcchem_internal_checkpoint at the end of the run, correct? It seems like it is taking a long for this file to write. By monitoring the size of the file with 'ls -l', I found it took at least 30 minutes for this file to write, which prevents the job from exiting from the cluster queue. For a C24 run, do you know about how long it should take to write this file? |
The mid-run checkpoints write to different filenames since each one includes the date. Taking a long time to write the restart could be your MPI, what are you using? We recommend using OpenMPI since IntelMPI can sometimes cause problems like this. Also, how many cores are you using? You could try turning on the restart write O-server in It might also be helpful to look through the existing GCHP GitHub issues. Use the search bar to search for "restart". |
To answer your question about writing the restart, it should take a few seconds or less for C24. |
Thanks @lizziel. Something is definitely going on then, because it's taking about 40 minutes to write gcchem_internal_checkpoint. I also noticed that GCHP did not rename the file after writing and it's over 600M. (the C24 restart file from the GCHP distribution is 391M) I'm using OpenMPI 4.1.4 and 288 cores. I'll try turning on the O-server. This all may just be a consequence of running multiple jobs in the same run directory. I'll see if this repeats if I run a single job. |
If gcchem_internal_checkpoint was not renamed at the end then your job either timed out or failed. Do you have log files to share? |
This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days it will be closed. You can add the "never stale" tag to prevent the issue from closing this issue. |
@cbutenhoff, are you all set with this issue? It will be closed soon due to inactivity. |
Thanks for pinging me @lizziel. I haven't had a chance yet to run the O-server test yet, but I plan to tomorrow. If you could leave this issue open until I get those results, I'd appreciate it! |
Your name
Chris Butenhoff
Your affiliation
Portland State University
Please provide a clear and concise description of your question or discussion topic.
Although I realize this probably isn't best practice, I am running a number of GCHP jobs from the same run directory so they all write their checkpoint files to the same Restart directory.
If a
gcchem_internal_checkpoint
file already exists, a job hangs (is not removed from the SLURM job queue) when trying to write its final checkpoint with this errorbecause ncwrite apparently is set to NO CLOBBER in
NetCDF4_FileFormatter.F90
(my print statements).I can imagine workarounds, including changing the source code to allow clobbering, but is there a way to simply disable writing this final checkpoint file?
The text was updated successfully, but these errors were encountered: