Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCHP 14.3.1 Disable writing final checkpoint file? #461

Open
cbutenhoff opened this issue Nov 12, 2024 · 9 comments
Open

GCHP 14.3.1 Disable writing final checkpoint file? #461

cbutenhoff opened this issue Nov 12, 2024 · 9 comments
Assignees
Labels
category: Question Further information is requested topic: Restart Files Related to GCHP restart files

Comments

@cbutenhoff
Copy link

Your name

Chris Butenhoff

Your affiliation

Portland State University

Please provide a clear and concise description of your question or discussion topic.

Although I realize this probably isn't best practice, I am running a number of GCHP jobs from the same run directory so they all write their checkpoint files to the same Restart directory.

If a gcchem_internal_checkpoint file already exists, a job hangs (is not removed from the SLURM job queue) when trying to write its final checkpoint with this error

...

Character Resource Parameter: GCHPchem_INTERNAL_CHECKPOINT_FILE:Restarts/gcchem_internal_checkpoint
Character Resource Parameter: GCHPchem_INTERNAL_CHECKPOINT_TYPE:pnc4
Using parallel NetCDF for file: Restarts/gcchem_internal_checkpoint
CB NetCDF4_FileFormatter.F90 status=         -35
CB NetCDF4_FileFormatter.F90 IOR=        4100
CB NetCDF4_FileFormatter.F90 mode=           4
CB NetCDF4_FileFormatter.F90 NF90_=        4096
CB NetCDF4_FileFormatter.F90 file=Restarts/gcchem_internal_checkpoint
CB NetCDF4_FileFormatter.F90 err=NetCDF: File exists && NC_NOCLOBBER
pe=00000 FAIL at line=00181    NetCDF4_FileFormatter.F90                <status=-35>
pe=00000 FAIL at line=03828    NCIO.F90                                 <status=-35>
pe=00000 FAIL at line=04081    NCIO.F90                                 <status=-35>
pe=00000 FAIL at line=05807    MAPL_Generic.F90                         <status=-35>
pe=00000 FAIL at line=02124    MAPL_Generic.F90                         <status=-35>
pe=00000 FAIL at line=03535    Chem_GridCompMod.F90                     <status=-35>
pe=00000 FAIL at line=01807    MAPL_Generic.F90                         <status=-35>
pe=00000 FAIL at line=02053    MAPL_Generic.F90                         <status=-35>
pe=00000 FAIL at line=00779    GCHP_GridCompMod.F90                     <status=-35>
pe=00000 FAIL at line=01807    MAPL_Generic.F90                         <status=-35>
pe=00000 FAIL at line=00873    MAPL_CapGridComp.F90                     <status=-35>

because ncwrite apparently is set to NO CLOBBER in NetCDF4_FileFormatter.F90 (my print statements).

I can imagine workarounds, including changing the source code to allow clobbering, but is there a way to simply disable writing this final checkpoint file?

@cbutenhoff cbutenhoff added the category: Question Further information is requested label Nov 12, 2024
@lizziel
Copy link
Contributor

lizziel commented Nov 13, 2024

HI @cbutenhoff, I think the only way to turn this off is to edit MAPL source code and recompile. However, why would you want to do this? If gcchem_internal_checkpoint exists then it means either GCHP is actively using it or a run did not complete properly, in which case the run directory will not rename it to the expected input restart format for later use. If you have multiple GCHP runs writing to the same Restarts folder then you might write to the same restart file. If gcchem_internal_checkpoint is only still around because a previous run failed and you allow clobbering it, how would you know that a previous run failed? As you say, using a single run directory for multiple runs at the same time is bad practice, the reason being there are several ways things could go wrong, this issue of the checkpoint being just one.

@lizziel lizziel self-assigned this Nov 13, 2024
@cbutenhoff
Copy link
Author

Thanks @lizziel.

If I have Midrun_Checkpoint=OFF, then GCHP should only write gcchem_internal_checkpoint at the end of the run, correct?

It seems like it is taking a long for this file to write. By monitoring the size of the file with 'ls -l', I found it took at least 30 minutes for this file to write, which prevents the job from exiting from the cluster queue.

For a C24 run, do you know about how long it should take to write this file?

@lizziel
Copy link
Contributor

lizziel commented Nov 14, 2024

The mid-run checkpoints write to different filenames since each one includes the date. gcchem_internal_checkpoint is reserved for end-of-run restart.

Taking a long time to write the restart could be your MPI, what are you using? We recommend using OpenMPI since IntelMPI can sometimes cause problems like this. Also, how many cores are you using? You could try turning on the restart write O-server in GCHP.rc. Look for entry WRITE_RESTART_BY_OSERVER.

It might also be helpful to look through the existing GCHP GitHub issues. Use the search bar to search for "restart".

@lizziel
Copy link
Contributor

lizziel commented Nov 14, 2024

To answer your question about writing the restart, it should take a few seconds or less for C24.

@cbutenhoff
Copy link
Author

Thanks @lizziel.

Something is definitely going on then, because it's taking about 40 minutes to write gcchem_internal_checkpoint. I also noticed that GCHP did not rename the file after writing and it's over 600M. (the C24 restart file from the GCHP distribution is 391M)

I'm using OpenMPI 4.1.4 and 288 cores. I'll try turning on the O-server. This all may just be a consequence of running multiple jobs in the same run directory. I'll see if this repeats if I run a single job.

@lizziel
Copy link
Contributor

lizziel commented Nov 18, 2024

If gcchem_internal_checkpoint was not renamed at the end then your job either timed out or failed. Do you have log files to share?

@yantosca yantosca added the topic: Restart Files Related to GCHP restart files label Dec 2, 2024
Copy link

github-actions bot commented Jan 2, 2025

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days it will be closed. You can add the "never stale" tag to prevent the issue from closing this issue.

@github-actions github-actions bot added the stale No recent activity on this issue label Jan 2, 2025
@lizziel
Copy link
Contributor

lizziel commented Jan 8, 2025

@cbutenhoff, are you all set with this issue? It will be closed soon due to inactivity.

@github-actions github-actions bot removed the stale No recent activity on this issue label Jan 9, 2025
@cbutenhoff
Copy link
Author

Thanks for pinging me @lizziel. I haven't had a chance yet to run the O-server test yet, but I plan to tomorrow. If you could leave this issue open until I get those results, I'd appreciate it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: Question Further information is requested topic: Restart Files Related to GCHP restart files
Projects
None yet
Development

No branches or pull requests

3 participants