Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report errors to BOTH ESMF Pet files AND log files #527

Open
ekluzek opened this issue Jan 18, 2025 · 8 comments · May be fixed by #532
Open

Report errors to BOTH ESMF Pet files AND log files #527

ekluzek opened this issue Jan 18, 2025 · 8 comments · May be fixed by #532
Labels
enhancement New feature or request

Comments

@ekluzek
Copy link
Collaborator

ekluzek commented Jan 18, 2025

When things go wrong (and in an experimental model they always will) good and graceful error messaging is the most helpful thing to figure out the problem and get through it quickly. When running the model, too much logging can be problematic, but when debugging an error -- there can almost never be too much messaging on the error.

I realized that most of the problem I had with #524 was that the error messages only went to the ESMF PET files. In some of my cases I didn't see the PET files with the error. In others when you write by changing CREATE_ESMF_PET_FILES to TRUE it writes out files for each task and in some cases the error only shows up in one of them, so it's hard to find it amongst the files. And since the files have to be opened when the error occurs, if anything goes wrong with that process -- it's possible the error won't be reported anywhere. Since that XML variable above is normally FALSE you also don't see PET files for a successful run -- so you don't get used to seeing them and working with them.

As such it would be helpful to not only report the error to ESMF -- but also to the log files. I think this is a good usability enhancement that would be useful especially for new users. New users learn quickly to look at the log files and especially the cesm.log file. When error messaging doesn't show up there -- it's more difficult to track down.

So my suggestion is that instead of having errors done something like this:

                 if (ierr < 0) then
                    rc = ESMF_FAILURE
                    call ESMF_LogWrite(trim(subname)//' ERROR rpointer file open returns error', &
                         ESMF_LOGMSG_ERROR, line=__LINE__, file=__FILE__)
                    return
                 end if

It include a write to the log file as so

                 if (ierr < 0) then
                    rc = ESMF_FAILURE
+                  write(logunit,*) " error opening rpointer file "
                    call ESMF_LogWrite(trim(subname)//' ERROR rpointer file open returns error', &
                         ESMF_LOGMSG_ERROR, line=__LINE__, file=__FILE__)
                    return
                 end if

And since this is something to do lots of places in the code, even better would be to make would be to make use of the new suggested arguments to shr_abort_abort (esmf_log_write, line and file) as follows. See the CESM_share issue about this.

                 if (ierr < 0) then
                    call shr_abort_abort( "rpointer file open returns error", esmf_log_write=.true., line=__LINE__, file=__FILE__)
                 end if

I hope others will agree with me on this, but I'd also like to hear divergent thoughts. And completely open to suggestions on names and where this should go and other such details.

@samsrabin
Copy link

This is a great idea that I think will help massively with debugging. Would it be possible for this write command to be added to ESMF_LogWrite? That would avoid needing to (a) add it everywhere in the code now and (b) remember to do it in the future.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Jan 22, 2025

@samsrabin thanks for the suggestion. At the end of my introduction I suggest creating a small function to do this called

med_utils_ErrAbort

The ESMF_LogWrite is an ESMF function from the ESMF library and it writes to the ESMF PET files, so there isn't a way for it to also write to another outside log file. And doing so would require at least having a separate line for it.

But, replacing these abort lines with a call to

med_utils_ErrAbort

would allow a single call to do do both these things. And other functionality to do something to shutdown the model could be added if it were useful in the future. This is pretty much what we do with endrun in CTSM we always call it, so we can modify error reporting for everything at once.

@jedwards4b and @billsacks can you chime in on what you think of this idea in this issue?

@jedwards4b
Copy link
Collaborator

Might I suggest that we add an optional call to the ESMF_LogWrite function to the
shr_abort_abort routine in addition to adding a new routine shr_log_write that writes to
both the ESMF PET files and to the component log file? In any case this issue belongs in https://github.com/ESCOMP/CESM_share

@ekluzek
Copy link
Collaborator Author

ekluzek commented Jan 22, 2025

Thanks for the suggestion @jedwards4b. I think you are saying do something like this below in the share code and use those in the CMEPS call. Basically adding a few optional arguments to shr_abort_abort (esmf_log, line, and file).

So the calls might look like this...

call shr_abort_abort( string=errMsg, esmf_log=.true., line=__LINE__, file=__FILE__)

A note there is that will change the current behavior because the current behavior is to mark the current routine as an ESMF fail write to the PET files and then return to the caller.

So for example what happens in #524 it fails in the esm_clockInit, which returns to it's caller SetModelServices and then returns to it's caller which I think is in NUOPC ESMF world which must then do the shutdown. So calling shr_abort in CMEPS rather than the current return up the stack method -- would be a change in behavior. I suspect the current behavior is important in order for ESMF to be able to open and write to the PET files as it needs to as well as get them properly closed and all that. As such I still think we should do the med_utils_errAbort idea. But, maybe the name should changed to say it's not really aborting but doing error logging. So it'd be something like this...

@billsacks can you speak to the need for ESMF NUOPC to return all the way up the stack verses calling an explicit abort?

                 if (ierr < 0) then
                    call med_utils_errLogAborting( ErrMsg="rpointer file open returns error", line=__LINE__, file=__FILE__)
                    return
                 end if

If you do want to change the behavior and have an explicit shr_abort -- I can open an issue in CESM_share, but even then there would also need to be this issue here to change the CMEPS code to use those calls rather than the return up the stack method currently in place.

@jedwards4b
Copy link
Collaborator

You're point is well taken. The reason ESMF returns up the stack is so that the user can see the stack trace on abort, which is the same thing that shr_abort does. I'm not sure that it matters which way it's done.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Jan 22, 2025

Ahh, that makes sense. It is very helpful to see the complete stack trace.

Sounds good. My thinking would be to leave it closer to what it's currently doing and do it inside CMEPS. But, as maintainer you can say which way it should happen.

@jedwards4b
Copy link
Collaborator

This same problem of sometimes aborting with a message only to the PET files and not to the component logs can happen in any of the component caps or in cdeps as well as in cmeps so I think that the place to put the new routine is share.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Jan 22, 2025

OK, added an issue to CESM_share, and updated the top of this to show using it.

@jedwards4b jedwards4b linked a pull request Jan 24, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants