Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read and cache lightning freq data in ELM to improve I/O performance #6742

Open
jayeshkrishna opened this issue Nov 11, 2024 · 8 comments
Open
Assignees
Labels
elm land model Land Performance v3.0 Issues affecting v3.0

Comments

@jayeshkrishna
Copy link
Contributor

jayeshkrishna commented Nov 11, 2024

When looking at the I/O performance of E3SM v3 high resolution (ne120) case we found that the lightning data was being read by the land model a lot of times (~3K reads, every 3 hrs) slowing down the simulation (total read time = 1538.39s, close to 8% of the runtime, for a 1 yr run). Instead of reading the lightning data in multiple time steps it might be useful to read the entire variable (or multiple timesteps) in a single read and cache it.

The performance results of the case above from PACE is given below for reference (Please look at the "Read file I/O statistics" section to see the input file with the slow read times),

https://pace.ornl.gov/scorpio/198912/ELM-1601179

The file discussed above is /global/cfs/cdirs/e3sm/inputdata/atm/datm7/NASA_LIS/clmforc.Li_2012_climo1995-2011.T62.lnfm_Total_c140423.nc .

$ du -sh  /global/cfs/cdirs/e3sm/inputdata/atm/datm7/NASA_LIS/clmforc.Li_2012_climo1995-2011.T62.lnfm_Total_c140423.nc
202M	/global/cfs/cdirs/e3sm/inputdata/atm/datm7/NASA_LIS/clmforc.Li_2012_climo1995-2011.T62.lnfm_Total_c140423.nc

The lightning data, variable lnfm, in the file is small (70K per timestep, 201MB for the entire variable), assuming that the variable is distributed across processes. Multiple timesteps can be cached by the model to improve read performance (by reducing the number of reads)

Since lnfm is a time dependent variable and time is an UNLIMITED dimension if you encounter issues (including performance issues) reading the entire variable in a single read you might want to convert the time dimension to a fixed size dimension first.

ncks --fix_rec_dmn time clmforc.Li_2012_climo1995-2011.T62.lnfm_Total_c140423.nc -o fixed_timedim_clmforc.Li_2012_climo1995-2011.T62.lnfm_Total_c140423.nc

After converting the UNLIMITED dimension (time) to a fixed size dimension you would need to modify the decomposition map passed to pio_read_darray() to read multiple (or all) time slices of the lightning data (lnfm variable)

@jayeshkrishna jayeshkrishna added Land Performance v3.0 Issues affecting v3.0 elm land model labels Nov 11, 2024
@rljacob
Copy link
Member

rljacob commented Nov 11, 2024

The compset is 2010_EAM%CMIP6_ELM%CNPRDCTCBCTOP_MPASSI%PRES_DOCN%DOM_MOSART_SGLC_SWAV_SIAC_SESP

@jayeshkrishna
Copy link
Contributor Author

FYI: @sarats , @dqwu

@rljacob
Copy link
Member

rljacob commented Nov 11, 2024

FYI: @thorntonpe . Not sure who would be best to fix this.

@glemieux
Copy link
Contributor

I just wanted to note quickly that I introduced some code back with #5369 that replicated the lightning read in for elm-fates usage. As such, I'll make sure to update that code with the recommended fix or we could refactor the elm fire code to integrate with the fates methods that were added with #5369. That would be a larger amount of work that is beyond scope here.

@ekluzek
Copy link
Contributor

ekluzek commented Nov 13, 2024

@jayeshkrishna I don't see any reason to change the time dimension away from unlimited. You certainly can read multiple time samples from a file that has the unlimited dimension. The advantage of the unlimited dimension is that it's easy to add new time slides on the end of the file.

Is there something else going on for your reasoning to change the file dimension?

@jayeshkrishna
Copy link
Contributor Author

@ekluzek : You are right, it should be possible to read the entire variable without converting the record dimension (Since typically these variables are read one timestep at a time you could encounter bugs that don't show up for typical runs. For files with multiple variables with unlimited dimensions the performance of reads would be improved by the conversion). I will update the conversion as a suggestion.

@peterdschwartz
Copy link
Contributor

Finally looking into this. There is an unused option to create a stream that reads the entire file, which seems like the simplest fix to pursue. I'll try to have a PR with a working version before end of the week.

If future lnfm files will be much larger than 200MB, then this will probably need to be adapted to calculate an appropriate span of years to read.

@peterdschwartz
Copy link
Contributor

The lnfm_init is actually duplicated: one in biogeochem/FireMod.F90 and one in biogeochem/FireDataBaseType.F90. At a glance, they seem identical but the latter is only used for FATES. After realizing I was modifying the wrong one, attempting to read the entire file crashes with a PIO error (i added the sprintf line):

 1: Assertion failed...
 1: PIO: FATAL ERROR: Aborting... Extra outermost dimension 2920 != 1 for index 0 (/global/u2/p/pschwar3/fresh_install/E3SM/externals/scorpio/src/clib/pio_darray_int.c: 1216)
/* Allow extra outermost dimensions in the decomposition */
                int num_extra_dims = (vdesc->record >= 0 && fndims > 1)? (ndims - (fndims - 1)) : (ndims - fndims);
                pioassert(num_extra_dims >= 0, "Unexpected num_extra_dims", __FILE__, __LINE__);
                char message[256];
                if (num_extra_dims > 0)
                {
                    for (int d = 0; d < num_extra_dims; d++){
                        sprintf(message, "Extra outermost dimension %d != 1 for index %d", iodesc->dimlen[d],d);
                        pioassert(iodesc->dimlen[d] == 1, message, __FILE__, __LINE__);
                        }
                }

The 2920 is the length of the time dimension in the lnfm file. I'm looking more into the stream creation parameters but the fix is not obvious to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
elm land model Land Performance v3.0 Issues affecting v3.0
Projects
None yet
Development

No branches or pull requests

5 participants