Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add job submission for wcoss2 #32

Closed
EdwardSafford-NOAA opened this issue May 7, 2024 · 25 comments
Closed

Add job submission for wcoss2 #32

EdwardSafford-NOAA opened this issue May 7, 2024 · 25 comments
Assignees

Comments

@EdwardSafford-NOAA
Copy link
Collaborator

Now that eva and emcpy are available on wcoss2 it should be possible to add the logic to run the obs-monitor J-job on wcoss2. Let's find out.

@EdwardSafford-NOAA EdwardSafford-NOAA self-assigned this May 7, 2024
EdwardSafford-NOAA added a commit to EdwardSafford-NOAA/obs-monitor that referenced this issue May 9, 2024
Save off work & try on hera
@EdwardSafford-NOAA
Copy link
Collaborator Author

EdwardSafford-NOAA commented May 10, 2024

I've run into a problem here that I just can't figure out. On wcoss2 I can run eva and obs-monitor components from the command line without error. But when I try to run a job via the batch system I hit syntax errors in the python code. For example this is from a pared down test script that runs to completion without error from the command line, but produces this syntax error in batch mode:

File "/lfs/h2/emc/da/noscrub/Edward.Safford/git/obs-monitor/ush/test.py", line 36 logger.info(f'cycle_tm: {cycle_tm}') ^ SyntaxError: invalid syntax
The location of the ^ didn't cut&paste well, it's actually under the final single quote on the line. Not sure that matters though; if I comment out that line then another bogus syntax error is generated from one of the wxflow components.

I've verified my module list, and the $PYTHONPATH and $PATH env vars are the same in both environments. What else could be different? @CoryMartin-NOAA , @kevindougherty-noaa , @ADCollard

@CoryMartin-NOAA
Copy link
Contributor

This looks like it is running with python2 vs python3. How is it being executed in the batch job? is $APRUN_PY python? perhaps it needs to be python3 on that machine?

@EdwardSafford-NOAA
Copy link
Collaborator Author

EdwardSafford-NOAA commented May 10, 2024

Hmm. $APRUN_PY was set to python, which is what I've used successfully from the command line. Changing $APRUN_PY in the batch job to python3 fails too but in a different place. It chokes when trying to import numpy:

File "/apps/prod/ve/intel/19.1.3.304/python/3.10.4/evs/1.0/lib/python3.10/site-packages/numpy/version.py", line 1 from __future__ import annotations ^ SyntaxError: future feature annotations is not defined

So that was a good thought but I'm still not sure what the issue is apart from wcoss2 is a real pain at times.

@CoryMartin-NOAA
Copy link
Contributor

I think the second half of your last sentence is the key :-)

I suspect there is something in the login env that is different from the batch environment. I have no idea what, but something is different. Sounds like this is probably a next week problem at this point.

This does still somehow suggest that the python3 at batch runtime != the one on the login node...

@EdwardSafford-NOAA
Copy link
Collaborator Author

Yes I've been trying to figure out what's different between login and batch environments. I've managed to rule out modules, $PYTHONPATH, and $PATH. I guess I should dump all the environment variables and sift though that happiness. That will be a real thrill.

@EdwardSafford-NOAA
Copy link
Collaborator Author

I've verified that, with the prescribed specified modules loaded, both python and python3 resolve to /apps/prod/ve/intel/19.1.3.304/python/3.10.4/evs/1.0/bin/python on wcoss2. I've been sifting through the full env list in the batch and interactive environments but haven't yet seen anything yet that would suggest where the mismatch lies.

I do note that I can run a simple test python program with imports from the python standard library in the batch environment. I get syntax errors though when I try to import things from /apps/prod/ve/intel/19.1.3.304/python/3.10.4/evs/1.0/lib/python3.10/site-packages/, like numpy or yaml. I also get syntax errors from things in my local lib at /lfs/h2/emc/da/noscrub/edward.safford/eva/opt . The errors sure seem to point to the batch run-time version of python is something other than version 3.10.4 but I sure haven't found any evidence of that.

@EdwardSafford-NOAA
Copy link
Collaborator Author

EdwardSafford-NOAA commented May 14, 2024

Some good news. At long last I've gotten past the syntax errors. In my parm file I'd been declaring APRUN_PY=python, which both interactively and in batch queues evaluates to /apps/prod/ve/intel/19.1.3.304/python/3.10.4/evs/1.0/bin/python per my debug output. But when I make this explicit declaration in the parm file APRUN_PY=/apps/prod/ve/intel/19.1.3.304/python/3.10.4/evs/1.0/bin/python no more syntax errors. So somehow python doesn't really resolve to /apps/prod/ve/intel/19.1.3.304/python/3.10.4/evs/1.0/bin/python in the batch queues. Still no idea why/how, but at least that's in the rear view mirror now. On to the next problem.

@CoryMartin-NOAA
Copy link
Contributor

@EdwardSafford-NOAA interesting. I guess it still must be a path issue somehow, but explicit works it seems, so that's good enough to proceed. Great!

@EdwardSafford-NOAA
Copy link
Collaborator Author

EdwardSafford-NOAA commented May 14, 2024

I think I'll just log that under 'mysteries of wcoss2'. I was able to generalize the script by setting APRUN_PY=which python.

@EdwardSafford-NOAA
Copy link
Collaborator Author

EdwardSafford-NOAA commented May 15, 2024

Excelsior!

gfs_gdas summary gnorms(2)

That's a minimization summary plot created via batch queue on wcoss2. (No logos because they don't yet work on wcoss2 -- same issue as tight layout.)

One of the complications/delays in getting this far has been my specification of specific environmental variables in job submission statement rather than using the -V option (which passes all env vars into the submitted job). Use of the -V is strongly discouraged by ops because of the overhead, so it's taken me some time to figure out exactly what I need to run. The final missing piece was LD_LIBRARY_PATH to pick up libgeos_c.so.1.

@EdwardSafford-NOAA
Copy link
Collaborator Author

I've now got the radTime, radSummary, and radBcoef plots working. That's the good news. Bad news is wall time. These run way slower on wcoss2 than on hera. I've tried to speed things up by distributing the plot jobs across multiple nodes using the available command file processing (CFP). That runs ok, but I'm not seeing any speed up over a single serial job. I've submitted a question about that to the help desk; it's entirely possible I don't understand something and/or the wiki is out of date.

@EdwardSafford-NOAA
Copy link
Collaborator Author

I finally figured out how the python syntax errors occurred. I wasn't exactly aware of how modules are handled on wcoss2, but (unwisely) assumed it was similar to hera. That is not so. Every submitted job starts with a stock set of modules and whatever modules have been loaded in the submitting script are ignored and must be reloaded if you want to use them. So that's why I kept picking up python2 when $APRUN_PY was defined as python, but worked when I specified the full path to python/3.10.4. Now I'm surprised I ever got it working at all.

I haven't yet gotten a response from the help desk about using CFP to run the jobs (similarly to how they run on hera). The module situation does complicate that strategy somewhat. ChatGPT gave me some other suggestions on how to proceed so I'll explore those ideas until I hear from the help desk.

@EdwardSafford-NOAA
Copy link
Collaborator Author

Modules are now understood and are getting (re)loaded correctly. Next hurdle is running the [expletive deleted] serial batch jobs like on hera. PBS/qsub is quite different from slurm. To run a bunch of serial jobs from one submitted job you have to specify the number of cpus and the number of tasks (jobs) in the command file and uniquely identify them in the command file. I've done that and what I get now is every job running on every cpu. Quite the mess. Documentation on this is about nil in the wcoss2 wiki (a point which I've raised with the helpdesk) so I'm blindly flailing away at this with ChatGPT, so far to no avail.

@EdwardSafford-NOAA
Copy link
Collaborator Author

EdwardSafford-NOAA commented May 31, 2024

Back to working with CFP and finally making real headway. I found the CFP example at /apps/docs/samples/intel/cfp/, and while it is somewhat out of date, it's together enough to demonstrate the basic principle. So I've cobbled together a job script that executes a series of commands from a separate command file. I put the command file together on-the-fly from the input yaml file (obs-monitor/parm/gfs/gfs_plots.yaml), with the number of requested cpus set by the number of separate eva plot jobs that are generated by slicing up the model yaml file. I'm stress testing now with the full set of Rad plots (time, summary, angle, bcoef).

@EdwardSafford-NOAA
Copy link
Collaborator Author

EdwardSafford-NOAA commented May 31, 2024

Looks good; the Rad plots were done in a shade over 50 min using 33 cpus. That's only slightly slower than hera and is with acceptable range. I'll move on to the conventional plots on Monday.

@EdwardSafford-NOAA
Copy link
Collaborator Author

I've added the ozone plots to the radiance plots with the CFP scheme and it's working fine. I've modified the minimization plot job to add module loading to the command file, and that's working correctly.

EdwardSafford-NOAA added a commit to EdwardSafford-NOAA/obs-monitor that referenced this issue Jun 5, 2024
Save work in progress.
@EdwardSafford-NOAA
Copy link
Collaborator Author

With all things working on wcoss2 I'm retesting all changes on hera and running into new problems. All my plots on hera now throw this error:

Traceback (most recent call last):
0: File "/scratch1/NCEPDEV/da/Edward.Safford/noscrub/git/obs-monitor/ush/plotObsMon.py", line 193, in
0: eva(plot_yaml)
0: File "/home/Edward.Safford/.local/lib/python3.9/site-packages/eva/eva_driver.py", line 235, in eva
0: figure_driver(eva_dict, data_collections, timing, logger)
0: File "/home/Edward.Safford/.local/lib/python3.9/site-packages/eva/plotting/batch/base/plot_tools/figure_driver.py", line 150, in figure_driver
0: make_figure(handler, figure_conf, plots_conf,
0: File "/home/Edward.Safford/.local/lib/python3.9/site-packages/eva/plotting/batch/base/plot_tools/figure_driver.py", line 235, in make_figure
0: fig.create_figure()
0: File "/home/Edward.Safford/.local/lib/python3.9/site-packages/emcpy/plots/create_plots.py", line 316, in create_figure
0: plot_dict[layer.plottype](layer, ax)
0: File "/home/Edward.Safford/.local/lib/python3.9/site-packages/emcpy/plots/create_plots.py", line 668, in _horizontalline
0: ax.axhline(plotobj.y, **inputs)
0: File "/scratch1/NCEPDEV/da/python/opt/core/miniconda3/4.6.14/envs/eva/lib/python3.9/site-packages/matplotlib/axes/_axes.py", line 737, in axhline
0: l = mlines.Line2D([xmin, xmax], [y, y], transform=trans, **kwargs)
0: File "/scratch1/NCEPDEV/da/python/opt/core/miniconda3/4.6.14/envs/eva/lib/python3.9/site-packages/matplotlib/lines.py", line 393, in init
0: self.update(kwargs)
0: File "/scratch1/NCEPDEV/da/python/opt/core/miniconda3/4.6.14/envs/eva/lib/python3.9/site-packages/matplotlib/artist.py", line 1067, in update
0: raise AttributeError(f"{type(self).name!r} object "
0: AttributeError: 'Line2D' object has no property 'type'
srun: error: h22c49: task 0: Exited with exit code 1

This worked as recently as 2 weeks ago. I hit this error in both the eva version in minconda3 and my local install (~/.local/bin/eva built using the latest develop branch), so it seems eva isn't the issue. Is it possible that the matplotlib package in miniconda3 is out of date?

@CoryMartin-NOAA
Copy link
Contributor

@EdwardSafford-NOAA can you try the new python install from when we moved to rocky8? see https://github.com/NOAA-EMC/GDASApp/blob/develop/modulefiles/EVA/hera.lua

@EdwardSafford-NOAA
Copy link
Collaborator Author

@CoryMartin-NOAA that looks to have it. From the command line I was able to run eva on one of the component plots now. I'm retesting using the obs-monitor driver. Thanks very much; that would have taken me a little longer to figure out.

@EdwardSafford-NOAA
Copy link
Collaborator Author

Next hurdle. When I run from obs-monitor plotObsMon.py can't pick up the included wxflow components. I have wxflow installed locally and then included in my $PYTHONPATH as /home/Edward.Safford/.local/lib/python3.9/site-packages. What do I need to do differently?

@CoryMartin-NOAA
Copy link
Contributor

we should probably include it as a submodule and use relative paths pointing to it. I think that's what GDASApp does.

@EdwardSafford-NOAA
Copy link
Collaborator Author

So I found this in GDASApp/modulefiles/GDAS/hera.intel.lua:

-- hack for wxflow
prepend_path("PYTHONPATH", "/scratch1/NCEPDEV/da/python/gdasapp/wxflow/20240307/src")

Setting my PYTHONPATH similarly does work. That's probably good enough for this round of testing but we will need to do better when I/we implement a dedicated VE for obs-monitor.

@CoryMartin-NOAA
Copy link
Contributor

@EdwardSafford-NOAA works for me, yes we can revisit when it matures

@EdwardSafford-NOAA
Copy link
Collaborator Author

EdwardSafford-NOAA commented Jun 7, 2024

On hera there are some new warnings as a result of the Rocky8 and associated python updates. I've opened https://github.com/JCSDA-internal/eva/issues/190 to address one in csv_space.py which I'll tackle when I have a moment.

A related eva issue, mon_data_space.py needs to exit gracefully if the control file for a requested plot isn't found. At present a missing control file for one plot results in the entire model plot job failing. Opened JCSDA-internal/eva#191 to correct that.

EdwardSafford-NOAA added a commit to EdwardSafford-NOAA/obs-monitor that referenced this issue Jun 24, 2024
Add runWcoss.sh script, lump all plots into single job.
EdwardSafford-NOAA added a commit to EdwardSafford-NOAA/obs-monitor that referenced this issue Jun 24, 2024
Add some Hera items.
EdwardSafford-NOAA added a commit to EdwardSafford-NOAA/obs-monitor that referenced this issue Jun 24, 2024
Merge branch 'feature/wcoss2-32' of https://github.com/EdwardSafford-NOAA/obs-monitor into feature/wcoss2-32
Fix pycodestyle issues.
EdwardSafford-NOAA added a commit to EdwardSafford-NOAA/obs-monitor that referenced this issue Jun 24, 2024
Bit more cleanup.
EdwardSafford-NOAA added a commit to EdwardSafford-NOAA/obs-monitor that referenced this issue Jun 24, 2024
Update J-job.
EdwardSafford-NOAA added a commit to EdwardSafford-NOAA/obs-monitor that referenced this issue Jun 24, 2024
Clean up.
EdwardSafford-NOAA added a commit to EdwardSafford-NOAA/obs-monitor that referenced this issue Jun 25, 2024
Rm gfs_plots.yaml.orig from branch.  It slipped in by mistake.
EdwardSafford-NOAA added a commit to EdwardSafford-NOAA/obs-monitor that referenced this issue Jun 26, 2024
Add prefix to split yaml files, clean up J-job setup.
EdwardSafford-NOAA added a commit to EdwardSafford-NOAA/obs-monitor that referenced this issue Jun 26, 2024
Replace runWcoss.sh with a setup script that get's sourced by the
cmdfile.
@EdwardSafford-NOAA
Copy link
Collaborator Author

Closed by #34 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants