-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add job submission for wcoss2 #32
Comments
Save off work & try on hera
I've run into a problem here that I just can't figure out. On wcoss2 I can run eva and obs-monitor components from the command line without error. But when I try to run a job via the batch system I hit syntax errors in the python code. For example this is from a pared down test script that runs to completion without error from the command line, but produces this syntax error in batch mode:
I've verified my module list, and the $PYTHONPATH and $PATH env vars are the same in both environments. What else could be different? @CoryMartin-NOAA , @kevindougherty-noaa , @ADCollard |
This looks like it is running with python2 vs python3. How is it being executed in the batch job? is |
Hmm.
So that was a good thought but I'm still not sure what the issue is apart from wcoss2 is a real pain at times. |
I think the second half of your last sentence is the key :-) I suspect there is something in the login env that is different from the batch environment. I have no idea what, but something is different. Sounds like this is probably a next week problem at this point. This does still somehow suggest that the python3 at batch runtime != the one on the login node... |
Yes I've been trying to figure out what's different between login and batch environments. I've managed to rule out modules, $PYTHONPATH, and $PATH. I guess I should dump all the environment variables and sift though that happiness. That will be a real thrill. |
I've verified that, with the prescribed specified modules loaded, both I do note that I can run a simple test python program with imports from the python standard library in the batch environment. I get syntax errors though when I try to import things from |
Some good news. At long last I've gotten past the syntax errors. In my parm file I'd been declaring |
@EdwardSafford-NOAA interesting. I guess it still must be a path issue somehow, but explicit works it seems, so that's good enough to proceed. Great! |
I think I'll just log that under 'mysteries of wcoss2'. I was able to generalize the script by setting |
Excelsior! That's a minimization summary plot created via batch queue on wcoss2. (No logos because they don't yet work on wcoss2 -- same issue as One of the complications/delays in getting this far has been my specification of specific environmental variables in job submission statement rather than using the -V option (which passes all env vars into the submitted job). Use of the -V is strongly discouraged by ops because of the overhead, so it's taken me some time to figure out exactly what I need to run. The final missing piece was LD_LIBRARY_PATH to pick up libgeos_c.so.1. |
I've now got the radTime, radSummary, and radBcoef plots working. That's the good news. Bad news is wall time. These run way slower on |
I finally figured out how the python syntax errors occurred. I wasn't exactly aware of how modules are handled on wcoss2, but (unwisely) assumed it was similar to hera. That is not so. Every submitted job starts with a stock set of modules and whatever modules have been loaded in the submitting script are ignored and must be reloaded if you want to use them. So that's why I kept picking up python2 when I haven't yet gotten a response from the help desk about using CFP to run the jobs (similarly to how they run on hera). The module situation does complicate that strategy somewhat. ChatGPT gave me some other suggestions on how to proceed so I'll explore those ideas until I hear from the help desk. |
Modules are now understood and are getting (re)loaded correctly. Next hurdle is running the [expletive deleted] serial batch jobs like on hera. PBS/qsub is quite different from slurm. To run a bunch of serial jobs from one submitted job you have to specify the number of cpus and the number of tasks (jobs) in the command file and uniquely identify them in the command file. I've done that and what I get now is every job running on every cpu. Quite the mess. Documentation on this is about nil in the wcoss2 wiki (a point which I've raised with the helpdesk) so I'm blindly flailing away at this with ChatGPT, so far to no avail. |
Back to working with CFP and finally making real headway. I found the CFP example at |
Looks good; the Rad plots were done in a shade over 50 min using 33 cpus. That's only slightly slower than hera and is with acceptable range. I'll move on to the conventional plots on Monday. |
I've added the ozone plots to the radiance plots with the CFP scheme and it's working fine. I've modified the minimization plot job to add module loading to the command file, and that's working correctly. |
Save work in progress.
With all things working on wcoss2 I'm retesting all changes on hera and running into new problems. All my plots on hera now throw this error: Traceback (most recent call last): This worked as recently as 2 weeks ago. I hit this error in both the eva version in minconda3 and my local install (~/.local/bin/eva built using the latest develop branch), so it seems eva isn't the issue. Is it possible that the matplotlib package in miniconda3 is out of date? |
@EdwardSafford-NOAA can you try the new python install from when we moved to rocky8? see https://github.com/NOAA-EMC/GDASApp/blob/develop/modulefiles/EVA/hera.lua |
@CoryMartin-NOAA that looks to have it. From the command line I was able to run eva on one of the component plots now. I'm retesting using the obs-monitor driver. Thanks very much; that would have taken me a little longer to figure out. |
Next hurdle. When I run from obs-monitor |
we should probably include it as a submodule and use relative paths pointing to it. I think that's what GDASApp does. |
So I found this in
Setting my PYTHONPATH similarly does work. That's probably good enough for this round of testing but we will need to do better when I/we implement a dedicated VE for obs-monitor. |
@EdwardSafford-NOAA works for me, yes we can revisit when it matures |
On hera there are some new warnings as a result of the Rocky8 and associated python updates. I've opened https://github.com/JCSDA-internal/eva/issues/190 to address one in A related eva issue, |
Add runWcoss.sh script, lump all plots into single job.
Add some Hera items.
Merge branch 'feature/wcoss2-32' of https://github.com/EdwardSafford-NOAA/obs-monitor into feature/wcoss2-32 Fix pycodestyle issues.
Bit more cleanup.
Update J-job.
Clean up.
Rm gfs_plots.yaml.orig from branch. It slipped in by mistake.
Add prefix to split yaml files, clean up J-job setup.
Replace runWcoss.sh with a setup script that get's sourced by the cmdfile.
Closed by #34 . |
Now that eva and emcpy are available on wcoss2 it should be possible to add the logic to run the obs-monitor J-job on wcoss2. Let's find out.
The text was updated successfully, but these errors were encountered: