Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reinstate product groups #3208

Merged

Conversation

WalterKolczynski-NOAA
Copy link
Contributor

@WalterKolczynski-NOAA WalterKolczynski-NOAA commented Jan 6, 2025

Description

Testing with full-sized GEFS found that the sheer number of tasks overloads rocoto, resulting in rocotorun taking over 10 min to complete or hanging entirely. To reduce the number of tasks, product groups are reimplemented so that multiple forecast hour are processed in a single task. However, the implementation is a little different than previously.

The jobs where groups are enabled (atmos_products, oceanice_products, wavepostsbs, atmos_ensstat, and gempak) have a new variable, MAX_TASKS, that controls how many groups to use. This setting is currently per member. The forecast hours to be processed are then divided into this many groups as evenly as possible without crossing forecast segment boundaries. The walltime for those jobs is then multiplied by the number of times in the largest group. For the gridded wave post job, the dependencies were also updated to trigger off of either the data being available or the appropriate segment completing (the dependencies had not been updated when the job was initially broken into fhrs).

A number of helper methods are added to Tasks to determine these groups and make a standard metatask variable dict in a centralized location. There is also a function to multiply the walltime, but this may be better off relocated to wxflow with the other time functions.

As part of switching from a single value to a list, hours are no longer passed by rocoto as zero-padded values. The lists are comma-delimited (without spaces) and split apart in the job stub (jobs/rocoto/*), so each j-job call is still a single forecast hour.

The offline post (upp) job is not broken into groups, since it really isn't used outside the analysis anymore.
Resolves #2999
Resolves #3210

Type of change

  • Bug fix (fixes something broken)
  • New feature (adds functionality)
  • Maintenance (code refactor, clean-up, new CI test, etc.)

Change characteristics

  • Is this a breaking change (a change in existing functionality)? NO
  • Does this change require a documentation update? NO
  • Does this change require an update to any of the following submodules? NO (If YES, please add a link to any PRs that are pending.)
    • EMC verif-global
    • GDAS
    • GFS-utils
    • GSI
    • GSI-monitor
    • GSI-utils
    • UFS-utils
    • UFS-weather-model
    • wxflow

How has this been tested?

  • Full-sized GEFS test on Hercules
  • Standard forecast-only test on Hercules
  • Standard atm-only cycled test on Hercules

Checklist

  • Any dependent changes have been merged and published
  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have documented my code, including function, input, and output descriptions
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • This change is covered by an existing CI test or a new one has been added
  • Any new scripts have been added to the .github/CODEOWNERS file with owners
  • I have made corresponding changes to the system documentation if necessary

Testing with full-sized GEFS found that the sheer number of tasks
overloads rocoto, resulting in `rocotorun` taking over 10 min to
complete or hanging entirely. To reduce the number of tasks, product
groups are reimplemented so that multiple forecast hour are processed
in a single task. However, the implementation is a little different
than previously.

The jobs where groups are enabled (atmos_products, oceanice_products,
and wavepostsbs) have a new variable, `MAX_TASKS`, that controls how
many groups to use. This setting is currently *per member*. The
forecast hours to be processed are then divided into this many groups
as evenly as possible without crossing forecast segment boundaries.
The walltime for those jobs is then multiplied by the number of times
in the largest group.

A number of helper methods are added to Tasks to determine these
groups and make a standard metatask variable dict in a centralized
location. There is also a function to multiply the walltime, but this
may be better off relocated to wxflow with the other time functions.

As part of switching from a single value to a list, hours are no longer
passed by rocoto as zero-padded values. The lists are comma-delimited
(without spaces) and split apart in the job stub (`jobs/rocoto/*`), so
each j-job call is still a single forecast hour.

The offline post (upp) job is not broken into groups, since it really
isn't used outside the analysis anymore. Gempak jobs that run over
multiple forecast hours also aren't broken into groups yet.

Resolves NOAA-EMC#2999
Copy link
Contributor

@aerorahul aerorahul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no major concerns with this PR. It looks good and clean to me. I have not created the XML and looked at it or run any tests.

A few comments inline with the review.

workflow/rocoto/gefs_tasks.py Show resolved Hide resolved
workflow/rocoto/tasks.py Outdated Show resolved Hide resolved
Copy link
Member

@KateFriedman-NOAA KateFriedman-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good, just leaving a few suggested changes to resolve some typos I noticed.

workflow/rocoto/tasks.py Outdated Show resolved Hide resolved
workflow/rocoto/tasks.py Outdated Show resolved Hide resolved
workflow/rocoto/tasks.py Outdated Show resolved Hide resolved
jobs/rocoto/atmos_products.sh Show resolved Hide resolved
jobs/rocoto/atmos_products.sh Outdated Show resolved Hide resolved
Adds an alternative data dependency for gridded wave post so jobs can
begin before the entire forecast is complete. Uses the next file
existing to prevent reading incomplete files.

Resolves NOAA-EMC#3210
@WalterKolczynski-NOAA
Copy link
Contributor Author

All reviewer comments other than the location of test methods have been addressed, so I am going to go ahead and run CI for the first machine.

@WalterKolczynski-NOAA WalterKolczynski-NOAA added the CI-Hera-Ready **CM use only** PR is ready for CI testing on Hera label Jan 8, 2025
@emcbot emcbot added CI-Hera-Building **Bot use only** CI testing is cloning/building on Hera CI-Hera-Running **Bot use only** CI testing on Hera for this PR is in-progress and removed CI-Hera-Ready **CM use only** PR is ready for CI testing on Hera CI-Hera-Building **Bot use only** CI testing is cloning/building on Hera labels Jan 8, 2025
@emcbot emcbot added CI-Hercules-Building **Bot use only** CI testing is cloning/building on Hercules CI-Wcoss2-Running **Bot use only** CI testing on WCOSS for this PR is in-progress CI-Wcoss2-Failed **Bot use only** CI testing on WCOSS for this PR has failed and removed CI-Wcoss2-Building **Bot use only** CI testing is cloning/building on WCOSS CI-Wcoss2-Running **Bot use only** CI testing on WCOSS for this PR is in-progress labels Jan 10, 2025
@emcbot emcbot added the CI-Wcoss2-Building **Bot use only** CI testing is cloning/building on WCOSS label Jan 10, 2025
@WalterKolczynski-NOAA WalterKolczynski-NOAA removed the CI-Wcoss2-Failed **Bot use only** CI testing on WCOSS for this PR has failed label Jan 10, 2025
@emcbot emcbot added CI-Wcoss2-Running **Bot use only** CI testing on WCOSS for this PR is in-progress and removed CI-Wcoss2-Building **Bot use only** CI testing is cloning/building on WCOSS labels Jan 10, 2025
@emcbot
Copy link

emcbot commented Jan 10, 2025

CI Tests set up to run in /lfs/h2/emc/ptmp/emc.global/PR/PR_3208/RUNTESTS on WCOSS

@emcbot emcbot added CI-Hercules-Running **Bot use only** CI testing on Hercules for this PR is in-progress and removed CI-Hercules-Building **Bot use only** CI testing is cloning/building on Hercules labels Jan 10, 2025
@WalterKolczynski-NOAA WalterKolczynski-NOAA added CI-Wcoss2-Failed **Bot use only** CI testing on WCOSS for this PR has failed and removed CI-Wcoss2-Running **Bot use only** CI testing on WCOSS for this PR is in-progress labels Jan 10, 2025
@WalterKolczynski-NOAA
Copy link
Contributor Author

Need to look at the gempak jobs a little closer.

@emcbot emcbot added CI-Hercules-Passed **Bot use only** CI testing on Hercules for this PR has completed successfully and removed CI-Hercules-Running **Bot use only** CI testing on Hercules for this PR is in-progress labels Jan 10, 2025
@emcbot
Copy link

emcbot commented Jan 10, 2025

CI Passed on Hercules in Build# 3
Built and ran in directory /work2/noaa/global/CI/HERCULES/3208


Experiment C48_ATM_7f978e6a Completed 2 Cycles: *SUCCESS* at Fri Jan 10 03:16:39 CST 2025
Experiment C48mx500_3DVarAOWCDA_7f978e6a Completed 2 Cycles: *SUCCESS* at Fri Jan 10 03:22:45 CST 2025
Experiment C48mx500_hybAOWCDA_7f978e6a Completed 2 Cycles: *SUCCESS* at Fri Jan 10 03:34:55 CST 2025
Experiment C96_S2SWA_gefs_replay_ics_7f978e6a Completed 1 Cycles: *SUCCESS* at Fri Jan 10 03:53:00 CST 2025
Experiment C96_atm3DVar_7f978e6a Completed 3 Cycles: *SUCCESS* at Fri Jan 10 04:29:20 CST 2025
Experiment C96C48_hybatmDA_7f978e6a Completed 3 Cycles: *SUCCESS* at Fri Jan 10 04:41:33 CST 2025
Experiment C48_S2SW_7f978e6a Completed 2 Cycles: *SUCCESS* at Fri Jan 10 05:17:41 CST 2025
Experiment C48_S2SWA_gefs_7f978e6a Completed 1 Cycles: *SUCCESS* at Fri Jan 10 05:36:30 CST 2025

@WalterKolczynski-NOAA WalterKolczynski-NOAA added CI-Wcoss2-Running **Bot use only** CI testing on WCOSS for this PR is in-progress CI-Wcoss2-Passed **Bot use only** CI testing on WCOSS for this PR has completed successfully and removed CI-Wcoss2-Failed **Bot use only** CI testing on WCOSS for this PR has failed CI-Wcoss2-Running **Bot use only** CI testing on WCOSS for this PR is in-progress labels Jan 14, 2025
Copy link
Member

@KateFriedman-NOAA KateFriedman-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @WalterKolczynski-NOAA !

@WalterKolczynski-NOAA WalterKolczynski-NOAA merged commit e27f5df into NOAA-EMC:develop Jan 15, 2025
5 checks passed
@WalterKolczynski-NOAA WalterKolczynski-NOAA deleted the feature/prod_groups branch January 15, 2025 00:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI-Hera-Passed **Bot use only** CI testing on Hera for this PR has completed successfully CI-Hercules-Passed **Bot use only** CI testing on Hercules for this PR has completed successfully CI-Wcoss2-Passed **Bot use only** CI testing on WCOSS for this PR has completed successfully
Projects
None yet
5 participants