(SE) SLES15 optimization for Github Actions #453

Dooruk · 2024-11-01T21:11:08Z

Tier 1 finished successfully but took too long, ~1 hour for almost each task(you can compare with previous Tier 1 runs). I had to do a few cylc related change in gmao_ci account to ~/bin/cylc and ~/cylc/global-workflow.yaml files, which are one time changes. Another change I had to do was ~/bin/cylc file. It chooses the correct Cylc installation depending on the OS, and ~/bin is added to $PATH.

See output here:

https://github.com/GEOS-ESM/swell/actions/runs/11633945304

Here are the steps I took to modify test_swell.yml to be able to run Test CI Applications Action:

Update CI-Workflows :
Modify following file:
GEOS-ESM/CI-workflows/.github/workflows/test_swell.yml
in feature/test_swell_application branch
In Swell -> Actions -> Test CI Applications and run any Swell branch (say we are testing different SLURM configs). Test CI only runs a particular CI-Workflows branch linked below:

swell/.github/workflows/test_ci_application_discover.yml

Line 12 in 7812c41

    
           uses: GEOS-ESM/CI-workflows/.github/workflows/test_swell.yml@feature/test_swell_application

To run this you need to be in @jardizzo's nams_check.py file.

The slowdown is caused partly due to these two lines for variational tasks, since now there are 126 cores available in Milan nodes (I personally request 100 ntasks-per-node):

swell/src/swell/utilities/slurm.py

Lines 49 to 50 in 7812c41

    
           "RunJediVariationalExecutable": {"all": {"nodes": 3, "ntasks-per-node": 36}}, 
        
           "RunJediUfoTestsExecutable": {"all": {"ntasks-per-node": 1}},

However I'm not not sure about why hofx suite would be slow, so there must be combination of Discover filesystem being slow + Swell SLES15 SLURM settings.

Default ntasks-per-node is defined here for different platforms:

https://github.com/GEOS-ESM/swell/blob/develop/src/swell/deployment/platforms/nccs_discover_sles15/slurm.yaml

@rtodling and I can help with the proper node/ntasks combination but filesystem could be resolved with $TSE_TMPDIR implementation?

The text was updated successfully, but these errors were encountered:

Dooruk · 2024-11-10T17:33:51Z

This is a bit urgent as our Tier 1 or 2 tests won't work since we updated runners to SLES15.

The time between submitting a job on a head node and its start is much longer compared to the previous SLES12 runner setup. To give an example, in this action run for 3dfgat_atmos, 20211212T0000Z/GetObservations-geos_atmosphere task takes 6 minutes from submission to start running. Is it due to runners not being able to handle the number of requested tasks? Compute node tasks don't seem to have this issue.
3dfgat_atmos wasn't working until I made this small fix to use more ntasks-per-node on Milan: develop...feature/tasks_per_node
3dvar still fails, I have no idea why but it always fails on the same task, GenerateBClimatology, for the same reason claiming missing files but they do exist. This task requires compute nodes. Exact same setup works on my local submission and 3dvar suite runs on a 5-degree setup so the compute requirements are minimal. I can live with it for now while we update everything else, including the build.
~~@jardizzo, could you update swell-tier1_application_discover.yml in develop with .github/workflows/test_swell.yml in feature/test_swell_application branch? That's what I've been testing with.~~ This is updated now.

Dooruk · 2025-01-06T17:04:38Z

Two quick updates on this following changes pertaining discover36.

I didn't see any performance increase in terms of time it takes to complete Github actions after the change NCCS conducted. Do we need to adjust or rest anything for the runners @jardizzo?
Ocean 3DVar suite is still failing. This suite works on local $NOBACKUP (both me and @mranst tested it) and on gmao_ci account if installed and launched manually. The only time it doesn't work is it when it stalls on Git Actions. I gradually bumped stall timeout in ~.cylc/flow/global.cylc to 15M but GenerateBClimatology still stalls for that long. I'm at a loss here on what to do both Tier 1 and 2 won't work because of it.

Dooruk added run-ci-discover (authorized users) Run the continuous integration suite on Discover core development design related issues and improvements labels Nov 1, 2024

Dooruk assigned ashiklom, jardizzo and mranst Nov 1, 2024

Dooruk mentioned this issue Nov 10, 2024

Update pinned versions to 10/07/2024 GEOS-ESM/jedi_bundle#55

Merged

Dooruk mentioned this issue Dec 4, 2024

Update JEDI build to November 19th, 2024 GEOS-ESM/jedi_bundle#57

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(SE) SLES15 optimization for Github Actions #453

(SE) SLES15 optimization for Github Actions #453

Dooruk commented Nov 1, 2024 •

edited

Loading

Dooruk commented Nov 10, 2024 •

edited

Loading

Dooruk commented Jan 6, 2025

(SE) SLES15 optimization for Github Actions #453

(SE) SLES15 optimization for Github Actions #453

Comments

Dooruk commented Nov 1, 2024 • edited Loading

Dooruk commented Nov 10, 2024 • edited Loading

Dooruk commented Jan 6, 2025

Dooruk commented Nov 1, 2024 •

edited

Loading

Dooruk commented Nov 10, 2024 •

edited

Loading