Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(SE) SLES15 optimization for Github Actions #453

Open
Dooruk opened this issue Nov 1, 2024 · 2 comments
Open

(SE) SLES15 optimization for Github Actions #453

Dooruk opened this issue Nov 1, 2024 · 2 comments
Assignees
Labels
core development design related issues and improvements run-ci-discover (authorized users) Run the continuous integration suite on Discover

Comments

@Dooruk
Copy link
Collaborator

Dooruk commented Nov 1, 2024

Tier 1 finished successfully but took too long, ~1 hour for almost each task(you can compare with previous Tier 1 runs). I had to do a few cylc related change in gmao_ci account to ~/bin/cylc and ~/cylc/global-workflow.yaml files, which are one time changes. Another change I had to do was ~/bin/cylc file. It chooses the correct Cylc installation depending on the OS, and ~/bin is added to $PATH.

See output here:

https://github.com/GEOS-ESM/swell/actions/runs/11633945304

Here are the steps I took to modify test_swell.yml to be able to run Test CI Applications Action:

  1. Update CI-Workflows :
    Modify following file:
    GEOS-ESM/CI-workflows/.github/workflows/test_swell.yml
    in feature/test_swell_application branch

  2. In Swell -> Actions -> Test CI Applications and run any Swell branch (say we are testing different SLURM configs). Test CI only runs a particular CI-Workflows branch linked below:

uses: GEOS-ESM/CI-workflows/.github/workflows/test_swell.yml@feature/test_swell_application

To run this you need to be in @jardizzo's nams_check.py file.

The slowdown is caused partly due to these two lines for variational tasks, since now there are 126 cores available in Milan nodes (I personally request 100 ntasks-per-node):

"RunJediVariationalExecutable": {"all": {"nodes": 3, "ntasks-per-node": 36}},
"RunJediUfoTestsExecutable": {"all": {"ntasks-per-node": 1}},

However I'm not not sure about why hofx suite would be slow, so there must be combination of Discover filesystem being slow + Swell SLES15 SLURM settings.

Default ntasks-per-node is defined here for different platforms:

https://github.com/GEOS-ESM/swell/blob/develop/src/swell/deployment/platforms/nccs_discover_sles15/slurm.yaml

@rtodling and I can help with the proper node/ntasks combination but filesystem could be resolved with $TSE_TMPDIR implementation?

@Dooruk Dooruk added run-ci-discover (authorized users) Run the continuous integration suite on Discover core development design related issues and improvements labels Nov 1, 2024
@Dooruk
Copy link
Collaborator Author

Dooruk commented Nov 10, 2024

This is a bit urgent as our Tier 1 or 2 tests won't work since we updated runners to SLES15.

  • The time between submitting a job on a head node and its start is much longer compared to the previous SLES12 runner setup. To give an example, in this action run for 3dfgat_atmos, 20211212T0000Z/GetObservations-geos_atmosphere task takes 6 minutes from submission to start running. Is it due to runners not being able to handle the number of requested tasks? Compute node tasks don't seem to have this issue.

  • 3dfgat_atmos wasn't working until I made this small fix to use more ntasks-per-node on Milan: develop...feature/tasks_per_node

  • 3dvar still fails, I have no idea why but it always fails on the same task, GenerateBClimatology, for the same reason claiming missing files but they do exist. This task requires compute nodes. Exact same setup works on my local submission and 3dvar suite runs on a 5-degree setup so the compute requirements are minimal. I can live with it for now while we update everything else, including the build.

  • @jardizzo, could you update swell-tier1_application_discover.yml in develop with .github/workflows/test_swell.yml in feature/test_swell_application branch? That's what I've been testing with. This is updated now.

@Dooruk
Copy link
Collaborator Author

Dooruk commented Jan 6, 2025

Two quick updates on this following changes pertaining discover36.

  1. I didn't see any performance increase in terms of time it takes to complete Github actions after the change NCCS conducted. Do we need to adjust or rest anything for the runners @jardizzo?

  2. Ocean 3DVar suite is still failing. This suite works on local $NOBACKUP (both me and @mranst tested it) and on gmao_ci account if installed and launched manually. The only time it doesn't work is it when it stalls on Git Actions. I gradually bumped stall timeout in ~.cylc/flow/global.cylc to 15M but GenerateBClimatology still stalls for that long. I'm at a loss here on what to do both Tier 1 and 2 won't work because of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core development design related issues and improvements run-ci-discover (authorized users) Run the continuous integration suite on Discover
Projects
None yet
Development

No branches or pull requests

4 participants