(SE) SLES15 optimization for Github Actions #453
Labels
core development
design related issues and improvements
run-ci-discover (authorized users)
Run the continuous integration suite on Discover
Tier 1 finished successfully but took too long, ~1 hour for almost each task(you can compare with previous Tier 1 runs). I had to do a few
cylc
related change ingmao_ci
account to~/bin/cylc
and~/cylc/global-workflow.yaml
files, which are one time changes. Another change I had to do was~/bin/cylc
file. It chooses the correct Cylc installation depending on the OS, and~/bin
is added to$PATH
.See output here:
https://github.com/GEOS-ESM/swell/actions/runs/11633945304
Here are the steps I took to modify
test_swell.yml
to be able to run Test CI Applications Action:Update CI-Workflows :
Modify following file:
GEOS-ESM/CI-workflows/.github/workflows/test_swell.yml
in feature/test_swell_application branch
In Swell -> Actions -> Test CI Applications and run any Swell branch (say we are testing different SLURM configs). Test CI only runs a particular CI-Workflows branch linked below:
swell/.github/workflows/test_ci_application_discover.yml
Line 12 in 7812c41
To run this you need to be in @jardizzo's
nams_check.py
file.The slowdown is caused partly due to these two lines for variational tasks, since now there are 126 cores available in Milan nodes (I personally request 100
ntasks-per-node
):swell/src/swell/utilities/slurm.py
Lines 49 to 50 in 7812c41
However I'm not not sure about why hofx suite would be slow, so there must be combination of Discover filesystem being slow + Swell SLES15 SLURM settings.
Default
ntasks-per-node
is defined here for different platforms:https://github.com/GEOS-ESM/swell/blob/develop/src/swell/deployment/platforms/nccs_discover_sles15/slurm.yaml
@rtodling and I can help with the proper node/ntasks combination but filesystem could be resolved with
$TSE_TMPDIR
implementation?The text was updated successfully, but these errors were encountered: