Skip to content

Commit

Permalink
contrib/aws: Fix cluster resource leaks for superseded jobs
Browse files Browse the repository at this point in the history
If a user puts up a PR, and then proceeds to push to that PR while AWS CI is still running,
Jenkins will use a feature called milestones to cancel the old build. Jenkins will try to
nicely abort the stage with sig-term like behavior, wait 10 seconds and do another nice abort
with sig-term behavior, and finally wait 10 more seconds before killing the stage with sig-kill
like behavior. We have a race condition that sometimes leaks clusters if the cleanup didn't finish
before the stage got forcefully terminated. After Jenkins terminates the stages, it always runs the
post-build actions. Currently, the post build actions only call cleanup in one region, but the CI creates
clusters in multiple regions.

If/When resources are leaked, they get automatically cleaned up by the resource reaper in the account
after they are alive for a fixed time. This PR allows us to be more frugal by cleaning up correctly,
instead of relying on the resource reaper to clean up for us. It also prevent us from hitting our tight
AWS account level EC2 limits.

This PR fixes our cleanup race by calling the cleanup logic for every region in the ALWAYS post build stage.
If the cluster is already cleaned up (successful run), the cleanup call is a no-op.

Signed-off-by: Seth Zegelstein <[email protected]>
  • Loading branch information
a-szegel committed Jan 9, 2025
1 parent c8c36e4 commit 86066cd
Showing 1 changed file with 21 additions and 18 deletions.
39 changes: 21 additions & 18 deletions contrib/aws/Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -61,17 +61,21 @@ def get_random_string(len) {
return s
}

def get_cluster_name_prefix(build_tag) {
return sh(
script: "echo ${build_tag} | sed \"s/^jenkins-//g\" | sed \"s/ //g\"",
returnStdout: true
)
}

def get_cluster_name(build_tag, os, instance_type) {
/*
* Compose the cluster name. Pcluster requires a cluster name under 60 characters.
* cluster name cannot have ".".
* Jenkins does not allow groovy to use the replace() method
* of string. Therefore we used shell command sed to replace "." with ""
*/
build_tag = sh(
script: "echo ${build_tag} | sed \"s/^jenkins-//g\" | sed \"s/ //g\"",
returnStdout: true
)
build_tag = get_cluster_name_prefix(build_tag)

def cluster_name = sh(
script: "echo '${build_tag.take(28)}-${os.take(10)}-${instance_type.take(10)}-'${get_random_string(8)} | tr -d '.\\n'",
Expand Down Expand Up @@ -133,10 +137,6 @@ pipeline {
timeout(time: 10, unit: 'HOURS')
skipDefaultCheckout()
}
environment {
// AWS region where the cluster is created
REGION="us-west-2"
}
stages {
// Cleanup workspace before job start.
stage("Clean up workspace") {
Expand Down Expand Up @@ -241,17 +241,20 @@ pipeline {
sh 'find PortaFiducia/tests/outputs -name "*.xml" | xargs du -shc'
junit testResults: 'PortaFiducia/tests/outputs/**/*.xml', keepLongStdio: false
archiveArtifacts artifacts: 'PortaFiducia/tests/outputs/**/*.*'
script {
// Try To Cleanup Resources
def regions = ["us-east-1", "eu-north-1", "us-west-2"]
cluster_name_prefix = get_cluster_name_prefix(env.BUILD_TAG)
regions.each { region ->
sh ". venv/bin/activate; ./PortaFiducia/scripts/delete_manual_cluster.py --cluster-name '${cluster_name_prefix}*' --region ${region}"
}
// Windows Cluster, has a different name
sh '''
. venv/bin/activate
./PortaFiducia/scripts/delete_manual_cluster.py --cluster-name WindowsLibfabricCi_${env.CHANGE_ID}_*
'''
}
}
failure {
sh '''
. venv/bin/activate
./PortaFiducia/scripts/delete_manual_cluster.py --cluster-name WindowsLibfabricCi_${env.CHANGE_ID}_*
'''
}
aborted {
sh '. venv/bin/activate; ./PortaFiducia/scripts/delete_manual_cluster.py --cluster-name "$BUILD_TAG"\'*\' --region $REGION'
}
// Cleanup workspace after job completes.
cleanup {
deleteDir()
}
Expand Down

0 comments on commit 86066cd

Please sign in to comment.