-
Notifications
You must be signed in to change notification settings - Fork 87
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Remove ami pinning from scale-config.yml files (#6163)
At news year eve, we had an small CI outage. Queue started to grow due to lack of capacity to create new linux instances. After investigating the issue we noticed that this is due to the pinning of labels in the `ami` tag of `scale-config.yml` and its variants. This is due to the fact that for security reasons, Amazon removed the tag on its services, so we could not resolve the AMI ID from the tag search. The goal of this tag is to enable to migrate to a newer AMI type runner-by-runner, so we can troubleshoot problems and avoid the issue of being stuck in the migration because of a particular job that runs in a particular instance. This was included with the concept of variants. Now that the migration is complete, the correct approach is to **REMOVE** these labels and rely on the labels that are pinned at release/deploy time. They are safer, for many reasons, somo of them: * The AMI id is pinned at release time, so if the label is not available anymore, the instances are still able to be created; * At the release time, we are immediately notified that the label is not available anymore and we need to upgrade, plus it prevents moving forward and deploying a broken state. * We are able to test the changes (honestly, many times we don't, but can and should) * Minor version upgrades, that are potentially problematic, can be rolled back faster; * We have "hands on controls" and are aware of the releases, so we know when a release is triggered and can monitor. Over have the change take immediate effect when Amazon releases a newer minor version; * Rolls forward is faster to a newer release and is guaranteed to be complete; So, to avoid outages similar to what we had, this action should be taken. This is on top of the following changes that correctly reflected the pinning we're using to the release: * pytorch-labs/pytorch-gha-infra@ea8466e - released on: https://github.com/pytorch-labs/pytorch-gha-infra/actions/runs/12752938298 * pytorch/ci-infra@f564bcf - released on: https://github.com/pytorch/ci-infra/actions/runs/12753299999 cc @zxiiro @malfet @atalman @seemethere @ZainRizvi
- Loading branch information
1 parent
80fbd83
commit d156788
Showing
3 changed files
with
0 additions
and
90 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.