Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove ami pinning from scale-config.yml files #6163

Merged
merged 1 commit into from
Jan 13, 2025

Conversation

jeanschmidt
Copy link
Contributor

At news year eve, we had an small CI outage. Queue started to grow due to lack of capacity to create new linux instances. After investigating the issue we noticed that this is due to the pinning of labels in the ami tag of scale-config.yml and its variants. This is due to the fact that for security reasons, Amazon removed the tag on its services, so we could not resolve the AMI ID from the tag search.

The goal of this tag is to enable to migrate to a newer AMI type runner-by-runner, so we can troubleshoot problems and avoid the issue of being stuck in the migration because of a particular job that runs in a particular instance. This was included with the concept of variants.

Now that the migration is complete, the correct approach is to REMOVE these labels and rely on the labels that are pinned at release/deploy time. They are safer, for many reasons, somo of them:

  • The AMI id is pinned at release time, so if the label is not available anymore, the instances are still able to be created;
  • At the release time, we are immediately notified that the label is not available anymore and we need to upgrade, plus it prevents moving forward and deploying a broken state.
  • We are able to test the changes (honestly, many times we don't, but can and should)
  • Minor version upgrades, that are potentially problematic, can be rolled back faster;
  • We have "hands on controls" and are aware of the releases, so we know when a release is triggered and can monitor. Over have the change take immediate effect when Amazon releases a newer minor version;
  • Rolls forward is faster to a newer release and is guaranteed to be complete;

So, to avoid outages similar to what we had, this action should be taken.

This is on top of the following changes that correctly reflected the pinning we're using to the release:

cc @zxiiro @malfet @atalman @seemethere @ZainRizvi

Copy link

vercel bot commented Jan 13, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Updated (UTC)
torchci ⬜️ Ignored (Inspect) Jan 13, 2025 6:13pm

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 13, 2025
Copy link
Contributor

@huydhn huydhn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeanschmidt jeanschmidt merged commit d156788 into main Jan 13, 2025
5 checks passed
@jeanschmidt jeanschmidt deleted the jeanschmidt/update_ami_pinning branch January 13, 2025 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants