Skip to content

Commit

Permalink
Remove ami pinning from scale-config.yml files (#6163)
Browse files Browse the repository at this point in the history
At news year eve, we had an small CI outage. Queue started to grow due
to lack of capacity to create new linux instances. After investigating
the issue we noticed that this is due to the pinning of labels in the
`ami` tag of `scale-config.yml` and its variants. This is due to the
fact that for security reasons, Amazon removed the tag on its services,
so we could not resolve the AMI ID from the tag search.

The goal of this tag is to enable to migrate to a newer AMI type
runner-by-runner, so we can troubleshoot problems and avoid the issue of
being stuck in the migration because of a particular job that runs in a
particular instance. This was included with the concept of variants.

Now that the migration is complete, the correct approach is to
**REMOVE** these labels and rely on the labels that are pinned at
release/deploy time. They are safer, for many reasons, somo of them:

* The AMI id is pinned at release time, so if the label is not available
anymore, the instances are still able to be created;
* At the release time, we are immediately notified that the label is not
available anymore and we need to upgrade, plus it prevents moving
forward and deploying a broken state.
* We are able to test the changes (honestly, many times we don't, but
can and should)
* Minor version upgrades, that are potentially problematic, can be
rolled back faster;
* We have "hands on controls" and are aware of the releases, so we know
when a release is triggered and can monitor. Over have the change take
immediate effect when Amazon releases a newer minor version;
* Rolls forward is faster to a newer release and is guaranteed to be
complete;

So, to avoid outages similar to what we had, this action should be
taken.

This is on top of the following changes that correctly reflected the
pinning we're using to the release:
*
pytorch-labs/pytorch-gha-infra@ea8466e
- released on:
https://github.com/pytorch-labs/pytorch-gha-infra/actions/runs/12752938298
*
pytorch/ci-infra@f564bcf
- released on:
https://github.com/pytorch/ci-infra/actions/runs/12753299999

cc @zxiiro @malfet @atalman @seemethere @ZainRizvi
  • Loading branch information
jeanschmidt authored Jan 13, 2025
1 parent 80fbd83 commit d156788
Show file tree
Hide file tree
Showing 3 changed files with 0 additions and 90 deletions.
30 changes: 0 additions & 30 deletions .github/lf-canary-scale-config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,157 +37,131 @@ runner_types:
instance_type: m7i-flex.8xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.c.linux.12xlarge:
disk_size: 200
instance_type: c5.12xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.c.linux.10xlarge.avx2:
disk_size: 200
instance_type: m4.10xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.c.linux.24xl.spr-metal:
disk_size: 200
instance_type: c7i.metal-24xl
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.c.linux.16xlarge.spr:
disk_size: 200
instance_type: c7i.16xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.c.linux.9xlarge.ephemeral:
disk_size: 200
instance_type: c5.9xlarge
is_ephemeral: true
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.c.linux.12xlarge.ephemeral:
disk_size: 200
instance_type: c5.12xlarge
is_ephemeral: true
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.c.linux.16xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.16xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.c.linux.24xlarge:
disk_size: 150
instance_type: c5.24xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.c.linux.24xlarge.ephemeral:
disk_size: 150
instance_type: c5.24xlarge
is_ephemeral: true
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.c.linux.2xlarge:
disk_size: 150
instance_type: c5.2xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.c.linux.4xlarge:
disk_size: 150
instance_type: c5.4xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.c.linux.4xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.4xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.c.linux.8xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.8xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.c.linux.g4dn.12xlarge.nvidia.gpu:
disk_size: 150
instance_type: g4dn.12xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.c.linux.g4dn.metal.nvidia.gpu:
disk_size: 150
instance_type: g4dn.metal
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.c.linux.g5.48xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.48xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.c.linux.g5.12xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.12xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.c.linux.g5.4xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.4xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.c.linux.g6.4xlarge.experimental.nvidia.gpu:
disk_size: 150
instance_type: g6.4xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.c.linux.large:
disk_size: 15
instance_type: c5.large
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.c.linux.arm64.2xlarge:
disk_size: 256
instance_type: t4g.2xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-arm64
lf.c.linux.arm64.m7g.4xlarge:
disk_size: 256
instance_type: m7g.4xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-arm64
lf.c.linux.arm64.2xlarge.ephemeral:
disk_size: 256
instance_type: t4g.2xlarge
is_ephemeral: true
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-arm64
lf.c.linux.arm64.m7g.4xlarge.ephemeral:
disk_size: 256
instance_type: m7g.4xlarge
is_ephemeral: true
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-arm64
lf.c.linux.arm64.m7g.metal:
disk_size: 256
instance_type: m7g.metal
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-arm64
lf.c.windows.g4dn.xlarge:
disk_size: 256
instance_type: g4dn.xlarge
Expand Down Expand Up @@ -228,22 +202,18 @@ runner_types:
instance_type: r5.2xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.c.linux.4xlarge.memory:
disk_size: 300
instance_type: r5.4xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.c.linux.8xlarge.memory:
disk_size: 400
instance_type: r5.8xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.c.linux.12xlarge.memory:
disk_size: 600
instance_type: r5.12xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
30 changes: 0 additions & 30 deletions .github/lf-scale-config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,157 +37,131 @@ runner_types:
instance_type: m7i-flex.8xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.linux.12xlarge:
disk_size: 200
instance_type: c5.12xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.linux.10xlarge.avx2:
disk_size: 200
instance_type: m4.10xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.linux.24xl.spr-metal:
disk_size: 200
instance_type: c7i.metal-24xl
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.linux.16xlarge.spr:
disk_size: 200
instance_type: c7i.16xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.linux.9xlarge.ephemeral:
disk_size: 200
instance_type: c5.9xlarge
is_ephemeral: true
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.linux.12xlarge.ephemeral:
disk_size: 200
instance_type: c5.12xlarge
is_ephemeral: true
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.linux.16xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.16xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.linux.24xlarge:
disk_size: 150
instance_type: c5.24xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.linux.24xlarge.ephemeral:
disk_size: 150
instance_type: c5.24xlarge
is_ephemeral: true
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.linux.2xlarge:
disk_size: 150
instance_type: c5.2xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.linux.4xlarge:
disk_size: 150
instance_type: c5.4xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.linux.4xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.4xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.linux.8xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.8xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.linux.g4dn.12xlarge.nvidia.gpu:
disk_size: 150
instance_type: g4dn.12xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.linux.g4dn.metal.nvidia.gpu:
disk_size: 150
instance_type: g4dn.metal
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.linux.g5.48xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.48xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.linux.g5.12xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.12xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.linux.g5.4xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.4xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.linux.g6.4xlarge.experimental.nvidia.gpu:
disk_size: 150
instance_type: g6.4xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.linux.large:
disk_size: 15
instance_type: c5.large
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.linux.arm64.2xlarge:
disk_size: 256
instance_type: t4g.2xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-arm64
lf.linux.arm64.m7g.4xlarge:
disk_size: 256
instance_type: m7g.4xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-arm64
lf.linux.arm64.2xlarge.ephemeral:
disk_size: 256
instance_type: t4g.2xlarge
is_ephemeral: true
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-arm64
lf.linux.arm64.m7g.4xlarge.ephemeral:
disk_size: 256
instance_type: m7g.4xlarge
is_ephemeral: true
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-arm64
lf.linux.arm64.m7g.metal:
disk_size: 256
instance_type: m7g.metal
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-arm64
lf.windows.g4dn.xlarge:
disk_size: 256
instance_type: g4dn.xlarge
Expand Down Expand Up @@ -228,22 +202,18 @@ runner_types:
instance_type: r5.2xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.linux.4xlarge.memory:
disk_size: 300
instance_type: r5.4xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.linux.8xlarge.memory:
disk_size: 400
instance_type: r5.8xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
lf.linux.12xlarge.memory:
disk_size: 600
instance_type: r5.12xlarge
is_ephemeral: false
os: linux
ami: al2023-ami-2023.6.202*-kernel-6.1-x86_64
Loading

0 comments on commit d156788

Please sign in to comment.