Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AKS doesn't use agentpool's managed identity against ACR when configuring registry mirror in containerd #7271

Open
jjournet opened this issue Oct 10, 2024 · 2 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@jjournet
Copy link

What happened:

We are deploying applications in AKS clusters from artifacts stored in ghcr.io

In Azure, we are configuring a private AKS cluster, limiting access to internet through a firewall. To retrieve images, we deployed an ACR (Azure Container Registry), and we configure the ACR cache to cache images from ghcr.io. In addition, we use RBAC controls, and the agentpool managed identity has ACRPull and Reader roles on the ACR.

For instance, our image is ghcr.io/company/images/controller:1, our acr is acr1.azurecr.io, and we use image: ac1.azurecr.io/company/images/controller:1.
The ACR cache rule is company/images/* -> ghcr.io/company/images/*

This works as expected, images is pulled from ghcr into the acr when requested, and our pod starts.

To further simplify deployment, we want to keep the image reference in the deployments as ghcr.io/company/images/controller:1 instead of the local ACR name. To do that, we configured containerd with a registry mirror as described in this issue's comment.

The hosts.toml for domain ghcr.io is configured as follow:

server = "https://ghcr.io"
[host."https://acr1.azurecr.io"]
    capabilities = ["pull", "resolve"]

However, when doing so, the image pull fails, and apparently, the node tries indeed to pull the image from acr1, but anonymously, without using the managed identity of the agentpool, with the following error in the pod description:

Warning  Failed     3s    kubelet            Failed to pull image "ghcr.io/hqy01/jeep/images/busybox:latest": failed to pull and unpack image "ghcr.io/hqy01/jeep/images/busybox:latest": failed to resolve reference "ghcr.io/hqy01/jeep/images/busybox:latest": failed to authorize: failed to fetch anonymous token: unexpected status from GET request to https://pue1dev7831pplt08acr |
│ 0001.azurecr.io/oauth2/token?scope=repository%3Ahqy01%2Fjeep%2Fimages%2Fbusybox%3Apull&service=pue1dev7831pplt08acr0001.azurecr.io: 401 Unauthorized

(I did paste the error as I had it, without simplifying names as I did in the explanation)

What you expected to happen:

As pulling the image directly from acr1.azurecr.io is working, and as we configure containerd to rewrite images ghcr.io to the acr, we expect the same behavior, and the node being able to pull the image from acr1.

How to reproduce it (as minimally and precisely as possible):

  • create an AKS cluster, with managed identity enabled, and no access to ghcr
  • create and ACR, and a cache rule to pull images from ghcr
  • grant reader and ACRPull to the AKS agentpool on the ACR
  • deploy the containerd registry mirror
  • create a pod that uses the ghcr image reference

Anything else we need to know?:

I am trying to reproduce the issue in an environment with less dependencies and less complexity. I didn't manage so far, will try when times permits.

Environment:

  • Kubernetes version: v1.29.8
  • Cloud provider or hardware configuration: AKS cluster in Azure
  • OS (e.g: cat /etc/os-release): AKSUbuntu/images/2204gen2containerd/versions/202409.23.0
  • Network plugin and version (if this is a network-related bug): Azure CNI Pod Subnet
@jjournet jjournet added the kind/bug Categorizes issue or PR as related to a bug. label Oct 10, 2024
@phealy
Copy link

phealy commented Oct 17, 2024

This is an AKS issue, not a cloud-provider-azure issue - please see Azure/AKS#1940 for more details. I'll be posting an update there.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

4 participants