Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(containerd): aws-node pods crashlooping using containerd version >=1.7.2x #2067

Closed
ronberna opened this issue Nov 20, 2024 · 10 comments
Closed
Labels
bug Something isn't working

Comments

@ronberna
Copy link

What happened:
While upgrading our EKS cluster to 1.29, we are seeing our vpc-cni aws-node pods crashloop. From the containerd-log.txt file we are seeing the following error messages:

Nov 18 22:55:26 ip-10-99-173-244.lab.opssuite.lab.swacorp.com containerd[1896]: time="2024-11-18T22:55:26.464516662Z" level=error msg="ExecSync for \"1ef544b5734aeace6a5fc46d8ae062d8acbd6de3b5380836ddd63f855a72deac\" failed" error="failed to exec in container: failed to start exec \"8e9a34387d4dfb6a1f52814c6d471d730b3563cafc483f38ea432ff9b06ca22d\": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown"
Nov 18 22:55:26 ip-10-99-173-244.lab.opssuite.lab.swacorp.com containerd[1896]: time="2024-11-18T22:55:26.472394651Z" level=error msg="ExecSync for \"1ef544b5734aeace6a5fc46d8ae062d8acbd6de3b5380836ddd63f855a72deac\" failed" error="failed to exec in container: failed to create exec \"a5d487e13ae5dfb2defe52aa9f9e1781982319a34a750f21a17d0b4decb027d7\": cannot exec in a stopped state: unknown"

This seems to be related to the following issue which has been closed, but it appears this is still happening. If we downgrade containerd to v1.7.11, everything works.
What you expected to happen:
Expected nodes to come up in a Ready state and aws-node pods not to crashloop
How to reproduce it (as minimally and precisely as possible):
To reproduce, we configure our nodes to use the ami that contains containerd v1.7.22 or v1.7.23
Environment:

  • AWS Region: us-east-1
  • Instance Type(s): any instance type
  • Cluster Kubernetes version: v1.29
  • Node Kubernetes version: v1.29.8-eks-a737599
  • AMI Version: /aws/service/eks/optimized-ami/1.29/amazon-linux-2-arm64/recommended/image_id ami-0f6a2c7eede2de322
  • Kube Proxy add-on version: v1.29.7-eksbuild.5
  • Amazon VPC CNI add-on version: v1.19.0-eksbuild.1
@ronberna ronberna added the bug Something isn't working label Nov 20, 2024
@cartermckinnon
Copy link
Member

There was a regression test for this bug added in containerd: containerd/containerd#10649

And I don't see it failing in recentrelease/1.7 CI runs: https://github.com/containerd/containerd/actions/workflows/ci.yml?query=branch%3Arelease%2F1.7

@henry118 what's your take on this?

@cartermckinnon
Copy link
Member

@cloudwitch @dkennedy09 @BJKupka can any of you confirm the AMI ID you're using if you're also seeing this error?

@cloudwitch
Copy link

@cloudwitch @dkennedy09 @BJKupka can any of you confirm the AMI ID you're using if you're also seeing this error?

We're on the same team. This is seen in multiple clusters. It's likely something with our configuration.

Happens with EKS 1.28 and 1.29 AMIs after 10/9/2024.

@cartermckinnon
Copy link
Member

Ah gotcha! Can you open a case with AWS support so we can dig into the logs?

@cloudwitch
Copy link

173196639600686 is our case.

@cloudwitch
Copy link

We saw this with the official EKS AMI, we're also seeing this with the customized ones our company builds (how we discovered this issue). They're simply run through a security scan, Cloudwatch Agent gets installed, and that's all that I've been able to see we do to it (and that I've been told by the folks who make them).

We tested out the official AMI to make sure we see the same issue there to rule out any AMI customization causing the issue.

Last known good 1.28 Arm64 AMI was built from ami-04b274e2e76eb396a.
Last known good 1.29 Arm64 AMI was built from ami-000d85b557036c5bb.
Last known good 1.28 AMD64 AMI was built from ami-0d3cb2ae67f05cf0b.
Last known good 1.29 AMD64 AMI was built from ami-02561a005c32adc67.

I believe our AMIs are customized in us-east-1 if you need to hunt down those AMIs.

We are also running Karpenter 0.37.0. This should be compatible with EKS 1.29 as the compatibility matrix shows >=0.34. Since we're running a Beta version, and not the 1.0.0+ version, it may be an interesting data point. That's the only thing that I can think of that we're doing that might be considered "weird".

@cartermckinnon
Copy link
Member

We'll follow up in the support case -- after looking at the logs, I don't think this is a recurrence of the bug described in #1933.

@cartermckinnon cartermckinnon closed this as not planned Won't fix, can't repro, duplicate, stale Nov 21, 2024
@cloudwitch
Copy link

cloudwitch commented Nov 22, 2024

For anyone looking at this in the future, the issue is in our Userdata.

Strip your UserData down to only the /etc/eks/bootstrap.sh. If that works, build your userdata back up bit by bit until you figure out what the issue is.

We're going to resume T/S Monday. My guess is the portion where we're doing some Route53 records is making the CNI not happy with the newer ContainerD versions. Once we get this Route53 command unraveled and the rest of the userdata troubleshot, I'll report back.

It's pretty wild we were able to run our UserData for 3+ years practically unchanged and a routine AMI update blew us up. Should have went to Bottlerockets a long time ago...

@ronberna
Copy link
Author

ronberna commented Nov 27, 2024

Based on further troubleshooting, it has been determined that our issue was caused by the --resolv-conf flag being passed to the bootstrap.sh --kubelet-extra-args. Once we removed this flag (which appears to be deprecated), our nodes were able to go into a Ready state.

@lefterisALEX
Copy link

lefterisALEX commented Jan 9, 2025

bumped to the same issue. we are using the --resolv-conf flag passed with --kubelet-extra-args and pointing to a custom resolv.conf (empty file).
Till containerd 1.7.19 in case the custom file was empty (like ours) , the containered was using the /etc/resolv.conf of the host. Starting with containerd 1.7.20 this is not happening anymore.

related PR
containerd/containerd#10462

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants