bug(containerd): aws-node pods crashlooping using containerd version >=1.7.2x #2067

ronberna · 2024-11-20T18:20:06Z

What happened:
While upgrading our EKS cluster to 1.29, we are seeing our vpc-cni aws-node pods crashloop. From the containerd-log.txt file we are seeing the following error messages:

Nov 18 22:55:26 ip-10-99-173-244.lab.opssuite.lab.swacorp.com containerd[1896]: time="2024-11-18T22:55:26.464516662Z" level=error msg="ExecSync for \"1ef544b5734aeace6a5fc46d8ae062d8acbd6de3b5380836ddd63f855a72deac\" failed" error="failed to exec in container: failed to start exec \"8e9a34387d4dfb6a1f52814c6d471d730b3563cafc483f38ea432ff9b06ca22d\": OCI runtime exec failed: exec failed: cannot exec in a stopped container: unknown"
Nov 18 22:55:26 ip-10-99-173-244.lab.opssuite.lab.swacorp.com containerd[1896]: time="2024-11-18T22:55:26.472394651Z" level=error msg="ExecSync for \"1ef544b5734aeace6a5fc46d8ae062d8acbd6de3b5380836ddd63f855a72deac\" failed" error="failed to exec in container: failed to create exec \"a5d487e13ae5dfb2defe52aa9f9e1781982319a34a750f21a17d0b4decb027d7\": cannot exec in a stopped state: unknown"

This seems to be related to the following issue which has been closed, but it appears this is still happening. If we downgrade containerd to v1.7.11, everything works.
What you expected to happen:
Expected nodes to come up in a Ready state and aws-node pods not to crashloop
How to reproduce it (as minimally and precisely as possible):
To reproduce, we configure our nodes to use the ami that contains containerd v1.7.22 or v1.7.23
Environment:

AWS Region: us-east-1
Instance Type(s): any instance type
Cluster Kubernetes version: v1.29
Node Kubernetes version: v1.29.8-eks-a737599
AMI Version: /aws/service/eks/optimized-ami/1.29/amazon-linux-2-arm64/recommended/image_id ami-0f6a2c7eede2de322
Kube Proxy add-on version: v1.29.7-eksbuild.5
Amazon VPC CNI add-on version: v1.19.0-eksbuild.1

The text was updated successfully, but these errors were encountered:

cartermckinnon · 2024-11-21T02:26:09Z

There was a regression test for this bug added in containerd: containerd/containerd#10649

And I don't see it failing in recentrelease/1.7 CI runs: https://github.com/containerd/containerd/actions/workflows/ci.yml?query=branch%3Arelease%2F1.7

@henry118 what's your take on this?

cartermckinnon · 2024-11-21T02:27:58Z

@cloudwitch @dkennedy09 @BJKupka can any of you confirm the AMI ID you're using if you're also seeing this error?

cloudwitch · 2024-11-21T02:38:01Z

@cloudwitch @dkennedy09 @BJKupka can any of you confirm the AMI ID you're using if you're also seeing this error?

We're on the same team. This is seen in multiple clusters. It's likely something with our configuration.

Happens with EKS 1.28 and 1.29 AMIs after 10/9/2024.

cartermckinnon · 2024-11-21T16:18:06Z

Ah gotcha! Can you open a case with AWS support so we can dig into the logs?

cloudwitch · 2024-11-21T16:32:40Z

173196639600686 is our case.

cloudwitch · 2024-11-21T17:22:14Z

We saw this with the official EKS AMI, we're also seeing this with the customized ones our company builds (how we discovered this issue). They're simply run through a security scan, Cloudwatch Agent gets installed, and that's all that I've been able to see we do to it (and that I've been told by the folks who make them).

We tested out the official AMI to make sure we see the same issue there to rule out any AMI customization causing the issue.

Last known good 1.28 Arm64 AMI was built from ami-04b274e2e76eb396a.
Last known good 1.29 Arm64 AMI was built from ami-000d85b557036c5bb.
Last known good 1.28 AMD64 AMI was built from ami-0d3cb2ae67f05cf0b.
Last known good 1.29 AMD64 AMI was built from ami-02561a005c32adc67.

I believe our AMIs are customized in us-east-1 if you need to hunt down those AMIs.

We are also running Karpenter 0.37.0. This should be compatible with EKS 1.29 as the compatibility matrix shows >=0.34. Since we're running a Beta version, and not the 1.0.0+ version, it may be an interesting data point. That's the only thing that I can think of that we're doing that might be considered "weird".

cartermckinnon · 2024-11-21T20:15:10Z

We'll follow up in the support case -- after looking at the logs, I don't think this is a recurrence of the bug described in #1933.

cloudwitch · 2024-11-22T23:12:14Z

For anyone looking at this in the future, the issue is in our Userdata.

Strip your UserData down to only the /etc/eks/bootstrap.sh. If that works, build your userdata back up bit by bit until you figure out what the issue is.

We're going to resume T/S Monday. My guess is the portion where we're doing some Route53 records is making the CNI not happy with the newer ContainerD versions. Once we get this Route53 command unraveled and the rest of the userdata troubleshot, I'll report back.

It's pretty wild we were able to run our UserData for 3+ years practically unchanged and a routine AMI update blew us up. Should have went to Bottlerockets a long time ago...

ronberna · 2024-11-27T04:55:22Z

Based on further troubleshooting, it has been determined that our issue was caused by the --resolv-conf flag being passed to the bootstrap.sh --kubelet-extra-args. Once we removed this flag (which appears to be deprecated), our nodes were able to go into a Ready state.

lefterisALEX · 2025-01-09T15:12:20Z

bumped to the same issue. we are using the --resolv-conf flag passed with --kubelet-extra-args and pointing to a custom resolv.conf (empty file).
Till containerd 1.7.19 in case the custom file was empty (like ours) , the containered was using the /etc/resolv.conf of the host. Starting with containerd 1.7.20 this is not happening anymore.

related PR
containerd/containerd#10462

ronberna added the bug Something isn't working label Nov 20, 2024

cartermckinnon closed this as not planned Won't fix, can't repro, duplicate, stale Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(containerd): aws-node pods crashlooping using containerd version >=1.7.2x #2067

bug(containerd): aws-node pods crashlooping using containerd version >=1.7.2x #2067

ronberna commented Nov 20, 2024

cartermckinnon commented Nov 21, 2024

cartermckinnon commented Nov 21, 2024

cloudwitch commented Nov 21, 2024

cartermckinnon commented Nov 21, 2024

cloudwitch commented Nov 21, 2024

cloudwitch commented Nov 21, 2024

cartermckinnon commented Nov 21, 2024

cloudwitch commented Nov 22, 2024 •

edited

Loading

ronberna commented Nov 27, 2024 •

edited

Loading

lefterisALEX commented Jan 9, 2025 •

edited

Loading

bug(containerd): aws-node pods crashlooping using containerd version >=1.7.2x #2067

bug(containerd): aws-node pods crashlooping using containerd version >=1.7.2x #2067

Comments

ronberna commented Nov 20, 2024

cartermckinnon commented Nov 21, 2024

cartermckinnon commented Nov 21, 2024

cloudwitch commented Nov 21, 2024

cartermckinnon commented Nov 21, 2024

cloudwitch commented Nov 21, 2024

cloudwitch commented Nov 21, 2024

cartermckinnon commented Nov 21, 2024

cloudwitch commented Nov 22, 2024 • edited Loading

ronberna commented Nov 27, 2024 • edited Loading

lefterisALEX commented Jan 9, 2025 • edited Loading

cloudwitch commented Nov 22, 2024 •

edited

Loading

ronberna commented Nov 27, 2024 •

edited

Loading

lefterisALEX commented Jan 9, 2025 •

edited

Loading