-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug(containerd): aws-node pods crashlooping using containerd version >=1.7.2x #2067
Comments
There was a regression test for this bug added in And I don't see it failing in recent @henry118 what's your take on this? |
@cloudwitch @dkennedy09 @BJKupka can any of you confirm the AMI ID you're using if you're also seeing this error? |
We're on the same team. This is seen in multiple clusters. It's likely something with our configuration. Happens with EKS 1.28 and 1.29 AMIs after 10/9/2024. |
Ah gotcha! Can you open a case with AWS support so we can dig into the logs? |
|
We saw this with the official EKS AMI, we're also seeing this with the customized ones our company builds (how we discovered this issue). They're simply run through a security scan, Cloudwatch Agent gets installed, and that's all that I've been able to see we do to it (and that I've been told by the folks who make them). We tested out the official AMI to make sure we see the same issue there to rule out any AMI customization causing the issue. Last known good 1.28 Arm64 AMI was built from I believe our AMIs are customized in us-east-1 if you need to hunt down those AMIs. We are also running Karpenter |
We'll follow up in the support case -- after looking at the logs, I don't think this is a recurrence of the bug described in #1933. |
For anyone looking at this in the future, the issue is in our Userdata. Strip your UserData down to only the We're going to resume T/S Monday. My guess is the portion where we're doing some Route53 records is making the CNI not happy with the newer ContainerD versions. Once we get this Route53 command unraveled and the rest of the userdata troubleshot, I'll report back. It's pretty wild we were able to run our UserData for 3+ years practically unchanged and a routine AMI update blew us up. Should have went to Bottlerockets a long time ago... |
Based on further troubleshooting, it has been determined that our issue was caused by the --resolv-conf flag being passed to the bootstrap.sh --kubelet-extra-args. Once we removed this flag (which appears to be deprecated), our nodes were able to go into a Ready state. |
bumped to the same issue. we are using the related PR |
What happened:
While upgrading our EKS cluster to 1.29, we are seeing our vpc-cni aws-node pods crashloop. From the containerd-log.txt file we are seeing the following error messages:
This seems to be related to the following issue which has been closed, but it appears this is still happening. If we downgrade containerd to v1.7.11, everything works.
What you expected to happen:
Expected nodes to come up in a Ready state and aws-node pods not to crashloop
How to reproduce it (as minimally and precisely as possible):
To reproduce, we configure our nodes to use the ami that contains containerd v1.7.22 or v1.7.23
Environment:
The text was updated successfully, but these errors were encountered: