I got pulled into an incident last week where half the pods in a new EKS cluster were sitting in Pending for 20 minutes. The node group had capacity, the image was in ECR, and nothing in the events log screamed an obvious problem. After digging in, it turned out the VPC had run out of secondary IP addresses — the cluster was using the default AWS VPC CNI configuration and the subnets were too small for the warm IP pool. The pods had nowhere to live.
If your pods are stuck, crash-looping, or failing to pull, here’s how to systematically find the cause.
The Problem
Pods fail to start or become ready, showing one of these states:
| Status | Reason |
|---|---|
Pending + FailedScheduling |
No node has capacity for the pod’s CPU/memory/GPU request |
Pending + FailedCreatePodSandBox |
CNI cannot allocate an IP address for the pod |
ImagePullBackOff / ErrImagePull |
Node cannot pull the container image from ECR |
CrashLoopBackOff |
Container starts and exits repeatedly |
CreateContainerConfigError |
Pod references a missing ConfigMap or Secret |
The cluster may appear healthy (kubectl get nodes shows all Ready) while pods cannot be scheduled — this disconnect between node health and pod health is what makes EKS troubleshooting frustrating.
Why Does This Happen?
- VPC CNI IP exhaustion: The AWS VPC CNI assigns a real VPC IP to every pod. Each EC2 instance type has a maximum number of ENIs and IPs per ENI — a
t3.mediumonly supports 17 usable pod IPs. Large deployments in small subnets run out of IPs long before they run out of CPU. - Missing ECR pull permissions on the node role: EKS worker nodes pull images using the node IAM role, not the pod’s service account. If
AmazonEC2ContainerRegistryReadOnlyisn’t attached to the node role, every image pull fails with an auth error. - Node group with no available capacity: If the Cluster Autoscaler or Karpenter isn’t configured (or fails to scale), pods with resource requests larger than any node’s free capacity stay Pending forever. This is especially common with GPU or ARM workloads that require specific node types.
- CoreDNS or kube-proxy not healthy: If CoreDNS pods are themselves Pending or crashing, every other pod that needs DNS resolution will fail readiness probes — creating cascading failures that look like application bugs.
- Liveness probe too aggressive: A liveness probe that fires before the application finishes initializing kills the pod during startup, triggering CrashLoopBackOff that never resolves.
The Fix
Step 1: Get the Actual Failure Reason
kubectl get pods only shows the status. Use describe to see the underlying event:
kubectl describe pod my-app-6d4f8c9b7-xk2mn -n production
Scroll to the Events section at the bottom. That’s where the real error lives.
Step 2: Fix IP Exhaustion (FailedCreatePodSandBox)
Check how many IPs are available in your worker subnets:
aws ec2 describe-subnets \
--filters "Name=tag:aws:cloudformation:stack-name,Values=eksctl-my-cluster-cluster" \
--query "Subnets[*].{Id:SubnetId,AZ:AvailabilityZone,Available:AvailableIpAddressCount,Cidr:CidrBlock}" \
--output table
If AvailableIpAddressCount is close to zero, you have two options. Enable prefix delegation on the VPC CNI, which assigns /28 prefixes (16 IPs) instead of individual IPs and dramatically increases pod density:
kubectl set env daemonset aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true
Then update the max pods per node. For a m5.large, this goes from 29 to 110:
aws eks update-nodegroup-config \
--cluster-name my-cluster \
--nodegroup-name standard-workers \
--scaling-config minSize=2,maxSize=10,desiredSize=3
For a longer-term fix, add a secondary CIDR to the VPC and create a new node group using those larger subnets.
Step 3: Fix ImagePullBackOff
If the image is in ECR, confirm the node role can pull it:
aws eks describe-nodegroup \
--cluster-name my-cluster \
--nodegroup-name standard-workers \
--query "nodegroup.nodeRole" \
--output text
Check the attached policies:
aws iam list-attached-role-policies \
--role-name eksctl-my-cluster-NodeInstanceRole-ABC123 \
--output table
You need AmazonEC2ContainerRegistryReadOnly. If it’s missing:
aws iam attach-role-policy \
--role-name eksctl-my-cluster-NodeInstanceRole-ABC123 \
--policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
If the image is in a different account, you also need the repository policy to allow that account’s node role. Check the ECR repository policy:
aws ecr get-repository-policy \
--repository-name my-app \
--region us-east-1
Step 4: Fix FailedScheduling (Pending)
If the pod is pending due to insufficient resources, look at what it’s requesting versus what’s available:
kubectl describe pod my-app-6d4f8c9b7-xk2mn -n production | grep -A 2 "Requests:"
Compare against node capacity:
kubectl describe nodes | grep -A 5 "Allocated resources"
If every node is over 80% allocated, scale the node group:
aws eks update-nodegroup-config \
--cluster-name my-cluster \
--nodegroup-name standard-workers \
--scaling-config minSize=2,maxSize=15,desiredSize=6
If you’re using Karpenter, make sure the NodePool has CPU and memory limits that allow growth:
kubectl get nodepool default -o yaml | grep -A 10 "limits:"
Step 5: Fix CrashLoopBackOff
Pull the previous container’s logs (the current one might already be restarted):
kubectl logs my-app-6d4f8c9b7-xk2mn -n production --previous
If the application itself is fine and the issue is probe timing, relax the liveness probe’s initialDelaySeconds:
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 60
periodSeconds: 10
failureThreshold: 3
Apply the change:
kubectl apply -f deployment.yaml
Step 6: Verify Cluster Add-ons
A broken CoreDNS or VPC CNI causes ripple failures. Check the add-on status:
aws eks describe-addon \
--cluster-name my-cluster \
--addon-name vpc-cni \
--query "addon.{Status:status,Version:addonVersion,Issues:health.issues}" \
--output json
If there are issues, update to the latest supported version:
aws eks update-addon \
--cluster-name my-cluster \
--addon-name vpc-cni \
--resolve-conflicts OVERWRITE
Is This Safe?
Mostly yes. kubectl describe and get are read-only. Attaching IAM policies to the node role is additive. Enabling prefix delegation requires the VPC CNI to restart on each node — this is done by rolling the DaemonSet, which briefly disrupts new pod IPs but doesn’t affect running pods. Node group scaling is non-disruptive. The one change to do carefully is modifying liveness probes on production deployments — if the new values are wrong, you’ll trigger a rolling restart that could cause downtime for single-replica workloads.
Key Takeaway
EKS pod failures almost always trace back to one of four root causes: IP exhaustion, IAM, node capacity, or probe misconfiguration. The VPC CNI’s default behavior of assigning one VPC IP per pod is the biggest silent trap — it works fine in demos and fails spectacularly in production at scale. If you’re running anything beyond a toy cluster, enable prefix delegation on day one and size your worker subnets for at least 4x your peak pod count. Always check kubectl describe pod before assuming the issue is with the application.
Have questions or ran into a different EKS issue? Connect with me on LinkedIn or X.