I was debugging a production deployment where ECS tasks kept cycling — starting, running for a few seconds, then dying. CloudWatch showed nothing useful because the container never lived long enough to emit logs. The ECS console just said STOPPED with a cryptic exit code.
If you’ve been stuck in that loop of watching tasks fail with no obvious explanation, here’s how to systematically diagnose and fix it.
The Problem
ECS tasks fail to start or crash immediately after launch. The stopped tasks show one of these reasons:
| Error | What It Means |
|---|---|
CannotPullContainerError: ref pull has been retried |
ECS cannot pull the container image from ECR or Docker Hub |
Essential container in task exited (exit code: 137) |
Container was killed by the OOM killer — it exceeded its memory limit |
Essential container in task exited (exit code: 1) |
Application crashed on startup |
Placement constraints not satisfied |
No EC2 instance matches the task’s CPU/memory requirements |
ResourceNotFoundException: Unable to assume role |
Task execution role is missing or misconfigured |
Tasks may show as PENDING for minutes before timing out, or flip between RUNNING and STOPPED in rapid succession.
Why Does This Happen?
- ECR permissions or VPC endpoint misconfiguration: Fargate tasks in private subnets need either a NAT gateway or VPC endpoints for ECR (
com.amazonaws.region.ecr.dkrandcom.amazonaws.region.ecr.api) and S3 (for layer downloads). Missing any one of these causes silent pull failures. - Memory limits set too low: If the task definition’s memory hard limit is lower than the application’s actual peak usage, the Linux OOM killer terminates the container instantly. This is the number one cause of exit code 137.
- Missing or wrong task execution role: The execution role (not the task role) is what ECS uses to pull images and send logs. If it doesn’t have
ecr:GetAuthorizationTokenandecr:BatchGetImagepermissions, pulls fail. - Image tag doesn’t exist: Deploying with a tag like
latestthat was overwritten or a commit SHA that was never pushed results in a pull error that looks like a networking issue. - Container health check failing too fast: If the health check
startPeriodis too short, ECS marks the container as unhealthy before the application has finished booting, then kills and replaces it.
The Fix
Step 1: Get the Stopped Task Details
First, find the actual stop reason. The console truncates this, so use the CLI:
aws ecs describe-tasks \
--cluster my-cluster \
--tasks arn:aws:ecs:us-east-1:123456789012:task/my-cluster/abc123 \
--query "tasks[0].{StopCode:stopCode,StopReason:stoppedReason,Containers:containers[*].{Name:name,ExitCode:exitCode,Reason:reason}}" \
--output json
This gives you the actual exit code and stop reason per container.
Step 2: Fix CannotPullContainerError
If the error is an image pull failure, verify the image exists:
aws ecr describe-images \
--repository-name my-app \
--image-ids imageTag=v1.2.3 \
--query "imageDetails[0].{Digest:imageDigest,PushedAt:imagePushedAt,Size:imageSizeInBytes}" \
--output table
If the image exists, the problem is networking or IAM. Check the task execution role:
aws iam list-attached-role-policies \
--role-name ecsTaskExecutionRole \
--output table
You need AmazonECSTaskExecutionRolePolicy attached. If it’s missing:
aws iam attach-role-policy \
--role-name ecsTaskExecutionRole \
--policy-arn arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy
Step 3: Fix Exit Code 137 (OOM Kill)
Exit code 137 means the container exceeded its memory allocation. Check the current limits in your task definition:
aws ecs describe-task-definition \
--task-definition my-app:latest \
--query "taskDefinition.containerDefinitions[*].{Name:name,MemoryHard:memory,MemorySoft:memoryReservation,CPU:cpu}" \
--output table
Compare this against the actual peak memory usage from CloudWatch:
aws cloudwatch get-metric-statistics \
--namespace AWS/ECS \
--metric-name MemoryUtilization \
--dimensions Name=ClusterName,Value=my-cluster Name=ServiceName,Value=my-service \
--start-time 2026-04-01T00:00:00Z \
--end-time 2026-04-02T00:00:00Z \
--period 300 \
--statistics Maximum \
--output table
If the maximum utilization is consistently above 90%, increase the memory limit in your task definition. A good rule of thumb is to set the hard limit at 1.5x your application’s steady-state usage.
Step 4: Fix Task Execution Role Issues
If the stopped reason mentions Unable to assume role, verify the trust policy on the execution role:
aws iam get-role \
--role-name ecsTaskExecutionRole \
--query "Role.AssumeRolePolicyDocument" \
--output json
The trust policy must include:
{
"Effect": "Allow",
"Principal": {
"Service": "ecs-tasks.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
Step 5: Fix Health Check Failures
If tasks start but get killed after 30-60 seconds, the health check is likely failing. Review the health check configuration:
aws ecs describe-task-definition \
--task-definition my-app:latest \
--query "taskDefinition.containerDefinitions[0].healthCheck" \
--output json
Increase the startPeriod to give your application time to boot:
{
"command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
"interval": 30,
"timeout": 5,
"retries": 3,
"startPeriod": 120
}
Register a new task definition revision with the updated health check and update the service.
Step 6: Verify the Fix
After making changes, force a new deployment:
aws ecs update-service \
--cluster my-cluster \
--service my-service \
--force-new-deployment
Then watch the task status:
aws ecs describe-services \
--cluster my-cluster \
--services my-service \
--query "services[0].{RunningCount:runningCount,DesiredCount:desiredCount,Deployments:deployments[*].{Status:status,Running:runningCount,Desired:desiredCount}}" \
--output json
Running count should match desired count within a few minutes.
Is This Safe?
Yes. The diagnostic commands are read-only. Updating the task definition creates a new revision without affecting running tasks. The --force-new-deployment flag triggers a rolling update using your service’s deployment configuration, so there’s no downtime if you have multiple tasks.
Key Takeaway
ECS task failures almost always fall into three buckets: networking (can’t pull the image), resources (not enough memory), or IAM (wrong role). The trick is to use describe-tasks on the stopped task ARN to get the real error message — the ECS console truncates the most important details. Always check the execution role separately from the task role, since they serve completely different purposes.
Have questions or ran into a different ECS issue? Connect with me on LinkedIn or X.