I was debugging a production deployment where ECS tasks kept cycling — starting, running for a few seconds, then dying. CloudWatch showed nothing useful because the container never lived long enough to emit logs. The ECS console just said STOPPED with a cryptic exit code.

If you’ve been stuck in that loop of watching tasks fail with no obvious explanation, here’s how to systematically diagnose and fix it.

The Problem

ECS tasks fail to start or crash immediately after launch. The stopped tasks show one of these reasons:

Error What It Means
CannotPullContainerError: ref pull has been retried ECS cannot pull the container image from ECR or Docker Hub
Essential container in task exited (exit code: 137) Container was killed by the OOM killer — it exceeded its memory limit
Essential container in task exited (exit code: 1) Application crashed on startup
Placement constraints not satisfied No EC2 instance matches the task’s CPU/memory requirements
ResourceNotFoundException: Unable to assume role Task execution role is missing or misconfigured

Tasks may show as PENDING for minutes before timing out, or flip between RUNNING and STOPPED in rapid succession.

Why Does This Happen?

  • ECR permissions or VPC endpoint misconfiguration: Fargate tasks in private subnets need either a NAT gateway or VPC endpoints for ECR (com.amazonaws.region.ecr.dkr and com.amazonaws.region.ecr.api) and S3 (for layer downloads). Missing any one of these causes silent pull failures.
  • Memory limits set too low: If the task definition’s memory hard limit is lower than the application’s actual peak usage, the Linux OOM killer terminates the container instantly. This is the number one cause of exit code 137.
  • Missing or wrong task execution role: The execution role (not the task role) is what ECS uses to pull images and send logs. If it doesn’t have ecr:GetAuthorizationToken and ecr:BatchGetImage permissions, pulls fail.
  • Image tag doesn’t exist: Deploying with a tag like latest that was overwritten or a commit SHA that was never pushed results in a pull error that looks like a networking issue.
  • Container health check failing too fast: If the health check startPeriod is too short, ECS marks the container as unhealthy before the application has finished booting, then kills and replaces it.

The Fix

Step 1: Get the Stopped Task Details

First, find the actual stop reason. The console truncates this, so use the CLI:

aws ecs describe-tasks \
  --cluster my-cluster \
  --tasks arn:aws:ecs:us-east-1:123456789012:task/my-cluster/abc123 \
  --query "tasks[0].{StopCode:stopCode,StopReason:stoppedReason,Containers:containers[*].{Name:name,ExitCode:exitCode,Reason:reason}}" \
  --output json

This gives you the actual exit code and stop reason per container.

Step 2: Fix CannotPullContainerError

If the error is an image pull failure, verify the image exists:

aws ecr describe-images \
  --repository-name my-app \
  --image-ids imageTag=v1.2.3 \
  --query "imageDetails[0].{Digest:imageDigest,PushedAt:imagePushedAt,Size:imageSizeInBytes}" \
  --output table

If the image exists, the problem is networking or IAM. Check the task execution role:

aws iam list-attached-role-policies \
  --role-name ecsTaskExecutionRole \
  --output table

You need AmazonECSTaskExecutionRolePolicy attached. If it’s missing:

aws iam attach-role-policy \
  --role-name ecsTaskExecutionRole \
  --policy-arn arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy

Step 3: Fix Exit Code 137 (OOM Kill)

Exit code 137 means the container exceeded its memory allocation. Check the current limits in your task definition:

aws ecs describe-task-definition \
  --task-definition my-app:latest \
  --query "taskDefinition.containerDefinitions[*].{Name:name,MemoryHard:memory,MemorySoft:memoryReservation,CPU:cpu}" \
  --output table

Compare this against the actual peak memory usage from CloudWatch:

aws cloudwatch get-metric-statistics \
  --namespace AWS/ECS \
  --metric-name MemoryUtilization \
  --dimensions Name=ClusterName,Value=my-cluster Name=ServiceName,Value=my-service \
  --start-time 2026-04-01T00:00:00Z \
  --end-time 2026-04-02T00:00:00Z \
  --period 300 \
  --statistics Maximum \
  --output table

If the maximum utilization is consistently above 90%, increase the memory limit in your task definition. A good rule of thumb is to set the hard limit at 1.5x your application’s steady-state usage.

Step 4: Fix Task Execution Role Issues

If the stopped reason mentions Unable to assume role, verify the trust policy on the execution role:

aws iam get-role \
  --role-name ecsTaskExecutionRole \
  --query "Role.AssumeRolePolicyDocument" \
  --output json

The trust policy must include:

{
  "Effect": "Allow",
  "Principal": {
    "Service": "ecs-tasks.amazonaws.com"
  },
  "Action": "sts:AssumeRole"
}

Step 5: Fix Health Check Failures

If tasks start but get killed after 30-60 seconds, the health check is likely failing. Review the health check configuration:

aws ecs describe-task-definition \
  --task-definition my-app:latest \
  --query "taskDefinition.containerDefinitions[0].healthCheck" \
  --output json

Increase the startPeriod to give your application time to boot:

{
  "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
  "interval": 30,
  "timeout": 5,
  "retries": 3,
  "startPeriod": 120
}

Register a new task definition revision with the updated health check and update the service.

Step 6: Verify the Fix

After making changes, force a new deployment:

aws ecs update-service \
  --cluster my-cluster \
  --service my-service \
  --force-new-deployment

Then watch the task status:

aws ecs describe-services \
  --cluster my-cluster \
  --services my-service \
  --query "services[0].{RunningCount:runningCount,DesiredCount:desiredCount,Deployments:deployments[*].{Status:status,Running:runningCount,Desired:desiredCount}}" \
  --output json

Running count should match desired count within a few minutes.

Is This Safe?

Yes. The diagnostic commands are read-only. Updating the task definition creates a new revision without affecting running tasks. The --force-new-deployment flag triggers a rolling update using your service’s deployment configuration, so there’s no downtime if you have multiple tasks.

Key Takeaway

ECS task failures almost always fall into three buckets: networking (can’t pull the image), resources (not enough memory), or IAM (wrong role). The trick is to use describe-tasks on the stopped task ARN to get the real error message — the ECS console truncates the most important details. Always check the execution role separately from the task role, since they serve completely different purposes.


Have questions or ran into a different ECS issue? Connect with me on LinkedIn or X.