Last month I pushed a new container image to ECR, updated the task definition, and deployed. The service sat at zero running tasks for fifteen minutes while the desired count screamed two. No logs in CloudWatch, no helpful error in the console — just tasks cycling from PENDING to STOPPED in a loop. I’ve since debugged this pattern dozens of times across different teams, and the root cause is never the same twice. Here’s the full breakdown.

The Problem

ECS Fargate tasks fail to reach RUNNING state. They either stay stuck in PENDING, or they briefly start and immediately stop. You see one or more of these errors in the stopped task details:

Error / Status Description
CannotPullContainerError Fargate cannot pull the container image from ECR or another registry
ResourceInitializationError The task failed to set up networking or pull secrets before the container started
Essential container in task exited A container marked essential crashed or exited with a non-zero code
Task stuck in PENDING The task never transitions to RUNNING and eventually times out
HealthCheck failures The container starts but the ALB or container health check marks it unhealthy, causing ECS to kill and replace it

This creates a deploy that looks successful from the pipeline side but never actually serves traffic.

Why Does This Happen?

  • Image pull failures (CannotPullContainerError) — The task execution role does not have ecr:GetDownloadSizeForLayer, ecr:BatchGetImage, or ecr:GetAuthorizationToken permissions. Alternatively, the image tag doesn’t exist, the ECR repository policy blocks the account, or the task is in a private subnet with no NAT gateway and no VPC endpoint for ECR.

  • Missing VPC endpoints or NAT gateway (ResourceInitializationError) — Fargate tasks in private subnets need a route to ECR, CloudWatch Logs, Secrets Manager, and SSM Parameter Store. Without a NAT gateway or the corresponding VPC endpoints (com.amazonaws.region.ecr.dkr, com.amazonaws.region.ecr.api, com.amazonaws.region.logs, com.amazonaws.region.secretsmanager), the task cannot initialize and dies with a ResourceInitializationError.

  • Secrets or parameters cannot be resolved — If your task definition references a Secrets Manager secret or SSM parameter in the secrets block, and the task execution role lacks secretsmanager:GetSecretValue or ssm:GetParameters, the task fails before the container even starts. The error message often just says “ResourceInitializationError” with no further detail.

  • Security group blocking outbound traffic — The security group attached to the ECS service must allow outbound HTTPS (port 443) to reach ECR, CloudWatch, and any other AWS service endpoint. A restrictive security group with no egress rules will silently block image pulls.

  • ENI limit reached in the subnet — Every Fargate task requires an ENI. If your subnet is small (e.g., a /26 with 64 addresses) and heavily used, you can run out of available IP addresses. The task stays PENDING because AWS cannot attach a network interface.

  • Container crashes on startup (exit code 1, 137, 139) — The image pulled successfully but the container process fails. Exit code 1 means a generic application error (bad config, missing env var). Exit code 137 means the container was killed by the kernel OOM killer because it exceeded the memory limit in the task definition. Exit code 139 means a segfault in the application binary.

  • Health check failures causing a restart loop — The ALB target group health check or the container-level healthCheck in the task definition is too aggressive. The container starts, is not ready in time, gets marked unhealthy, ECS kills it, and restarts it. This loop runs indefinitely.

The Fix

Start by identifying which error you’re dealing with. The stopped task reason tells you almost everything.

Step 1: Get the stopped task reason

# List recently stopped tasks for the service
aws ecs list-tasks \
  --cluster my-cluster \
  --service-name my-service \
  --desired-status STOPPED \
  --region us-east-1 \
  --query 'taskArns[0:3]' \
  --output text
# Describe the stopped task to see the reason and exit codes
aws ecs describe-tasks \
  --cluster my-cluster \
  --tasks arn:aws:ecs:us-east-1:123456789012:task/my-cluster/abc123def456 \
  --region us-east-1 \
  --query 'tasks[0].{stopCode:stopCode,stoppedReason:stoppedReason,containers:containers[*].{name:name,exitCode:exitCode,reason:reason}}' \
  --output json

This gives you the stoppedReason field. Match it against the table above and follow the corresponding fix below.

Step 2: Fix CannotPullContainerError

Verify the image URI in your task definition actually exists:

# Check the image exists in ECR
aws ecr describe-images \
  --repository-name my-app \
  --image-ids imageTag=latest \
  --region us-east-1

If the image exists, the problem is permissions or networking. Check the task execution role:

# Get the task execution role from the task definition
aws ecs describe-task-definition \
  --task-definition my-app:12 \
  --region us-east-1 \
  --query 'taskDefinition.executionRoleArn' \
  --output text
# List the policies attached to that role
aws iam list-attached-role-policies \
  --role-name ecsTaskExecutionRole \
  --output table

The role must have AmazonECSTaskExecutionRolePolicy attached, or a custom policy granting ecr:GetAuthorizationToken, ecr:BatchCheckLayerAvailability, ecr:GetDownloadSizeForLayer, ecr:BatchGetImage, logs:CreateLogStream, and logs:PutLogEvents.

Step 3: Fix ResourceInitializationError (networking)

Check whether the task’s subnet has a route to the internet or VPC endpoints:

# Find the subnets used by the ECS service
aws ecs describe-services \
  --cluster my-cluster \
  --services my-service \
  --region us-east-1 \
  --query 'services[0].networkConfiguration.awsvpcConfiguration.subnets' \
  --output text
# Check the route table for that subnet
aws ec2 describe-route-tables \
  --filters "Name=association.subnet-id,Values=subnet-0abc123def456" \
  --region us-east-1 \
  --query 'RouteTables[0].Routes' \
  --output table

You need either a 0.0.0.0/0 -> nat-* route (NAT gateway) or VPC endpoints. To check existing endpoints:

# List VPC endpoints in the VPC
aws ec2 describe-vpc-endpoints \
  --filters "Name=vpc-id,Values=vpc-0abc123def" \
  --region us-east-1 \
  --query 'VpcEndpoints[*].{Service:ServiceName,State:State}' \
  --output table

If endpoints are missing, create them for com.amazonaws.us-east-1.ecr.dkr, com.amazonaws.us-east-1.ecr.api, and com.amazonaws.us-east-1.logs. For ECR you also need an S3 gateway endpoint since ECR stores image layers in S3:

# Create the S3 gateway endpoint (required for ECR image pulls)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-0abc123def \
  --service-name com.amazonaws.us-east-1.s3 \
  --route-table-ids rtb-0abc123def \
  --region us-east-1

Step 4: Fix secrets resolution failures

If the stopped reason mentions “unable to pull secrets or registry auth,” verify the execution role has permission:

# Test that the execution role can read the secret
aws secretsmanager get-secret-value \
  --secret-id arn:aws:secretsmanager:us-east-1:123456789012:secret:my-app/db-password-AbCdEf \
  --region us-east-1 \
  --query 'Name' \
  --output text

If you get AccessDeniedException, add secretsmanager:GetSecretValue to the task execution role. For SSM parameters, add ssm:GetParameters. If the secret is encrypted with a customer-managed KMS key, the role also needs kms:Decrypt on that key.

Step 5: Fix container exit codes

If the container pulled and started but exited immediately, check CloudWatch Logs:

# Get the log group from the task definition
aws ecs describe-task-definition \
  --task-definition my-app:12 \
  --region us-east-1 \
  --query 'taskDefinition.containerDefinitions[0].logConfiguration.options' \
  --output json
# Tail the most recent log stream
aws logs get-log-events \
  --log-group-name /ecs/my-app \
  --log-stream-name ecs/my-app/abc123def456 \
  --limit 50 \
  --region us-east-1 \
  --query 'events[*].message' \
  --output text

For exit code 137 (OOM killed), increase the memory in the task definition. Check the actual memory usage before the kill:

# Check memory utilization for the service
aws cloudwatch get-metric-statistics \
  --namespace AWS/ECS \
  --metric-name MemoryUtilization \
  --dimensions Name=ClusterName,Value=my-cluster Name=ServiceName,Value=my-service \
  --start-time 2026-03-11T00:00:00Z \
  --end-time 2026-03-11T12:00:00Z \
  --period 300 \
  --statistics Maximum \
  --region us-east-1

If utilization was hitting 100% before the kill, increase the task memory. For exit code 1, read the application logs — it’s almost always a missing environment variable, bad config file, or a dependency that isn’t reachable from the VPC.

Step 6: Fix health check restart loops

If tasks start but keep getting replaced, check the target group health check settings:

# Find the target group ARN from the service
aws ecs describe-services \
  --cluster my-cluster \
  --services my-service \
  --region us-east-1 \
  --query 'services[0].loadBalancers[0].targetGroupArn' \
  --output text
# Check the health check configuration
aws elbv2 describe-target-groups \
  --target-group-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/abc123 \
  --region us-east-1 \
  --query 'TargetGroups[0].{Path:HealthCheckPath,Interval:HealthCheckIntervalSeconds,Timeout:HealthCheckTimeoutSeconds,HealthyThreshold:HealthyThresholdCount,UnhealthyThreshold:UnhealthyThresholdCount}' \
  --output table

If your app takes 30 seconds to start but the health check declares it unhealthy after 15, increase the health check interval or add a grace period to the ECS service:

# Update the service with a health check grace period (in seconds)
aws ecs update-service \
  --cluster my-cluster \
  --service my-service \
  --health-check-grace-period-seconds 120 \
  --region us-east-1

This tells ECS to wait 120 seconds before starting to evaluate health check results for a new task.

How to Run This

  1. Open the ECS console and go to your cluster. Click the service, then the Tasks tab. Filter by “Stopped” to see failed tasks and note the stoppedReason.
  2. Run the describe-tasks command to get the full error and any container exit codes.
  3. If the error is CannotPullContainerError, verify the image URI, the task execution role policies, and network routing from the task subnet.
  4. If the error is ResourceInitializationError, check VPC endpoints or NAT gateway configuration for the task’s subnet.
  5. If the error mentions secrets, verify the task execution role has secretsmanager:GetSecretValue or ssm:GetParameters and KMS decrypt permissions.
  6. If the container exited with code 137, increase the task memory. If code 1, read the CloudWatch Logs for the application error.
  7. If tasks keep cycling with health check failures, add a healthCheckGracePeriodSeconds to the service and verify the health check path returns 200.

Is This Safe?

All describe-* and list-* commands are read-only. Updating the health check grace period is non-disruptive and takes effect on the next task deployment. Creating VPC endpoints adds new resources but does not affect existing traffic. Changing the task definition memory or execution role requires a new deployment, which ECS handles as a rolling update — existing tasks keep running until new ones pass health checks.

Key Takeaway

ECS Fargate task failures almost always come down to one of three things: the task execution role is missing permissions, the network path from the task subnet to ECR and other services is broken, or the container itself is crashing. Start with describe-tasks to read the stoppedReason, then follow the trail. Once you’ve fixed it once, add the missing VPC endpoints or role permissions to your infrastructure code so the next deploy doesn’t hit the same wall.


Have questions or ran into a different issue? Connect with me on LinkedIn or X.