Fix EC2 Auto Scaling Group Not Launching Instances on Demand

If you’ve set up an Auto Scaling Group with proper scaling policies, waited for load to trigger a scale-out event, and watched nothing happen, you’re in the frustrating position of not knowing why ASG isn’t scaling. I’ve debugged dozens of silent scaling failures, and the causes are almost always the same. In this post, I’ll show you how to diagnose and fix them.

The Problem

Your Auto Scaling Group receives a scale-out request (from a policy or manual request), but no new instances launch. The ASG Activity History shows a failed launch event with a cryptic error. Existing instances get overloaded, latency spikes, and your infrastructure doesn’t scale. Here’s what you see:

Error Type	Description
Launch Failed	ASG Activity: “Failed to launch instance. Error: …”
Instances Launched Then Terminated	ASG creates instance, then immediately terminates it
No Scale-Out	ASG desired > current, but no launch attempts
Capacity Not Increasing	ASG reports scaling activity, but running instance count unchanged

The ASG appears to be “working” (it exists, policies are active), but it silently fails to launch.

Why Does This Happen?

Launch Template references AMI that no longer exists — You created the Launch Template with ami-old123def, then deleted that AMI. ASG tries to launch from a non-existent image.
Instance profile IAM role doesn’t have required permissions — New instances can’t access S3, CloudWatch, Secrets Manager, or other services the application needs. Instance fails health check and is terminated.
Security group ID references wrong VPC — You specified a security group from vpc-prod but the ASG subnets are in vpc-staging. Security group in different VPC can’t be applied; launch fails.
Subnet has exhausted available IPs — The VPC subnet has a /28 CIDR (only 11 usable IPs) and all are in use. ASG tries to launch but there’s no IP address available.
Health check grace period too short — Instance launches but takes 90 seconds to pass startup scripts and report healthy. If health check grace period is 30 seconds, the instance is marked unhealthy and terminated before it finishes booting.

The Fix

The diagnostic process is to read the scaling activity history and look at the exact error message.

Step 1: Check the Activity History

# Describe scaling activities for the ASG
aws autoscaling describe-scaling-activities \
  --auto-scaling-group-name my-app-asg \
  --region us-east-1 \
  --max-records 10 \
  --output table

Look at the StatusMessage column. It will say something like:

“We currently do not have sufficient capacity.” → Capacity issue (try different AZ or instance type)
“Launch template does not reference an active image.” → AMI deleted
“Error attaching security group: …” → Security group in wrong VPC
“Error: InsufficientAddressCapacity” → Subnet out of IPs

Step 2: Verify the Launch Template

Check if the Launch Template references a valid AMI:

# Get the Launch Template details
aws ec2 describe-launch-template-versions \
  --launch-template-name my-app-template \
  --region us-east-1 \
  --query 'LaunchTemplateVersions[0].LaunchTemplateData.ImageId' \
  --output text

This returns the AMI ID (e.g., ami-0abc123def). Verify the AMI exists:

# Check if AMI exists
aws ec2 describe-images \
  --image-ids ami-0abc123def \
  --region us-east-1 \
  --query 'Images[0].State' \
  --output text

If the output is empty, the AMI was deleted. Update the Launch Template to use a valid AMI:

# Find a valid Amazon Linux 2 AMI
aws ec2 describe-images \
  --owners amazon \
  --filters "Name=name,Values=amzn2-ami-hvm-*-x86_64-gp2" \
  --query 'Images[-1].ImageId' \
  --output text
# Output: ami-new456xyz

# Create new Launch Template version with valid AMI
aws ec2 create-launch-template-version \
  --launch-template-name my-app-template \
  --source-version 1 \
  --launch-template-data '{"ImageId":"ami-new456xyz"}' \
  --region us-east-1

# Set new version as default
aws ec2 modify-launch-template \
  --launch-template-name my-app-template \
  --default-version 2 \
  --region us-east-1

ASG will use the new version on next launch attempt.

Step 3: Verify IAM Instance Profile

Check that the instance profile has necessary permissions:

# Get the IAM instance profile from the Launch Template
aws ec2 describe-launch-template-versions \
  --launch-template-name my-app-template \
  --region us-east-1 \
  --query 'LaunchTemplateVersions[0].LaunchTemplateData.IamInstanceProfile.Arn' \
  --output text
# Output: arn:aws:iam::123456789012:instance-profile/my-app-role

# Get the role name
aws iam get-instance-profile \
  --instance-profile-name my-app-role \
  --query 'InstanceProfile.Roles[0].RoleName' \
  --output text
# Output: my-app-role

# List the role's attached policies
aws iam list-attached-role-policies \
  --role-name my-app-role

Ensure the role has policies for the services your application needs:

S3: AmazonS3ReadOnlyAccess or custom policy
CloudWatch: CloudWatchAgentServerPolicy
Secrets Manager: Custom policy with secretsmanager:GetSecretValue

If policies are missing, attach them:

aws iam attach-role-policy \
  --role-name my-app-role \
  --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy

Step 4: Verify Security Group is in Correct VPC

Check the ASG’s VPC and the security group’s VPC:

# Get ASG VPC and subnet info
aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names my-app-asg \
  --region us-east-1 \
  --query 'AutoScalingGroups[0].VPCZoneIdentifier' \
  --output text
# Output: subnet-1a,subnet-1b,subnet-1c

# Get the VPC for subnet-1a
aws ec2 describe-subnets \
  --subnet-ids subnet-1a \
  --region us-east-1 \
  --query 'Subnets[0].VpcId' \
  --output text
# Output: vpc-prod123

# Check the security group's VPC
aws ec2 describe-security-groups \
  --group-ids sg-0abc123 \
  --region us-east-1 \
  --query 'SecurityGroups[0].VpcId' \
  --output text
# Output: vpc-prod123

Both should show the same VPC ID. If they differ, create a new security group in the correct VPC and update the Launch Template.

Step 5: Check Subnet IP Availability

If the subnet is out of IPs, ASG can’t launch:

# Check available IPs in each subnet
aws ec2 describe-subnets \
  --subnet-ids subnet-1a subnet-1b subnet-1c \
  --region us-east-1 \
  --query 'Subnets[*].[SubnetId,AvailableIpAddressCount,CidrBlock]' \
  --output table

If AvailableIpAddressCount is 0 for all subnets, your ASG is pinned by IP exhaustion. Solutions:

Allocate a new subnet with larger CIDR block
Terminate unused resources to free IPs
Use IPv6 (separate address space)

Step 6: Check Health Check Grace Period

If instances launch then terminate immediately, the health check is likely the issue:

# Get the current health check settings
aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names my-app-asg \
  --region us-east-1 \
  --query 'AutoScalingGroups[0].[HealthCheckType,HealthCheckGracePeriod]' \
  --output text
# Output: ELB 300

The grace period is 300 seconds (5 minutes). If your instance takes longer to boot and become healthy, increase it:

aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name my-app-asg \
  --health-check-grace-period 600 \
  --region us-east-1

How to Run This

Check Activity History: aws autoscaling describe-scaling-activities --auto-scaling-group-name my-asg
Read the error message carefully — it points to the root cause
If AMI error: find valid AMI with aws ec2 describe-images --owners amazon --filters "Name=name,Values=amzn2-ami-*"
If IAM error: attach missing policies to instance profile role
If security group error: verify SG is in same VPC as subnets
If IP exhaustion: check AvailableIpAddressCount in subnets; allocate new subnet if needed
If health check error: increase HealthCheckGracePeriod to allow more boot time

Is This Safe?

All diagnostic commands are read-only. Updates to Launch Template, IAM policies, and health check grace period are safe and take effect on next scaling event. Existing instances are unaffected until the next launch.

Key Takeaway

ASG scaling failures are almost always due to invalid AMI, missing IAM permissions, security group in wrong VPC, or insufficient IP addresses. Use the Activity History StatusMessage to pinpoint the exact cause, then fix accordingly. Always increase HealthCheckGracePeriod for applications that take >30 seconds to start.

Struggling with ASG scaling failures? Connect with me on LinkedIn or X.