If you’ve set up an Auto Scaling Group with proper scaling policies, waited for load to trigger a scale-out event, and watched nothing happen, you’re in the frustrating position of not knowing why ASG isn’t scaling. I’ve debugged dozens of silent scaling failures, and the causes are almost always the same. In this post, I’ll show you how to diagnose and fix them.
The Problem
Your Auto Scaling Group receives a scale-out request (from a policy or manual request), but no new instances launch. The ASG Activity History shows a failed launch event with a cryptic error. Existing instances get overloaded, latency spikes, and your infrastructure doesn’t scale. Here’s what you see:
| Error Type | Description |
|---|---|
| Launch Failed | ASG Activity: “Failed to launch instance. Error: …” |
| Instances Launched Then Terminated | ASG creates instance, then immediately terminates it |
| No Scale-Out | ASG desired > current, but no launch attempts |
| Capacity Not Increasing | ASG reports scaling activity, but running instance count unchanged |
The ASG appears to be “working” (it exists, policies are active), but it silently fails to launch.
Why Does This Happen?
- Launch Template references AMI that no longer exists — You created the Launch Template with
ami-old123def, then deleted that AMI. ASG tries to launch from a non-existent image. - Instance profile IAM role doesn’t have required permissions — New instances can’t access S3, CloudWatch, Secrets Manager, or other services the application needs. Instance fails health check and is terminated.
- Security group ID references wrong VPC — You specified a security group from
vpc-prodbut the ASG subnets are invpc-staging. Security group in different VPC can’t be applied; launch fails. - Subnet has exhausted available IPs — The VPC subnet has a
/28CIDR (only 11 usable IPs) and all are in use. ASG tries to launch but there’s no IP address available. - Health check grace period too short — Instance launches but takes 90 seconds to pass startup scripts and report healthy. If health check grace period is 30 seconds, the instance is marked unhealthy and terminated before it finishes booting.
The Fix
The diagnostic process is to read the scaling activity history and look at the exact error message.
Step 1: Check the Activity History
# Describe scaling activities for the ASG
aws autoscaling describe-scaling-activities \
--auto-scaling-group-name my-app-asg \
--region us-east-1 \
--max-records 10 \
--output table
Look at the StatusMessage column. It will say something like:
- “We currently do not have sufficient capacity.” → Capacity issue (try different AZ or instance type)
- “Launch template does not reference an active image.” → AMI deleted
- “Error attaching security group: …” → Security group in wrong VPC
- “Error: InsufficientAddressCapacity” → Subnet out of IPs
Step 2: Verify the Launch Template
Check if the Launch Template references a valid AMI:
# Get the Launch Template details
aws ec2 describe-launch-template-versions \
--launch-template-name my-app-template \
--region us-east-1 \
--query 'LaunchTemplateVersions[0].LaunchTemplateData.ImageId' \
--output text
This returns the AMI ID (e.g., ami-0abc123def). Verify the AMI exists:
# Check if AMI exists
aws ec2 describe-images \
--image-ids ami-0abc123def \
--region us-east-1 \
--query 'Images[0].State' \
--output text
If the output is empty, the AMI was deleted. Update the Launch Template to use a valid AMI:
# Find a valid Amazon Linux 2 AMI
aws ec2 describe-images \
--owners amazon \
--filters "Name=name,Values=amzn2-ami-hvm-*-x86_64-gp2" \
--query 'Images[-1].ImageId' \
--output text
# Output: ami-new456xyz
# Create new Launch Template version with valid AMI
aws ec2 create-launch-template-version \
--launch-template-name my-app-template \
--source-version 1 \
--launch-template-data '{"ImageId":"ami-new456xyz"}' \
--region us-east-1
# Set new version as default
aws ec2 modify-launch-template \
--launch-template-name my-app-template \
--default-version 2 \
--region us-east-1
ASG will use the new version on next launch attempt.
Step 3: Verify IAM Instance Profile
Check that the instance profile has necessary permissions:
# Get the IAM instance profile from the Launch Template
aws ec2 describe-launch-template-versions \
--launch-template-name my-app-template \
--region us-east-1 \
--query 'LaunchTemplateVersions[0].LaunchTemplateData.IamInstanceProfile.Arn' \
--output text
# Output: arn:aws:iam::123456789012:instance-profile/my-app-role
# Get the role name
aws iam get-instance-profile \
--instance-profile-name my-app-role \
--query 'InstanceProfile.Roles[0].RoleName' \
--output text
# Output: my-app-role
# List the role's attached policies
aws iam list-attached-role-policies \
--role-name my-app-role
Ensure the role has policies for the services your application needs:
- S3:
AmazonS3ReadOnlyAccessor custom policy - CloudWatch:
CloudWatchAgentServerPolicy - Secrets Manager: Custom policy with
secretsmanager:GetSecretValue
If policies are missing, attach them:
aws iam attach-role-policy \
--role-name my-app-role \
--policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
Step 4: Verify Security Group is in Correct VPC
Check the ASG’s VPC and the security group’s VPC:
# Get ASG VPC and subnet info
aws autoscaling describe-auto-scaling-groups \
--auto-scaling-group-names my-app-asg \
--region us-east-1 \
--query 'AutoScalingGroups[0].VPCZoneIdentifier' \
--output text
# Output: subnet-1a,subnet-1b,subnet-1c
# Get the VPC for subnet-1a
aws ec2 describe-subnets \
--subnet-ids subnet-1a \
--region us-east-1 \
--query 'Subnets[0].VpcId' \
--output text
# Output: vpc-prod123
# Check the security group's VPC
aws ec2 describe-security-groups \
--group-ids sg-0abc123 \
--region us-east-1 \
--query 'SecurityGroups[0].VpcId' \
--output text
# Output: vpc-prod123
Both should show the same VPC ID. If they differ, create a new security group in the correct VPC and update the Launch Template.
Step 5: Check Subnet IP Availability
If the subnet is out of IPs, ASG can’t launch:
# Check available IPs in each subnet
aws ec2 describe-subnets \
--subnet-ids subnet-1a subnet-1b subnet-1c \
--region us-east-1 \
--query 'Subnets[*].[SubnetId,AvailableIpAddressCount,CidrBlock]' \
--output table
If AvailableIpAddressCount is 0 for all subnets, your ASG is pinned by IP exhaustion. Solutions:
- Allocate a new subnet with larger CIDR block
- Terminate unused resources to free IPs
- Use IPv6 (separate address space)
Step 6: Check Health Check Grace Period
If instances launch then terminate immediately, the health check is likely the issue:
# Get the current health check settings
aws autoscaling describe-auto-scaling-groups \
--auto-scaling-group-names my-app-asg \
--region us-east-1 \
--query 'AutoScalingGroups[0].[HealthCheckType,HealthCheckGracePeriod]' \
--output text
# Output: ELB 300
The grace period is 300 seconds (5 minutes). If your instance takes longer to boot and become healthy, increase it:
aws autoscaling update-auto-scaling-group \
--auto-scaling-group-name my-app-asg \
--health-check-grace-period 600 \
--region us-east-1
How to Run This
- Check Activity History:
aws autoscaling describe-scaling-activities --auto-scaling-group-name my-asg - Read the error message carefully — it points to the root cause
- If AMI error: find valid AMI with
aws ec2 describe-images --owners amazon --filters "Name=name,Values=amzn2-ami-*" - If IAM error: attach missing policies to instance profile role
- If security group error: verify SG is in same VPC as subnets
- If IP exhaustion: check
AvailableIpAddressCountin subnets; allocate new subnet if needed - If health check error: increase
HealthCheckGracePeriodto allow more boot time
Is This Safe?
All diagnostic commands are read-only. Updates to Launch Template, IAM policies, and health check grace period are safe and take effect on next scaling event. Existing instances are unaffected until the next launch.
Key Takeaway
ASG scaling failures are almost always due to invalid AMI, missing IAM permissions, security group in wrong VPC, or insufficient IP addresses. Use the Activity History StatusMessage to pinpoint the exact cause, then fix accordingly. Always increase HealthCheckGracePeriod for applications that take >30 seconds to start.
Struggling with ASG scaling failures? Connect with me on LinkedIn or X.