If you’ve woken up to a CloudWatch alert saying your EC2 instance is running at 95% CPU utilization and the application is sluggish or unresponsive, you know the panic of not knowing what’s consuming all those cycles. I’ve triaged hundreds of high CPU incidents, and the fix is rarely complicated — it just requires methodical diagnosis. In this post, I’ll walk you through finding the culprit and stopping it.

The Problem

Your CloudWatch alarm fires showing CPUUtilization > 90% sustained for 5+ minutes. You check the EC2 console and see a red line on the CPU graph. The application becomes slow, requests time out, and users are calling. Here’s what you’re seeing:

Error Type Description
High CPU Alert CloudWatch alarm triggered: CPUUtilization > 90% for 5 minutes
Slow Application Requests are taking 10x longer than normal
System Unresponsive SSH commands lag, processes feel stuck

The instance is still running, but it’s clearly maxed out.

Why Does This Happen?

  • Runaway process (loop or leak) — A process is stuck in an infinite loop or consuming CPU due to a memory leak triggering excessive garbage collection.
  • Instance too small for workload — You launched a t3.micro to run a data processing job or ML model that needs c5.2xlarge. The workload simply exceeds the instance’s capacity.
  • Memory pressure causing excessive swap I/O — The instance ran out of RAM and started swapping to disk. Swapping is 1000x slower than RAM and spikes CPU as the kernel manages page faults.
  • Cron job or scheduled task consuming CPU — A backup script, database vacuum, or log rotation runs at a specific time and consumes all available CPU.

The Fix

The diagnostic process is systematic. Start by looking at CloudWatch metrics, then dive into the instance itself.

Step 1: Get Historical CPU Data

Pull CPU utilization data from CloudWatch for the past hour to understand the pattern:

aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-0abc123def456ghij \
  --start-time 2026-02-03T00:00:00Z \
  --end-time 2026-02-03T02:00:00Z \
  --period 300 \
  --statistics Average,Maximum \
  --region us-east-1 \
  --output table

Look at the timestamp. If the spike happened at 2:00am, it’s likely a cron job. If it’s sustained, it’s likely a runaway process.

Step 2: SSH In and Check Running Processes

Find the process consuming the most CPU:

# SSH to the instance
ssh -i key.pem ec2-user@<public-ip>

# Show top 10 processes sorted by CPU usage
ps aux --sort=-%cpu | head -11
# Output: USER PID %CPU %MEM COMMAND
# root 1234 95.2 5.1 /usr/bin/python

The %CPU column tells you the percentage of total CPU. If a single process is at 95%, that’s your culprit. Note the PID and process name.

Step 3: Investigate the Suspicious Process

Once you’ve identified the process (e.g., /usr/bin/python at PID 1234), investigate what it is:

# See the full command line
cat /proc/1234/cmdline | tr '\0' ' '
# Output: /usr/bin/python /opt/myapp/process.py

# Check which user owns it
ps -p 1234 -o user=

# See how long it's been running
ps -p 1234 -o etime=

Now you know what process is running, who owns it, and how long it’s been alive. Is it a legitimate process or something unexpected?

Step 4: Check for Memory Pressure (Swap Utilization)

High CPU sometimes masks a memory problem. Check if swapping is happening:

# Check memory and swap usage
free -h
# Output:
#               total   used   free
# Mem:          7.8Gi   7.6Gi  0.2Gi
# Swap:         2.0Gi   1.8Gi  0.2Gi

# If Swap is heavily used (>50%), that's your issue. Check swap I/O:
iostat -x 1 5 | grep -A 5 "Device"

If swap is maxed and CPU is high, the instance needs more RAM or you have a memory leak.

Step 5: Check Cron Jobs and Scheduled Tasks

If the spike happens at regular times, check cron:

# List all cron jobs for all users
sudo cat /var/spool/cron/crontabs/*

# Check system cron directories
ls -la /etc/cron.d/
ls -la /etc/cron.daily/

If you find a cron job running at 2:00am every night and your CPU spike is at 2:00am, you’ve found it. Consider rescheduling the job to off-peak hours.

Step 6: Stop the Offending Process (If Safe)

Once you’ve identified the process and verified it’s not critical, you can stop it:

# Graceful kill (signal 15)
kill 1234

# Wait 5 seconds, then check if it's gone
sleep 5
ps -p 1234

# If still running, force kill (signal 9)
kill -9 1234

Monitor CloudWatch CPU for the next 2 minutes. The utilization should drop. If it does, the process was the culprit. If it doesn’t, there’s another process or multiple processes consuming CPU.

Step 7: Right-Size Your Instance

If the CPU spike is legitimate workload (not a runaway process), your instance type is too small. Use this CLI to see instance type CPU specifications:

# See all available instance types and their CPU count
aws ec2 describe-instance-types \
  --filters "Name=instance-type,Values=t3.*" \
  --query 'InstanceTypes[*].[InstanceType,VCpuInfo.DefaultVCpus,MemoryInfo.SizeInMiB]' \
  --output table

If your t3.micro (1 vCPU) is consistently at 90% running your workload, upgrade to t3.small (2 vCPU) or c5.large (2 vCPU). Stop the instance, change the instance type, and restart:

# Stop the instance
aws ec2 stop-instances --instance-ids i-0abc123def456ghij

# Change instance type (instance must be in a VPC)
aws ec2 modify-instance-attribute \
  --instance-id i-0abc123def456ghij \
  --instance-type t3.small

# Start the instance
aws ec2 start-instances --instance-ids i-0abc123def456ghij

How to Run This

  1. Open CloudWatch Dashboards → find your instance’s CPU metric → note the timestamp of the spike.
  2. SSH to the instance: ssh -i key.pem ec2-user@<public-ip>
  3. Run ps aux --sort=-%cpu | head -11 to find the top CPU consumer.
  4. Run cat /proc/PID/cmdline to see the full command of the culprit.
  5. Check free -h for memory pressure. If Swap is high, the issue is memory, not CPU.
  6. Check cron jobs: sudo cat /var/spool/cron/crontabs/* to see if a scheduled job is running.
  7. If the process is safe to kill, run kill -9 PID.
  8. If the workload is legitimate but CPU is still high, resize the instance with aws ec2 modify-instance-attribute --instance-type t3.small.

Is This Safe?

Reading metrics and processes is completely safe. Killing a process is disruptive if it’s important — always identify the process before killing it. Right-sizing requires an instance reboot (brief downtime). Test instance type changes in non-production first.

Key Takeaway

High CPU is almost always one of four things: a runaway process, insufficient instance capacity, memory pressure with swapping, or a cron job. Use CloudWatch metrics to identify when, use Linux process tools to identify what, then fix accordingly.


Chasing CPU spikes and need help? Connect with me on LinkedIn or X.