If you’re using EC2 Spot instances to save 70-90% on compute costs, you know the tradeoff: AWS can reclaim the capacity with only a 2-minute warning. I’ve seen teams lose entire batch jobs, mid-processing pipelines, and millions of dollars worth of compute because they didn’t implement interruption handling. In this post, I’ll show you how to detect the warning and gracefully shut down before you lose everything.

The Problem

You’re running a machine learning training job on a Spot instance to save money. Two hours in, AWS reclaims the capacity without warning. The training state is lost, the instance terminates abruptly, and you have nothing to show for 2 hours of compute time. You didn’t even know interruption was coming.

Error Type Description
Unexpected Termination Spot instance terminated with no warning; data loss
Pipeline Broken Batch job mid-execution gets killed; results discarded
No Checkpoint No way to resume work where it left off

The frustration is that AWS does warn you — 2 minutes before termination — but you weren’t listening.

Why Does This Happen?

  • AWS reclaims Spot capacity with a 2-minute interruption notice — When demand spikes (other customers willing to pay on-demand prices), AWS can reclaim Spot instances to fulfill on-demand requests. You get a grace period to shut down gracefully.
  • Most teams don’t poll for the notice — The warning is broadcast via an instance metadata endpoint, but if your application doesn’t check it, you’ll never know it’s coming.
  • No checkpoint logic — Even if you detect the warning, you need logic to pause work, save state to S3, and clean up resources within the 2-minute window.

The Fix

The solution has two parts: (1) detect the interruption notice, and (2) implement graceful shutdown logic.

Part 1: Detect the Interruption Warning

AWS broadcasts interruption notices via the instance metadata service. Your application must poll this endpoint every few seconds:

# Polling loop to detect interruption notice
while true; do
  # Query the metadata endpoint
  INTERRUPTION=$(curl -s http://169.254.169.254/latest/meta-data/spot/instance-action)

  if [ ! -z "$INTERRUPTION" ]; then
    # Interruption is imminent (2 minutes remaining)
    echo "Spot interruption notice received: $INTERRUPTION"
    echo "2 minutes until termination. Starting graceful shutdown..."

    # Call your checkpoint and shutdown logic here
    /opt/myapp/graceful-shutdown.sh
    break
  fi

  # Poll every 5 seconds
  sleep 5
done

The metadata endpoint returns an empty response if all is normal, and a JSON payload like this when interruption is pending:

{
  "action": "terminate",
  "time": "2026-02-04T15:30:15Z"
}

Part 2: Implement Graceful Shutdown

Once you detect the warning, you have 2 minutes to:

  1. Stop accepting new work
  2. Complete in-flight tasks or save their state
  3. Checkpoint to S3 or a database
  4. Clean up resources

Here’s a sample shutdown script:

#!/bin/bash
# graceful-shutdown.sh

echo "Starting graceful shutdown..."

# Stop accepting new requests
touch /var/tmp/shutdown-in-progress

# Wait for existing requests to complete (max 60 seconds)
TIMEOUT=60
ELAPSED=0
while [ $(jobs -r | wc -l) -gt 0 ] && [ $ELAPSED -lt $TIMEOUT ]; do
  sleep 1
  ELAPSED=$((ELAPSED + 1))
done

# Kill any remaining jobs forcefully
kill %1 %2 %3 2>/dev/null

# Checkpoint application state to S3
echo "Saving checkpoint to S3..."
tar -czf /tmp/checkpoint.tar.gz /var/myapp/state/
aws s3 cp /tmp/checkpoint.tar.gz \
  s3://my-backup-bucket/checkpoints/$(date +%s).tar.gz \
  --region us-east-1

# Clean up temporary files
rm -rf /tmp/temp-files-*

# Flush logs to CloudWatch
/opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
  -a flush-logs \
  -m ec2

echo "Graceful shutdown complete. Instance will terminate shortly."

Part 3: Use EventBridge for Automated Response (Better)

Instead of polling from within your application, use EventBridge to trigger a Lambda function when the interruption notice is broadcast:

# Create an EventBridge rule for Spot Instance Interruption Warning
aws events put-rule \
  --name spot-interruption-handler \
  --event-bus-name default \
  --event-pattern '{
    "source": ["aws.ec2"],
    "detail-type": ["EC2 Spot Instance Interruption Warning"],
    "detail": {
      "instance-id": ["i-0abc123def456ghij"]
    }
  }' \
  --region us-east-1

# Add Lambda as the target
aws events put-targets \
  --rule spot-interruption-handler \
  --targets "Id"="1","Arn"="arn:aws:lambda:us-east-1:123456789012:function:HandleSpotInterruption" \
  --region us-east-1

The Lambda function receives the event with the instance ID and can trigger shutdown via SSM Session Manager or a pre-installed agent.

Part 4: Robust Design for Fault Tolerance

Even with graceful shutdown, assume worst-case: the instance terminates before your shutdown completes. Design your application to be resumable:

# Example Python application with checkpointing
import json
import boto3
import os

s3 = boto3.client('s3')
CHECKPOINT_BUCKET = 'my-checkpoints'
CHECKPOINT_KEY = f'job-{os.environ["JOB_ID"]}/progress.json'

def load_checkpoint():
    """Resume from last checkpoint or start fresh"""
    try:
        response = s3.get_object(Bucket=CHECKPOINT_BUCKET, Key=CHECKPOINT_KEY)
        return json.loads(response['Body'].read())
    except:
        return {'processed_items': 0, 'total_items': 1000}

def save_checkpoint(progress):
    """Save progress every N items"""
    s3.put_object(
        Bucket=CHECKPOINT_BUCKET,
        Key=CHECKPOINT_KEY,
        Body=json.dumps(progress)
    )

# Main processing loop
progress = load_checkpoint()

for item_id in range(progress['processed_items'], progress['total_items']):
    process_item(item_id)
    progress['processed_items'] += 1

    # Save checkpoint every 10 items
    if item_id % 10 == 0:
        save_checkpoint(progress)

print(f"Completed {progress['processed_items']} items")

How to Run This

  1. Add polling loop to your application startup (or use EventBridge rule as described above).
  2. Create a graceful-shutdown.sh script that checkpoints your state to S3.
  3. Test interruption handling: launch a Spot instance, let it run your app, then manually terminate it to verify graceful shutdown works.
  4. For batch jobs: use Spot Fleet with capacityOptimized allocation strategy to minimize interruptions across multiple AZs and instance types.
  5. Monitor interruption events in CloudWatch to track how often your instances are interrupted.

Is This Safe?

Polling the metadata endpoint every 5 seconds has minimal overhead. Saving state to S3 is safe and non-destructive. The 2-minute window is fixed — ensure your checkpoint logic completes within that window. Test graceful shutdown with instance terminations before deploying to production.

Key Takeaway

Spot interruptions are not a failure — they’re an expected part of the Spot pricing model. Build your applications to detect the 2-minute warning, checkpoint state, and resume after interruption. This transforms a “catastrophic loss” into “brief pause,” saving you thousands of dollars per year.


Working with Spot instances and need help with resilience? Connect with me on LinkedIn or X.