I deployed a Site-to-Site VPN to connect our on-premises data center to AWS, and the connection worked at first. But after a few hours, the tunnel status started oscillating between UP and DOWN, sometimes every few minutes. Traffic would work for a bit, then drop, then come back. CloudWatch showed the tunnel flapping constantly, and our applications were experiencing brief outages. After investigating, I found the issue was related to IKE dead peer detection settings and NAT timeout on our firewall. In this post, I’ll walk through exactly what causes this and how to fix it.

The Problem

Your Site-to-Site VPN connection keeps flapping between UP and DOWN states. CloudWatch metrics show the tunnel going up and down repeatedly. AWS reports “Tunnel Up” but traffic intermittently fails, then recovers moments later. The pattern repeats every few minutes, and your applications suffer from connection drops.

Here’s what CloudWatch and your router both show:

2026-02-16 14:00:01 Tunnel 1 UP
2026-02-16 14:00:45 Tunnel 1 DOWN - Phase 1 Failed
2026-02-16 14:01:02 Tunnel 1 UP
2026-02-16 14:01:48 Tunnel 1 DOWN - IKE Timeout

Status Flap: Every 45-60 seconds
Tunnel Availability: ~70%
Issue Description
Tunnel Flapping VPN tunnel status cycles up/down multiple times per minute
IKE Session Timeout Phase 1 or Phase 2 negotiation fails due to timeout
Tunnel Only Partially Up One tunnel up, one down (asymmetric)
DPD Causing Drops Dead Peer Detection aggressively closing idle tunnels
High Latency Perceived as Down Network latency causes DPD to timeout

Why Does This Happen?

  • IKE Dead Peer Detection (DPD) aggressive mode dropping tunnels: By default, IPsec DPD in “clear” mode tears down tunnels that haven’t seen traffic for a certain period (typically 10-30 seconds). If your application has idle periods, DPD might aggressively close the tunnel thinking the peer is dead.

  • On-premises firewall resetting the tunnel due to NAT timeout: If your corporate firewall uses connection tracking and times out idle UDP flows (especially port 500/4500 for IPsec), the tunnel gets reset. IPsec requires the same src/dst port pair; if the firewall recreates the session, the tuple changes and the tunnel drops.

  • Phase 1/Phase 2 lifetime mismatch causing re-keying issues: If your router and AWS have different IKE Phase 1 lifetimes or Phase 2 lifetimes, re-keying happens at different times. Mismatches can cause race conditions where one side initiates re-key while the other is already negotiating, causing temporary session loss.

  • Only one tunnel being used, no failover configured: AWS VPN has two tunnels for redundancy. If you’re only using one tunnel in your BGP routing, that single tunnel is a single point of failure. If it drops, there’s no automatic failover to the second tunnel.

  • MTU issues with IPsec overhead not accounted for: IPsec adds overhead (typically 70-100 bytes). If your network path MTU is 1500 but IPsec encapsulation pushes it to 1571, packets get fragmented. Fragmentation can cause keepalive packets to be dropped, triggering DPD timeouts.

The Fix

First, check both tunnel statuses:

aws ec2 describe-vpn-connections \
  --vpn-connection-ids vpn-0a1b2c3d4e5f6g7h8 \
  --query 'VpnConnections[0].[VpnConnectionId,State,VgwTelemetry[*].[TunnelAddress,Status,LastStatusChange]]' \
  --output table

Fix DPD Configuration

On your customer gateway (on-premises router), change DPD action from “clear” to “restart”. DPD restart re-initiates the session instead of tearing it down.

For Cisco IOS IPsec:

crypto ikev2 proposal PROPOSAL_1
  encryption aes-cbc-256
  integrity sha256
  dh-group 14
!
crypto ikev2 policy POLICY_1
  proposal PROPOSAL_1
!
crypto ikev2 keyring KEY_1
  peer <AWS_PUBLIC_IP>
    address <AWS_PUBLIC_IP>
    pre-shared-key <PSK>
!
crypto ikev2 profile PROFILE_1
  match identity remote address <AWS_PUBLIC_IP>
  authentication remote pre-share
  authentication local pre-share
  keyring local KEY_1
  lifetime 28800
  dpd 10 3 on-demand
!
crypto ipsec transform-set TRANSFORM_1 esp-aes 256 esp-sha256-hmac
  mode tunnel
!
crypto ipsec profile IPSEC_PROFILE_1
  set transform-set TRANSFORM_1
  set ikev2-profile PROFILE_1
  set pfs group14
!
interface Tunnel0
  ip address 169.254.10.2 255.255.255.252
  tunnel source <YOUR_PUBLIC_IP>
  tunnel destination <AWS_PUBLIC_IP>
  tunnel mode ipsec ipv4
  tunnel protection ipsec profile IPSEC_PROFILE_1
  no shut

Use Both Tunnels with Equal-Cost Multi-Path

Configure your BGP routing to use both tunnels with equal cost. This provides automatic failover if one tunnel drops.

On your router, create BGP routes with equal weight for both tunnel IPs:

router bgp 65000
  neighbor 169.254.10.1 remote-as 64512
  neighbor 169.254.11.1 remote-as 64512

  address-family ipv4
    neighbor 169.254.10.1 activate
    neighbor 169.254.11.1 activate
    network 10.0.0.0 mask 255.255.0.0
  exit-address-family

Set Matching IKE Timers

Ensure Phase 1 and Phase 2 lifetimes match between your router and AWS. AWS defaults:

  • IKE Phase 1 lifetime: 28800 seconds (8 hours)
  • IKE Phase 2 lifetime: 3600 seconds (1 hour)

On your router, set matching values (shown in Cisco IOS example above). AWS will auto-negotiate if your values are within acceptable ranges.

Fix NAT Timeout Issues on Firewall

If your firewall is resetting IPsec connections due to NAT timeout, increase the connection tracking timeout for UDP port 500 and 4500:

# Example for Cisco ASA
timeout conn 1:00:00
timeout half-closed 0:10:00
timeout udp 0:02:00  <- Increase this if IPsec is failing

! Explicitly track IPsec
timeout udp 0:05:00  for port 500/4500

Or configure IPsec pass-through on the firewall to bypass NAT for IPsec traffic.

Configure TCP MSS Clamping for Optimal MTU

Set TCP MSS to account for IPsec overhead (standard: 1379):

router(config)# interface Tunnel0
router(config-if)# ip tcp adjust-mss 1379

Monitor Tunnel Health

Get detailed tunnel metrics:

aws ec2 describe-vpn-connections \
  --vpn-connection-ids vpn-0a1b2c3d4e5f6g7h8 \
  --query 'VpnConnections[0].VgwTelemetry[*].[TunnelAddress,Status,AcceptedRouteCount,LastStatusChange]' \
  --output table

Set up CloudWatch alarms for tunnel status:

aws cloudwatch put-metric-alarm \
  --alarm-name vpn-tunnel-down \
  --alarm-description "Alert when VPN tunnel is down" \
  --metric-name TunnelState \
  --namespace AWS/VPN \
  --statistic Average \
  --period 60 \
  --threshold 0 \
  --comparison-operator LessThanThreshold \
  --evaluation-periods 1

How to Run This

  1. Change DPD action from “clear” to “restart” on your router.
  2. Configure BGP to use both tunnels with equal cost.
  3. Ensure IKE Phase 1/2 lifetimes match (or are within AWS-acceptable ranges).
  4. Increase UDP timeout on your firewall for IPsec ports (500/4500).
  5. Enable TCP MSS clamping on the tunnel interface to 1379.
  6. Monitor tunnel status with CloudWatch metrics.
  7. Test by monitoring tunnel state over 24 hours for stability.

Is This Safe?

Yes, these configurations are safe and standard for VPN deployments. DPD restart is the recommended mode. Using both tunnels improves reliability.

Key Takeaway

VPN flapping is usually caused by aggressive DPD, NAT timeout on firewalls, or timer mismatches. Switch DPD to restart mode, use both tunnels in your routing for automatic failover, and configure matching IKE timers. Monitor tunnel status over time—stability should improve immediately after these changes.


Have questions or ran into a different networking issue? Connect with me on LinkedIn or X.