I deployed a Site-to-Site VPN to connect our on-premises data center to AWS, and the connection worked at first. But after a few hours, the tunnel status started oscillating between UP and DOWN, sometimes every few minutes. Traffic would work for a bit, then drop, then come back. CloudWatch showed the tunnel flapping constantly, and our applications were experiencing brief outages. After investigating, I found the issue was related to IKE dead peer detection settings and NAT timeout on our firewall. In this post, I’ll walk through exactly what causes this and how to fix it.
The Problem
Your Site-to-Site VPN connection keeps flapping between UP and DOWN states. CloudWatch metrics show the tunnel going up and down repeatedly. AWS reports “Tunnel Up” but traffic intermittently fails, then recovers moments later. The pattern repeats every few minutes, and your applications suffer from connection drops.
Here’s what CloudWatch and your router both show:
2026-02-16 14:00:01 Tunnel 1 UP
2026-02-16 14:00:45 Tunnel 1 DOWN - Phase 1 Failed
2026-02-16 14:01:02 Tunnel 1 UP
2026-02-16 14:01:48 Tunnel 1 DOWN - IKE Timeout
Status Flap: Every 45-60 seconds
Tunnel Availability: ~70%
| Issue | Description |
|---|---|
| Tunnel Flapping | VPN tunnel status cycles up/down multiple times per minute |
| IKE Session Timeout | Phase 1 or Phase 2 negotiation fails due to timeout |
| Tunnel Only Partially Up | One tunnel up, one down (asymmetric) |
| DPD Causing Drops | Dead Peer Detection aggressively closing idle tunnels |
| High Latency Perceived as Down | Network latency causes DPD to timeout |
Why Does This Happen?
-
IKE Dead Peer Detection (DPD) aggressive mode dropping tunnels: By default, IPsec DPD in “clear” mode tears down tunnels that haven’t seen traffic for a certain period (typically 10-30 seconds). If your application has idle periods, DPD might aggressively close the tunnel thinking the peer is dead.
-
On-premises firewall resetting the tunnel due to NAT timeout: If your corporate firewall uses connection tracking and times out idle UDP flows (especially port 500/4500 for IPsec), the tunnel gets reset. IPsec requires the same src/dst port pair; if the firewall recreates the session, the tuple changes and the tunnel drops.
-
Phase 1/Phase 2 lifetime mismatch causing re-keying issues: If your router and AWS have different IKE Phase 1 lifetimes or Phase 2 lifetimes, re-keying happens at different times. Mismatches can cause race conditions where one side initiates re-key while the other is already negotiating, causing temporary session loss.
-
Only one tunnel being used, no failover configured: AWS VPN has two tunnels for redundancy. If you’re only using one tunnel in your BGP routing, that single tunnel is a single point of failure. If it drops, there’s no automatic failover to the second tunnel.
-
MTU issues with IPsec overhead not accounted for: IPsec adds overhead (typically 70-100 bytes). If your network path MTU is 1500 but IPsec encapsulation pushes it to 1571, packets get fragmented. Fragmentation can cause keepalive packets to be dropped, triggering DPD timeouts.
The Fix
First, check both tunnel statuses:
aws ec2 describe-vpn-connections \
--vpn-connection-ids vpn-0a1b2c3d4e5f6g7h8 \
--query 'VpnConnections[0].[VpnConnectionId,State,VgwTelemetry[*].[TunnelAddress,Status,LastStatusChange]]' \
--output table
Fix DPD Configuration
On your customer gateway (on-premises router), change DPD action from “clear” to “restart”. DPD restart re-initiates the session instead of tearing it down.
For Cisco IOS IPsec:
crypto ikev2 proposal PROPOSAL_1
encryption aes-cbc-256
integrity sha256
dh-group 14
!
crypto ikev2 policy POLICY_1
proposal PROPOSAL_1
!
crypto ikev2 keyring KEY_1
peer <AWS_PUBLIC_IP>
address <AWS_PUBLIC_IP>
pre-shared-key <PSK>
!
crypto ikev2 profile PROFILE_1
match identity remote address <AWS_PUBLIC_IP>
authentication remote pre-share
authentication local pre-share
keyring local KEY_1
lifetime 28800
dpd 10 3 on-demand
!
crypto ipsec transform-set TRANSFORM_1 esp-aes 256 esp-sha256-hmac
mode tunnel
!
crypto ipsec profile IPSEC_PROFILE_1
set transform-set TRANSFORM_1
set ikev2-profile PROFILE_1
set pfs group14
!
interface Tunnel0
ip address 169.254.10.2 255.255.255.252
tunnel source <YOUR_PUBLIC_IP>
tunnel destination <AWS_PUBLIC_IP>
tunnel mode ipsec ipv4
tunnel protection ipsec profile IPSEC_PROFILE_1
no shut
Use Both Tunnels with Equal-Cost Multi-Path
Configure your BGP routing to use both tunnels with equal cost. This provides automatic failover if one tunnel drops.
On your router, create BGP routes with equal weight for both tunnel IPs:
router bgp 65000
neighbor 169.254.10.1 remote-as 64512
neighbor 169.254.11.1 remote-as 64512
address-family ipv4
neighbor 169.254.10.1 activate
neighbor 169.254.11.1 activate
network 10.0.0.0 mask 255.255.0.0
exit-address-family
Set Matching IKE Timers
Ensure Phase 1 and Phase 2 lifetimes match between your router and AWS. AWS defaults:
- IKE Phase 1 lifetime: 28800 seconds (8 hours)
- IKE Phase 2 lifetime: 3600 seconds (1 hour)
On your router, set matching values (shown in Cisco IOS example above). AWS will auto-negotiate if your values are within acceptable ranges.
Fix NAT Timeout Issues on Firewall
If your firewall is resetting IPsec connections due to NAT timeout, increase the connection tracking timeout for UDP port 500 and 4500:
# Example for Cisco ASA
timeout conn 1:00:00
timeout half-closed 0:10:00
timeout udp 0:02:00 <- Increase this if IPsec is failing
! Explicitly track IPsec
timeout udp 0:05:00 for port 500/4500
Or configure IPsec pass-through on the firewall to bypass NAT for IPsec traffic.
Configure TCP MSS Clamping for Optimal MTU
Set TCP MSS to account for IPsec overhead (standard: 1379):
router(config)# interface Tunnel0
router(config-if)# ip tcp adjust-mss 1379
Monitor Tunnel Health
Get detailed tunnel metrics:
aws ec2 describe-vpn-connections \
--vpn-connection-ids vpn-0a1b2c3d4e5f6g7h8 \
--query 'VpnConnections[0].VgwTelemetry[*].[TunnelAddress,Status,AcceptedRouteCount,LastStatusChange]' \
--output table
Set up CloudWatch alarms for tunnel status:
aws cloudwatch put-metric-alarm \
--alarm-name vpn-tunnel-down \
--alarm-description "Alert when VPN tunnel is down" \
--metric-name TunnelState \
--namespace AWS/VPN \
--statistic Average \
--period 60 \
--threshold 0 \
--comparison-operator LessThanThreshold \
--evaluation-periods 1
How to Run This
- Change DPD action from “clear” to “restart” on your router.
- Configure BGP to use both tunnels with equal cost.
- Ensure IKE Phase 1/2 lifetimes match (or are within AWS-acceptable ranges).
- Increase UDP timeout on your firewall for IPsec ports (500/4500).
- Enable TCP MSS clamping on the tunnel interface to 1379.
- Monitor tunnel status with CloudWatch metrics.
- Test by monitoring tunnel state over 24 hours for stability.
Is This Safe?
Yes, these configurations are safe and standard for VPN deployments. DPD restart is the recommended mode. Using both tunnels improves reliability.
Key Takeaway
VPN flapping is usually caused by aggressive DPD, NAT timeout on firewalls, or timer mismatches. Switch DPD to restart mode, use both tunnels in your routing for automatic failover, and configure matching IKE timers. Monitor tunnel status over time—stability should improve immediately after these changes.
Have questions or ran into a different networking issue? Connect with me on LinkedIn or X.