Troubleshoot AWS Direct Connect BGP Session Drops

I set up an AWS Direct Connect virtual interface with BGP to our data center, and everything seemed to work initially. But within hours, the BGP session started dropping and re-establishing every few minutes. The metrics showed the session was flapping—up for a moment, then down, then up again. This caused intermittent connectivity that broke our applications. After investigating, I found the issue was related to BGP timer configuration and a physical layer problem on the cross-connect. In this post, I’ll walk through exactly what causes this and how to fix it.

The Problem

Your Direct Connect BGP session keeps dropping and re-establishing, causing intermittent connectivity between AWS and your on-premises environment. The session shows as “Established” in the console, but metrics reveal it’s constantly flapping. Traffic works intermittently, and applications suffer from connection resets.

Here’s what you see in AWS CloudWatch and on your router:

2026-02-16 10:05:23 BGP Session Established
2026-02-16 10:05:45 BGP Session Dropped - Hold Timer Expired
2026-02-16 10:06:02 BGP Session Established
2026-02-16 10:06:28 BGP Session Dropped - Hold Timer Expired

Uptime: 22 seconds
Downtime: 17 seconds
Flap: Every 40 seconds

Issue	Description
BGP Flapping	Session repeatedly goes up and down every few minutes
Hold Timer Expiration	“Neighbor unreachable” due to missed keepalives
Packet Loss on Cross-Connect	BGP keepalives not arriving on time
BFD Not Configured	No fast detection mechanism for link failures
Route Oscillation	Routes appear and disappear, disrupting traffic

Why Does This Happen?

BGP keepalive/hold timer mismatch: AWS Direct Connect defaults to a hold timer of 90 seconds. If your customer router has a different hold timer (e.g., 30 seconds), and keepalives aren’t sent frequently enough, one side thinks the session is dead and tears it down. Both sides must agree on timers.
BFD (Bidirectional Forwarding Detection) not configured: BFD is a separate protocol that detects link failures much faster than BGP keepalives (sub-second detection). Without BFD, BGP relies solely on the hold timer. If packet loss causes keepalives to be delayed, the hold timer expires and the session drops.
Physical layer issues on the cross-connect: Intermittent packet loss on the Direct Connect physical link causes BGP keepalive packets to be delayed or dropped. This causes the hold timer to expire on one or both sides. The link might appear to have marginal signal quality or port errors.
Too many BGP routes being advertised: AWS allows a default of 100 prefixes per virtual interface. If you exceed this limit, AWS tears down the BGP session entirely as a protection measure. Routes must be aggregated or the prefix limit must be increased.
MTU mismatch or TCP MSS issues: Jumbo frames (MTU 9000+) might be configured on your router but not on the AWS side or cross-connect. This causes BGP TCP packets to fragment or get dropped if they exceed the path MTU, leading to keepalive loss and session flapping.

The Fix

First, check the virtual interface status and BGP state:

aws directconnect describe-virtual-interfaces \
  --virtual-interface-id dxvif-0a1b2c3d4e5f6g7h8 \
  --query 'virtualInterfaces[0].[VirtualInterfaceState,BgpPeers]' \
  --output text

Look at the BGP peers to see the current state:

aws directconnect describe-virtual-interfaces \
  --virtual-interface-id dxvif-0a1b2c3d4e5f6g7h8 \
  --query 'virtualInterfaces[0].BgpPeers[*].[BgpPeerId,BgpStatus,Asn,AuthKey]' \
  --output table

Fix BGP Timer Configuration

On your customer router, configure BGP timers to match AWS. AWS defaults:

Keep-alive: 30 seconds
Hold time: 90 seconds

On your router (example for Cisco IOS):

router bgp 65000
  neighbor 169.254.1.1 timers 30 90
  neighbor 169.254.1.1 timers connect 10

Verify from AWS side:

aws directconnect describe-virtual-interfaces \
  --virtual-interface-id dxvif-0a1b2c3d4e5f6g7h8 \
  --query 'virtualInterfaces[0].BgpPeers[*].[CustomerAddress,Asn]'

Enable BFD for Faster Detection

BFD must be configured on both the customer router and AWS. On your router (Cisco example):

interface GigabitEthernet0/0
  ip address 169.254.1.2 255.255.255.0
  bfd interval 300 min_rx 300 multiplier 3

Enable BFD on the AWS side using the console or API (requires virtual interface modification). The API doesn’t have direct BFD control, so this must be done via the console or by contacting AWS support.

Check for Physical Layer Issues

Verify the physical connection has no errors:

aws directconnect describe-connections \
  --connection-id dxcon-0a1b2c3d4e5f6g7h8 \
  --query 'connections[0].[ConnectionState,PortEncryptionStatus]'

Check CloudWatch metrics for packet loss:

aws cloudwatch get-metric-statistics \
  --namespace AWS/DX \
  --metric-name PhysicalConnectionState \
  --dimensions Name=Connection,Value=dxcon-0a1b2c3d4e5f6g7h8 \
  --start-time 2026-02-15T00:00:00Z \
  --end-time 2026-02-16T23:59:59Z \
  --period 300 \
  --statistics Average

Contact your AWS account team or DX provider if physical layer issues are suspected. Request port diagnostics on the cross-connect.

Aggregate Routes to Stay Within Prefix Limit

Check how many prefixes you’re advertising:

aws directconnect describe-virtual-interfaces \
  --virtual-interface-id dxvif-0a1b2c3d4e5f6g7h8 \
  --query 'virtualInterfaces[0].RouteFilterPrefixes[*].Cidr' \
  --output text | wc -w

If over 100, aggregate routes on your router. For example, instead of advertising 10 individual /24s, advertise a single /21. Repeat this for other route blocks.

Request a prefix limit increase if aggregation isn’t possible:

# Contact AWS support or account team to increase prefix limit
# API doesn't directly support this request

Configure TCP MSS Clamping

On your router, set TCP MSS to 1379 (standard for IPsec-like overhead):

interface GigabitEthernet0/0
  ip tcp adjust-mss 1379

How to Run This

Verify BGP timers match on both sides (keep-alive 30s, hold 90s).
Enable BFD on the customer router (AWS side via console).
Check for physical layer issues — request diagnostics on the cross-connect.
Aggregate BGP routes if exceeding 100 prefix limit.
Configure TCP MSS clamping to 1379 on the customer router.
Monitor CloudWatch metrics for BGP state and packet loss.
Test with both tunnels active (DX has redundant paths).

Is This Safe?

Yes, these configurations are safe and standard for Direct Connect. Timer changes won’t disrupt the session—both sides negotiate. BFD and MSS clamping are safe best practices.

Key Takeaway

BGP flapping is usually caused by timer mismatches, missing keepalives due to packet loss, or physical layer issues. Configure matching timers on both sides, enable BFD for faster detection, and ensure the physical connection is healthy. If routes keep oscillating, aggregate them to stay within the prefix limit.

Have questions or ran into a different networking issue? Connect with me on LinkedIn or X.