This week I got paged at 2am because a service running in an EKS cluster suddenly couldn’t resolve any of our internal *.internal.company.com hostnames. Public DNS resolved fine. The Route 53 private hosted zone had all the records. The VPC was associated with the zone. I spent an hour in a panic before realizing the VPC had been updated to use a custom DHCP options set a few hours earlier — one that pointed DNS to a corporate forwarder that didn’t know about the private zone. Route 53 was blameless. The DHCP change had cut the VPC off from VPC DNS.

Route 53 issues are often not Route 53 issues — they’re DHCP, VPC, or Resolver rule issues that manifest as DNS failure. Here’s how to trace them quickly.

The Problem

DNS queries fail or return unexpected results:

Symptom What It Means
NXDOMAIN for a private record Private hosted zone not associated with the VPC, or VPC not using VPC DNS
SERVFAIL intermittently Resolver endpoint unavailable, or forwarding loop
Correct answer but wrong IP A conflicting record in another zone (e.g., split-horizon failure)
Resolution works from EC2 but not from Lambda Lambda VPC config uses different subnets without DNS attributes
Health check flapping Health checker source IPs blocked, or TCP check hitting a slow endpoint

The DNS layer fails silently — applications usually log it as a connection error, a timeout, or a generic exception, which sends teams hunting the wrong problem.

Why Does This Happen?

  • VPC not associated with the private hosted zone: A private hosted zone is only resolvable from VPCs explicitly associated with it. New VPCs, new accounts in an organization, or cross-account VPCs all require explicit association — creating the zone is not enough.
  • VPC DNS hostnames or DNS resolution disabled: Every VPC has two flags, enableDnsHostnames and enableDnsSupport. If either is off, instances can’t use the VPC’s DNS resolver at .2 and private hosted zones won’t work.
  • Custom DHCP options set pointing to external DNS: A DHCP options set with custom DNS servers overrides the VPC resolver. Instances then use the listed servers, which don’t know about private zones unless you’ve built Resolver rules to forward them.
  • Resolver endpoint rules missing or misrouted: In a hybrid setup, outbound Resolver endpoints forward specific domains to on-prem DNS. If the rule targets aren’t reachable or the rule domain is too broad (e.g., forwarding amazonaws.com), it breaks AWS service resolution.
  • Health check source IPs blocked: Route 53 health checkers query from a specific set of IP ranges. If a security group or WAF blocks those ranges, every check fails and the health check alarm fires even though the endpoint is healthy.
  • DNS answer TTL too long during migration: A record’s TTL was 1 hour when you changed it — clients keep hitting the old value for up to an hour, making the “DNS change” appear to have failed.

The Fix

Step 1: Confirm Resolution Path from the Failing Host

First, find out which DNS server the host is actually using. From Linux:

cat /etc/resolv.conf

The nameserver line should be the VPC’s .2 address (e.g., 10.0.0.2). If it’s something else, your DHCP options set is overriding.

Test resolution directly against the VPC resolver:

dig @10.0.0.2 internal.company.com

If that works but dig internal.company.com (default resolver) doesn’t, the DHCP options set is the problem.

Step 2: Verify VPC DNS Attributes

Both flags must be true:

aws ec2 describe-vpc-attribute \
  --vpc-id vpc-0abc123def \
  --attribute enableDnsSupport \
  --query "EnableDnsSupport.Value"
aws ec2 describe-vpc-attribute \
  --vpc-id vpc-0abc123def \
  --attribute enableDnsHostnames \
  --query "EnableDnsHostnames.Value"

If either is false, enable it:

aws ec2 modify-vpc-attribute \
  --vpc-id vpc-0abc123def \
  --enable-dns-hostnames '{"Value":true}'
aws ec2 modify-vpc-attribute \
  --vpc-id vpc-0abc123def \
  --enable-dns-support '{"Value":true}'

Step 3: Verify Private Hosted Zone Association

List all VPCs currently associated with the zone:

aws route53 get-hosted-zone \
  --id Z1ABCDEF12345 \
  --query "VPCs[*].{VPC:VPCId,Region:VPCRegion}" \
  --output table

If the failing VPC is missing, associate it:

aws route53 associate-vpc-with-hosted-zone \
  --hosted-zone-id Z1ABCDEF12345 \
  --vpc VPCRegion=us-east-1,VPCId=vpc-0abc123def

For cross-account association, the zone owner must first authorize the association:

aws route53 create-vpc-association-authorization \
  --hosted-zone-id Z1ABCDEF12345 \
  --vpc VPCRegion=us-east-1,VPCId=vpc-0abc123def

Then the VPC owner runs associate-vpc-with-hosted-zone using their own credentials.

Step 4: Check DHCP Options Set

aws ec2 describe-vpcs \
  --vpc-ids vpc-0abc123def \
  --query "Vpcs[0].DhcpOptionsId" \
  --output text
aws ec2 describe-dhcp-options \
  --dhcp-options-ids dopt-0abc12345 \
  --query "DhcpOptions[0].DhcpConfigurations" \
  --output json

If domain-name-servers is anything other than AmazonProvidedDNS, your VPC is using custom DNS. For private hosted zones to work, either revert to AmazonProvidedDNS or set up Route 53 Resolver rules forwarding your private domain to VPC DNS.

Step 5: Fix Hybrid DNS via Resolver Rules

If you need custom forwarding (e.g., forwarding corp.local to on-prem), use Resolver rules instead of DHCP options:

aws route53resolver create-resolver-rule \
  --creator-request-id $(date +%s) \
  --name forward-corp-local \
  --rule-type FORWARD \
  --domain-name corp.local \
  --resolver-endpoint-id rslvr-out-abc12345 \
  --target-ips "Ip=10.1.1.53,Port=53" "Ip=10.1.2.53,Port=53"

Then associate the rule with the VPC:

aws route53resolver associate-resolver-rule \
  --resolver-rule-id rslvr-rr-abc12345 \
  --vpc-id vpc-0abc123def

This lets the VPC resolver handle everything else while still forwarding specific domains on-prem.

Step 6: Fix Health Check Failures

Check the actual health check status and failure reason:

aws route53 get-health-check-status \
  --health-check-id abc12345-def6-7890-ghij-klmnopqrstuv \
  --query "HealthCheckObservations[*].{Region:Region,Status:StatusReport.Status}" \
  --output table

If some regions succeed and others fail, the target is regional-sensitive. If all fail uniformly, check whether you’re blocking the Route 53 health checker IP ranges:

aws route53 get-checker-ip-ranges \
  --query "CheckerIpRanges" \
  --output text

Add these CIDR ranges to your security group or ALB’s allow list.

Step 7: Verify Resolution and Propagation

After changes, flush the caller’s cache and resolve again:

dig +short internal.company.com

For TTL-related issues, check current TTL on the record:

aws route53 list-resource-record-sets \
  --hosted-zone-id Z1ABCDEF12345 \
  --query "ResourceRecordSets[?Name=='internal.company.com.'].{Name:Name,TTL:TTL,Type:Type}" \
  --output table

Reduce TTL before making planned record changes in the future:

aws route53 change-resource-record-sets \
  --hosted-zone-id Z1ABCDEF12345 \
  --change-batch file://change.json

Is This Safe?

Mostly yes. All the describe and get commands are read-only. Associating a VPC with a private hosted zone is additive — it doesn’t remove existing associations. Enabling VPC DNS attributes is non-disruptive. The riskier changes are DHCP options modifications and Resolver rule associations, which take effect on DHCP lease renewal or immediately for new queries — these can surprise long-running connections that cached the old DNS. Always coordinate DHCP options set changes during a maintenance window if possible.

Key Takeaway

When DNS is broken in AWS, the actual root cause is almost never in Route 53 itself. It’s usually the VPC’s DHCP options pointing to external DNS, the VPC not being associated with the private hosted zone, or VPC DNS attributes disabled. Start by checking what resolver the host is actually using (resolv.conf or equivalent) and work outward from there. Use Route 53 Resolver rules instead of DHCP options whenever you need custom forwarding — they’re specific to domains and don’t break everything else.


Have questions or ran into a different Route 53 issue? Connect with me on LinkedIn or X.