I was on a production deployment call when our monitoring dashboard lit up with 502 errors spiking to 30% of all requests behind our Application Load Balancer. Users were seeing “502 Bad Gateway” pages, and within minutes the 502s started mixing with 504 Gateway Timeouts. The backend instances looked healthy in the console, CPU was under 40%, and the application logs showed no crashes. It took me two hours of digging through ALB access logs, target group health checks, and keep-alive configurations before I found the root cause: the backend was closing connections before the ALB expected it to. In this post, I’ll walk through exactly what causes these errors and how to fix them.

The Problem

Your Application Load Balancer is returning HTTP 502 Bad Gateway or 504 Gateway Timeout errors to clients, even though your backend instances appear to be running normally. The errors may be intermittent or sustained, and they often spike during deployments, scaling events, or periods of high traffic.

Here’s what you might see in ALB access logs:

h2 2026-03-15T14:22:31.445Z app/my-alb/a1b2c3d4e5f6g7h8 10.0.1.55:443 10.0.2.100:8080 -1 -1 -1 502 - 36 "GET https://api.example.com:443/v1/orders HTTP/2.0" "Mozilla/5.0" ...
h2 2026-03-15T14:22:45.891Z app/my-alb/a1b2c3d4e5f6g7h8 10.0.1.55:443 10.0.2.101:8080 0.001 60.001 -1 504 - 220 "POST https://api.example.com:443/v1/reports HTTP/2.0" "Mozilla/5.0" ...
Error Code Meaning
502 Bad Gateway ALB received an invalid response from the target, or the target closed the connection before sending a response
504 Gateway Timeout Target did not respond within the ALB’s idle timeout period
target_response_time = -1 in access logs ALB never received a response — target closed connection or was unreachable
elb_status_code = 502 with target_status_code = - Target reset the connection before the ALB got any HTTP response

Why Does This Happen?

  • Target keep-alive timeout is shorter than the ALB idle timeout: This is the number one cause of intermittent 502s. The ALB reuses connections to targets. If your backend closes an idle connection at, say, 60 seconds but the ALB idle timeout is also 60 seconds, a race condition occurs. The ALB sends a request on a connection the target just closed. The fix is to set the backend keep-alive timeout higher than the ALB idle timeout.

  • Target health check is failing or the target is unhealthy: If a target fails health checks, the ALB removes it from rotation. But during the drain period, or if the health check interval is too long, requests can still land on a failing instance and get 502s back.

  • Security group or NACL blocking traffic between ALB and targets: The ALB’s security group allows inbound traffic from clients, but the target’s security group must allow inbound traffic from the ALB on the application port. If this rule is missing or too restrictive, the ALB can’t reach the target.

  • Target response takes longer than the ALB idle timeout: If your backend takes 90 seconds to process a request but the ALB idle timeout is set to 60 seconds (the default), the ALB gives up waiting and returns a 504. This is common with report generation, file processing, or long-running API calls.

  • Targets are overloaded or crashing: If your application is running out of memory, hitting thread pool limits, or crashing under load, it may close connections abruptly or fail to respond. The ALB sees a broken connection and returns 502.

  • Misconfigured health check path or port: The health check might be hitting a path that returns 200, but the actual application port is unresponsive. Or the health check uses a different port than the application traffic, masking real issues.

  • SSL/TLS handshake failure between ALB and target: If your target group uses HTTPS and the backend certificate is expired, self-signed without proper trust, or using an incompatible TLS version, the ALB can’t establish a connection and returns 502.

The Fix

1. Check Target Group Health

Start by checking whether your targets are actually healthy:

aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/a1b2c3d4e5f6g7h8 \
  --output table

If targets show unhealthy, check the reason:

aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/a1b2c3d4e5f6g7h8 \
  --query 'TargetHealthDescriptions[*].[Target.Id,TargetHealth.State,TargetHealth.Reason,TargetHealth.Description]' \
  --output table

Common reasons include Target.ResponseCodeMismatch, Target.Timeout, and Target.FailedHealthChecks.

2. Review ALB Access Logs for Patterns

Enable access logs if they aren’t already:

aws elbv2 modify-load-balancer-attributes \
  --load-balancer-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:loadbalancer/app/my-alb/a1b2c3d4e5f6g7h8 \
  --attributes Key=access_logs.s3.enabled,Value=true \
               Key=access_logs.s3.bucket,Value=my-alb-logs-bucket \
               Key=access_logs.s3.prefix,Value=alb-logs

Download and inspect the logs. Look at target_processing_time and elb_status_code:

# Find all 502 and 504 errors in the last hour's logs
aws s3 cp s3://my-alb-logs-bucket/alb-logs/ /tmp/alb-logs/ --recursive

zcat /tmp/alb-logs/*.gz | awk '$9 == 502 || $9 == 504 {print $0}' | head -20

If target_processing_time is -1, the target never responded. If it equals the idle timeout value, you have a 504 timeout issue.

3. Fix the Keep-Alive Race Condition (Most Common 502 Fix)

Check the ALB idle timeout:

aws elbv2 describe-load-balancer-attributes \
  --load-balancer-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:loadbalancer/app/my-alb/a1b2c3d4e5f6g7h8 \
  --query 'Attributes[?Key==`idle_timeout.timeout_seconds`].Value' \
  --output text

The default is 60 seconds. Your backend’s keep-alive timeout must be higher than this value. For common web servers:

Nginx — edit /etc/nginx/nginx.conf:

keepalive_timeout 65;  # Must be higher than ALB idle timeout (default 60)

Apache — edit /etc/httpd/conf/httpd.conf:

KeepAliveTimeout 65

Node.js (Express):

const server = app.listen(8080);
server.keepAliveTimeout = 65000;  // milliseconds, must exceed ALB idle timeout
server.headersTimeout = 66000;    // must be higher than keepAliveTimeout

Spring Boot — in application.properties:

server.tomcat.keep-alive-timeout=65000

If you’d rather lower the ALB idle timeout instead:

aws elbv2 modify-load-balancer-attributes \
  --load-balancer-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:loadbalancer/app/my-alb/a1b2c3d4e5f6g7h8 \
  --attributes Key=idle_timeout.timeout_seconds,Value=30

4. Fix Security Groups

The ALB’s security group and the target’s security group must allow traffic between them. Check the target’s inbound rules:

aws ec2 describe-security-groups \
  --group-ids sg-0a1b2c3d4e5f6g7h8 \
  --query 'SecurityGroups[0].IpPermissions[*].[FromPort,ToPort,IpProtocol,UserIdGroupPairs[*].GroupId,IpRanges[*].CidrIp]' \
  --output table

The target security group must allow inbound traffic on the application port from the ALB’s security group:

aws ec2 authorize-security-group-ingress \
  --group-id sg-0a1b2c3d4e5f6g7h8 \
  --protocol tcp \
  --port 8080 \
  --source-group sg-0alb-security-group-id

5. Increase ALB Idle Timeout for Slow Endpoints (504 Fix)

If you have endpoints that legitimately take longer than 60 seconds, increase the ALB idle timeout:

aws elbv2 modify-load-balancer-attributes \
  --load-balancer-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:loadbalancer/app/my-alb/a1b2c3d4e5f6g7h8 \
  --attributes Key=idle_timeout.timeout_seconds,Value=120

The maximum is 4000 seconds. Remember to also update your backend keep-alive timeout to be higher than this new value.

6. Verify Health Check Configuration

Check the health check settings:

aws elbv2 describe-target-groups \
  --target-group-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/a1b2c3d4e5f6g7h8 \
  --query 'TargetGroups[0].[HealthCheckProtocol,HealthCheckPort,HealthCheckPath,HealthCheckIntervalSeconds,HealthCheckTimeoutSeconds,HealthyThresholdCount,UnhealthyThresholdCount,Matcher.HttpCode]' \
  --output table

Update the health check if it’s misconfigured:

aws elbv2 modify-target-group \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/a1b2c3d4e5f6g7h8 \
  --health-check-protocol HTTP \
  --health-check-port 8080 \
  --health-check-path /health \
  --health-check-interval-seconds 15 \
  --health-check-timeout-seconds 5 \
  --healthy-threshold-count 2 \
  --unhealthy-threshold-count 3 \
  --matcher HttpCode=200

7. Check CloudWatch Metrics for Confirmation

Pull the error metrics to confirm the issue and verify your fix:

aws cloudwatch get-metric-statistics \
  --namespace AWS/ApplicationELB \
  --metric-name HTTPCode_ELB_502_Count \
  --dimensions Name=LoadBalancer,Value=app/my-alb/a1b2c3d4e5f6g7h8 \
  --start-time 2026-03-15T12:00:00Z \
  --end-time 2026-03-15T16:00:00Z \
  --period 300 \
  --statistics Sum \
  --output table

aws cloudwatch get-metric-statistics \
  --namespace AWS/ApplicationELB \
  --metric-name HTTPCode_ELB_504_Count \
  --dimensions Name=LoadBalancer,Value=app/my-alb/a1b2c3d4e5f6g7h8 \
  --start-time 2026-03-15T12:00:00Z \
  --end-time 2026-03-15T16:00:00Z \
  --period 300 \
  --statistics Sum \
  --output table

Also check target response time to spot slow backends:

aws cloudwatch get-metric-statistics \
  --namespace AWS/ApplicationELB \
  --metric-name TargetResponseTime \
  --dimensions Name=LoadBalancer,Value=app/my-alb/a1b2c3d4e5f6g7h8 \
  --start-time 2026-03-15T12:00:00Z \
  --end-time 2026-03-15T16:00:00Z \
  --period 300 \
  --statistics Average,p99 \
  --output table

Is This Safe?

Yes, these changes are safe. Adjusting the ALB idle timeout, fixing keep-alive values, and updating security group rules are standard operational changes. Lowering the idle timeout may cause 504s on legitimately slow requests, so check your application’s response time percentiles before reducing it. Increasing the keep-alive timeout on your web server has no negative side effects — it just tells the server to hold connections open slightly longer.

Key Takeaway

The most common cause of ALB 502 errors is a keep-alive race condition: the backend closes the connection at the exact moment the ALB tries to reuse it. Set your backend’s keep-alive timeout to at least 5 seconds higher than the ALB idle timeout. For 504 errors, the fix is almost always increasing the ALB idle timeout to match your slowest endpoint’s response time. Start your investigation with describe-target-health and ALB access logs — they tell you exactly what’s happening on every request.


Have questions or ran into a different issue? Connect with me on LinkedIn or X.