A customer-facing API I was asked to review was returning intermittent 502 Bad Gateway errors at around 3% of traffic. The Lambda backend looked healthy, CloudWatch showed successful invocations matching the expected volume, and X-Ray traces showed the function completing in 200ms on average. The 502s were real, but they weren’t reaching the backend at all — API Gateway was rejecting malformed responses from the Lambda function’s integration. The issue was a silent bug where the function returned a plain string instead of the Lambda proxy integration format expected.

If your API Gateway is returning 5xx errors and the backend looks fine, here’s how to find what’s actually happening.

The Problem

API Gateway returns 5xx errors even when the backend integration appears healthy:

Error What It Means
502 Bad Gateway Integration response is malformed (Lambda proxy format mismatch)
503 Service Unavailable Integration endpoint is unreachable or the account’s concurrency is exhausted
504 Gateway Timeout Integration took longer than API Gateway’s 29-second hard limit
500 Internal Server Error Generic failure — usually a mapping template or authorizer issue
403 Forbidden from authorizer Custom Lambda authorizer returned deny or failed

API Gateway’s Execution Log has the authoritative error, but it’s disabled by default in most deployments — so teams fly blind.

Why Does This Happen?

  • Lambda proxy integration format violation: When integrationType is AWS_PROXY, the Lambda must return an object shaped like {"statusCode": 200, "body": "..."}. Returning anything else — a plain string, a null, or missing statusCode — causes API Gateway to reject the response with 502.
  • 29-second integration timeout: API Gateway’s maximum integration timeout is 29 seconds. Any backend that takes longer gets its response discarded and the client receives 504. Increasing the Lambda’s timeout beyond 29 seconds has no effect.
  • Lambda concurrency limit reached: If the Lambda reserved concurrency is set, once that’s exhausted API Gateway gets throttled responses and returns 503 to clients. The Lambda invocations show as throttled in metrics but are invisible in the function’s invocation logs.
  • Custom authorizer returns unexpected shape: A custom authorizer must return a specific policy document shape. Bugs that return {"allow": true} or similar guesses cause the authorizer to be treated as deny.
  • VPC Link target unhealthy: For HTTP_PROXY integrations pointing to a private ALB/NLB via VPC Link, the NLB’s target group health is critical. If targets are unhealthy, the integration silently fails with 503.
  • Resource policy or WAF blocking request: A resource policy restricting source IPs or a WAF rule can deny requests before they reach the integration, returning 403 rather than the integration’s actual response.

The Fix

Step 1: Enable Execution Logging

This is the single highest-value debugging step. Turn on full request/response logging for the stage:

aws apigateway update-stage \
  --rest-api-id abc12345 \
  --stage-name prod \
  --patch-operations \
    op=replace,path=/*/*/logging/loglevel,value=INFO \
    op=replace,path=/*/*/logging/dataTrace,value=true \
    op=replace,path=/*/*/metrics/enabled,value=true

Make sure API Gateway has permission to write CloudWatch Logs (via the AmazonAPIGatewayPushToCloudWatchLogs managed policy on the API Gateway CloudWatch role).

Then tail the execution logs:

aws logs tail "API-Gateway-Execution-Logs_abc12345/prod" --follow

The real error will be in these logs — including the full integration response that triggered the 502.

Step 2: Fix 502 Bad Gateway (Lambda Format)

Test the Lambda directly to see what it returns:

aws lambda invoke \
  --function-name my-api-handler \
  --payload '{"httpMethod":"GET","path":"/items"}' \
  --cli-binary-format raw-in-base64-out \
  response.json && cat response.json

The response must have this shape for Lambda proxy integration:

{
  "statusCode": 200,
  "headers": {
    "Content-Type": "application/json"
  },
  "body": "{\"items\":[]}",
  "isBase64Encoded": false
}

Note that body must be a string, not an object. This is the number one cause of 502s.

Step 3: Fix 504 Gateway Timeout

Check the integration latency:

aws cloudwatch get-metric-statistics \
  --namespace AWS/ApiGateway \
  --metric-name IntegrationLatency \
  --dimensions Name=ApiName,Value=my-api Name=Stage,Value=prod \
  --start-time 2026-04-21T00:00:00Z \
  --end-time 2026-04-22T23:59:59Z \
  --period 300 \
  --statistics Maximum,Average \
  --output table

If Maximum approaches 29000 ms (29 seconds), you’re hitting the hard limit. Options:

  1. Convert to async: Return 202 Accepted immediately, process in the background, and have clients poll for results.
  2. Optimize the backend: Profile the Lambda to find and remove the slow path.
  3. Move to HTTP API: HTTP APIs allow up to 30 seconds, but the extra second rarely helps.

For async operations, update the integration to respond immediately:

aws apigateway update-integration \
  --rest-api-id abc12345 \
  --resource-id xyz9876 \
  --http-method POST \
  --patch-operations op=replace,path=/requestTemplates/application~1json,value='{"statusCode":202,"message":"accepted"}'

Step 4: Fix 503 Service Unavailable

Check if it’s concurrency-related:

aws lambda get-function-concurrency \
  --function-name my-api-handler

If reserved concurrency is limiting and traffic is real, remove the reservation or increase it:

aws lambda put-function-concurrency \
  --function-name my-api-handler \
  --reserved-concurrent-executions 500

For VPC Link integrations, verify NLB target health:

aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-api/abc123 \
  --query "TargetHealthDescriptions[*].{Target:Target.Id,State:TargetHealth.State,Reason:TargetHealth.Reason}" \
  --output table

If all targets show unhealthy, the NLB health check configuration or backend needs fixing before API Gateway will route successfully.

Step 5: Fix Authorizer Failures

Check whether the authorizer itself is failing or just denying:

aws logs filter-log-events \
  --log-group-name "/aws/lambda/my-authorizer" \
  --start-time $(date -d "1 hour ago" +%s)000 \
  --filter-pattern "ERROR"

Authorizers must return a valid IAM policy document:

{
  "principalId": "user123",
  "policyDocument": {
    "Version": "2012-10-17",
    "Statement": [
      {
        "Action": "execute-api:Invoke",
        "Effect": "Allow",
        "Resource": "arn:aws:execute-api:us-east-1:123456789012:abc12345/prod/*/*"
      }
    ]
  }
}

If the authorizer returns anything else, API Gateway treats it as deny.

Step 6: Verify the Fix Under Load

Replay recent traffic or use a load generator. Monitor the 5xx error rate:

aws cloudwatch get-metric-statistics \
  --namespace AWS/ApiGateway \
  --metric-name 5XXError \
  --dimensions Name=ApiName,Value=my-api Name=Stage,Value=prod \
  --start-time 2026-04-22T00:00:00Z \
  --end-time 2026-04-22T23:59:59Z \
  --period 60 \
  --statistics Sum \
  --output table

The error rate should drop to zero within a few minutes of the deployment.

Is This Safe?

Mostly yes. Enabling execution logging is safe but generates CloudWatch Logs charges proportional to request volume — monitor costs and consider lowering the log level once debugging is done. Fixing the Lambda proxy format requires a deployment, which is safe for stateless APIs but should go through your normal release process. Modifying Lambda concurrency takes effect immediately and can impact other functions in the same account if you set it too high and exhaust the account’s unreserved pool.

Key Takeaway

API Gateway 5xx errors are almost never about the gateway itself — they’re about the backend’s response shape, latency, or availability. The fastest way to diagnose them is to enable execution logging and read what the gateway actually saw, rather than guessing from the client-facing error. Remember the 29-second ceiling: if your backend can take longer than that, API Gateway is the wrong fit, and you need an async pattern. And always test the Lambda’s response format directly — a 502 is usually a missing statusCode field, not a platform problem.


Have questions or ran into a different API Gateway issue? Connect with me on LinkedIn or X.