A customer-facing API I was asked to review was returning intermittent 502 Bad Gateway errors at around 3% of traffic. The Lambda backend looked healthy, CloudWatch showed successful invocations matching the expected volume, and X-Ray traces showed the function completing in 200ms on average. The 502s were real, but they weren’t reaching the backend at all — API Gateway was rejecting malformed responses from the Lambda function’s integration. The issue was a silent bug where the function returned a plain string instead of the Lambda proxy integration format expected.
If your API Gateway is returning 5xx errors and the backend looks fine, here’s how to find what’s actually happening.
The Problem
API Gateway returns 5xx errors even when the backend integration appears healthy:
| Error | What It Means |
|---|---|
502 Bad Gateway |
Integration response is malformed (Lambda proxy format mismatch) |
503 Service Unavailable |
Integration endpoint is unreachable or the account’s concurrency is exhausted |
504 Gateway Timeout |
Integration took longer than API Gateway’s 29-second hard limit |
500 Internal Server Error |
Generic failure — usually a mapping template or authorizer issue |
403 Forbidden from authorizer |
Custom Lambda authorizer returned deny or failed |
API Gateway’s Execution Log has the authoritative error, but it’s disabled by default in most deployments — so teams fly blind.
Why Does This Happen?
- Lambda proxy integration format violation: When integrationType is
AWS_PROXY, the Lambda must return an object shaped like{"statusCode": 200, "body": "..."}. Returning anything else — a plain string, a null, or missingstatusCode— causes API Gateway to reject the response with 502. - 29-second integration timeout: API Gateway’s maximum integration timeout is 29 seconds. Any backend that takes longer gets its response discarded and the client receives 504. Increasing the Lambda’s timeout beyond 29 seconds has no effect.
- Lambda concurrency limit reached: If the Lambda reserved concurrency is set, once that’s exhausted API Gateway gets throttled responses and returns 503 to clients. The Lambda invocations show as throttled in metrics but are invisible in the function’s invocation logs.
- Custom authorizer returns unexpected shape: A custom authorizer must return a specific policy document shape. Bugs that return
{"allow": true}or similar guesses cause the authorizer to be treated as deny. - VPC Link target unhealthy: For HTTP_PROXY integrations pointing to a private ALB/NLB via VPC Link, the NLB’s target group health is critical. If targets are unhealthy, the integration silently fails with 503.
- Resource policy or WAF blocking request: A resource policy restricting source IPs or a WAF rule can deny requests before they reach the integration, returning 403 rather than the integration’s actual response.
The Fix
Step 1: Enable Execution Logging
This is the single highest-value debugging step. Turn on full request/response logging for the stage:
aws apigateway update-stage \
--rest-api-id abc12345 \
--stage-name prod \
--patch-operations \
op=replace,path=/*/*/logging/loglevel,value=INFO \
op=replace,path=/*/*/logging/dataTrace,value=true \
op=replace,path=/*/*/metrics/enabled,value=true
Make sure API Gateway has permission to write CloudWatch Logs (via the AmazonAPIGatewayPushToCloudWatchLogs managed policy on the API Gateway CloudWatch role).
Then tail the execution logs:
aws logs tail "API-Gateway-Execution-Logs_abc12345/prod" --follow
The real error will be in these logs — including the full integration response that triggered the 502.
Step 2: Fix 502 Bad Gateway (Lambda Format)
Test the Lambda directly to see what it returns:
aws lambda invoke \
--function-name my-api-handler \
--payload '{"httpMethod":"GET","path":"/items"}' \
--cli-binary-format raw-in-base64-out \
response.json && cat response.json
The response must have this shape for Lambda proxy integration:
{
"statusCode": 200,
"headers": {
"Content-Type": "application/json"
},
"body": "{\"items\":[]}",
"isBase64Encoded": false
}
Note that body must be a string, not an object. This is the number one cause of 502s.
Step 3: Fix 504 Gateway Timeout
Check the integration latency:
aws cloudwatch get-metric-statistics \
--namespace AWS/ApiGateway \
--metric-name IntegrationLatency \
--dimensions Name=ApiName,Value=my-api Name=Stage,Value=prod \
--start-time 2026-04-21T00:00:00Z \
--end-time 2026-04-22T23:59:59Z \
--period 300 \
--statistics Maximum,Average \
--output table
If Maximum approaches 29000 ms (29 seconds), you’re hitting the hard limit. Options:
- Convert to async: Return 202 Accepted immediately, process in the background, and have clients poll for results.
- Optimize the backend: Profile the Lambda to find and remove the slow path.
- Move to HTTP API: HTTP APIs allow up to 30 seconds, but the extra second rarely helps.
For async operations, update the integration to respond immediately:
aws apigateway update-integration \
--rest-api-id abc12345 \
--resource-id xyz9876 \
--http-method POST \
--patch-operations op=replace,path=/requestTemplates/application~1json,value='{"statusCode":202,"message":"accepted"}'
Step 4: Fix 503 Service Unavailable
Check if it’s concurrency-related:
aws lambda get-function-concurrency \
--function-name my-api-handler
If reserved concurrency is limiting and traffic is real, remove the reservation or increase it:
aws lambda put-function-concurrency \
--function-name my-api-handler \
--reserved-concurrent-executions 500
For VPC Link integrations, verify NLB target health:
aws elbv2 describe-target-health \
--target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-api/abc123 \
--query "TargetHealthDescriptions[*].{Target:Target.Id,State:TargetHealth.State,Reason:TargetHealth.Reason}" \
--output table
If all targets show unhealthy, the NLB health check configuration or backend needs fixing before API Gateway will route successfully.
Step 5: Fix Authorizer Failures
Check whether the authorizer itself is failing or just denying:
aws logs filter-log-events \
--log-group-name "/aws/lambda/my-authorizer" \
--start-time $(date -d "1 hour ago" +%s)000 \
--filter-pattern "ERROR"
Authorizers must return a valid IAM policy document:
{
"principalId": "user123",
"policyDocument": {
"Version": "2012-10-17",
"Statement": [
{
"Action": "execute-api:Invoke",
"Effect": "Allow",
"Resource": "arn:aws:execute-api:us-east-1:123456789012:abc12345/prod/*/*"
}
]
}
}
If the authorizer returns anything else, API Gateway treats it as deny.
Step 6: Verify the Fix Under Load
Replay recent traffic or use a load generator. Monitor the 5xx error rate:
aws cloudwatch get-metric-statistics \
--namespace AWS/ApiGateway \
--metric-name 5XXError \
--dimensions Name=ApiName,Value=my-api Name=Stage,Value=prod \
--start-time 2026-04-22T00:00:00Z \
--end-time 2026-04-22T23:59:59Z \
--period 60 \
--statistics Sum \
--output table
The error rate should drop to zero within a few minutes of the deployment.
Is This Safe?
Mostly yes. Enabling execution logging is safe but generates CloudWatch Logs charges proportional to request volume — monitor costs and consider lowering the log level once debugging is done. Fixing the Lambda proxy format requires a deployment, which is safe for stateless APIs but should go through your normal release process. Modifying Lambda concurrency takes effect immediately and can impact other functions in the same account if you set it too high and exhaust the account’s unreserved pool.
Key Takeaway
API Gateway 5xx errors are almost never about the gateway itself — they’re about the backend’s response shape, latency, or availability. The fastest way to diagnose them is to enable execution logging and read what the gateway actually saw, rather than guessing from the client-facing error. Remember the 29-second ceiling: if your backend can take longer than that, API Gateway is the wrong fit, and you need an async pattern. And always test the Lambda’s response format directly — a 502 is usually a missing statusCode field, not a platform problem.
Have questions or ran into a different API Gateway issue? Connect with me on LinkedIn or X.