I was on a call with a team whose application had gone down during peak hours. Their RDS instance was returning Connection refused to every client. The dashboard showed the instance as Available, which made it even more confusing — the database was running but refusing all connections. Turns out it had quietly run out of storage and entered a read-only state, and the application’s error handling was masking the real error.
Here’s how to diagnose and fix the most common RDS failures.
The Problem
Applications lose connectivity to RDS or encounter degraded performance:
| Error | What It Means |
|---|---|
Connection refused (port 3306/5432) |
Security group, network ACL, or the instance is not accepting connections |
Could not connect to server: Connection timed out |
Routing or DNS issue between client and RDS |
ERROR: cannot execute INSERT in a read-only transaction |
Instance is in read-only mode due to storage full |
Replication lag: XXX seconds |
Read replica is falling behind the primary |
too many connections |
Connection count exceeds the max_connections parameter |
The RDS console may show Available even when the database is functionally broken, which makes these issues harder to catch without proper monitoring.
Why Does This Happen?
- Security groups not allowing inbound traffic: RDS instances need their security group to explicitly allow inbound traffic on the database port (3306 for MySQL, 5432 for PostgreSQL) from the application’s security group or CIDR range. This is the most common cause of Connection refused for new setups.
- Storage auto-scaling disabled or maxed out: When an RDS instance runs out of allocated storage, it transitions to a read-only state to prevent data corruption. If storage auto-scaling is disabled or has hit its maximum threshold, the database silently stops accepting writes.
- Parameter group with low max_connections: The default
max_connectionsvalue scales with instance class. If you’re on adb.t3.microwith the default, you get roughly 66 connections. A connection pool misconfiguration or a burst of Lambda functions can exhaust this instantly. - Read replica on a different instance class: Read replicas that are smaller than the primary cannot keep up with write-heavy workloads, causing replication lag to grow unbounded.
- Multi-AZ failover changed the endpoint IP: After a Multi-AZ failover, the DNS CNAME is updated, but clients with cached DNS entries keep connecting to the old IP. Applications that don’t respect DNS TTL will see connection failures for minutes after failover.
The Fix
Step 1: Verify Instance Status and Storage
Check the instance state and how much storage is available:
aws rds describe-db-instances \
--db-instance-identifier my-database \
--query "DBInstances[0].{Status:DBInstanceStatus,Engine:Engine,Storage:AllocatedStorage,MaxStorage:MaxAllocatedStorage,StorageType:StorageType,MultiAZ:MultiAZ,Class:DBInstanceClass}" \
--output table
If MaxAllocatedStorage equals AllocatedStorage, auto-scaling has hit its ceiling or is disabled.
Check actual storage usage via CloudWatch:
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name FreeStorageSpace \
--dimensions Name=DBInstanceIdentifier,Value=my-database \
--start-time 2026-04-08T00:00:00Z \
--end-time 2026-04-09T23:59:59Z \
--period 3600 \
--statistics Minimum \
--output table
If free storage is near zero, that’s your problem.
Step 2: Fix Storage-Full Read-Only Mode
Enable storage auto-scaling with a higher ceiling:
aws rds modify-db-instance \
--db-instance-identifier my-database \
--max-allocated-storage 500 \
--apply-immediately
If you need immediate relief, increase the allocated storage:
aws rds modify-db-instance \
--db-instance-identifier my-database \
--allocated-storage 200 \
--apply-immediately
Note that storage modifications can only happen once every 6 hours. The instance will remain in storage-optimization status during the change but stays available for reads and writes once the initial allocation completes.
Step 3: Fix Connection Refused Errors
Identify the security group attached to the RDS instance:
aws rds describe-db-instances \
--db-instance-identifier my-database \
--query "DBInstances[0].VpcSecurityGroups[*].{GroupId:VpcSecurityGroupId,Status:Status}" \
--output table
Check the inbound rules:
aws ec2 describe-security-groups \
--group-ids sg-0abc123def \
--query "SecurityGroups[0].IpPermissions[*].{Port:FromPort,Protocol:IpProtocol,Sources:IpRanges[*].CidrIp,SourceSGs:UserIdGroupPairs[*].GroupId}" \
--output json
If your application’s security group or IP range isn’t listed, add it:
aws ec2 authorize-security-group-ingress \
--group-id sg-0abc123def \
--protocol tcp \
--port 5432 \
--source-group sg-0app456ghi
Step 4: Fix max_connections Exhaustion
Check current connection count:
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name DatabaseConnections \
--dimensions Name=DBInstanceIdentifier,Value=my-database \
--start-time 2026-04-09T00:00:00Z \
--end-time 2026-04-09T23:59:59Z \
--period 300 \
--statistics Maximum \
--output table
If connections are hitting the ceiling, the real fix is implementing connection pooling (RDS Proxy or PgBouncer). As a short-term workaround, you can increase max_connections via a custom parameter group:
aws rds modify-db-parameter-group \
--db-parameter-group-name my-custom-params \
--parameters "ParameterName=max_connections,ParameterValue=200,ApplyMethod=pending-reboot"
Then reboot the instance to apply:
aws rds reboot-db-instance \
--db-instance-identifier my-database
For a longer-term solution, use RDS Proxy to manage connection pooling transparently.
Step 5: Fix Read Replica Lag
Check the current replication lag:
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name ReplicaLag \
--dimensions Name=DBInstanceIdentifier,Value=my-database-replica \
--start-time 2026-04-09T00:00:00Z \
--end-time 2026-04-09T23:59:59Z \
--period 300 \
--statistics Maximum \
--output table
If lag is growing, check the replica’s instance class compared to the primary. The replica should be the same class or larger:
aws rds describe-db-instances \
--db-instance-identifier my-database-replica \
--query "DBInstances[0].{Class:DBInstanceClass,IOPS:Iops,StorageType:StorageType}" \
--output table
Scale up the replica to match the primary:
aws rds modify-db-instance \
--db-instance-identifier my-database-replica \
--db-instance-class db.r6g.xlarge \
--apply-immediately
Step 6: Verify Connectivity
After making changes, test the connection from your application’s network:
aws rds describe-db-instances \
--db-instance-identifier my-database \
--query "DBInstances[0].Endpoint.{Address:Address,Port:Port}" \
--output table
Use the endpoint to test connectivity (from an EC2 instance in the same VPC):
nc -zv my-database.abc123.us-east-1.rds.amazonaws.com 5432
A successful connection returns Connection to ... succeeded.
Is This Safe?
Yes. The diagnostic commands are read-only. Increasing storage is non-destructive and the database remains available. Modifying security groups is additive. The only action that causes a brief interruption is rebooting the instance to apply parameter changes — schedule this during a maintenance window if the database is in production.
Key Takeaway
RDS looking Available in the console doesn’t mean it’s healthy. A storage-full database stays technically available but rejects all writes, and the error message your application sees depends entirely on your database driver. Set up CloudWatch alarms on FreeStorageSpace (alarm when below 10% of total), DatabaseConnections (alarm at 80% of max_connections), and ReplicaLag (alarm above 30 seconds). These three metrics catch 90% of RDS failures before they become incidents.
Have questions or ran into a different RDS issue? Connect with me on LinkedIn or X.