I’ve debugged this scenario more times than I can count: Control Tower detects drift in your landing zone, you click “Repair,” and the repair either fails silently or completes but drift is still showing. It’s maddening because the repair should fix it, but it doesn’t. The issue is almost always deeper than just clicking “Repair” — you need to understand what caused the drift in the first place. In this post, I’ll show you how to diagnose and fix repair failures.
The Problem
When Control Tower detects drift and you attempt to repair it, the repair operation deploys CloudFormation StackSets to restore baseline resources. If the repair fails or completes but drift persists, it means underlying issues are blocking the repair:
| Error Type | Description |
|---|---|
| Repair timeout | Stack operation took too long, likely due to manual changes |
| Permission denied | Execution role lacks necessary permissions |
| Role missing | IAM roles used by Control Tower were deleted or modified |
| Stack failed | CloudFormation template failed due to resource state |
| Security Hub disabled | Security Hub or GuardDuty manually disabled in account |
Why Does This Happen?
- Manual changes to CloudTrail, Config, or SNS — If you manually deleted or modified these Control Tower-managed resources, the repair attempts to restore them but may encounter conflicts if resources are in an inconsistent state.
- Stack instance in OUTDATED or FAILED state — When a StackSet operation fails, stack instances can be left in OUTDATED state. Control Tower cannot repair until these are resolved.
- Missing execution role in the account — Control Tower uses the
AWSControlTowerExecutionrole to deploy and update stacks. If this role was deleted, repair fails. - Security Hub or GuardDuty manually disabled — If you disabled these services in a member account, Control Tower’s repair cannot re-enable them if the services were removed from the stack.
- IAM role used by Control Tower was modified — The
AWSControlTowerAdminrole in the management account orAWSControlTowerExecutionin member accounts must have specific permissions. Modifying these roles breaks repair.
The Fix
Diagnose the specific issue causing repair failure and fix it systematically.
Step 1: Identify What’s Drifted
In the Control Tower console:
Control Tower → Landing Zone → Check for drift
Wait for the scan to complete. Then go to:
Control Tower → Organization → [Select drifted account/OU] → View drift details
Note exactly which resources are drifted (e.g., CloudTrail trail deleted, Config recorder modified, etc.).
Step 2: Manually Restore the Drifted Resource
For each drifted resource, manually restore it to the expected state. For example, if CloudTrail was deleted:
aws cloudtrail describe-trails \
--region us-east-1 \
--output text
If the expected trail is missing, you may need to contact AWS Support or consult the Control Tower baseline documentation for the exact trail name and configuration.
Step 3: Check StackSet Operations
View StackSet operations for the baseline stacks:
aws cloudformation list-stack-set-operations \
--stack-set-name AWSControlTowerBP-BASELINE-CLOUDTRAIL \
--output text
Look for operations in “FAILED” or “STOPPED_ON_FAILURE” state:
aws cloudformation list-stack-set-operation-results \
--stack-set-name AWSControlTowerBP-BASELINE-CLOUDTRAIL \
--operation-id [operation-id-from-above] \
--output table
If an operation failed, note which account and which resource caused the failure.
Step 4: Verify IAM Roles
Check that the required execution roles exist and have proper permissions:
aws iam get-role \
--role-name AWSControlTowerAdmin \
--output text
aws iam get-role \
--role-name AWSControlTowerExecution \
--output text
If either role is missing, you’ll need to contact AWS Support to restore it.
Step 5: Re-Register the OU
After fixing underlying issues (restored drifted resources, verified roles), re-register the OU to trigger a fresh baseline deployment:
Control Tower → Organization → [Select OU] → Register OU
This is more thorough than “Repair” and will re-deploy all baseline resources.
How to Run This
- Open the Control Tower console.
- Go to Landing Zone → Check for drift and wait for the scan.
- Select a drifted account or OU and view drift details.
- Note which resources are drifted (CloudTrail, Config, SNS, etc.).
- Run the CloudTrail and StackSet commands above to identify what’s missing.
- Manually restore drifted resources (e.g., re-create a deleted CloudTrail).
- Run the IAM commands to verify roles exist and are intact.
- Go to Control Tower → Organization → [Select OU] → Register OU.
- Wait for the re-registration to complete (usually 10–20 minutes).
- Check for drift again — it should be clear now.
Is This Safe?
Yes. Re-registering an OU is safe and repeatable. Control Tower will re-deploy baseline resources idempotently. Manually restoring drifted resources is safe as long as you’re restoring them to Control Tower’s expected state (which the drift details show you).
Key Takeaway
Landing Zone repair failures are usually caused by underlying manual changes to Control Tower-managed resources. Identify what drifted, manually restore it to the expected state, and then re-register the OU. The repair will succeed on the next attempt.
Have questions or ran into a different Control Tower issue? Connect with me on LinkedIn or X.