Anatomy of the AWS US-EAST-1 Outage
/ 14 min read
Table of Contents
New to AWS? Check out the glossary at the bottom for definitions.
Overview
AWS experienced an outage in US-EAST-1 region for 15 hours which impacted several services like Snapchat, Reddit, Venmo, etc. The estimated losses for which are more than half a billion dollars… ouch.
For context, US-EAST-1 is AWS’s largest and oldest region—it’s basically the backbone of the internet. When it goes down, you notice.
Here is my attempt to explain what happened to anybody who wants to know.
graph LR
subgraph "DynamoDB DNS Management System"
DP[DNS Planner
Monitors health & capacity
Creates DNS plans]
DE1[DNS Enactor AZ-1
Applies plans to Route53]
DE2[DNS Enactor AZ-2
Applies plans to Route53]
DE3[DNS Enactor AZ-3
Applies plans to Route53]
R53[Amazon Route53
DNS Records]
DP --> DE1
DP --> DE2
DP --> DE3
DE1 --> R53
DE2 --> R53
DE3 --> R53
end
subgraph "DynamoDB Service"
DDB[DynamoDB Regional Endpoint
dynamodb.us-east-1.amazonaws.com]
LB1[Load Balancer 1]
LB2[Load Balancer 2]
LBN[Load Balancer N]
R53 --> DDB
DDB --> LB1
DDB --> LB2
DDB --> LBN
end
subgraph "EC2 Management Systems"
DWFM[DropletWorkflow Manager
Manages physical servers]
NM[Network Manager
Network state propagation]
DROPLETS[Physical Servers
Droplets]
DWFM --> DROPLETS
NM --> DROPLETS
DWFM -.->|depends on| DDB
end
subgraph "Network Load Balancer"
NLB[Network Load Balancer]
HC[Health Check Subsystem]
NLBNODES[NLB Nodes]
HC --> NLBNODES
NLB --> NLBNODES
HC -.->|health checks| DROPLETS
end
subgraph "Affected AWS Services"
LAMBDA[AWS Lambda]
ECS[Amazon ECS/EKS/Fargate]
CONNECT[Amazon Connect]
STS[AWS STS]
CONSOLE[AWS Management Console]
REDSHIFT[Amazon Redshift]
SUPPORT[AWS Support]
OTHER[Other AWS Services]
end
%% Dependencies
DDB -.->|dependency| LAMBDA
DDB -.->|dependency| ECS
DDB -.->|dependency| CONNECT
DDB -.->|dependency| STS
DDB -.->|dependency| CONSOLE
DDB -.->|dependency| REDSHIFT
DDB -.->|dependency| SUPPORT
DDB -.->|dependency| OTHER
DROPLETS -.->|dependency| LAMBDA
DROPLETS -.->|dependency| ECS
NLB -.->|dependency| CONNECT
NLB -.->|dependency| LAMBDA
%% Failure cascade
R53 -.->|DNS failure| DDB
DDB -.->|cascade| DWFM
DWFM -.->|cascade| NM
NM -.->|cascade| NLB
classDef failure fill:#ffcccc,stroke:#ff0000,stroke-width:2px
classDef affected fill:#ffffcc,stroke:#ffaa00,stroke-width:2px
class R53,DDB failure
class DWFM,NM,NLB,LAMBDA,ECS,CONNECT,STS,CONSOLE,REDSHIFT,SUPPORT,OTHER affected
What exactly went wrong in DynamoDB?
The DNS system is split into 2 components, DNS Planner and DNS Enactor.
The DNS Planner monitors the health and capacity of the load balancers and periodically creates a new DNS plan which defines the percentage.
The DNS Enactor which is designed to have minimal dependencies to allow for system recovery in any scenario, enacts DNS plans by applying the required changes in the Amazon Route53 service.
The way things normally work is, a DNS Enactor picks up the latest plan produced by DNS Planner. It checks if the current plan is newer than the plan currently present in Route53 and then starts applying the new plan.
In this case, right before the updating event started, one of the Enactors started experiencing unusually high delays, needing to retry multiple times. Meanwhile, the DNS Planner created a lot of newer plans which were picked up by another Enactor and applied them in Route 53. The timing of these events is what created the race condition. When the second Enactor completed its process, it triggered a clean up process which cleans up all the older plans that were previously applied. When this clean up process was invoked, the first Enactor (which was unusually delayed) applied its older plan, overwriting the newer one. The second Enactor’s clean-up process then deleted this older plan because it was many generations older than the plan it had just applied. Because this plan was deleted, all IP addresses were immediately removed because of which the system was left in an inconsistent state and prevented all subsequent plan updates from being applied by any DNS Enactors.
sequenceDiagram
participant DP as DNS Planner
participant DE1 as DNS Enactor A (Slow)
participant DE2 as DNS Enactor B (Fast)
participant R53 as Route53
participant CLEANUP as Cleanup Process
Note over DP, CLEANUP: Normal Operation
DP->>DE1: Plan X (older)
DP->>DE2: Plan X (older)
Note over DE1: Starts applying Plan X
DE1->>R53: Check: Plan X newer than current? ✓
DE1->>R53: Begin applying Plan X to endpoints
Note over DE1: ⚠️ Experiences unusual delays
DE1-->>R53: Retry endpoint 1... (blocked)
DE1-->>R53: Retry endpoint 2... (blocked)
Note over DP: Continues generating newer plans
DP->>DE2: Plan Y (newer)
DP->>DE2: Plan Z (newest)
Note over DE2: Picks up Plan Z (newest)
DE2->>R53: Check: Plan Z newer than current? ✓
DE2->>R53: Apply Plan Z to all endpoints (fast)
DE2->>R53: ✅ Plan Z applied successfully
Note over DE2: Triggers cleanup after success
DE2->>CLEANUP: Invoke cleanup process
Note over DE1, CLEANUP: 🚨 RACE CONDITION WINDOW
par
DE1->>R53: Apply old Plan X to regional endpoint
Note over DE1: Overwrites newer Plan Z!
and
CLEANUP->>R53: Delete old plans (including Plan X)
Note over CLEANUP: Deletes the plan that was just applied!
end
Note over R53: 💥 FAILURE STATE
R53-->>R53: Regional endpoint has empty DNS record
R53-->>R53: System in inconsistent state
Note over DP, CLEANUP: 11:48 PM PDT - DNS Resolution Fails
Note over DP, CLEANUP: All DynamoDB connections fail
rect rgb(255, 200, 200)
Note over DP, CLEANUP: Manual Intervention Required
Note over DP, CLEANUP: 12:38 AM - Engineers identify DNS issue
Note over DP, CLEANUP: 1:15 AM - Temporary mitigations applied
Note over DP, CLEANUP: 2:25 AM - DNS information restored
Note over DP, CLEANUP: 2:40 AM - Full recovery (DNS cache expiry)
end
And why did this affect the EC2 service?
To understand what happened with the EC2 instances lets define a few sub-systems that EC2 Management System uses:
The DWFM is responsible for the management of all the underlying physical servers that are used by EC2 for the hosting of EC2 instances, also called “droplets”.
The Network Manager, which is responsible for the management and propagation of network state to all EC2 instances and network appliances.
Each DWFM manages physical servers in one Availability Zone and maintains a lease on each server that it is currently managing. The lease tracks server state and ensures that instance operations (like shutdown or reboot) are properly reflected across EC2 systems.
When DynamoDB went down, the state checks performed by DWFM started failing resulting in lease timeouts. Because of these timeouts, whenever a customer tried launching a new instance, an “Insufficient Capacity” error was returned.
Once the DNS issue was resolved, a large number of droplets tried to renew their leases, because of which requests started to timeout and created more backlog and resulted in congestive collapse.
The issue was finally resolved when the engineers manually restarted DWFM hosts and cleared the queues.
sequenceDiagram
participant DDB as DynamoDB
participant DWFM as DWFM
participant DROPLETS as Physical Servers
participant NM as Network Manager
participant API as EC2 API
participant CUSTOMERS as Customers
Note over DDB, CUSTOMERS: Normal Operations
loop Every few minutes
DWFM->>DDB: State check & lease validation
DDB-->>DWFM: ✅ Success
DWFM->>DROPLETS: Maintain lease
end
CUSTOMERS->>API: Launch new EC2 instance
API->>DWFM: Request capacity
DWFM->>DROPLETS: Allocate from leased servers
DROPLETS-->>DWFM: ✅ Server allocated
NM->>DROPLETS: Configure network for new instance
API-->>CUSTOMERS: ✅ Instance launched
Note over DDB, CUSTOMERS: 🚨 11:48 PM - DynamoDB DNS Failure
rect rgb(255, 200, 200)
DWFM->>DDB: State check
DDB-->>DWFM: ❌ DNS resolution failed
Note over DWFM: Cannot complete state checks
Note over DROPLETS: Leases begin timing out
end
Note over DDB, CUSTOMERS: 11:48 PM - 2:25 AM: Lease Timeout Period
CUSTOMERS->>API: Launch new EC2 instance
API->>DWFM: Request capacity
DWFM->>DROPLETS: Check available servers
DROPLETS-->>DWFM: ❌ No servers with active leases
API-->>CUSTOMERS: ❌ "Insufficient capacity"
Note over DDB, CUSTOMERS: 2:25 AM - DynamoDB Recovers
rect rgb(200, 255, 200)
DWFM->>DDB: State check
DDB-->>DWFM: ✅ Success
end
Note over DDB, CUSTOMERS: 2:25 AM - 5:28 AM: Congestive Collapse
rect rgb(255, 255, 200)
par Massive lease re-establishment
DWFM->>DROPLETS: Re-establish lease 1
DWFM->>DROPLETS: Re-establish lease 2
DWFM->>DROPLETS: Re-establish lease N
end
Note over DWFM: Too many simultaneous requests
Note over DWFM: Lease attempts timeout
Note over DWFM: Work queues up → System overload
CUSTOMERS->>API: Launch new EC2 instance
API-->>CUSTOMERS: ❌ "Insufficient capacity"
Note over DWFM: 4:14 AM - Engineers restart DWFM hosts
Note over DWFM: Queues cleared, processing normalized
end
Note over DDB, CUSTOMERS: 5:28 AM - DWFM Recovery Complete
DWFM->>DROPLETS: ✅ All leases re-established
Note over DDB, CUSTOMERS: 5:28 AM - 10:36 AM: Network Manager Backlog
rect rgb(200, 200, 255)
CUSTOMERS->>API: Launch new EC2 instance
API->>DWFM: Request capacity
DWFM->>DROPLETS: ✅ Server allocated
NM->>DROPLETS: Configure network (delayed)
Note over NM: Processing backlog of network configs
API-->>CUSTOMERS: ✅ Instance launched (no network)
Note over NM: 6:21 AM - Increased latencies
Note over NM: 10:36 AM - Backlog cleared
NM->>DROPLETS: ✅ Network configured
end
Note over DDB, CUSTOMERS: 10:36 AM - 1:50 PM: Throttle Removal
Note over API: 11:23 AM - Begin removing throttles
CUSTOMERS->>API: Launch new EC2 instance
API->>DWFM: Request capacity
DWFM->>DROPLETS: ✅ Server allocated
NM->>DROPLETS: ✅ Network configured
API-->>CUSTOMERS: ✅ Instance launched successfully
Note over DDB, CUSTOMERS: 1:50 PM - Full Recovery
And the NLB was affected because…?
The NLB outage was caused by a timing mismatch between EC2 instance launches and network configuration propagation. After DWFM recovered at 5:28 AM, new EC2 instances launched successfully but Network Manager was overwhelmed processing a backlog of network configurations. NLB’s health check system began checking these new instances before their network state fully propagated, causing health checks to alternate between failing and passing as network configs were applied. This flapping behavior overloaded the health check subsystem, triggered automatic AZ DNS failovers, and removed healthy capacity from service, resulting in connection errors for customers. Engineers resolved the issue by disabling automatic failovers at 9:36 AM, allowing all healthy nodes to return to service.
sequenceDiagram
participant DWFM as DWFM
participant EC2 as New EC2 Instance
participant NM as Network Manager
participant HC as NLB Health Check
participant NLB as NLB Service
participant DNS as DNS/Route53
participant CUSTOMER as Customer
Note over DWFM, CUSTOMER: 5:28 AM - DWFM Recovery Complete
DWFM->>EC2: ✅ Launch new instance
Note over EC2: Instance running but network incomplete
par Network backlog processing
NM-->>EC2: Network config pending...
and Health check starts
HC->>EC2: Health check
EC2-->>HC: ❌ Network not ready
HC->>NLB: Mark as unhealthy
NLB->>DNS: Remove from service
end
NM->>EC2: ✅ Network config applied
HC->>EC2: Health check
EC2-->>HC: ✅ Now healthy
HC->>NLB: Mark as healthy
NLB->>DNS: Add back to service
Note over HC, DNS: Health checks flapping (fail/pass/fail/pass)
loop Flapping cycle
HC->>EC2: Health check
alt Network timing issue
EC2-->>HC: ❌ Temporary network issue
HC->>NLB: Remove from service
NLB->>DNS: Remove from DNS
else Network working
EC2-->>HC: ✅ Healthy
HC->>NLB: Add to service
NLB->>DNS: Add to DNS
end
end
Note over HC, CUSTOMER: 6:21 AM - Health check system overloaded
rect rgb(255, 200, 200)
Note over HC: System degraded from flapping
HC-->>NLB: Delayed health checks
NLB->>DNS: Automatic AZ failover triggered
Note over DNS: Capacity removed from service
CUSTOMER->>NLB: Connection request
NLB-->>CUSTOMER: ❌ Connection error (insufficient capacity)
end
Note over HC, CUSTOMER: 9:36 AM - Engineers disable automatic failover
rect rgb(200, 255, 200)
Note over NLB: Automatic failover disabled
NLB->>DNS: Restore all healthy nodes
CUSTOMER->>NLB: Connection request
NLB-->>CUSTOMER: ✅ Connection successful
end
Note over HC, CUSTOMER: 2:09 PM - Re-enable automatic failover after full recovery
What is AWS doing to resolve this issue?
AWS has disabled the DynamoDB DNS Planner and Enactor automation worldwide while they fix the race condition and add protections to prevent incorrect DNS plans from being applied. They’re also adding velocity controls to NLB so it can’t remove too much capacity at once during health check failures. For EC2, they’re building better test suites to catch DWFM recovery issues and improving throttling to handle high load without collapsing. Plus, they’re doing a full investigation across all services to find more ways to prevent this kind of thing and speed up recovery times.
What should services that faced this outage do about this?
- Multi-Region Failover: If your service runs in multiple regions, redirect traffic to unaffected regions during outages, even if it means degraded performance or higher latency. Some availability is always better than complete downtime. Set up automated failover mechanisms so you’re not scrambling to manually reroute traffic at 2 AM.
- Practice Disaster Recovery: Regularly simulate large-scale failures to test your disaster recovery procedures. Running these drills will expose weaknesses in your architecture, reveal hidden dependencies, and ensure your team knows exactly what to do when things go sideways. The middle of an actual outage is not the time to discover your failover doesn’t work.
- Evaluate Dependencies and SLAs: Calculate the actual cost of downtime versus the cost of building redundancy across regions or cloud providers. Assess whether your service SLAs realistically account for your dependencies’ SLAs (like AWS). For some services, accepting the risk might be more cost-effective than maintaining complex multi-region architectures. Make an informed decision based on your numbers, not assumptions.
Glossary
AWS (Amazon Web Services) - Amazon’s cloud computing platform that provides servers, storage, databases, and other services that power a huge chunk of the internet.
US-EAST-1 - AWS’s largest and oldest data center region, located in Northern Virginia. Think of it as the “main office” of AWS—when it goes down, a lot of the internet goes with it.
DynamoDB - AWS’s NoSQL database service. It’s fast, scalable, and apparently, a single point of failure for a lot of other services.
EC2 (Elastic Compute Cloud) - AWS’s virtual server service. Basically, you rent computer power in the cloud instead of buying physical servers.
NLB (Network Load Balancer) - Distributes incoming network traffic across multiple servers to prevent any single server from getting overwhelmed. Like a traffic cop for data.
DNS (Domain Name System) - The internet’s phonebook. It translates human-readable domain names (like dynamodb.us-east-1.amazonaws.com) into IP addresses that computers can understand. When DNS breaks, nothing can find anything.
Route53 - AWS’s DNS service. It’s what tells your computer where to find AWS services.
Availability Zone (AZ) - Independent data centers within an AWS region. US-EAST-1 has multiple AZs for redundancy—if one fails, the others should keep working. (Spoiler: didn’t help this time.)
Droplet - AWS’s internal term for the physical servers that host EC2 instances. Not to be confused with DigitalOcean’s droplets.
Lease - A time-limited claim on a resource. In this case, DWFM “leases” physical servers, meaning it has permission to manage them for a set period. When leases expire without renewal, chaos ensues.
DWFM (DropletWorkflow Manager) - The system that manages all the physical servers (droplets) used by EC2. When it can’t talk to DynamoDB, it can’t verify server leases, and everything grinds to a halt.
Network Manager - Handles network configuration for EC2 instances. Makes sure your servers can actually talk to the internet.
Health Check - Automated tests that verify if a server is working properly. NLB uses these to decide which servers should receive traffic.
Cascading Failure - When one component fails and causes other dependent components to fail in a domino effect. In this case: DynamoDB → EC2 → NLB → everything else.
Race Condition - A bug that occurs when the timing of events matters. Two processes trying to do things simultaneously can produce unpredictable results—like two people trying to walk through a doorway at the same time.
Congestive Collapse - When a system gets so overwhelmed with retry attempts that it can’t process anything, making the problem worse. Like a traffic jam where honking doesn’t make it move faster.
DNS Planner - The part of DynamoDB’s DNS system that monitors health and creates plans for which IP addresses should be in Route53.
DNS Enactor - The part that actually applies the DNS Planner’s plans to Route53. Designed to have minimal dependencies so it can help recover from failures. (Ironic, given what happened.)
IP Address - The actual numeric address that computers use to find each other on the internet (e.g., 192.168.1.1).
Endpoint - The URL or address where you access a service (e.g., dynamodb.us-east-1.amazonaws.com).
References
-
AWS Official Post-Mortem: Summary of the Amazon DynamoDB Service Disruption in Northern Virginia (US-EAST-1) Region - AWS’s detailed technical post-event summary explaining the race condition, cascading failures, and recovery efforts (October 19-20, 2025)
-
A Single DNS Race Condition Brought AWS to Its Knees - The Register’s technical analysis of the outage and its impacts (October 23, 2025)
-
AWS Health Dashboard - Real-time status updates for AWS services across all regions
-
Race Condition - Wikipedia - Comprehensive overview of race conditions in computing systems
-
Handling Race Condition in Distributed Systems - GeeksforGeeks guide on detecting and managing race conditions in distributed architectures