AWS US-EAST-1 Outage

2025-10-10·14 min read

New to AWS? Check out the glossary at the bottom for definitions.

Overview

AWS experienced an outage in US-EAST-1 region for 15 hours which impacted several services like Snapchat, Reddit, Venmo, etc. The estimated losses for which are more than half a billion dollars... ouch.

For context, US-EAST-1 is AWS's largest and oldest region—it's basically the backbone of the internet. When it goes down, you notice.

Here is my attempt to explain what happened to anybody who wants to know.

mermaid

What exactly went wrong in DynamoDB?

The DNS system is split into 2 components, DNS Planner and DNS Enactor.

The DNS Planner monitors the health and capacity of the load balancers and periodically creates a new DNS plan which defines the percentage.

The DNS Enactor which is designed to have minimal dependencies to allow for system recovery in any scenario, enacts DNS plans by applying the required changes in the Amazon Route53 service.

The way things normally work is, a DNS Enactor picks up the latest plan produced by DNS Planner. It checks if the current plan is newer than the plan currently present in Route53 and then starts applying the new plan.

In this case, right before the updating event started, one of the Enactors started experiencing unusually high delays, needing to retry multiple times. Meanwhile, the DNS Planner created a lot of newer plans which were picked up by another Enactor and applied them in Route 53. The timing of these events is what created the race condition. When the second Enactor completed its process, it triggered a clean up process which cleans up all the older plans that were previously applied. When this clean up process was invoked, the first Enactor (which was unusually delayed) applied its older plan, overwriting the newer one. The second Enactor's clean-up process then deleted this older plan because it was many generations older than the plan it had just applied. Because this plan was deleted, all IP addresses were immediately removed because of which the system was left in an inconsistent state and prevented all subsequent plan updates from being applied by any DNS Enactors.

mermaid

And why did this affect the EC2 service?

To understand what happened with the EC2 instances lets define a few sub-systems that EC2 Management System uses:

DropletWorkflow Manager (DWFM)
Network Manager

The DWFM is responsible for the management of all the underlying physical servers that are used by EC2 for the hosting of EC2 instances, also called "droplets".

The Network Manager, which is responsible for the management and propagation of network state to all EC2 instances and network appliances.

Each DWFM manages physical servers in one Availability Zone and maintains a lease on each server that it is currently managing. The lease tracks server state and ensures that instance operations (like shutdown or reboot) are properly reflected across EC2 systems.

When DynamoDB went down, the state checks performed by DWFM started failing resulting in lease timeouts. Because of these timeouts, whenever a customer tried launching a new instance, an "Insufficient Capacity" error was returned.

Once the DNS issue was resolved, a large number of droplets tried to renew their leases, because of which requests started to timeout and created more backlog and resulted in congestive collapse.

The issue was finally resolved when the engineers manually restarted DWFM hosts and cleared the queues.

mermaid

And the NLB was affected because...?

The NLB outage was caused by a timing mismatch between EC2 instance launches and network configuration propagation. After DWFM recovered at 5:28 AM, new EC2 instances launched successfully but Network Manager was overwhelmed processing a backlog of network configurations. NLB's health check system began checking these new instances before their network state fully propagated, causing health checks to alternate between failing and passing as network configs were applied. This flapping behavior overloaded the health check subsystem, triggered automatic AZ DNS failovers, and removed healthy capacity from service, resulting in connection errors for customers. Engineers resolved the issue by disabling automatic failovers at 9:36 AM, allowing all healthy nodes to return to service.

mermaid

What is AWS doing to resolve this issue?

AWS has disabled the DynamoDB DNS Planner and Enactor automation worldwide while they fix the race condition and add protections to prevent incorrect DNS plans from being applied. They're also adding velocity controls to NLB so it can't remove too much capacity at once during health check failures. For EC2, they're building better test suites to catch DWFM recovery issues and improving throttling to handle high load without collapsing. Plus, they're doing a full investigation across all services to find more ways to prevent this kind of thing and speed up recovery times.

What should services that faced this outage do about this?

Multi-Region Failover: If your service runs in multiple regions, redirect traffic to unaffected regions during outages, even if it means degraded performance or higher latency. Some availability is always better than complete downtime. Set up automated failover mechanisms so you're not scrambling to manually reroute traffic at 2 AM.
Practice Disaster Recovery: Regularly simulate large-scale failures to test your disaster recovery procedures. Running these drills will expose weaknesses in your architecture, reveal hidden dependencies, and ensure your team knows exactly what to do when things go sideways. The middle of an actual outage is not the time to discover your failover doesn't work.
Evaluate Dependencies and SLAs: Calculate the actual cost of downtime versus the cost of building redundancy across regions or cloud providers. Assess whether your service SLAs realistically account for your dependencies' SLAs (like AWS). For some services, accepting the risk might be more cost-effective than maintaining complex multi-region architectures. Make an informed decision based on your numbers, not assumptions.

Glossary

AWS (Amazon Web Services) - Amazon's cloud computing platform that provides servers, storage, databases, and other services that power a huge chunk of the internet.

US-EAST-1 - AWS's largest and oldest data center region, located in Northern Virginia. Think of it as the "main office" of AWS—when it goes down, a lot of the internet goes with it.

DynamoDB - AWS's NoSQL database service. It's fast, scalable, and apparently, a single point of failure for a lot of other services.

EC2 (Elastic Compute Cloud) - AWS's virtual server service. Basically, you rent computer power in the cloud instead of buying physical servers.

NLB (Network Load Balancer) - Distributes incoming network traffic across multiple servers to prevent any single server from getting overwhelmed. Like a traffic cop for data.

DNS (Domain Name System) - The internet's phonebook. It translates human-readable domain names (like dynamodb.us-east-1.amazonaws.com) into IP addresses that computers can understand. When DNS breaks, nothing can find anything.

Route53 - AWS's DNS service. It's what tells your computer where to find AWS services.

Availability Zone (AZ) - Independent data centers within an AWS region. US-EAST-1 has multiple AZs for redundancy—if one fails, the others should keep working. (Spoiler: didn't help this time.)

Droplet - AWS's internal term for the physical servers that host EC2 instances. Not to be confused with DigitalOcean's droplets.

Lease - A time-limited claim on a resource. In this case, DWFM "leases" physical servers, meaning it has permission to manage them for a set period. When leases expire without renewal, chaos ensues.

DWFM (DropletWorkflow Manager) - The system that manages all the physical servers (droplets) used by EC2. When it can't talk to DynamoDB, it can't verify server leases, and everything grinds to a halt.

Network Manager - Handles network configuration for EC2 instances. Makes sure your servers can actually talk to the internet.

Health Check - Automated tests that verify if a server is working properly. NLB uses these to decide which servers should receive traffic.

Cascading Failure - When one component fails and causes other dependent components to fail in a domino effect. In this case: DynamoDB → EC2 → NLB → everything else.

Race Condition - A bug that occurs when the timing of events matters. Two processes trying to do things simultaneously can produce unpredictable results—like two people trying to walk through a doorway at the same time.

Congestive Collapse - When a system gets so overwhelmed with retry attempts that it can't process anything, making the problem worse. Like a traffic jam where honking doesn't make it move faster.

DNS Planner - The part of DynamoDB's DNS system that monitors health and creates plans for which IP addresses should be in Route53.

DNS Enactor - The part that actually applies the DNS Planner's plans to Route53. Designed to have minimal dependencies so it can help recover from failures. (Ironic, given what happened.)

IP Address - The actual numeric address that computers use to find each other on the internet (e.g., 192.168.1.1).

Endpoint - The URL or address where you access a service (e.g., dynamodb.us-east-1.amazonaws.com).

References

AWS Official Post-Mortem: Summary of the Amazon DynamoDB Service Disruption in Northern Virginia (US-EAST-1) Region - AWS's detailed technical post-event summary explaining the race condition, cascading failures, and recovery efforts (October 19-20, 2025)
A Single DNS Race Condition Brought AWS to Its Knees - The Register's technical analysis of the outage and its impacts (October 23, 2025)
AWS Health Dashboard - Real-time status updates for AWS services across all regions
Race Condition - Wikipedia - Comprehensive overview of race conditions in computing systems
Handling Race Condition in Distributed Systems - GeeksforGeeks guide on detecting and managing race conditions in distributed architectures