Nameet Rajore

AWS us-east-1 outage

Overview

AWS experienced an outage in US-EAST-1 region for 15 hours which impacted several services like Snapchat, Reddit, Venmo, etc. The estimated losses for which are more than half a billion dollars... ouch.

For context, US-EAST-1 is AWS's largest and oldest region—it's basically the backbone of the internet. When it goes down, you notice.

Here is my attempt to explain what happened to anybody who wants to know.

diagram

What exactly went wrong in DynamoDB?

The DNS system is split into 2 components, DNS Planner and DNS Enactor.

The DNS Planner monitors the health and capacity of the load balancers and periodically creates a new DNS plan which defines the percentage.

The DNS Enactor which is designed to have minimal dependencies to allow for system recovery in any scenario, enacts DNS plans by applying the required changes in the Amazon Route53 service.

The way things normally work is, a DNS Enactor picks up the latest plan produced by DNS Planner. It checks if the current plan is newer than the plan currently present in Route53 and then starts applying the new plan.

In this case, right before the updating event started, one of the Enactors started experiencing unusually high delays, needing to retry multiple times. Meanwhile, the DNS Planner created a lot of newer plans which were picked up by another Enactor and applied them in Route 53. The timing of these events is what created the race condition. When the second Enactor completed its process, it triggered a clean up process which cleans up all the older plans that were previously applied. When this clean up process was invoked, the first Enactor (which was unusually delayed) applied its older plan, overwriting the newer one. The second Enactor's clean-up process then deleted this older plan because it was many generations older than the plan it had just applied. Because this plan was deleted, all IP addresses were immediately removed because of which the system was left in an inconsistent state and prevented all subsequent plan updates from being applied by any DNS Enactors.

diagram

And why did this affect the EC2 service?

To understand what happened with the EC2 instances lets define a few sub-systems that EC2 Management System uses:

  1. DropletWorkflow Manager (DWFM)
  2. Network Manager

The DWFM is responsible for the management of all the underlying physical servers that are used by EC2 for the hosting of EC2 instances, also called "droplets".

The Network Manager, which is responsible for the management and propagation of network state to all EC2 instances and network appliances.

Each DWFM manages physical servers in one Availability Zone and maintains a lease on each server that it is currently managing. The lease tracks server state and ensures that instance operations (like shutdown or reboot) are properly reflected across EC2 systems.

When DynamoDB went down, the state checks performed by DWFM started failing resulting in lease timeouts. Because of these timeouts, whenever a customer tried launching a new instance, an "Insufficient Capacity" error was returned.

Once the DNS issue was resolved, a large number of droplets tried to renew their leases, because of which requests started to timeout and created more backlog and resulted in congestive collapse.

The issue was finally resolved when the engineers manually restarted DWFM hosts and cleared the queues.

diagram

And the NLB was affected because...?

The NLB outage was caused by a timing mismatch between EC2 instance launches and network configuration propagation. After DWFM recovered at 5:28 AM, new EC2 instances launched successfully but Network Manager was overwhelmed processing a backlog of network configurations. NLB's health check system began checking these new instances before their network state fully propagated, causing health checks to alternate between failing and passing as network configs were applied. This flapping behavior overloaded the health check subsystem, triggered automatic AZ DNS failovers, and removed healthy capacity from service, resulting in connection errors for customers. Engineers resolved the issue by disabling automatic failovers at 9:36 AM, allowing all healthy nodes to return to service.

diagram

What is AWS doing to resolve this issue?

AWS has disabled the DynamoDB DNS Planner and Enactor automation worldwide while they fix the race condition and add protections to prevent incorrect DNS plans from being applied. They're also adding velocity controls to NLB so it can't remove too much capacity at once during health check failures. For EC2, they're building better test suites to catch DWFM recovery issues and improving throttling to handle high load without collapsing. Plus, they're doing a full investigation across all services to find more ways to prevent this kind of thing and speed up recovery times.

What should services that faced this outage do about this?

  1. Multi-Region Failover: If your service runs in multiple regions, redirect traffic to unaffected regions during outages, even if it means degraded performance or higher latency. Some availability is always better than complete downtime. Set up automated failover mechanisms so you're not scrambling to manually reroute traffic at 2 AM.
  2. Practice Disaster Recovery: Regularly simulate large-scale failures to test your disaster recovery procedures. Running these drills will expose weaknesses in your architecture, reveal hidden dependencies, and ensure your team knows exactly what to do when things go sideways. The middle of an actual outage is not the time to discover your failover doesn't work.
  3. Evaluate Dependencies and SLAs: Calculate the actual cost of downtime versus the cost of building redundancy across regions or cloud providers. Assess whether your service SLAs realistically account for your dependencies' SLAs (like AWS). For some services, accepting the risk might be more cost-effective than maintaining complex multi-region architectures. Make an informed decision based on your numbers, not assumptions.

References

  1. AWS Official Post-Mortem: Summary of the Amazon DynamoDB Service Disruption in Northern Virginia (US-EAST-1) Region - AWS's detailed technical post-event summary explaining the race condition, cascading failures, and recovery efforts (October 19-20, 2025)

  2. A Single DNS Race Condition Brought AWS to Its Knees - The Register's technical analysis of the outage and its impacts (October 23, 2025)

  3. AWS Health Dashboard - Real-time status updates for AWS services across all regions

  4. Race Condition - Wikipedia - Comprehensive overview of race conditions in computing systems

  5. Handling Race Condition in Distributed Systems - GeeksforGeeks guide on detecting and managing race conditions in distributed architectures