skip to content
Nameet Rajore

Anatomy of the AWS US-EAST-1 Outage

/ 14 min read

Table of Contents

New to AWS? Check out the glossary at the bottom for definitions.

Overview

AWS experienced an outage in US-EAST-1 region for 15 hours which impacted several services like Snapchat, Reddit, Venmo, etc. The estimated losses for which are more than half a billion dollars… ouch.

For context, US-EAST-1 is AWS’s largest and oldest region—it’s basically the backbone of the internet. When it goes down, you notice.

Here is my attempt to explain what happened to anybody who wants to know.

graph LR
    subgraph "DynamoDB DNS Management System"
        DP[DNS Planner
Monitors health & capacity
Creates DNS plans] DE1[DNS Enactor AZ-1
Applies plans to Route53] DE2[DNS Enactor AZ-2
Applies plans to Route53] DE3[DNS Enactor AZ-3
Applies plans to Route53] R53[Amazon Route53
DNS Records] DP --> DE1 DP --> DE2 DP --> DE3 DE1 --> R53 DE2 --> R53 DE3 --> R53 end subgraph "DynamoDB Service" DDB[DynamoDB Regional Endpoint
dynamodb.us-east-1.amazonaws.com] LB1[Load Balancer 1] LB2[Load Balancer 2] LBN[Load Balancer N] R53 --> DDB DDB --> LB1 DDB --> LB2 DDB --> LBN end subgraph "EC2 Management Systems" DWFM[DropletWorkflow Manager
Manages physical servers] NM[Network Manager
Network state propagation] DROPLETS[Physical Servers
Droplets] DWFM --> DROPLETS NM --> DROPLETS DWFM -.->|depends on| DDB end subgraph "Network Load Balancer" NLB[Network Load Balancer] HC[Health Check Subsystem] NLBNODES[NLB Nodes] HC --> NLBNODES NLB --> NLBNODES HC -.->|health checks| DROPLETS end subgraph "Affected AWS Services" LAMBDA[AWS Lambda] ECS[Amazon ECS/EKS/Fargate] CONNECT[Amazon Connect] STS[AWS STS] CONSOLE[AWS Management Console] REDSHIFT[Amazon Redshift] SUPPORT[AWS Support] OTHER[Other AWS Services] end %% Dependencies DDB -.->|dependency| LAMBDA DDB -.->|dependency| ECS DDB -.->|dependency| CONNECT DDB -.->|dependency| STS DDB -.->|dependency| CONSOLE DDB -.->|dependency| REDSHIFT DDB -.->|dependency| SUPPORT DDB -.->|dependency| OTHER DROPLETS -.->|dependency| LAMBDA DROPLETS -.->|dependency| ECS NLB -.->|dependency| CONNECT NLB -.->|dependency| LAMBDA %% Failure cascade R53 -.->|DNS failure| DDB DDB -.->|cascade| DWFM DWFM -.->|cascade| NM NM -.->|cascade| NLB classDef failure fill:#ffcccc,stroke:#ff0000,stroke-width:2px classDef affected fill:#ffffcc,stroke:#ffaa00,stroke-width:2px class R53,DDB failure class DWFM,NM,NLB,LAMBDA,ECS,CONNECT,STS,CONSOLE,REDSHIFT,SUPPORT,OTHER affected

What exactly went wrong in DynamoDB?

The DNS system is split into 2 components, DNS Planner and DNS Enactor.

The DNS Planner monitors the health and capacity of the load balancers and periodically creates a new DNS plan which defines the percentage.

The DNS Enactor which is designed to have minimal dependencies to allow for system recovery in any scenario, enacts DNS plans by applying the required changes in the Amazon Route53 service.

The way things normally work is, a DNS Enactor picks up the latest plan produced by DNS Planner. It checks if the current plan is newer than the plan currently present in Route53 and then starts applying the new plan.

In this case, right before the updating event started, one of the Enactors started experiencing unusually high delays, needing to retry multiple times. Meanwhile, the DNS Planner created a lot of newer plans which were picked up by another Enactor and applied them in Route 53. The timing of these events is what created the race condition. When the second Enactor completed its process, it triggered a clean up process which cleans up all the older plans that were previously applied. When this clean up process was invoked, the first Enactor (which was unusually delayed) applied its older plan, overwriting the newer one. The second Enactor’s clean-up process then deleted this older plan because it was many generations older than the plan it had just applied. Because this plan was deleted, all IP addresses were immediately removed because of which the system was left in an inconsistent state and prevented all subsequent plan updates from being applied by any DNS Enactors.

sequenceDiagram
    participant DP as DNS Planner
    participant DE1 as DNS Enactor A (Slow)
    participant DE2 as DNS Enactor B (Fast)
    participant R53 as Route53
    participant CLEANUP as Cleanup Process

    Note over DP, CLEANUP: Normal Operation
    DP->>DE1: Plan X (older)
    DP->>DE2: Plan X (older)

    Note over DE1: Starts applying Plan X
    DE1->>R53: Check: Plan X newer than current? ✓
    DE1->>R53: Begin applying Plan X to endpoints

    Note over DE1: ⚠️ Experiences unusual delays
    DE1-->>R53: Retry endpoint 1... (blocked)
    DE1-->>R53: Retry endpoint 2... (blocked)

    Note over DP: Continues generating newer plans
    DP->>DE2: Plan Y (newer)
    DP->>DE2: Plan Z (newest)

    Note over DE2: Picks up Plan Z (newest)
    DE2->>R53: Check: Plan Z newer than current? ✓
    DE2->>R53: Apply Plan Z to all endpoints (fast)
    DE2->>R53: ✅ Plan Z applied successfully

    Note over DE2: Triggers cleanup after success
    DE2->>CLEANUP: Invoke cleanup process

    Note over DE1, CLEANUP: 🚨 RACE CONDITION WINDOW

    par
        DE1->>R53: Apply old Plan X to regional endpoint
        Note over DE1: Overwrites newer Plan Z!
    and
        CLEANUP->>R53: Delete old plans (including Plan X)
        Note over CLEANUP: Deletes the plan that was just applied!
    end

    Note over R53: 💥 FAILURE STATE
    R53-->>R53: Regional endpoint has empty DNS record
    R53-->>R53: System in inconsistent state

    Note over DP, CLEANUP: 11:48 PM PDT - DNS Resolution Fails
    Note over DP, CLEANUP: All DynamoDB connections fail

    rect rgb(255, 200, 200)
        Note over DP, CLEANUP: Manual Intervention Required
        Note over DP, CLEANUP: 12:38 AM - Engineers identify DNS issue
        Note over DP, CLEANUP: 1:15 AM - Temporary mitigations applied
        Note over DP, CLEANUP: 2:25 AM - DNS information restored
        Note over DP, CLEANUP: 2:40 AM - Full recovery (DNS cache expiry)
    end

And why did this affect the EC2 service?

To understand what happened with the EC2 instances lets define a few sub-systems that EC2 Management System uses:

  1. DropletWorkflow Manager (DWFM)
  2. Network Manager

The DWFM is responsible for the management of all the underlying physical servers that are used by EC2 for the hosting of EC2 instances, also called “droplets”.

The Network Manager, which is responsible for the management and propagation of network state to all EC2 instances and network appliances.

Each DWFM manages physical servers in one Availability Zone and maintains a lease on each server that it is currently managing. The lease tracks server state and ensures that instance operations (like shutdown or reboot) are properly reflected across EC2 systems.

When DynamoDB went down, the state checks performed by DWFM started failing resulting in lease timeouts. Because of these timeouts, whenever a customer tried launching a new instance, an “Insufficient Capacity” error was returned.

Once the DNS issue was resolved, a large number of droplets tried to renew their leases, because of which requests started to timeout and created more backlog and resulted in congestive collapse.

The issue was finally resolved when the engineers manually restarted DWFM hosts and cleared the queues.

sequenceDiagram
    participant DDB as DynamoDB
    participant DWFM as DWFM
    participant DROPLETS as Physical Servers
    participant NM as Network Manager
    participant API as EC2 API
    participant CUSTOMERS as Customers

    Note over DDB, CUSTOMERS: Normal Operations
    loop Every few minutes
        DWFM->>DDB: State check & lease validation
        DDB-->>DWFM: ✅ Success
        DWFM->>DROPLETS: Maintain lease
    end

    CUSTOMERS->>API: Launch new EC2 instance
    API->>DWFM: Request capacity
    DWFM->>DROPLETS: Allocate from leased servers
    DROPLETS-->>DWFM: ✅ Server allocated
    NM->>DROPLETS: Configure network for new instance
    API-->>CUSTOMERS: ✅ Instance launched

    Note over DDB, CUSTOMERS: 🚨 11:48 PM - DynamoDB DNS Failure
    rect rgb(255, 200, 200)
        DWFM->>DDB: State check
        DDB-->>DWFM: ❌ DNS resolution failed
        Note over DWFM: Cannot complete state checks
        Note over DROPLETS: Leases begin timing out
    end

    Note over DDB, CUSTOMERS: 11:48 PM - 2:25 AM: Lease Timeout Period
    CUSTOMERS->>API: Launch new EC2 instance
    API->>DWFM: Request capacity
    DWFM->>DROPLETS: Check available servers
    DROPLETS-->>DWFM: ❌ No servers with active leases
    API-->>CUSTOMERS: ❌ "Insufficient capacity"

    Note over DDB, CUSTOMERS: 2:25 AM - DynamoDB Recovers
    rect rgb(200, 255, 200)
        DWFM->>DDB: State check
        DDB-->>DWFM: ✅ Success
    end

    Note over DDB, CUSTOMERS: 2:25 AM - 5:28 AM: Congestive Collapse
    rect rgb(255, 255, 200)
        par Massive lease re-establishment
            DWFM->>DROPLETS: Re-establish lease 1
            DWFM->>DROPLETS: Re-establish lease 2
            DWFM->>DROPLETS: Re-establish lease N
        end
        Note over DWFM: Too many simultaneous requests
        Note over DWFM: Lease attempts timeout
        Note over DWFM: Work queues up → System overload

        CUSTOMERS->>API: Launch new EC2 instance
        API-->>CUSTOMERS: ❌ "Insufficient capacity"

        Note over DWFM: 4:14 AM - Engineers restart DWFM hosts
        Note over DWFM: Queues cleared, processing normalized
    end

    Note over DDB, CUSTOMERS: 5:28 AM - DWFM Recovery Complete
    DWFM->>DROPLETS: ✅ All leases re-established

    Note over DDB, CUSTOMERS: 5:28 AM - 10:36 AM: Network Manager Backlog
    rect rgb(200, 200, 255)
        CUSTOMERS->>API: Launch new EC2 instance
        API->>DWFM: Request capacity
        DWFM->>DROPLETS: ✅ Server allocated
        NM->>DROPLETS: Configure network (delayed)
        Note over NM: Processing backlog of network configs
        API-->>CUSTOMERS: ✅ Instance launched (no network)

        Note over NM: 6:21 AM - Increased latencies
        Note over NM: 10:36 AM - Backlog cleared
        NM->>DROPLETS: ✅ Network configured
    end

    Note over DDB, CUSTOMERS: 10:36 AM - 1:50 PM: Throttle Removal
    Note over API: 11:23 AM - Begin removing throttles
    CUSTOMERS->>API: Launch new EC2 instance
    API->>DWFM: Request capacity
    DWFM->>DROPLETS: ✅ Server allocated
    NM->>DROPLETS: ✅ Network configured
    API-->>CUSTOMERS: ✅ Instance launched successfully

    Note over DDB, CUSTOMERS: 1:50 PM - Full Recovery

And the NLB was affected because…?

The NLB outage was caused by a timing mismatch between EC2 instance launches and network configuration propagation. After DWFM recovered at 5:28 AM, new EC2 instances launched successfully but Network Manager was overwhelmed processing a backlog of network configurations. NLB’s health check system began checking these new instances before their network state fully propagated, causing health checks to alternate between failing and passing as network configs were applied. This flapping behavior overloaded the health check subsystem, triggered automatic AZ DNS failovers, and removed healthy capacity from service, resulting in connection errors for customers. Engineers resolved the issue by disabling automatic failovers at 9:36 AM, allowing all healthy nodes to return to service.

sequenceDiagram
    participant DWFM as DWFM
    participant EC2 as New EC2 Instance
    participant NM as Network Manager
    participant HC as NLB Health Check
    participant NLB as NLB Service
    participant DNS as DNS/Route53
    participant CUSTOMER as Customer

    Note over DWFM, CUSTOMER: 5:28 AM - DWFM Recovery Complete

    DWFM->>EC2: ✅ Launch new instance
    Note over EC2: Instance running but network incomplete

    par Network backlog processing
        NM-->>EC2: Network config pending...
    and Health check starts
        HC->>EC2: Health check
        EC2-->>HC: ❌ Network not ready
        HC->>NLB: Mark as unhealthy
        NLB->>DNS: Remove from service
    end

    NM->>EC2: ✅ Network config applied

    HC->>EC2: Health check
    EC2-->>HC: ✅ Now healthy
    HC->>NLB: Mark as healthy
    NLB->>DNS: Add back to service

    Note over HC, DNS: Health checks flapping (fail/pass/fail/pass)

    loop Flapping cycle
        HC->>EC2: Health check
        alt Network timing issue
            EC2-->>HC: ❌ Temporary network issue
            HC->>NLB: Remove from service
            NLB->>DNS: Remove from DNS
        else Network working
            EC2-->>HC: ✅ Healthy
            HC->>NLB: Add to service
            NLB->>DNS: Add to DNS
        end
    end

    Note over HC, CUSTOMER: 6:21 AM - Health check system overloaded

    rect rgb(255, 200, 200)
        Note over HC: System degraded from flapping
        HC-->>NLB: Delayed health checks
        NLB->>DNS: Automatic AZ failover triggered
        Note over DNS: Capacity removed from service
        CUSTOMER->>NLB: Connection request
        NLB-->>CUSTOMER: ❌ Connection error (insufficient capacity)
    end

    Note over HC, CUSTOMER: 9:36 AM - Engineers disable automatic failover

    rect rgb(200, 255, 200)
        Note over NLB: Automatic failover disabled
        NLB->>DNS: Restore all healthy nodes
        CUSTOMER->>NLB: Connection request
        NLB-->>CUSTOMER: ✅ Connection successful
    end

    Note over HC, CUSTOMER: 2:09 PM - Re-enable automatic failover after full recovery

What is AWS doing to resolve this issue?

AWS has disabled the DynamoDB DNS Planner and Enactor automation worldwide while they fix the race condition and add protections to prevent incorrect DNS plans from being applied. They’re also adding velocity controls to NLB so it can’t remove too much capacity at once during health check failures. For EC2, they’re building better test suites to catch DWFM recovery issues and improving throttling to handle high load without collapsing. Plus, they’re doing a full investigation across all services to find more ways to prevent this kind of thing and speed up recovery times.

What should services that faced this outage do about this?

  1. Multi-Region Failover: If your service runs in multiple regions, redirect traffic to unaffected regions during outages, even if it means degraded performance or higher latency. Some availability is always better than complete downtime. Set up automated failover mechanisms so you’re not scrambling to manually reroute traffic at 2 AM.
  2. Practice Disaster Recovery: Regularly simulate large-scale failures to test your disaster recovery procedures. Running these drills will expose weaknesses in your architecture, reveal hidden dependencies, and ensure your team knows exactly what to do when things go sideways. The middle of an actual outage is not the time to discover your failover doesn’t work.
  3. Evaluate Dependencies and SLAs: Calculate the actual cost of downtime versus the cost of building redundancy across regions or cloud providers. Assess whether your service SLAs realistically account for your dependencies’ SLAs (like AWS). For some services, accepting the risk might be more cost-effective than maintaining complex multi-region architectures. Make an informed decision based on your numbers, not assumptions.

Glossary

AWS (Amazon Web Services) - Amazon’s cloud computing platform that provides servers, storage, databases, and other services that power a huge chunk of the internet.

US-EAST-1 - AWS’s largest and oldest data center region, located in Northern Virginia. Think of it as the “main office” of AWS—when it goes down, a lot of the internet goes with it.

DynamoDB - AWS’s NoSQL database service. It’s fast, scalable, and apparently, a single point of failure for a lot of other services.

EC2 (Elastic Compute Cloud) - AWS’s virtual server service. Basically, you rent computer power in the cloud instead of buying physical servers.

NLB (Network Load Balancer) - Distributes incoming network traffic across multiple servers to prevent any single server from getting overwhelmed. Like a traffic cop for data.

DNS (Domain Name System) - The internet’s phonebook. It translates human-readable domain names (like dynamodb.us-east-1.amazonaws.com) into IP addresses that computers can understand. When DNS breaks, nothing can find anything.

Route53 - AWS’s DNS service. It’s what tells your computer where to find AWS services.

Availability Zone (AZ) - Independent data centers within an AWS region. US-EAST-1 has multiple AZs for redundancy—if one fails, the others should keep working. (Spoiler: didn’t help this time.)

Droplet - AWS’s internal term for the physical servers that host EC2 instances. Not to be confused with DigitalOcean’s droplets.

Lease - A time-limited claim on a resource. In this case, DWFM “leases” physical servers, meaning it has permission to manage them for a set period. When leases expire without renewal, chaos ensues.

DWFM (DropletWorkflow Manager) - The system that manages all the physical servers (droplets) used by EC2. When it can’t talk to DynamoDB, it can’t verify server leases, and everything grinds to a halt.

Network Manager - Handles network configuration for EC2 instances. Makes sure your servers can actually talk to the internet.

Health Check - Automated tests that verify if a server is working properly. NLB uses these to decide which servers should receive traffic.

Cascading Failure - When one component fails and causes other dependent components to fail in a domino effect. In this case: DynamoDB → EC2 → NLB → everything else.

Race Condition - A bug that occurs when the timing of events matters. Two processes trying to do things simultaneously can produce unpredictable results—like two people trying to walk through a doorway at the same time.

Congestive Collapse - When a system gets so overwhelmed with retry attempts that it can’t process anything, making the problem worse. Like a traffic jam where honking doesn’t make it move faster.

DNS Planner - The part of DynamoDB’s DNS system that monitors health and creates plans for which IP addresses should be in Route53.

DNS Enactor - The part that actually applies the DNS Planner’s plans to Route53. Designed to have minimal dependencies so it can help recover from failures. (Ironic, given what happened.)

IP Address - The actual numeric address that computers use to find each other on the internet (e.g., 192.168.1.1).

Endpoint - The URL or address where you access a service (e.g., dynamodb.us-east-1.amazonaws.com).

References

  1. AWS Official Post-Mortem: Summary of the Amazon DynamoDB Service Disruption in Northern Virginia (US-EAST-1) Region - AWS’s detailed technical post-event summary explaining the race condition, cascading failures, and recovery efforts (October 19-20, 2025)

  2. A Single DNS Race Condition Brought AWS to Its Knees - The Register’s technical analysis of the outage and its impacts (October 23, 2025)

  3. AWS Health Dashboard - Real-time status updates for AWS services across all regions

  4. Race Condition - Wikipedia - Comprehensive overview of race conditions in computing systems

  5. Handling Race Condition in Distributed Systems - GeeksforGeeks guide on detecting and managing race conditions in distributed architectures