More than DNS: Learnings from the 14 hour AWS outage

Brain drain, bad reboot, or cloud eating its own tail? The comments are brutal

TLDR: A DNS tool bug at AWS snowballed into 140 service failures, with slow EC2 recovery dragging out the outage. Commenters feud over brain drain, relying on AWS to fix AWS, and weak reboot drills—plus hotel-lobby war-room jokes—underscoring how fragile critical cloud plumbing can be.

Amazon’s cloud had a regional meltdown, and the internet’s peanut gallery turned into a war room. AWS says a race in the tool that updates DNS—the internet’s address book—blanked a key entry for DynamoDB (a core database), cascading into 140 services and a long recovery. But commenters argue the real villain wasn’t “always DNS,” it was EC2—the virtual servers—getting stuck and needing an elaborate, hours-long reboot plan. One veteran put it this way: the DNS bump got fixed; EC2 was the slow-burn disaster.

Then came the drama. One camp yells brain drain: did losing veterans make the response clumsy? Another demands a cloud “black start”: don’t use your own tools to fix your own tools—keep a tiny, independent backup path, the “Honda generator.” Others dissected the writeup, spotting missing safeguards (“no lock on the DNS updater?”) and warning that shortcuts only show when you must start from zero.

Humor didn’t stop: one imagined engineers debugging in a hotel lobby under fluorescent lights. The verdict? Read the AWS summary, but the comments say the lesson is resilience drills, faster clean restarts, and fewer single points of cloud ego.

Key Points

•AWS us-east-1 had a 16+ hour outage, impacting 140 services and blowing SLAs; the author expects significant revenue impact across affected users.
•AWS’s public postmortem structures the incident around DynamoDB, EC2, and NLB, with 137 other services affected including Lambda, IAM, STS, ElastiCache, ECR, and Secrets Manager.
•The initial trigger was a latent race condition in DynamoDB’s DNS management system, producing an incorrect empty DNS record for dynamodb.us-east-1.amazonaws.com.
•DynamoDB and EC2 are described as bedrock “layer zero” services within AWS; their failure led to a cascade affecting roughly 70% of services in us-east-1.
•DNS Enactors ran independently across us-east-1a, 1b, and 1c for resiliency but performed uncoordinated mutations, contributing to the TOCTOU failure mode.

Hottest takes

"you need a honda petrol generator" — ggm

"talent exodus from AWS is starting to take its toll" — JCM9

"metastable failure mode in EC2" — tptacek

October 29, 2025

Cloud crash, comments clash

Brain drain, bad reboot, or cloud eating its own tail? The comments are brutal

TLDR: A DNS tool bug at AWS snowballed into 140 service failures, with slow EC2 recovery dragging out the outage. Commenters feud over brain drain, relying on AWS to fix AWS, and weak reboot drills—plus hotel-lobby war-room jokes—underscoring how fragile critical cloud plumbing can be.

Key Points

Hottest takes

October 29, 2025

Cloud crash, comments clash

More than DNS: Learnings from the 14 hour AWS outage

Brain drain, bad reboot, or cloud eating its own tail? The comments are brutal

TLDR: A DNS tool bug at AWS snowballed into 140 service failures, with slow EC2 recovery dragging out the outage. Commenters feud over brain drain, relying on AWS to fix AWS, and weak reboot drills—plus hotel-lobby war-room jokes—underscoring how fragile critical cloud plumbing can be.

Key Points

Hottest takes

Save News