We Uncovered a Race Condition in Aurora RDS

Failover fiasco: are we paying for cloud magic or risk roulette

TLDR: Hightouch hit a confirmed bug in Amazon’s Aurora database during a failover upgrade, after an earlier AWS outage stacked up work. The crowd is split between blaming cloud design, questioning whether failovers should break under normal traffic, and wondering if pricey Aurora underperforms vs simpler setups.

Hightouch tried a routine upgrade on Amazon’s fancy cloud database, Aurora, to clear a backlog from an earlier AWS outage—and promptly tripped a race-condition bug. AWS later confirmed it. The community? Absolutely buzzing. Some readers went full “we told you so,” warning that adding extra readers (machines that only read data) is a slippery way to think you’re scaling, while the real bottlenecks stay put. Others were skeptical: if failovers (switching who’s in charge of writes) break when you keep normal traffic going, how is anyone using this safely?

Performance and price set off a second firestorm. One camp says Aurora’s insert speed and costs feel off compared to plain old RDS (regular AWS database) on beefy storage. Another camp loves Aurora’s design—separating compute from storage sounds like cloud magic—that’s supposed to give fast recovery and scale. Cue memes: “fast failover” became fast fall-over, and someone joked about paying a premium “to babysit failovers.”

The mood swings between trust issues (“I pay AWS to avoid this kind of chaos”) and design debates (“this model is powerful, but fragile under pressure”). Bottom line: Hightouch found a real bug, but the crowd’s bigger question is whether cloud convenience is worth the surprise plot twists when things get busy.

Key Points

  • An Oct 20 AWS us-east-1 outage caused a large processing backlog for Hightouch’s Events system.
  • Hightouch attempted an Aurora PostgreSQL upgrade on Oct 23 and encountered a race condition bug in Aurora RDS.
  • AWS later confirmed the Aurora race condition as an AWS-side bug.
  • Hightouch’s architecture scales via Kubernetes, Kafka, and Postgres; Kafka maintained durability but queues ballooned due to service failures.
  • The upgrade plan relied on Aurora fast failover: add a replica, promote an upgraded reader to writer with <15s expected downtime, then upgrade the old writer.

Hottest takes

"adding read replicas as a way to scale is a slippery slope" — redwood
"manually triggered failovers will always fail if your application tries to maintain its normal write traffic" — gtowey
"Aurora seems to be especially slow for inserts and also quite expe..." — jansommer
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.