Cloudflare outage should not have happened

Internet faceplant sparks a brawl: bad code or reckless rollout?

TLDR: A Cloudflare outage was traced to a sloppy database query colliding with new permissions, triggering a crash loop. Commenters are split: blame bad code and logic, or blame a global rollout done too fast, with others roasting the blog as armchair critique—because how we prevent internet meltdowns matters.

Cloudflare tripped, the internet stumbled, and the comments lit up like a holiday tree. Eduardo Bellani’s post says the company’s own root cause analysis revealed a simple database query didn’t say which database to use, so a permissions change doubled the results, fed junk into a “Bot Management” file, and launched a crash loop. His big thesis: Cloudflare is treating a logic problem like a hardware problem—replicas won’t save you if your app and database rules clash. The crowd? Split and spicy.

On one side, folks like mikece framed it as “MV Dali energy”—a tiny loose wire causing a massive chain reaction, with jokes about “please don’t .unwrap() in prod” making the rounds. Another camp, led by cmckn, says the true villain wasn’t schema but speed: global, rapid deployment. In plain English—don’t flip every switch at once; roll out slowly. The meta-drama hit hard too: locknitpicker called the blog “Monday morning quarterbacking,” while vessenes warned that the “just be perfect” plan (perfect databases, no NULLs, formally checked code) costs fortunes, slows releases, and shrinks hiring pools. etchalon side-eyed the idea that a world-class team “missed a basic thing.”

Meanwhile, memes flew: “Press F to filter your DB,” “ClickHouse vs Postgres cage match,” and “YOLO deploys are cancelled.” The internet loves a postmortem; it loves a comment brawl even more.

Key Points

  • A ClickHouse query used by Cloudflare did not filter by database and began returning duplicate column metadata after permission changes at 11:05.
  • The query’s expanded result set affected Bot Management’s feature file generation, increasing features beyond expected values.
  • The application code was not designed to handle the unexpected data, triggering a crash loop across core Cloudflare systems.
  • The issue evaded rollout checks because the faulty code path depended on data assumed to be impossible to generate.
  • The author argues the outage’s root cause lies in a logical single point of failure and database/application mismatch, not physical redundancy.

Hottest takes

"a single loose wire which set off a cascading failure" — mikece
"If you don’t want to take down all of prod, you can’t update all of prod at the same time" — cmckn
"This sort of Monday morning quarterbacking is pointless" — locknitpicker
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.
Cloudflare outage should not have happened - Weaving News | Weaving News