Corrosion

Rust “footgun” stalls Fly.io worldwide, sparks praise, side‑eye, and spicy memes

TLDR: Fly.io had a full platform outage from a small Rust locking mistake, then open‑sourced its new routing system, Corrosion. Commenters applauded the transparency but clashed over Rust’s limits, “roll your own” vs Kubernetes/Consul, and whether router‑style designs beat classic consensus—high stakes for anyone building global apps.

Fly.io just dropped a brutally honest confession: a tiny Rust lock mistake froze their entire platform, turning every proxy into a statue. Then they open‑sourced the system at the heart of it, Corrosion, like a reality show reveal. The crowd? Loud. Many cheered the candor—“this is how postmortems should read”—and loved the hard‑won lesson that distributed systems (computers spread across the world) don’t just share data, they share bugs. Others winced: “oof, worst outage yet.”

Cue the brawl. Rust loyalists say the language didn’t fail—humans did—while skeptics pounced: “so much for ‘memory‑safe saves the day.’” Another fight: build vs buy. Some argued “stop reinventing Kubernetes,” the popular container manager, and HashiCorp’s Consul, a service directory. Defenders countered that Fly.io’s global model doesn’t fit a central brain; instead, Corrosion borrows from OSPF (the router protocol that spreads a map), dodging Raft (a voting‑style consensus system) that struggles across oceans. The community ran wild with memes: the blog’s “deep‑fried turkey” analogy got turned into GIFs, and “never trust a distributed system without an interesting failure story” is already a T‑shirt slogan. Meanwhile, tptacek linked a previous thread, proving this saga has lore. Tech chaos, great writing, and spicy takes—irresistible.

Key Points

  • On September 1, 2024, a new “virtual service” configuration caused every Fly.io proxy to deadlock, resulting in the platform’s worst outage.
  • The outage was triggered by a Rust concurrency bug in proxy code: an if let over an RWLock incorrectly assumed the lock was released in the else branch.
  • Fly.io introduced and open-sourced Corrosion, its unconventional service discovery and state distribution system.
  • Fly.io’s orchestration model makes individual servers the source of truth and uses a central API to bid out work, avoiding a centralized scheduling database.
  • After finding Consul, SQLite caches, and Raft unsuitable for global routing, Fly.io adopted design cues from link-state protocols like OSPF to build a global routing database.

Hottest takes

“Previously: https://news.ycombinator.com/item?id=45680583” — tptacek
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.