November 25, 2025
Let it crash? Hold my pager
What Now? Handling Errors in Large Systems
Crash it, fix it, or fake it — the internet picks sides
TLDR: Cloudflare’s outage post mentioning “unwrap” reignited a war over whether apps should crash on errors or keep going. Marc Brooker argues it depends on system design, and commenters split between “never crash” and “let it crash,” trading memes and hard-earned on-call wisdom because reliability is everything.
A single word in Cloudflare’s outage write-up — the Rust command “unwrap,” which basically means “crash if there’s a problem” — lit the internet on fire. Enter AWS engineer Marc Brooker, who poured gasoline (and clarity) on the debate by saying the real question isn’t one program, it’s the whole system. His essay lays out when crashing is smart, when it’s chaos, and why your architecture decides the rules, not your vibe.
Commenters split into tribes fast. The “never crash in production” crowd waved their on-call pagers like pitchforks, posting “This is fine” dog memes cuddling a broken config file. The “let it crash” camp (inspired by Erlang’s “fail fast” philosophy) clapped back: crash early, recover higher up, or you just hide bugs. Rust fans insisted “unwrap is fine in safe spots,” while ops folks begged: if failures are correlated (many servers failing the same way), a mass crash is a meltdown, not a strategy. Product managers chimed in with the customer view: keep the lights on, use the last good settings, and don’t corrupt the database.
The hottest take? Don’t blame a line of code; blame the system that made it dangerous. As one commenter put it, error handling isn’t a moral stance — it’s blast radius management. Meanwhile, the memes wrote themselves: “YOLO.unwrap()”, “Don’t crash, cache,” and “Let it crash? Hold my pager.”
Key Points
- •Cloudflare’s Nov. 18 outage postmortem mentioning Rust’s unwrap sparked discussion on production error handling.
- •Brooker explains Rust’s Result and unwrap, noting unwrap crashes on error and acts like an assert.
- •He asserts that error-handling decisions are global system properties, not local component choices.
- •Three principles guide decisions: failure correlation, higher-layer handling capability, and whether meaningful continuation is possible.
- •Architectural context matters: traditional services tolerate low error rates via autoscaling; fine-grained systems (AWS Lambda, Erlang) handle higher error rates and often favor crash-and-restart.