May 22, 2026

When “fine” is a total lie

We should get rid of average CPU utilization

Your graphs said “all good” while the app quietly melted down

TLDR: The app kept failing because the usual CPU chart hid short bursts of overload, making the system pause at the worst moments. Commenters loved the cautionary tale, but argued over the lesson: bad metric, bad interpretation, or the eternal fix of just throwing more power at it?

This story landed like catnip for anyone who has ever stared at a dashboard, whispered “but the numbers look normal,” and then watched their software fall over anyway. The writer’s big reveal? A production app kept timing out not because the computer looked “busy” on average, but because it was getting briefly slammed and then forced to sit in the corner by the system’s resource rules. Translation for non-experts: the usual health meter said everything was fine, while real users were getting the digital equivalent of a traffic jam.

And the comments? Pure engineer group chat energy. One camp was fully hooked, with one reader calling it basically a detective novel for people who’ve been burned by misleading charts before. Another came in swinging with the bluntest possible counterpoint: don’t kill the average, just measure speed if speed is what you care about. That sparked the classic internet mini-drama: is the article exposing a bad metric, or just bad metric-reading? Meanwhile, a third crowd piled on with the bigger existential dread, saying this problem isn’t just about processor use—memory graphs can fool you too, and once you start digging, the rabbit hole never ends.

The funniest jab was also the most brutally simple: if the app is slow… just give it more resources. It’s the kind of caveman-tech wisdom that makes experts laugh, then sigh, then reluctantly admit it sometimes works. In other words, the community reaction was a mix of applause, nitpicking, and meme-worthy fatalism—aka the internet at its absolute best.

Key Points

  • A Go application experienced intermittent production timeouts that did not reproduce in development, CI/CD, or integration tests.
  • The article argues that average CPU utilization is a poor indicator for latency-sensitive workloads because wait times rise nonlinearly as utilization increases.
  • The root cause of the production issue was Linux cgroup CPU throttling, not high average CPU usage.
  • A container CPU limit of `2000m` is enforced by the kernel as a time budget per scheduling period rather than simply as two continuously available CPUs.
  • On multi-core hosts, a container can spend its CPU budget quickly across several cores and then be throttled, causing delayed requests and timeout failures despite apparently normal utilization metrics.

Hottest takes

"measure latency if we care about latency" — ahartmetz
"metrics lie to you" — zeafoamrun
"if app slow, give more resources" — ksk23
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.