February 21, 2026

It’s not you, it’s your hard drive

When etcd crashes, check your disks first

It wasn’t a mystery bug — it was your bargain-bin hard drive

TLDR: A flashy edge demo kept crashing because etcd, the cluster’s “brain,” choked on slow shared storage. Commenters piled in with war stories and jokes, mostly agreeing: stop blaming magic and give etcd a fast local disk; tuning timeouts helps little. Moral: hardware speed still matters.

A slick edge-to-cloud demo hit a brick wall when its “brain” — etcd, the tiny database that keeps Kubernetes organized — kept face-planting. The team stacked multiple virtual machines on a single mini PC and, surprise: slow, shared storage made etcd panic. Cue the comments bringing the heat. Veterans rolled in with “told you so” energy, with user kg spelling it out: etcd needs fast, consistent writes or it throws a tantrum. Another reader, AntonFriberg, dropped a war story about network storage (think drives over Wi‑Fi for your cluster) leaving etcd helpless.

Drama time: one camp roasted the setup — “don’t run your cluster’s brain on thrift-store disks.” Another camp argued the real lesson is respecting etcd’s design: it’s consistent for a reason, and consistency wants speed. A few insisted timeout tweaks are just makeup on a broken foundation, while pragmatists said “buy an SSD and move on.”

Jokes flew too. People cast etcd as a diva that demands NVMe and candles, not NFS. Memes landed: “It’s always the disk,” “fsync is the final boss,” and “never put the brain on a network share.” The verdict? Before blaming Kubernetes magic, check your storage — and maybe your life choices. Bonus reading: etcd

Key Points

  • A Cloud-Edge-IoT demo using MLSysOps, Karmada, and k3s experienced recurring pod crashes every 5–10 minutes.
  • Investigation showed etcd timeouts caused by slow, inconsistent storage I/O on VMs sharing a NUC’s host disk.
  • etcd’s strong consistency depends on timely write-ahead log persistence and fsync completion, making it sensitive to I/O latency.
  • Missed heartbeats and election deadlines led to leader election failures, loss of quorum, and API-dependent pods failing.
  • Increasing etcd timeouts provided limited relief but did not resolve the underlying storage performance issue.

Hottest takes

"etcd is a strongly consistent, distributed key-value store, and that consistency comes at a cost: it is extraordinarily sensitive to I/O latency" — kg
"Then all the sudden there was a longer lasting outage where the ETCD did not really recover on its own" — AntonFriberg
"Leader elections fail" — kg
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.