I/O is no longer the bottleneck? (2022)

Reads are blazing, word counts crawl—was the disk ever the problem

TLDR: File reading is super fast, but counting words still drags, igniting a debate over what’s really slowing things down. The crowd says it’s not the disk—memory and single-core limits matter—while optimizers argue smarter code and new tricks can push speeds higher.

The tech crowd is in full popcorn mode after a programmer tested the claim that reading files isn’t the slowdown anymore. They measured blazing 1.6 GB/s (cold) and 12.8 GB/s (warm) read speeds, then tried a classic task—counting words—and got a rude awakening: a “fast” C version peaked around 278–330 MB/s, while the trusty wc -w clocked an even slower 245 MB/s. The culprit? Branchy code the compiler can’t easily turn into the kind of super-charged instructions (like AVX2, a “do lots at once” trick) that modern chips love.

Cue the comments throwing elbows. One camp says: it was never the disk. As pvorb puts it, I/O bottlenecks weren’t about smooth, straight-line reads at all. Meanwhile, eliasdejong drops a reality check: the real cap is how fast a single core can be fed—think “copy speed” limits—around 6 GB/s on many PCs, up to 20 GB/s on fancy Apple chips. Others chime in with humor: kevmo314 joked NVMe made it feel like we’ve got “2 TB of RAM,” then admits some GPU boxes actually do. There’s meta-drama too: wmf links to prior debates, and akoboldfrying points to a fresh addendum with a “clever trick” to squeeze out more speed. The vibe: is it your code, your compiler, or physics? The only unanimous take—this isn’t over, and the optimizers are sharpening their knives.

Key Points

  • Measured sequential read speeds: 1.6 GB/s (cold cache) and 12.8 GB/s (warm cache).
  • Optimized C word frequency counter compiled with GCC 12 achieved ~278 MB/s on a 425MB input.
  • Branch-heavy loops prevented autovectorization; moving lowercase logic outside improved throughput to ~330 MB/s using clang.
  • wc -w baseline delivered ~245.2 MB/s due to broader whitespace and locale-sensitive parsing.
  • Explicit SIMD (AVX2) vectorization is proposed as necessary for substantial parsing speedups; compilers struggle with branchy scalar code.

Hottest takes

"I/O being the bottleneck never was about sequential reads" — pvorb
"The performance limit now is basically memcpy speed" — eliasdejong
"I’d joke that now we have 2 TB of RAM" — kevmo314
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.