Read Locks Are Not Your Friends

Dev says 'read locks' slow you down; internet fights over chips, design, and receipts

TLDR: A Rust cache showed read locks can be ~5× slower than a simple lock because many readers battle over a shared counter, clogging the CPU. Comments split between Apple-vs-Intel theories, per‑core reader designs, and demands for clearer code—lesson: profile the hardware and pick locks thoughtfully.

A Rust dev tried the “obvious” trick—use a read lock to speed up a read-heavy cache—and got dunked by reality: read locks were about five times slower than a plain old mutex. Cue the crowd yelling “hardware hot potato!” as folks explain that every reader tweaks a hidden counter, making the CPU bounce a 64-byte chunk of data between cores like a sizzling meme. The write lock? Ironically quieter—one thread, less bus drama.

The comments turned into a hardware vs design cage match. One camp wonders if this is Apple Silicon being quirky—“would this fly on Intel/AMD?” asks whizzter, igniting chip wars. Another camp dives into fixes: _dky sketches per-slot locks padded so each lives on its own cache line (translation: keep readers from elbowing each other). Then ot shows up with receipts: don’t generalize—some locks use per-core tricks so reads scale, like folly::SharedMutex. Meanwhile amluto calls out the code: “confusing,” needs more detail to truly optimize.

And of course, the memes: Retr0id drops, “claudes love to talk about The Hardware Reality,” roasting the essay vibes. The vibe? Obvious optimizations can betray you, profile first, design second, and don’t assume a “read” is really a read when the hardware says otherwise.

Key Points

  • On Apple Silicon M4, a read-heavy LRU tensor cache saw ~5× lower throughput with RwLock than with a mutex.
  • The RwLock’s internal atomic reader count updates caused cache line ping-pong and heavy coherence traffic.
  • Short critical sections (e.g., fast map lookups) amplify RwLock overhead, negating benefits of shared reads.
  • The write lock was less noisy on the hardware bus, allowing a single thread to complete work efficiently.
  • Recommendations include profiling (perf, cargo-flamegraph) and sharding the cache to reduce contention.

Hottest takes

"is there an optimization in Apple silicon that makes this bad while it'd fly on Intel/AMD cpus?" — whizzter
"This is drawing broad conclusions from a specific RW mutex implementation" — ot
"claudes love to talk about The Hardware Reality" — Retr0id
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.