February 25, 2026
Ping-pong locks, hot takes
Read Locks Are Not Your Friends
Dev says 'read locks' slow you down; internet fights over chips, design, and receipts
TLDR: A Rust cache showed read locks can be ~5× slower than a simple lock because many readers battle over a shared counter, clogging the CPU. Comments split between Apple-vs-Intel theories, per‑core reader designs, and demands for clearer code—lesson: profile the hardware and pick locks thoughtfully.
A Rust dev tried the “obvious” trick—use a read lock to speed up a read-heavy cache—and got dunked by reality: read locks were about five times slower than a plain old mutex. Cue the crowd yelling “hardware hot potato!” as folks explain that every reader tweaks a hidden counter, making the CPU bounce a 64-byte chunk of data between cores like a sizzling meme. The write lock? Ironically quieter—one thread, less bus drama.
The comments turned into a hardware vs design cage match. One camp wonders if this is Apple Silicon being quirky—“would this fly on Intel/AMD?” asks whizzter, igniting chip wars. Another camp dives into fixes: _dky sketches per-slot locks padded so each lives on its own cache line (translation: keep readers from elbowing each other). Then ot shows up with receipts: don’t generalize—some locks use per-core tricks so reads scale, like folly::SharedMutex. Meanwhile amluto calls out the code: “confusing,” needs more detail to truly optimize.
And of course, the memes: Retr0id drops, “claudes love to talk about The Hardware Reality,” roasting the essay vibes. The vibe? Obvious optimizations can betray you, profile first, design second, and don’t assume a “read” is really a read when the hardware says otherwise.
Key Points
- •On Apple Silicon M4, a read-heavy LRU tensor cache saw ~5× lower throughput with RwLock than with a mutex.
- •The RwLock’s internal atomic reader count updates caused cache line ping-pong and heavy coherence traffic.
- •Short critical sections (e.g., fast map lookups) amplify RwLock overhead, negating benefits of shared reads.
- •The write lock was less noisy on the hardware bus, allowing a single thread to complete work efficiently.
- •Recommendations include profiling (perf, cargo-flamegraph) and sharding the cache to reduce contention.