Read Locks Are Not Your Friends

Dev says 'read locks' slow you down; internet fights over chips, design, and receipts

TLDR: A Rust cache showed read locks can be ~5× slower than a simple lock because many readers battle over a shared counter, clogging the CPU. Comments split between Apple-vs-Intel theories, per‑core reader designs, and demands for clearer code—lesson: profile the hardware and pick locks thoughtfully.

A Rust dev tried the “obvious” trick—use a read lock to speed up a read-heavy cache—and got dunked by reality: read locks were about five times slower than a plain old mutex. Cue the crowd yelling “hardware hot potato!” as folks explain that every reader tweaks a hidden counter, making the CPU bounce a 64-byte chunk of data between cores like a sizzling meme. The write lock? Ironically quieter—one thread, less bus drama.

The comments turned into a hardware vs design cage match. One camp wonders if this is Apple Silicon being quirky—“would this fly on Intel/AMD?” asks whizzter, igniting chip wars. Another camp dives into fixes: _dky sketches per-slot locks padded so each lives on its own cache line (translation: keep readers from elbowing each other). Then ot shows up with receipts: don’t generalize—some locks use per-core tricks so reads scale, like folly::SharedMutex. Meanwhile amluto calls out the code: “confusing,” needs more detail to truly optimize.

And of course, the memes: Retr0id drops, “claudes love to talk about The Hardware Reality,” roasting the essay vibes. The vibe? Obvious optimizations can betray you, profile first, design second, and don’t assume a “read” is really a read when the hardware says otherwise.

Key Points

•On Apple Silicon M4, a read-heavy LRU tensor cache saw ~5× lower throughput with RwLock than with a mutex.
•The RwLock’s internal atomic reader count updates caused cache line ping-pong and heavy coherence traffic.
•Short critical sections (e.g., fast map lookups) amplify RwLock overhead, negating benefits of shared reads.
•The write lock was less noisy on the hardware bus, allowing a single thread to complete work efficiently.
•Recommendations include profiling (perf, cargo-flamegraph) and sharding the cache to reduce contention.

Hottest takes

"is there an optimization in Apple silicon that makes this bad while it'd fly on Intel/AMD cpus?" — whizzter

"This is drawing broad conclusions from a specific RW mutex implementation" — ot

"claudes love to talk about The Hardware Reality" — Retr0id

February 25, 2026

Ping-pong locks, hot takes

Dev says 'read locks' slow you down; internet fights over chips, design, and receipts

Key Points

Hottest takes

February 25, 2026

Ping-pong locks, hot takes

Read Locks Are Not Your Friends

Dev says 'read locks' slow you down; internet fights over chips, design, and receipts

Key Points

Hottest takes

Save News