Principles of Mechanical Sympathy

Be nice to your CPU? Devs feud over “Mechanical Sympathy”

TLDR: A new piece says faster software starts with “mechanical sympathy”—designing code to flow like the hardware does—while commenters split between hardware purists, AI skeptics who say model calls dwarf any tweaks, and pragmatists preaching “measure first, then optimize.” It matters because better empathy equals faster, cheaper systems.

A racing-legend mantra—“mechanical sympathy”—just crashed into coding culture, and the pit lane is spicy. The article argues modern hardware is blazing fast, but software crawls because we ignore how computers actually move data. Think: keep things in order, read memory in sequence, avoid tiny fights over shared data, and batch work smartly. Cue the grandstand: one camp cheered the back-to-the-metal message, with one pro asserting there’s a single rule—make your code mirror the machine. Another camp asked, “Cool story, but does this matter for AI?”

Enter the skeptic. One commenter torpedoed the hype, saying model calls are so slow and infrequent that local CPU tricks won’t save you—so don’t bring a Formula 1 pit crew to a traffic jam. Meanwhile, a crowd of pragmatists rallied around “measure first, then tune,” loving advice about greedy batching and observability—explained in human terms as “track what matters and set promises for reliability.”

And the vibes? Surprisingly tender. One user confessed they’ve always felt “sympathy for machines” and once got roasted for it, prompting memes like “hug your cache lines” and “don’t crunch your gearbox—or your CPU.” The consensus split: purists want software shaped like hardware, realists say AI latency rules, and middle-grounders just want better dashboards and fewer slow starts.

Key Points

  • Mechanical sympathy is presented as designing software to align with hardware behavior to improve performance.
  • The CPU memory hierarchy (registers/buffers, L1/L2 per-core caches, shared L3, and RAM) rewards locality and predictable access patterns.
  • Sequential data access typically outperforms page-local and random access; ETL pipelines should prefer sequential scans over per-key queries.
  • Cache lines (often 64 bytes) can cause false sharing when multiple cores write to different variables in the same line; padding mitigates this.
  • Examples include LMAX Architecture’s single-thread throughput, and the author’s systems at Wayfair and custom encodings outperforming Protocol Buffers.

Hottest takes

"make the topology of the software match the topology of the hardware as closely as possible." — jandrewrogers
"parallel agent systems are of questionable utility" — bob1029
"I used to liken it to a formula 1 driver (scientist) and the car / pit crew (engineers)." — tommodev
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.