October 29, 2025

One extra byte, four times the pain

Quantifying pass-by-value overhead

Programmers find a weird speed cliff; AMD gets side‑eye and comments go feral

TLDR: A benchmark found small data copies are fast, but certain ~4KB sizes on some AMD chips hit a weird slowdown. Commenters split between “semantics, not overhead,” “benchmarks lie in the real world,” and “just use ChatGPT to crank code,” turning a copy test into a full‑blown performance drama.

A coder set out to measure whether passing data to functions “by value” (copy it) is slow, and stumbled into a plot twist: small copies are cheap, big ones trigger a copy instruction (think a built‑in loop), and on some AMD chips there’s a bizarre performance cliff where adding one extra byte makes things 4x slower. You read that right: 4064 bytes flies, 4065 bytes crawls. The author even invites AMD engineers to explain the mystery and compiler folks to try a cheeky workaround. The crowd? Absolutely buzzing. One camp cheers the detective work and gawks at the “don’t use 4KB‑ish sizes” warning like it’s a cursed treasure map. Another camp shrugs: “benchmarks aren’t the real world”—your computer’s other tasks and memory habits can flip the results anyway. Then the philosophy nerds arrive: some insist there’s “no pass‑by‑value overhead”, only compiler decisions, turning the thread into a semantics showdown. And because it’s 2025, someone drops the line that they use ChatGPT as a ‘dumb code generator’ for microbenchmarks, igniting eye‑rolls and applause. Between the laughs ("one extra byte ruins your day") and side‑eye at AMD, the vibe is peak dev‑drama: clever graphs, spicy theory fights, and a mystery bug begging for an engineer’s confession. Check the benchmark code if you dare.

Key Points

  • Structs up to 256 bytes are cheaply passed by value using SIMD/vectorized moves; beyond that, rep movs is used.
  • Passing by reference has similar overhead to passing a pointer-sized struct by value.
  • A clear threshold at 257 bytes marks a transition from unrolled vectorized copies to rep movs, with the latter generally slower.
  • rep movs performance exhibits a 32-byte periodic pattern (8 fast, 24 slow widths) and reproducible spikes at certain sizes.
  • Measured rates: ~730M 16-byte structs/sec and ~26M 2048-byte structs/sec; avoid by-value passing for 4046–4080 and 8161–8176 bytes on AMD Ryzen 3900X.

Hottest takes

"There is no pass-by-value overhead." — jklowden
"I would ignore this benchmark because it’s not going to predict anything for real world code." — pizlonator
"I usually use ChatGPT for such microbenchmarks" — codedokode
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.