Why is calling my asm function from Rust slower than calling it from C?

Same code, slower speed: Rust vs C sparks a spicy comment war

TLDR: A shared assembly function ran slower in Rust because extra stack use and a tricky wrapper made one memory load stall; simpler code fixed it. Comments exploded into Rust vs C jokes, gripes about arcane syntax, and a 1Password sidebar, debating safety abstractions versus raw speed.

A nerdy mystery got spicy: the team found their hand‑written assembly (machine-level code) runs 30% slower when called from Rust than C, even though it’s the same function. The smoking gun? One single “load from memory” line lit up in Rust because the Rust side stored extra data in temporary memory (the “stack”). They traced it to a fancy Rust wrapper the compiler couldn’t simplify through a function pointer. The fix was swapping in a simpler, more compiler-friendly version. MacBook-only tools and some guesswork added to the drama, but the investigation felt like CSI: CPU.

Then came the comments—pure cinema. One user groaned, “rust, you were meant to replace c++, not join it…”, turning the thread into a language culture war. Another deadpanned, “So rust isn’t just a clever name,” while outsiders confessed the Rust syntax looked “arcane.” Off-topic chaos? Oh yes: a PSA that the 1Password browser extension breaks code highlighting, with no clear fix yet, became its own subplot. The crowd split between performance diehards and practical folks who just want videos to play faster, debating whether safety features are worth tiny speed hits or if devs should “write simple code and let the compiler breathe.” The vibe: roasts, memes, and grudging respect for the detective work—now heading to HN and r/rust for round two.

Key Points

  • A shared NEON assembly function (cdef_filter4_pri_edged_8bpc_neon) is ~30% slower when called from rav1d (Rust) than from dav1d (C).
  • Profiling with samply on macOS shows a large sample delta concentrated in ld1 {v0.s}[2], [x13]: 10 samples (C) vs 441 samples (Rust).
  • The Rust implementation stores more data on the stack, causing slower memory access and higher cost for the specific ld1 instruction.
  • LLVM IR inspection indicates the compiler cannot optimize away a Rust abstraction across function pointers, blocking expected simplifications.
  • Switching to a more compiler-friendly abstraction (via a PR) resolves the optimization issue and aligns performance with the C baseline.

Hottest takes

“rust, you were meant to replace c++, not join it…” — dmitrygr
“PSA: if you don’t see syntax highlighting, disable the 1Password extension” — saghm
“So rust isn’t just a clever name.” — neuroelectron
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.