May 28, 2026

When “smart” got embarrassingly slow

Tuning LLVM's SLP Vectorizer Cost Model

A tiny compiler tweak made code way slower, and the comments were not calm

TLDR: A recent LLVM change tried to replace a simple step-by-step math sequence with a fancier shortcut, but it accidentally added extra work and made one benchmark much slower. Commenters loved the bug hunt, roasted the so-called optimization, and argued over whether this was a rare mistake or just normal compiler chaos.

What should have been a nerdy little performance fix turned into full comment-section theater after developer Kavin Gnanapandithan tracked down why a recent LLVM update made one benchmark dramatically slower. In plain English: a tool that helps turn code into fast machine instructions got a little too excited about a “smart” shortcut, and that shortcut ended up doing extra busywork every single loop. The result was ugly: far more instructions, many more cycles, and a benchmark that looked like it had slammed the brakes.

The community reaction was a mix of respect, relief, and classic compiler-dev doom-posting. A lot of readers praised the detective work, calling it a perfect example of why performance work is never “done.” Others piled on the broader complaint: optimizers love to call something “profitable” right up until real hardware says, “absolutely not.” The spiciest debate was over blame. Was this an embarrassing miss in LLVM’s cost model, or just the unavoidable price of building super-complex tools for wildly different chips? Predictably, both camps showed up.

And yes, the jokes landed too. Commenters mocked the idea of “optimizing” by storing values to memory just to reload them again, with vibes ranging from “this is just moving furniture in a fire” to “the compiler invented extra chores for itself.” Even non-experts could follow the drama: the machine was told to be clever, got carried away, and the internet instantly noticed.

Key Points

  • The article attributes an LLVM performance regression to a cost-model error in the SLP vectorizer when evaluating ordered vector reductions.
  • A benchmark on Igalia’s LNT instance for the BPI-F3 showed an 89% delta, with about 26% more issued instructions and 48% more cycles.
  • The newer LLVM-generated code stores scalar floating-point values to the stack using `fsd`, loads them with `vle64.v`, and then reduces them with `vfredosum.vs`.
  • The previous implementation used a direct ordered chain of scalar `fadd` instructions without the extra stack-store and vector-load sequence.
  • The post says the cost model failed to include the per-iteration cost of building the initial vector, leading LLVM to mark unprofitable code as profitable.

Hottest takes

"profitable according to who? certainly not the CPU" — @commenter
"store it, load it, regret it" — @commenter
"this is why benchmarking beats vibes every time" — @commenter
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.