Optimization of 32-bit Unsigned Division by Constants on 64-bit Targets

New compiler trick speeds up 32-bit math — cheers, nitpicks, and a 'missed the art' cry

TLDR: A new method makes dividing by constants much faster on 64‑bit chips, with LLVM already merging the patch and nearly 2× gains in tests. Commenters are split between applause and “seen it before,” arguing over vector limits and whether the paper ignored well‑known “magic number” tricks.

A new paper promises to make old-school 32‑bit math scream on today’s 64‑bit chips, and the internet is having a moment. The team’s trick speeds up dividing by a constant number (think “divide by 7”) and landed up to 1.67× faster on Intel and 1.98× on Apple M4. Even better, the patch is already merged into LLVM, the code behind many compilers. Translation: this could hit real apps soon.

But the comments? Pure theater. One camp says, duh, compilers already swap slow division for a quick multiply‑and‑shift shortcut. User foltik drops the napkin‑math version and hints some divisors are “problem children”—cue the meme: dividing by 7 is a boss fight. Another camp waves the PR receipts, with gopalv pointing out it works great in a CPU register but not inside vectors (the multi‑number lanes used for ultra‑fast math). Bonus shade: “fastdiv hasn’t had an update in years.”

Then comes the spice. Commenter ridiculous_fish detonates: “This paper missed the state of the art!!!” and dives into “magic numbers,” those special multipliers compilers precompute to fake division. purplesyringa is surprised too, name‑dropping Daniel Lemire’s blog as prior reading. Meanwhile pkaye plays librarian with a throwback rec: “Hackers Delight.” Verdict: real speedups, real code, and a real brawl over who discovered what first. Nerd drama never divides evenly.

Key Points

  • The paper proposes a 64-bit-targeted optimization for 32-bit unsigned division by constants.
  • It addresses inefficiencies in existing GM-method-based code generation for divisors like 7 on 64-bit CPUs.
  • Patches implementing the optimization were created for LLVM and GCC.
  • Microbenchmarks show 1.67x speedup on Intel Xeon w9-3495X (Sapphire Rapids) and 1.98x on Apple M4.
  • The LLVM patch has been merged into the llvm:main branch, indicating practical adoption.

Hottest takes

"x / d = (x * c) >> a" — foltik
"you can do this in a register but not inside vectors." — gopalv
"This paper missed the state of the art!!!" — ridiculous_fish
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.