Binary GCD

Old math hack gets a speed glow‑up — and the comments go feral

TLDR: A classic “binary GCD” swaps slow division for fast bit shifts to speed up finding a number’s greatest common factor. Readers are split between “prove it with benchmarks” skeptics and “shifts go brrr” enthusiasts, with a top link to Lemire’s tests reminding everyone it’s old, proven, and still spicy.

A coder dusted off a vintage math trick called “binary GCD” — a way to find the greatest common divisor using fast bit shifts instead of slow division — and the crowd instantly turned into a high‑energy pit. The post admits the first version lost to the regular library tool by 2x (ouch), then pivots to a clever cleanup: ditch divisions, use a built‑in “count trailing zeros” move to jump multiple steps at once, and go branchless so the CPU doesn’t trip over if/else gymnastics.

That’s where the community split. One camp screamed “old news, show receipts!”, with a top nudge to Daniel Lemire’s benchmarked take. Another camp cheered the makeover: “cut the divides, let the shifts go brrr.” The pragmatists rolled in asking for real‑world tests, 64‑bit edge cases, negatives, and what modern compilers already do under the hood. Meanwhile, performance purists fought the “don’t reinvent std::gcd” crowd in a classic internet showdown: micro‑optimize vs. trust the library.

Jokes flew fast: “The divide instruction is lava,” “Ancient China speedruns your math homework,” and the inevitable meme of a GPU‑buff shift flexing on a sad little divide. Bottom line: the algorithm’s ancient, the tricks are modern, and the comments are the real benchmark — loud, spicy, and demanding proof.

Key Points

  • General integer division (idiv) on x86 is slow; profiling shows ~90% runtime spent on division in a naive GCD implementation.
  • Binary GCD replaces division with shifts, comparisons, and subtractions, retaining logarithmic running time.
  • A naive recursive binary GCD in C++ performs worse than std::gcd due to heavy branching among cases.
  • Using __builtin_ctz to divide by the highest power of two reduces iterations and improves performance.
  • Reorganizing cases removes repeated even-even checks and enables a branchless main loop by de-evenizing after odd-odd steps.

Hottest takes

"related: https://lemire.me/blog/2013/12/26/fastest-way-to-compute-the-greatest-common-divisor/" — tosh
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.