March 6, 2026
Sign me up or signSGD off?
Polar Factor Beyond Newton-Schulz – Fast Matrix Inverse Square Root
AI devs split: new matrix move claims speed; skeptics say 'just use the old way'
TLDR: New method claims faster, safer matrix “direction” updates by working on the small side and certifying quality on the fly. The crowd is split between fans of speed and stability and skeptics yelling for the old, reliable approach, with memes about Polar Bears and training crashes.
A new post promises a faster way to get the “direction” of a matrix (think: pointing the arrow without caring about length) for training AI models. Instead of a bulky route, it crunches a smaller side-first and then does one big multiply, plus an online check to prove the result is close to perfect. The goal: speed on GPUs, stability in low-precision math, and fewer training meltdowns.
Cue the comments going full popcorn. One camp cheers the online certificate—“finally a quick sanity check”—saying it’s better than the older “Polar Express” tricks from Amsel et al. and avoids the controversial AOL column scaling from Turbo-Muon (Boissin et al.) that some say adds bias. Another camp counters with “just do SVD,” aka the classic slow-but-accurate method: simple, predictable, done. The snark squad calls this “putting a Polar Bear costume on signSGD,” poking fun at Muon/Lion optimizer hype. Meanwhile, bf16 (a small-number format) survivors are begging for fewer NaNs and seem thrilled at anything that keeps training steady. Jokes flew—“Gramazon Prime shipping your inverse square roots,” “it’s all polars all the way down”—while practitioners debate if forming the Gram matrix will secretly wreck conditioning. Bottom line: speed vs. simplicity, certificates vs. trust-me bro, and a whole lot of polar puns.
Key Points
- •The article proposes computing the polar factor for tall matrices by operating on the Gram matrix B = GᵀG and approximating B^{-1/2} with minimax polynomials and Jacobi preconditioning.
- •The method performs iterative work on small n×n matrices and requires only two rectangular GEMMs, making it GPU-efficient and suitable for bf16 with fp32 accumulation as needed.
- •An online certificate is provided via the Gram residual E = Z̃ᵀ B Z̃ − I, bounding singular values of Ũ and enabling runtime assessment of accuracy using ||E||_F.
- •The approach emphasizes online coefficient selection, contrasting with Polar Express’s offline coefficients, and seeks stronger guarantees than rectangular-only iterations.
- •The article advises against Turbo-Muon’s AOL column scaling (which biases the target) and instead uses unbiased Jacobi preconditioning on the SPD Gram matrix.