Gram Newton-Schulz: A Fast, Hardware-Aware Newton-Schulz Algorithm for Muon

AI training just got a surprise speed boost, and the comments are losing it

TLDR: Researchers say they sped up a costly part of training giant AI models by as much as 50%, which could shave meaningful time and money off building them. Commenters were wildly upbeat, calling it instantly useful and joking that old-school matrix math has unexpectedly become the hottest skill in tech.

The big headline from this paper is simple: the team says they found a way to make a pricey part of AI training up to 50% faster, which can trim real time off building giant language models. In plain English, this is about making the "engine tune-up" step in training less of a time-and-money hog. And the community reaction? Borderline victory lap. On the discussion thread, one commenter called it a "superior alternative" and showered the authors with thanks, while another did the back-of-the-napkin math and declared this could mean about a 7% overall speedup with "no downside". In AI-land, that’s the kind of claim that makes people sit up very fast.

The mood in the comments is less scandal and more geek astonishment, with a side of hero worship. Tri Dao’s lab got a mini fan-club moment, with one user basically saying they’ve already saved the world a mountain of electricity with FlashAttention, and now they’re back for another efficiency sequel. The funniest reaction came from a commenter reminiscing about an old class on multiplying giant matrices, admitting they once thought it would never matter in the real world: "Boy was I wrong." That line pretty much became the thread’s accidental meme.

So yes, this is one of those deeply technical breakthroughs that somehow produced a very human response: relief, awe, and a chorus of people saying, in various ways, "Turns out boring math is running the future."

Key Points

•The article says Muon is being used to train models such as Kimi K2 Thinking and GLM-5, and that it needs fewer optimizer steps than AdamW but has a higher per-step cost.
•Muon’s extra cost is attributed to Newton-Schulz orthogonalization, which requires matrix multiplications with O(mn^2) cost instead of the O(mn) element-wise work used by optimizers like AdamW.
•Depending on training configuration, the article states that Newton-Schulz accounts for 2% to 17% of total end-to-end wall-clock training time.
•The article identifies three inefficiencies in standard Newton-Schulz implementations: many rectangular matrix multiplications, failure to exploit symmetry in intermediate matrices, and cuBLAS performance limitations on Hopper GPUs.
•The proposed Gram Newton-Schulz method is presented as reducing optimizer time by up to 50% in trillion-parameter mixture-of-experts models such as Kimi K2.

Hottest takes

"Fantastic work, instantly valuable, immediately usable" — cs702

"up to around 7% of basically pure improvement with no downside" — akoboldfrying

"Boy was I wrong" — jnwatson

June 11, 2026

Big Math Energy

AI training just got a surprise speed boost, and the comments are losing it

Key Points

Hottest takes

June 11, 2026

Big Math Energy

Gram Newton-Schulz: A Fast, Hardware-Aware Newton-Schulz Algorithm for Muon

AI training just got a surprise speed boost, and the comments are losing it

Key Points

Hottest takes

Save News