Gram Newton-Schulz: A Fast, Hardware-Aware Newton-Schulz Algorithm for Muon

AI training just got a surprise speed boost, and the comments are losing it

TLDR: Researchers say they sped up a costly part of training giant AI models by as much as 50%, which could shave meaningful time and money off building them. Commenters were wildly upbeat, calling it instantly useful and joking that old-school matrix math has unexpectedly become the hottest skill in tech.

The big headline from this paper is simple: the team says they found a way to make a pricey part of AI training up to 50% faster, which can trim real time off building giant language models. In plain English, this is about making the "engine tune-up" step in training less of a time-and-money hog. And the community reaction? Borderline victory lap. On the discussion thread, one commenter called it a "superior alternative" and showered the authors with thanks, while another did the back-of-the-napkin math and declared this could mean about a 7% overall speedup with "no downside". In AI-land, that’s the kind of claim that makes people sit up very fast.

The mood in the comments is less scandal and more geek astonishment, with a side of hero worship. Tri Dao’s lab got a mini fan-club moment, with one user basically saying they’ve already saved the world a mountain of electricity with FlashAttention, and now they’re back for another efficiency sequel. The funniest reaction came from a commenter reminiscing about an old class on multiplying giant matrices, admitting they once thought it would never matter in the real world: "Boy was I wrong." That line pretty much became the thread’s accidental meme.

So yes, this is one of those deeply technical breakthroughs that somehow produced a very human response: relief, awe, and a chorus of people saying, in various ways, "Turns out boring math is running the future."

Key Points

  • The article says Muon is being used to train models such as Kimi K2 Thinking and GLM-5, and that it needs fewer optimizer steps than AdamW but has a higher per-step cost.
  • Muon’s extra cost is attributed to Newton-Schulz orthogonalization, which requires matrix multiplications with O(mn^2) cost instead of the O(mn) element-wise work used by optimizers like AdamW.
  • Depending on training configuration, the article states that Newton-Schulz accounts for 2% to 17% of total end-to-end wall-clock training time.
  • The article identifies three inefficiencies in standard Newton-Schulz implementations: many rectangular matrix multiplications, failure to exploit symmetry in intermediate matrices, and cuBLAS performance limitations on Hopper GPUs.
  • The proposed Gram Newton-Schulz method is presented as reducing optimizer time by up to 50% in trillion-parameter mixture-of-experts models such as Kimi K2.

Hottest takes

"Fantastic work, instantly valuable, immediately usable" — cs702
"up to around 7% of basically pure improvement with no downside" — akoboldfrying
"Boy was I wrong" — jnwatson
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.