Modern Optimizers – An Alchemist's Notes on Deep Learning

Adam vs the “Whitening” Wizards: faster training or just hype

TLDR: New “whitening” optimizers claim to train AI models faster by reshaping each step to the terrain. The community is split: some see real gains and clever stats links (hello Jeffreys prior), others say it’s hype and a rename of old ideas—classic Adam loyalists versus optimizer fashionistas.

Deep learning’s favorite workout plan, Adam, just got called out by a new squad of “spectral-whitening” optimizers promising faster training with fewer hiccups. The post explains a simple idea in plain terms: instead of stepping blindly downhill, these methods reshape each step to match the terrain—like switching from flip-flops to hiking boots. Cue the comments: skeptics say it’s fancy math for “turn the learning rate down,” while fans swear these tools hit a better speed-vs-cost sweet spot.

The hottest thread sparks around the mysterious square root in the “whitening” metric. One commenter dropped a brainy mic: it “looks a lot like the Jeffreys prior,” sending stats nerds into a mini frenzy. Others snarked that every year someone rebrands Newton and natural gradients with a cooler name. Meanwhile, a side-drama kicked off over calling it “whitening,” with folks arguing the term feels outdated—even if the math isn’t. Meme mode activated: “Just add more GPUs,” “Square roots are OP,” and “Adam isn’t broken, you are.” The vibe: half curiosity, half eye-roll, lots of popcorn. If these “alchemist” tricks actually tame the wild parameters, Adam might finally have a rival. If not, it’s another optimizer fashion week.

Key Points

  • Gradient descent steps can be formulated as minimizing loss with a distance penalty defined by a metric, generalizing beyond Euclidean distance.
  • A non-Euclidean metric introduces a matrix preconditioner, enabling updates that account for parameter sensitivity and can stabilize learning.
  • Newton’s method is impractical for deep nets due to Hessian cost and negative curvature; Gauss-Newton offers a PSD, computable alternative.
  • The whitening metric is defined as the square-root of the Gauss-Newton matrix, proposed as a conservative and effective preconditioner.
  • The article examines whether spectral-whitening optimizers outperform Adam and outlines theoretical motivations and trade-offs.

Hottest takes

“Looks a lot like the Jeffreys prior” — derbOac
“Stop inventing new optimizers—just tune Adam” — shipitpls
“Square roots? My loss function can’t even add” — gpuGoblin
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.