November 11, 2025
Math fights, meme nights
Modern Optimizers – An Alchemist's Notes on Deep Learning
Adam vs the “Whitening” Wizards: faster training or just hype
TLDR: New “whitening” optimizers claim to train AI models faster by reshaping each step to the terrain. The community is split: some see real gains and clever stats links (hello Jeffreys prior), others say it’s hype and a rename of old ideas—classic Adam loyalists versus optimizer fashionistas.
Deep learning’s favorite workout plan, Adam, just got called out by a new squad of “spectral-whitening” optimizers promising faster training with fewer hiccups. The post explains a simple idea in plain terms: instead of stepping blindly downhill, these methods reshape each step to match the terrain—like switching from flip-flops to hiking boots. Cue the comments: skeptics say it’s fancy math for “turn the learning rate down,” while fans swear these tools hit a better speed-vs-cost sweet spot.
The hottest thread sparks around the mysterious square root in the “whitening” metric. One commenter dropped a brainy mic: it “looks a lot like the Jeffreys prior,” sending stats nerds into a mini frenzy. Others snarked that every year someone rebrands Newton and natural gradients with a cooler name. Meanwhile, a side-drama kicked off over calling it “whitening,” with folks arguing the term feels outdated—even if the math isn’t. Meme mode activated: “Just add more GPUs,” “Square roots are OP,” and “Adam isn’t broken, you are.” The vibe: half curiosity, half eye-roll, lots of popcorn. If these “alchemist” tricks actually tame the wild parameters, Adam might finally have a rival. If not, it’s another optimizer fashion week.
Key Points
- •Gradient descent steps can be formulated as minimizing loss with a distance penalty defined by a metric, generalizing beyond Euclidean distance.
- •A non-Euclidean metric introduces a matrix preconditioner, enabling updates that account for parameter sensitivity and can stabilize learning.
- •Newton’s method is impractical for deep nets due to Hessian cost and negative curvature; Gauss-Newton offers a PSD, computable alternative.
- •The whitening metric is defined as the square-root of the Gauss-Newton matrix, proposed as a conservative and effective preconditioner.
- •The article examines whether spectral-whitening optimizers outperform Adam and outlines theoretical motivations and trade-offs.