May 6, 2026
404s, formulas, and forum drama
A Simpler Parametrization for Modern Optimizers
Big brain math post drops... and the crowd only sees a vanished page
TLDR: The article pitched a simpler way to train AI by locking in weight size and focusing on direction changes instead of juggling extra settings. But commenters barely got that far, because the real drama was the missing page, deleted-post sleuthing, and jokes about a broken link earning karma.
This was supposed to be a neat, stripped-down idea about how to train artificial intelligence models more simply: keep each chunk of a model’s weights on a fixed “size” set at the start, then steer only the direction of those weights with one master timing rule. In plain English, the author argues that what programmers often call “weight decay” shouldn’t be a separate knob at all — it should just fall out naturally from the rule that keeps things balanced. It’s a very math-heavy attempt to make modern training feel cleaner, tidier, and less full of random settings.
But the community’s reaction? Absolute 404-core chaos. Instead of debating the theory, commenters piled into the far juicier scandal: the article link appeared broken or deleted. One person bluntly reported seeing only “we’ve misplaced that URL,” while others asked the devastatingly internet question: why does a dead link already have karma? Then the detective work began. A commenter dug up a GitHub commit suggesting the post was deleted hours earlier, which only turned the thread into a mini mystery. And in a final comic twist, someone trying to investigate got smacked by GitHub’s “too many requests” limit — a perfect ending for a thread where the comments had more plot than the paper.
So yes, there’s a serious math idea here. But the real show was the crowd turning a niche optimizer note into a missing-page melodrama.
Key Points
- •The article formulates normalized optimization as constrained stochastic optimization on a product of fixed-radius RMS spheres, with each block radius set by initialization.
- •It defines the tangent space and tangent projection on an RMS sphere, and expresses constrained gradient flow as the negative tangent projection of the gradient.
- •The radial correction term in the update is presented as the exact Lagrange multiplier that preserves radius, rather than as an independently tuned weight-decay penalty.
- •The article introduces the halving exponent as an additive parametrization for retention values and uses a global direction half-life to compute per-update retention.
- •For language models, the training count can be measured in processed tokens, allowing angular movement to automatically adjust when batch size changes while keeping token half-life fixed.