Attention Residuals

AI gets a memory makeover that could cut costs — and a teen led the charge

TLDR: AttnRes lets AI pick what to remember from earlier steps, boosting scores and promising cheaper, lighter models. Commenters are hyped about lower costs and home-friendly inference, debating whether it’s fresh innovation or LSTM déjà vu, while cheering that one of the authors is still in high school.

The research drop called Attention Residuals (AttnRes) just lit up the comment sections. In plain English: it’s a new way for AI “layers” to pick what to remember from earlier steps instead of mashing everything together. Fans say this could make big models smarter while using less muscle. One excited voice claims it could even cut training needs by about 20% and lower bandwidth for running models at home. Translation: faster experiments, cheaper bills, more AI on consumer PCs.

Numbers flexed too: the team reports better scores across tough tests, especially reasoning and coding (big jumps on GPQA-Diamond and HumanEval). But the real drama? A commenter yelled “this feels like LSTMs,” the old-school tech with memory gates, sparking the eternal internet debate: visionary upgrade or just a clever remix. Another commenter zeroed in on the practical magic: the Block AttnRes version groups layers so it scales without melting your GPU — the supposed sweet spot is about 8 blocks.

Meanwhile, the community is losing it over the human story: one author is reportedly a high school student (link). There’s the obligatory paper link, the “run it on my gaming rig” memes, and a chorus of “drop-in replacement? yes please.” It’s hype, history lessons, and hardware dreams all at once.

Key Points

•AttnRes replaces uniform residual accumulation with softmax attention over preceding layer outputs using a learned, input-dependent pseudo-query per layer.
•Full AttnRes has O(Ld) memory; Block AttnRes reduces this to O(Nd) by attending over block-level representations and recovers most benefits with ~8 blocks.
•Scaling experiments show AttnRes consistently outperforms baselines across compute budgets; Block AttnRes matches a baseline trained with 1.25× more compute.
•Downstream evaluations on a Kimi Linear 48B / 3B activated model (1.4T tokens) show gains across MMLU, GPQA-Diamond, BBH, TriviaQA, Math, HumanEval, MBPP, CMMLU, and C-Eval.
•Training dynamics indicate AttnRes keeps hidden-state magnitudes bounded and distributes gradient norms more uniformly, mitigating PreNorm dilution.

Hottest takes

"Drops compute required for training by ~20%." — jjcm

"reminds me of the input gates of an LSTM." — jszymborski

"Amazingly, the first author is a high school student!" — Murfalo

March 20, 2026

A teen tweak steals the spotlight

AI gets a memory makeover that could cut costs — and a teen led the charge

Key Points

Hottest takes

March 20, 2026

A teen tweak steals the spotlight

Attention Residuals

AI gets a memory makeover that could cut costs — and a teen led the charge

Key Points

Hottest takes

Save News