March 4, 2026

Slow cooking AI, fast-cooking comments

NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute

AI slow-cooks on tiny data, big GPUs; fans cheer, skeptics yell ‘memorization’

TLDR: NanoGPT Slowrun set a 5.5x “learn-more-from-less” record by squeezing extra performance from a small dataset using lots of compute. Comments split between praise, calls to credit BabyLM, and worries about memorizing the validation set—raising a big question: can tiny-data tricks really generalize, and why it matters beyond chatbots.

Q Labs just flipped the script with NanoGPT Slowrun — an AI “slow cook” where the pot is near-infinite compute but the ingredients are tiny data. Train on 100 million tokens (think: a small reading list), and wring more learning from less. The baseline jumped from 2.4× to 5.5× data efficiency in days, thanks to tweaks like smarter shuffling each training pass, new ways to mix signals, a different activation “spark,” and teaming up models. Big promises follow: 10× soon, maybe 100× this year. The community rushed in with claps and caveats.

Fans cheered the vibe shift — “optimize depth, not speed” — while @lzaborowski celebrated flipping the usual constraint: make cheap compute do the hard work. But drama arrived fast. @suddenlybananas asked for credit to BabyLM, a prior small-data challenge, and the thread side-eyed whether this is new or just rebranded. The spiciest worry came from @archermarks: is this real learning or just memorization of the validation set? Cue memes of GPUs “slow-roasting” the same dataset, and jokes about dropout as the chef’s seasoning. The stakes feel real: if data is the bottleneck for robots and biolabs, data-efficient learning could be the next frontier — if it actually generalizes.

Key Points

  • Q Labs released NanoGPT Slowrun, a benchmark to optimize data-efficient language modeling on a fixed 100M-token FineWeb dataset with unlimited compute.
  • Initial experiments achieved ~2.4x data efficiency compared to modded-nanogpt, with Muon outperforming AdamW, SOAP, and MAGMA.
  • Multi-epoch training and aggressive regularization (weight decay up to 16x standard plus dropout) enabled scaling to large parameter counts.
  • Community contributions raised data efficiency to 5.5x via epoch-start shuffling, learned value embedding projections, switching to SwiGLU, and ensembling.
  • The project targets further gains (10x short-term, 100x longer-term) and invites exploration of second-order/natural gradient methods, diffusion models, curriculum learning, and evolutionary search.

Hottest takes

"Reminds me a fair bit of the BabyLM challenge" — suddenlybananas
"how worried are you about over-training on this particular dataset?" — archermarks
"how much signal you can extract from the same dataset when compute is cheap" — lzaborowski
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.