March 4, 2026
Slow cooking AI, fast-cooking comments
NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute
AI slow-cooks on tiny data, big GPUs; fans cheer, skeptics yell ‘memorization’
TLDR: NanoGPT Slowrun set a 5.5x “learn-more-from-less” record by squeezing extra performance from a small dataset using lots of compute. Comments split between praise, calls to credit BabyLM, and worries about memorizing the validation set—raising a big question: can tiny-data tricks really generalize, and why it matters beyond chatbots.
Q Labs just flipped the script with NanoGPT Slowrun — an AI “slow cook” where the pot is near-infinite compute but the ingredients are tiny data. Train on 100 million tokens (think: a small reading list), and wring more learning from less. The baseline jumped from 2.4× to 5.5× data efficiency in days, thanks to tweaks like smarter shuffling each training pass, new ways to mix signals, a different activation “spark,” and teaming up models. Big promises follow: 10× soon, maybe 100× this year. The community rushed in with claps and caveats.
Fans cheered the vibe shift — “optimize depth, not speed” — while @lzaborowski celebrated flipping the usual constraint: make cheap compute do the hard work. But drama arrived fast. @suddenlybananas asked for credit to BabyLM, a prior small-data challenge, and the thread side-eyed whether this is new or just rebranded. The spiciest worry came from @archermarks: is this real learning or just memorization of the validation set? Cue memes of GPUs “slow-roasting” the same dataset, and jokes about dropout as the chef’s seasoning. The stakes feel real: if data is the bottleneck for robots and biolabs, data-efficient learning could be the next frontier — if it actually generalizes.
Key Points
- •Q Labs released NanoGPT Slowrun, a benchmark to optimize data-efficient language modeling on a fixed 100M-token FineWeb dataset with unlimited compute.
- •Initial experiments achieved ~2.4x data efficiency compared to modded-nanogpt, with Muon outperforming AdamW, SOAP, and MAGMA.
- •Multi-epoch training and aggressive regularization (weight decay up to 16x standard plus dropout) enabled scaling to large parameter counts.
- •Community contributions raised data efficiency to 5.5x via epoch-start shuffling, learned value embedding projections, switching to SwiGLU, and ensembling.
- •The project targets further gains (10x short-term, 100x longer-term) and invites exploration of second-order/natural gradient methods, diffusion models, curriculum learning, and evolutionary search.