March 19, 2026
Data Diet or Hype Riot?
NanoGPT Slowrun: 10x Data Efficiency with Infinite Compute
Fans call it a genius shortcut, skeptics say 'just generate more data'
TLDR: NanoGPT says it matched big results using one‑tenth the data by teaming up smaller models. Commenters split between cheering a compute‑first shortcut and saying labs can just auto‑generate data, while others mocked the math flair and dreamed of self‑training AI—all signaling a faster, cheaper path matters.
NanoGPT just claimed a radical “data diet”: a team of smaller models matches big‑model results using roughly one‑tenth the reading material. They trained eight 1.8B‑parameter models on 100M tokens (tiny chunks of text) and say it rivals a setup that usually needs 1B. The trick? Ensembles (many models vote), chain distillation (each new model learns from the last), plus extra regularization so the group improves even when individual models start overfitting.
The comments lit up. Skeptics like littlestymaar say the “data shortage” isn’t real because labs can just make synthetic data: generate more and better. Others wonder if the old Chinchilla playbook (how to balance model size and data) just got tossed. Fans counter that this is a compute‑first shortcut that could lower the gatekeeping power of massive datasets and let you keep pushing quality without hoarding more internet.
Then came the memes. abeppu roasted the math box for setting T=1.0 yet dressing it up like rocket science. andai asked the vibe‑check question: how many cats do humans need to see to learn “cat”? Futurists chimed in that an AI might soon train a better AI in a loop. For receipts, yorwba linked the prior HN thread. Hype vs side‑eye, served with popcorn.
Key Points
- •NanoGPT Slowrun reports up to 10x data efficiency: an 18B-parameter ensemble trained on 100M tokens matches a baseline needing 1B tokens.
- •The approach departs from Chinchilla scaling, which would suggest ~5M parameters for 100M tokens; the authors instead scale via ensembles.
- •Longer training can worsen individual model loss but improve ensemble loss; extending from 12 to 18 epochs improved ensemble loss (3.185 → 3.166).
- •Chain knowledge distillation, training each model from the immediately previous one (α=0.5, T=1.0), improved ensemble loss to 3.126 with 8 models (from 7x to 8x efficiency).
- •Strong regularization (L2 weight decay up to 1.6 and dropout 0.1) is used to enhance generalization under limited data.