Pretraining Language Models via Neural Cellular Automata

AI ‘homeschooled’ by pixel worlds, not words — and the comments are losing it

TLDR: A new study says warming up AI on synthetic “cellular automata” worlds makes it learn faster and finish better than warming up on real text, even with 10× more words. Commenters are split between “bias-lite reasoning” excitement and “just pattern compression” skepticism—with memes about AI homeschooled by Game of Life.

The internet is buzzing after researchers claim you can “pre‑pre‑train” chatbots on wild pixel worlds—Neural Cellular Automata (think Conway’s Game of Life with a brain)—and beat warming up on real text. With just 164M synthetic tokens in this first‑stage warm‑up, models later learned faster and finished stronger than ones primed on the C4 web‑text dataset—even when C4 got 10× more data. Gains carried into math and coding tests. Paper · Code.

But the comments are the main event. voxleone cheered the shift from word guessing to modeling how states evolve—structure over semantics. dzink went full manifesto: models could learn reasoning from fully synthetic universes, then only pick up meaning from a tiny, curated set—aka bias on a diet. The theory crowd showed up too: benob linked “universal pre‑training” ideas and begged for clean math; stanfordkid flexed homebrew experiments growing 3D fractal worlds into vision models (ViTs = vision transformers). Not everyone’s convinced; skeptics asked if this just teaches fancy compression tricks dressed up as “reasoning.”

Then came the memes: “AI raised by Tamagotchi,” “Game of Life > Book of Life,” and “synthetic homeschool” for robots. Drama level: high. If structure really beats words, we could decouple reasoning from the messy internet—and the culture wars that ride with it. For now, the flex is clear: at equal budget, synthetic chaos wins; even with 10× more words, it’s still quicker and 5% better, stirring equal parts hype and side‑eye.

Key Points

  • Pre-pre-training on neural cellular automata (NCA) sequences improves transformer language modeling versus from-scratch, C4, and Dyck baselines under equal 164M-token budgets.
  • NCA trajectories are tokenized into 2×2 patches and used for next-token prediction, encouraging in-context inference of latent transition rules.
  • The training pipeline: Stage 1 NCA pre-pre-training (164M tokens), Stage 2 natural-language pre-training (4–13B tokens across web, math, code), Stage 3 task-specific instruction tuning (<1B tokens).
  • Perplexity improves across OpenWebText (−5.7%), OpenWebMath (−5.2%), and CodeParrot (−4.2%); reasoning benchmarks also improve (GSM8K 4.36%, HumanEval 7.49%, BigBench-Lite 26.51%).
  • Even with ~10× more C4 tokens (1.6B vs 164M), NCA converges 1.4× faster and achieves ~5% better final perplexity on OpenWebText.

Hottest takes

shift learning from “predict tokens” to “model state evolution.” — voxleone
foundation models that acquire reasoning from fully synthetic data — dzink
I did a similar project but using 3D fractals — stanfordkid
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.