Different Language Models Learn Similar Number Representations

Number sense: born in bots or taught by data? Commenters clash

TLDR: Researchers found many language AIs encode numbers with repeating patterns around 2, 5, and 10, but only some learn clean, usable number buckets. Comments split: is it data or design, is the headline overhyped, and does this hint at universal "instincts"—or just Benford-flavored training leftovers?

The paper says many AI language systems—Transformers, RNNs, even old-school word embeddings—end up sensing numbers in similar, repeating patterns around 2, 5, and 10. But here’s the gossip: only some models carve those patterns cleanly enough to sort numbers into tidy buckets, like “even vs odd.” The authors say tidy geometry needs more than repeating patterns; it depends on the training recipe—data, architecture, optimizer, tokenizer—and can arise via two routes: real-world co-occurrence signals or practicing multi-token addition.

The comments went full reality TV. One camp cheers, with ACCount37 doing victory laps for the “platonic representation” crowd—finally, a universal number vibe! Another camp slams the headline, with causal calling it “editorialized,” waving the accuracy flag. Curious minds (gn_central) push the core drama: is this nature (architecture) or nurture (data)? Big‑picture philosophers (dboreham) say this is how “instincts” emerge everywhere. And matja drops a curveball: maybe it just looks like Benford’s Law—the weird rule where 1 shows up more than 9 in real data. The memes? “LLMs counting on their toes,” “mod math cult,” and “press F for single-token addition,” because only multi-token drills seem to teach clean number lines. Hot, nerdy, chaotic—just how we like it. And yes, it really matters.

Key Points

  • Language models trained on natural text learn periodic number features with dominant periods at T=2, 5, and 10.
  • Many architectures (Transformers, Linear RNNs, LSTMs, classical word embeddings) show period-T spikes in the Fourier domain.
  • Only some models develop geometrically separable features enabling linear classification of numbers modulo T.
  • The authors prove Fourier-domain sparsity is necessary but not sufficient for mod-T geometric separability.
  • Feature separability depends on data, architecture, optimizer, and tokenizer, and can arise from co-occurrence signals or multi-token addition tasks (but not single-token addition).

Hottest takes

"The "platonic representation hypothesis" crowd can't stop winning." — ACCount37
"It's going to turn out that emergent states that are the same or similar in different learning systems fed roughly the same training data will be very common." — dboreham
"Title is editorialized and needs to be fixed" — causal
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.