Different Language Models Learn Similar Number Representations

Number sense: born in bots or taught by data? Commenters clash

TLDR: Researchers found many language AIs encode numbers with repeating patterns around 2, 5, and 10, but only some learn clean, usable number buckets. Comments split: is it data or design, is the headline overhyped, and does this hint at universal "instincts"—or just Benford-flavored training leftovers?

The paper says many AI language systems—Transformers, RNNs, even old-school word embeddings—end up sensing numbers in similar, repeating patterns around 2, 5, and 10. But here’s the gossip: only some models carve those patterns cleanly enough to sort numbers into tidy buckets, like “even vs odd.” The authors say tidy geometry needs more than repeating patterns; it depends on the training recipe—data, architecture, optimizer, tokenizer—and can arise via two routes: real-world co-occurrence signals or practicing multi-token addition.

The comments went full reality TV. One camp cheers, with ACCount37 doing victory laps for the “platonic representation” crowd—finally, a universal number vibe! Another camp slams the headline, with causal calling it “editorialized,” waving the accuracy flag. Curious minds (gn_central) push the core drama: is this nature (architecture) or nurture (data)? Big‑picture philosophers (dboreham) say this is how “instincts” emerge everywhere. And matja drops a curveball: maybe it just looks like Benford’s Law—the weird rule where 1 shows up more than 9 in real data. The memes? “LLMs counting on their toes,” “mod math cult,” and “press F for single-token addition,” because only multi-token drills seem to teach clean number lines. Hot, nerdy, chaotic—just how we like it. And yes, it really matters.

Key Points

•Language models trained on natural text learn periodic number features with dominant periods at T=2, 5, and 10.
•Many architectures (Transformers, Linear RNNs, LSTMs, classical word embeddings) show period-T spikes in the Fourier domain.
•Only some models develop geometrically separable features enabling linear classification of numbers modulo T.
•The authors prove Fourier-domain sparsity is necessary but not sufficient for mod-T geometric separability.
•Feature separability depends on data, architecture, optimizer, and tokenizer, and can arise from co-occurrence signals or multi-token addition tasks (but not single-token addition).

Hottest takes

"The "platonic representation hypothesis" crowd can't stop winning." — ACCount37

"It's going to turn out that emergent states that are the same or similar in different learning systems fed roughly the same training data will be very common." — dboreham

"Title is editorialized and needs to be fixed" — causal

April 24, 2026

Do AIs have number vibes?

Number sense: born in bots or taught by data? Commenters clash

Key Points

Hottest takes

April 24, 2026

Do AIs have number vibes?

Different Language Models Learn Similar Number Representations

Number sense: born in bots or taught by data? Commenters clash

Key Points

Hottest takes

Save News