December 4, 2025
Tokens, tantrums, and tea
Contextualization Machines
Author says AI are context machines; comments slam safety refusals and 'no scaling walls'
TLDR: Author says transformers are “context machines” that pre-load meaning and scale with bigger vocabularies. Comments clap back: one blasts surface-level parroting and safety refusals, another cites the author’s past “no scaling walls” take. Why it matters: it reframes AI strengths and blind spots for real-world use.
Forget the geek-speak: the author argues transformers — the tech behind chatbots — aren’t just “next word guessers,” they’re context machines that enrich each token (a chunk of text) with more meaning every step. Bigger vocabularies mean bigger pre-loaded chunks, and the claim riffs on the “Over-Tokenized Transformer” idea. It’s a fresh mental model meant to explain why these systems behave the way they do.
But the comments lit up. PaulHoule torched the feel‑good framing, saying current models only react to surface words and don’t do real moral reasoning, pointing out how they easily refuse requests like “build an atomic bomb” or “help me cheat in League of Legends,” while still missing deeper context. Another user, behnamoh, rolled up with receipts, noting the same author once argued there’d be no scaling walls and dropping a link as a mic‑drop: stochasm.blog/posts/scaling_post. Some readers called this “rebranding” of old ideas; others cheered the clearer explanation. Jokes flew about “more context, less conscience” and whether bigger token vocab just means more clever ways to say “no.” The vibe: fascinating idea, spicy skepticism, and a reminder that theory is cute until the model meets messy human questions.
Key Points
- •The article frames transformers as contextualization machines, not just next-token predictors.
- •In decoder-only transformers, the residual chain is the backbone and layers add contextualization via additive transformations.
- •Hidden states remain highly similar across layers, with small differences representing added contextualization (shown via cosine similarity in Llama-3.2-1B).
- •Tokenizer and embedding matrix provide initial contextualization; larger vocabularies yield more specific starting embeddings.
- •The Over-Tokenized Transformers paper is cited to connect tokenization choices to model scaling and performance.