December 19, 2025
Ghosts, vibes, and benchmark beef
LLM Year in Review
Ghost brains, broken tests, and a brawl over ‘vibe coding’
TLDR: Big shift: labs embraced RLVR, training AIs to earn verifiable rewards and think longer, making models feel smarter while benchmarks looked suspect. Comments erupted over haunted training data, power concentration, and ‘vibe coding,’ with calls for UI‑building AIs and fairer, open tooling—why this matters for everyone’s digital future.
Andrej Karpathy says 2025’s LLM glow‑up came from RLVR—a training stage where AIs get scored on math/code puzzles with right‑or‑wrong answers. Labs diverted compute into longer thinking time, so similarly sized models suddenly show their work and feel smarter (think OpenAI’s o3, following o1; see DeepSeek R1). The catch? Benchmarks started looking cooked; if you train on test‑like tasks, you crush the test. His “we’re summoning ghosts, not raising animals” metaphor stuck: genius one minute, gullible the next.
Commenters came in swinging. swyx blasted it onto X, while TheAceOfHearts demanded priorities: real UX, with AI that builds its own interfaces instead of walls of text. victorbuilds dropped the line of the week—“the call is coming from inside the dataset”—as bots reply to bots, haunting the training pool. thoughtpeddler pressed the uncomfortable stuff: power concentration, open‑source health, local hardware limits. And bgwalter lit the match on developer culture, roasting “vibe coding” as clout over craft. Result: a spicy split—builders cheering RLVR’s reasoning streak, skeptics yelling that tests are broken, data is haunted, and the future needs better UIs and a fairer power map.
Key Points
- •RLVR became a new major training stage in 2025, added to pretraining, SFT, and RLHF.
- •RLVR trains LLMs with automatically verifiable rewards, yielding reasoning-like strategies.
- •Labs shifted compute from pretraining to longer RLVR runs, achieving higher capability per dollar without larger models.
- •RLVR introduced a capability knob via test-time compute, using longer reasoning traces and increased “thinking time.”
- •Benchmark reliability decreased because many benchmarks are verifiable and susceptible to RLVR and synthetic data.