LLM Year in Review

Ghost brains, broken tests, and a brawl over ‘vibe coding’

TLDR: Big shift: labs embraced RLVR, training AIs to earn verifiable rewards and think longer, making models feel smarter while benchmarks looked suspect. Comments erupted over haunted training data, power concentration, and ‘vibe coding,’ with calls for UI‑building AIs and fairer, open tooling—why this matters for everyone’s digital future.

Andrej Karpathy says 2025’s LLM glow‑up came from RLVR—a training stage where AIs get scored on math/code puzzles with right‑or‑wrong answers. Labs diverted compute into longer thinking time, so similarly sized models suddenly show their work and feel smarter (think OpenAI’s o3, following o1; see DeepSeek R1). The catch? Benchmarks started looking cooked; if you train on test‑like tasks, you crush the test. His “we’re summoning ghosts, not raising animals” metaphor stuck: genius one minute, gullible the next.

Commenters came in swinging. swyx blasted it onto X, while TheAceOfHearts demanded priorities: real UX, with AI that builds its own interfaces instead of walls of text. victorbuilds dropped the line of the week—“the call is coming from inside the dataset”—as bots reply to bots, haunting the training pool. thoughtpeddler pressed the uncomfortable stuff: power concentration, open‑source health, local hardware limits. And bgwalter lit the match on developer culture, roasting “vibe coding” as clout over craft. Result: a spicy split—builders cheering RLVR’s reasoning streak, skeptics yelling that tests are broken, data is haunted, and the future needs better UIs and a fairer power map.

Key Points

•RLVR became a new major training stage in 2025, added to pretraining, SFT, and RLHF.
•RLVR trains LLMs with automatically verifiable rewards, yielding reasoning-like strategies.
•Labs shifted compute from pretraining to longer RLVR runs, achieving higher capability per dollar without larger models.
•RLVR introduced a capability knob via test-time compute, using longer reasoning traces and increased “thinking time.”
•Benchmark reliability decreased because many benchmarks are verifiable and susceptible to RLVR and synthetic data.

Hottest takes

"The call is coming from inside the dataset." — victorbuilds

"Vibe coding is sufficient for job hoppers" — bgwalter

"how 2025 changed the concentration of power" — thoughtpeddler

December 19, 2025

Ghosts, vibes, and benchmark beef

Ghost brains, broken tests, and a brawl over ‘vibe coding’

Key Points

Hottest takes

December 19, 2025

Ghosts, vibes, and benchmark beef

LLM Year in Review

Ghost brains, broken tests, and a brawl over ‘vibe coding’

Key Points

Hottest takes

Save News