March 18, 2026
Leaderboards or lie-derboards?
Book: The Emerging Science of Machine Learning Benchmarks
AI’s report cards made it famous—and the comments are on fire
TLDR: A new book says AI’s “report cards” drove huge progress—even if they’re messy—and the internet erupts over whether scoreboards help science or just game it. Fans cheer the clarity; critics decry bias, test‑prep vibes, and stock‑pumping numbers, making this a must‑watch fight over how we measure AI’s power.
Moritz Hardt just dropped a book arguing that AI’s obsession with scoreboards—called benchmarks—was the “mistake that made machine learning,” and tech forums are buzzing. Fans are hyped to hear a grown‑up finally say it out loud: leaderboards like ImageNet and college‑quiz sets like MMLU turned AI into a sport, complete with victory laps in earnings calls and stock jitters when one lab beats another. One top‑liked reaction simply stans the author: “If Moritz Hardt writes it, I read it.”
But the crowd splits fast. The scoreboard lovers say, “No points, no progress”—benchmarks coordinated effort and lit a fire under the field. The skeptics clap back: bench tests breed Goodhart’s law—when a measure becomes the target, it stops measuring anything real—plus bias in datasets and underpaid annotators doing the dirty work. Meme‑lords pile on with “AI is standardized testing with GPUs,” “leaderboards or lie‑derboards,” and throwback jokes about the dog‑breed Olympics that crowned the 2010s. Ethicists chime in: benchmarks don’t just steer research, they steer society. Pragmatists ask whether we should lock test sets in a vault or keep them open to everyone. Meanwhile, the hype police beg CEOs to stop flexing multiple‑choice scores like an IQ test for robots. Verdict? The internet can’t decide if benchmarks are the hero, the villain, or the messy reality keeping the AI show on the road.
Key Points
- •The article introduces a book focused on explaining why machine learning benchmarks work despite known shortcomings.
- •It outlines critiques of benchmarks: metric gaming, overfitting to artifacts, narrow research incentives, poor transfer, and ethical issues including bias and labor concerns.
- •Benchmarks have driven ML progress, exemplified by ImageNet in the 2010s and language model benchmarks like MMLU gaining geopolitical significance.
- •Recent benchmark results, such as DeepSeek’s R1 outperforming OpenAI’s o1 on reasoning tasks, have triggered market reactions, underscoring benchmarks’ influence.
- •The book’s first part covers foundational material, classical guarantees of the holdout and cross-validation methods, and why these guarantees often fail in practice due to adaptivity.