June 24, 2026
Benchmarks, but make it messy
Why eval startups fail (2025)
The market is tiny, the hype is huge, and the comments are already roasting it
TLDR: The article says startups that sell AI model scorecards usually fail because the experts leave, the customer base is tiny, and the tests become easy to game. Commenters turned that into a roast, ranging from "What’s an eval?" confusion to debates over whether tool companies should actually be the winners.
A spicy new essay arguing that "eval startups"—companies trying to sell scorecards for which artificial intelligence model is best—almost always crash has unleashed exactly the kind of internet reaction you’d expect: confusion, cynicism, and a few gloriously petty dunks. The author’s case is blunt: the smartest people who can build these testing systems usually leave for flashier, richer jobs; the customer pool is weirdly narrow; and the minute a benchmark matters, big model makers start optimizing for it until the score stops meaning much. In plain English: the business dies because the talent leaves, the buyers are scarce, and the tests get gamed.
But the real show is in the comments. One user, theteapot, delivered the accidental mic-drop of the thread: “What’s an eval?” That one line practically became the unofficial mascot of the article’s argument—if people have to ask what the product is, maybe that’s the problem. Others went full philosopher, with bitlad shrugging, “Everything eventually fails,” as if the startup graveyard needed its own fortune cookie. Then came the sharper business takes: GL26 argued these companies are selling information that goes stale too slowly to be truly valuable, unlike fast-moving financial data. Still, not everyone was ready to bury the category. jdw64 pushed back, pointing out that in software history, the real money often went to the toolmakers, not the app builders. And then there was the driest laugh in the room: wseqyrku simply replying “Aha” to the claim that there just aren’t enough customers. Brutal, minimalist, iconic.
Key Points
- •The article says independent AI eval startups have rarely succeeded, with safety evals described as the main exception.
- •It argues that people capable of building strong evals often move into post-training or application development because those areas offer higher payoffs and more influence.
- •The article states that creating good evals requires capabilities such as high-quality data collection through human feedback pipelines or synthetic data generation.
- •It cites an example of three researchers leaving Epoch AI evaluation roles to start a company focused on post-training tools for agents.
- •The article argues that the addressable customer base for eval startups is narrow because advanced developers can run their own evals, while less technical buyers usually want complete solutions instead of benchmark analysis.