Are LLMs not getting better?

Study says AI coding stalled; commenters yell “you skipped the new models”

TLDR: A deep-dive claims AI coding quality that maintainers would approve hasn’t climbed since early 2025, fitting a flat trend over time. Comments erupt: critics say the dataset misses the newest models, others report real-world gains, and many agree on a middle ground—big leap last year, a recent plateau that matters for developers.

A stats-heavy post claims the AIs that write code haven’t gotten better at shipping code humans would actually approve. The author looked at “merge rates” (how often maintainers accept AI-written changes) from the metr study and says a flat line fits the data better than a slow climb—meaning: after a jump in late 2024, 2025–now looks like a plateau. Translation: passing tests isn’t the same as code a real maintainer would merge, and by that tougher bar, the line stays flat.

Cue comment-section fireworks. mike_hearn says real life tells a different story: he’s “editing their work much less” using GPT 5.4. boonzeet and others argue the sample is tiny and conveniently skips the newest big names—Opus 4.5/4.6, Sonnet 4.6, and Google’s Gemini—so how can it prove anything? raincole goes full meme: “No Gemini. No Opus 4.5. No GPT codex,” calling the post “ragebait.” reedf1 dubs the omissions “insane.” On the flip side, Flavius threads the needle: last year brought big gains, but the last three months feel like a wall.

The vibe? Plateau vs. placebo. Some say the graph is gospel; others say it’s cherry-picking an old playlist while the party’s moved on. Jokes about “step functions” and “graph wars” fly, but the core split remains: trust the chart, or trust your keyboard and ship logs.

Key Points

•The article analyzes METR’s LLM coding benchmarks under two criteria: test-passing versus maintainer-approved (mergeable) code.
•It notes a reported shift in the 50% success horizon from 50 minutes down to 8 minutes, while emphasizing poorer performance under the stricter merge criterion.
•The author disputes METR’s slight upward merge-rate trend, proposing a step change late in 2024 followed by a plateau through early 2025 onward.
•Using leave-one-out cross-validation and Brier scores, the step function outperforms a linear trend; a constant model performs best overall.
•The author concludes that LLM merge rates—and thus practical programming ability—have shown no improvement for over a year.

Hottest takes

“I just find myself editing their work much less these days” — mike_hearn

“No Gemini. No Opus 4.5. No GPT codex.” — raincole

“They have improved tremendously in the last year… hit a wall in the last 3 months” — Flavius

March 12, 2026

Plateau or placebo?

Study says AI coding stalled; commenters yell “you skipped the new models”

Key Points

Hottest takes

March 12, 2026

Plateau or placebo?

Are LLMs not getting better?

Study says AI coding stalled; commenters yell “you skipped the new models”

Key Points

Hottest takes

Save News