March 11, 2026
Pass the test, fail the merge
Many SWE-bench-Passing PRs would not be merged
Bots ace the tests, flunk the vibe check — maintainers say “not in my repo”
TLDR: A new METR study says roughly half of AI code fixes that pass a benchmark wouldn’t be accepted by real maintainers, meaning scores may overstate usefulness. Commenters split between “benchmarks are grade inflation” and “progress is real,” with jokes about “AI archaeology” and gripes about “completely unusable” models.
The study drop from METR says the quiet part out loud: about half of AI-written code fixes that “pass the test” on SWE-bench wouldn’t actually be merged by real project maintainers. The automated grader looks generous; human reviewers? Not so much. Maintainers rated AI patches way lower than the bot—about 24 points fewer—and progress on “what humans would accept” looks slower than progress on “what tests will pass.” METR stresses this isn’t a death sentence for AI devs; the agents didn’t get a chance to revise like a human would. Still, the community lit up.
Skeptics pounced. One dev said flashy benchmark darlings turned out “completely unusable” on their machine, and another joked, “Is this AI archaeology?” as the report covers mid-2024 to 2025 agents. A sober critique emerged too: as one commenter put it, tests “miss the things that are hard to encode”—like matching the team’s style, not breaking other parts, and respecting the project’s unwritten rules. Translation: passing the test isn’t passing the vibe check.
But the hype isn’t dead. Optimists pointed out the curve still points up—AI keeps getting better, caveats and all. Meanwhile, the most savage roast was also the shortest: a lone “Nevermind,” capturing the whiplash between “benchmarks look great!” and “maintainers say nope.” The verdict? Benchmarks are a useful thermometer—but they’re not the taste test.
Key Points
- •Four maintainers from three SWE-bench Verified repositories reviewed 296 AI-generated PRs from mid-2024 to mid/late-2025.
- •Maintainer decisions were calibrated using 47 human “golden” PRs that had been merged; the golden baseline acceptance rate was 68%.
- •On average, maintainer merge approvals were about 24 percentage points lower than automated grader pass rates.
- •The measured improvement rate for maintainer merges was 9.6 percentage points per year slower than for automated grader outcomes, though this evidence is less robust.
- •Roughly half of test-passing SWE-bench Verified PRs would not be merged; METR cautions against interpreting benchmark scores as direct measures of real-world usefulness.