April 26, 2026
Benchmarks on blast
Why SWE-bench Verified no longer measures frontier coding capabilities
OpenAI bins a flawed coding test; commenters yell 'goalposts' and 'duh'
TLDR: OpenAI says its go-to coding test is tainted by bad questions and training leaks, so it’s dropping the scores and building cleaner exams while pointing to SWE-bench Pro. Commenters are split between “goalpost move,” “obviously,” and “try a rival test,” highlighting how shaky AI report cards can be.
OpenAI just told the class the exam was broken. The company says its widely cited coding benchmark, SWE-bench Verified (a test set for code-fixing tasks), no longer shows real skill because of two messes: bad tests that reject correct answers, and models being trained on the very problems they’re graded on. After big early gains, scores barely budged (from 74.9% to 80.9%), and now OpenAI is dropping the metric, promising fresh, clean evaluations and steering folks to SWE-bench Pro. It’s part of their Preparedness Framework, and yes, this all lives on the OpenAI blog.
Cue the comment section meltdown. The top vibe: “moving the goalposts.” One user rolled their eyes at the training-data reveal with a crisp “No shit, Sherlock!”, while another stared at the audit stats and basically asked, “Wait, a quarter of the Q&A was wrong this whole time?!” The plot twist cameo: “Terminal Bench is the future,” a not-so-subtle pitch for a rival test. Meanwhile, a side-quest drama erupted over forced auto-translation on the site—“codage de pointe” in French got roasted as sounding like a perfume ad. Memes flew about “Reddit math” on the 59.4% of a 27.6% slice, and the room split between “good on them for fessing up” and “convenient timing, huh?” Either way, the internet is keeping score—even if the test isn’t.
Key Points
- •OpenAI found SWE-bench Verified is compromised by flawed tests and training contamination, making it unsuitable for measuring frontier coding capabilities.
- •In an audit of a 27.6% subset of difficult tasks, at least 59.4% had tests that rejected functionally correct solutions.
- •All tested frontier models showed evidence of prior exposure to some benchmark problems/solutions, including reproducing gold patches or verbatim details.
- •Performance gains on Verified may reflect training exposure rather than real-world coding ability; OpenAI has ceased reporting these scores and advises others to stop.
- •OpenAI is building new uncontaminated evaluations and recommends reporting results on SWE-bench Pro; original SWE-bench (2023) details and limitations are outlined.