Why SWE-bench Verified no longer measures frontier coding capabilities

OpenAI bins a flawed coding test; commenters yell 'goalposts' and 'duh'

TLDR: OpenAI says its go-to coding test is tainted by bad questions and training leaks, so it’s dropping the scores and building cleaner exams while pointing to SWE-bench Pro. Commenters are split between “goalpost move,” “obviously,” and “try a rival test,” highlighting how shaky AI report cards can be.

OpenAI just told the class the exam was broken. The company says its widely cited coding benchmark, SWE-bench Verified (a test set for code-fixing tasks), no longer shows real skill because of two messes: bad tests that reject correct answers, and models being trained on the very problems they’re graded on. After big early gains, scores barely budged (from 74.9% to 80.9%), and now OpenAI is dropping the metric, promising fresh, clean evaluations and steering folks to SWE-bench Pro. It’s part of their Preparedness Framework, and yes, this all lives on the OpenAI blog.

Cue the comment section meltdown. The top vibe: “moving the goalposts.” One user rolled their eyes at the training-data reveal with a crisp “No shit, Sherlock!”, while another stared at the audit stats and basically asked, “Wait, a quarter of the Q&A was wrong this whole time?!” The plot twist cameo: “Terminal Bench is the future,” a not-so-subtle pitch for a rival test. Meanwhile, a side-quest drama erupted over forced auto-translation on the site—“codage de pointe” in French got roasted as sounding like a perfume ad. Memes flew about “Reddit math” on the 59.4% of a 27.6% slice, and the room split between “good on them for fessing up” and “convenient timing, huh?” Either way, the internet is keeping score—even if the test isn’t.

Key Points

•OpenAI found SWE-bench Verified is compromised by flawed tests and training contamination, making it unsuitable for measuring frontier coding capabilities.
•In an audit of a 27.6% subset of difficult tasks, at least 59.4% had tests that rejected functionally correct solutions.
•All tested frontier models showed evidence of prior exposure to some benchmark problems/solutions, including reproducing gold patches or verbatim details.
•Performance gains on Verified may reflect training exposure rather than real-world coding ability; OpenAI has ceased reporting these scores and advises others to stop.
•OpenAI is building new uncontaminated evaluations and recommends reporting results on SWE-bench Pro; original SWE-bench (2023) details and limitations are outlined.

Hottest takes

"we are now moving the goal posts" — 1a527dd5

"Terminal Bench is the future" — neversupervised

"No shit, Sherlock!" — adityamwagh

April 26, 2026

Benchmarks on blast

OpenAI bins a flawed coding test; commenters yell 'goalposts' and 'duh'

Key Points

Hottest takes

April 26, 2026

Benchmarks on blast

Why SWE-bench Verified no longer measures frontier coding capabilities

OpenAI bins a flawed coding test; commenters yell 'goalposts' and 'duh'

Key Points

Hottest takes

Save News