May 26, 2026

Benchmarks, bots, and a comment brawl

DeepSWE: A contamination-free benchmark for long-horizon coding agents

A new AI coding test drops — and commenters instantly start side-eyeing the scores

TLDR: DeepSWE is a new test meant to more honestly measure how good AI coding tools are at real software tasks by using fresh, harder problems. Commenters immediately split between impressed and suspicious, debating whether the scores already look too high, too weird, or still miss code quality.

A shiny new test called DeepSWE just arrived promising to measure whether today’s artificial intelligence coding helpers can handle real software work, not just recycled homework. Its creators say the big deal is that the tasks are new, messy, and much harder to fake: short, natural prompts, bigger code changes, lots of different projects, and checks that look at whether the software actually works. Translation for non-engineers: this is supposed to be a more honest exam for coding bots.

But the real show was in the comments, where the crowd wasted zero seconds before turning it into a courtroom drama. One camp basically said, “Hold on — if top models are already around 70%, did this benchmark launch half-completed?” That sparked immediate skepticism that the test could get “maxed out” too fast. Another mini-mystery had people squinting at the leaderboard and asking why one high-end model setting landed below a lighter setting from the same family — the kind of chart weirdness that sends benchmark-watchers into full conspiracy-board mode.

Meanwhile, some commenters gave the benchmark a standing ovation for matching their real-life pain: one user said it lines up with their experience of some assistants forgetting requirements and “reward hacking,” a phrase that sounds like cheating in broad daylight. Even the praise came with side-eye, with people asking whether the test checks for clean, maintainable code or just whether the bot can squeak by. The funniest energy came from the classic internet combo of hype, suspicion, and founder-in-the-thread calm, with Datacurve’s cofounder popping in like a brave reality show contestant saying, basically, ask me anything.

Key Points

  • DeepSWE is introduced as a long-horizon coding-agent benchmark built around contamination-free, from-scratch tasks.
  • The benchmark contains 113 tasks across 91 active open-source repositories in TypeScript, Go, Python, JavaScript, and Rust.
  • The article says DeepSWE uses shorter, behavior-focused prompts but requires substantially larger solutions than SWE-bench Pro.
  • DeepSWE uses hand-written verifiers that test software behavior rather than implementation details.
  • The article claims an audit of SWE-bench Pro found verifier error rates of 8% false positives and 24% false negatives.

Hottest takes

"70% at launch seems pretty saturated" — dnnssl2
"What happened" that put one Opus setting below another? — toastmaster11
"forgotten requirements and reward hacking" — vanuatu
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.