Show HN: I benchmarked LLM agents on fixing real-world security vulnerabilities

AI bug-fixing showdown lands with a correction — and commenters instantly start throwing tomatoes

TLDR: A new benchmark found that even top AI models only fixed about half of real software security holes, showing they’re far from reliable. But the loudest reaction wasn’t awe — it was irritation, with commenters blasting the post as overhyped benchmark fluff instead of useful insight.

A flashy new Show HN post tried to answer a very scary question in plain English: can today’s smartest AI tools actually fix real security holes in software before hackers exploit them? The creator built a test called CVE-Bench using 20 real-world flaws from popular Python projects, then sent five major models in to patch them under different levels of guidance. The headline result is not exactly a victory lap: even the top model only fixed half of the problems overall, and did best when it was given the fullest possible explanation. In other words, these bots still need a lot of hand-holding.

But the real popcorn moment came from the mix of benchmarks, bragging rights, and a correction notice. The author updated the post after discovering some tests had unfairly rejected valid fixes, boosting scores by a few points. The rankings stayed the same, but that tiny statistical glow-up only made the comments feel even spicier: was this a careful scientific effort, or yet another AI scorecard dressed up as insight?

That’s where commenter KyleTheDev kicked the door open. Instead of debating the numbers, he went straight for the style, accusing the post of being bloated, “slop,” and disrespectful to readers who just wanted the useful lessons. That hot take captures the mood perfectly: some readers love the ambition of testing AI on real security problems, while others are rolling their eyes at the benchmark theater and begging for less hype, fewer charts, and more substance. In classic HN fashion, the benchmark wasn’t the only thing being stress-tested — so was everyone’s patience.

Key Points

  • The article introduces CVE-Bench, a benchmark that evaluates five frontier LLMs on fixing 20 real-world security vulnerabilities across three prompt conditions.
  • A correction updated five flawed security tests; solve rates rose by 3 to 7 points, rankings stayed unchanged, and some cross-family comparisons became statistically significant at α = 0.05.
  • The best reported model result was gpt-5.5 with a 50% overall solve rate and 60% under the full-advisory prompt condition.
  • The benchmark found repeatable failure modes including wrong-search drift, budget exhaustion, and partial fixes, and observed up to 4× token-cost variation for similar outcomes.
  • Task curation covered 15 CWE categories across 18 Python projects and excluded large monorepos and fixes requiring compiled languages such as Rust, C, or C++.

Hottest takes

"generate slop" — KyleTheDev
"I really wish people would stop doing this" — KyleTheDev
"it just feels insulting" — KyleTheDev
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.