May 29, 2026
Patch me if you can
CVE-Bench: testing LLM agents on real-world vulnerability patches
AI promised bug-fixing magic, but commenters say the real mess is getting fixes shipped
TLDR: A new benchmark found today’s top AI models still can’t reliably fix real software security holes, with the best only solving about half. Commenters pounced on the hype, arguing that spotting bugs was never the hardest part — shipping real fixes is, and that’s why this matters.
The big reveal from CVE-Bench is a real buzzkill for anyone hoping artificial intelligence was about to become the internet’s emergency repair crew. A new test put five top AI models through 20 real software security flaws, and the best performer only fixed half of them. Even after a correction bumped scores up a little, the pecking order stayed the same: better than feared, maybe, but nowhere near “fire your security team” territory. And when the AI was given less hand-holding, it got shakier fast.
But the comments? That’s where the popcorn starts. The loudest reaction was basically: “You’re all arguing about the wrong thing.” The standout hot take came from david_shaw, who said finding flaws was never the main pain for most companies — actually getting fixes out the door is. That landed like a bucket of cold water on all the shiny AI hype, especially after earlier claims that some models could spot security problems better than human experts. Community mood swung hard toward “cool demo, now show me it works in real life.”
There’s also some delicious benchmark drama: five tests originally rejected valid fixes, the numbers got recalculated, and suddenly some statistical face-offs became officially significant. Translation for normal people: the scoreboard changed a bit, but not enough to crown a new king. The joke practically wrote itself — AI can maybe find the problem, commenters say, but the real vulnerability is tech marketing.
Key Points
- •A 2026-05-28 correction updated five benchmark tests that had rejected valid alternative fixes; solve rates increased by 3 to 7 points per model, while model ranking stayed unchanged.
- •CVE-Bench evaluates five frontier models on fixing 20 real-world CVEs across three prompt conditions: full advisory, behavioral description only, and file-plus-function location only.
- •The best reported result was a 50% overall solve rate for gpt-5.5, rising to 60% under the full-advisory condition; no model reliably fixed real vulnerabilities.
- •All models weakened in the location-only condition, and the article identifies repeatable failure modes including wrong-search drift, budget exhaustion, and partial fixes.
- •The benchmark covers 15 CWE categories across 18 Python projects and excludes monorepos and fixes that require compiled-language changes alongside Python.