April 13, 2026
Bots, bugs, and beef
N-Day-Bench – Can LLMs find real vulnerabilities in real codebases?
GPT tops bug-hunt board as fans demand open source and Mythos, plus one wild hack story
TLDR: New N-Day-Bench results put GPT‑5.4 on top in a monthly bug-hunt of real software. Comments clash over adding no‑bug cases to measure false alarms, bringing in open‑source models, and a wild SQL‑injection anecdote—proof these AI bug‑finders look promising, but trust and transparency are the fight.
Can chatbots actually sniff out real software bugs? The new N-Day-Bench says yes‑ish, with OpenAI’s GPT‑5.4 leading the latest run (83.93) over GLM‑5.1 and Claude Opus, while Google’s Gemini trails. The test uses real, already‑disclosed flaws added after each model’s training cutoff, updates monthly, and posts receipts via public traces and a leaderboard.
And oh boy, the comments: hype, side‑eye, and popcorn. One camp is demanding trust checks — mbbutler wants cases with no bugs to catch false alarms. Another is banging on the door for transparency, with spicyusername pushing to include open‑source models. A third asks where Anthropic’s hyped ‘Claude Mythos’ even is, as Rohinator waits on that showdown.
Others zeroed in on the setup’s vibe: a Curator, a Finder, and a Judge — users joked it reads like a reality‑show panel, with the Finder limited to 24 command‑line steps like a TV season budget. Still, every step is logged, which fans say keeps it honest. The top five? GPT‑5.4, GLM‑5.1, Claude Opus 4.6, Kimi K2.5, and Gemini 3.1 Pro Preview — cue the bragging rights.
The spookiest moment? linzhangrun’s office tale: Gemini reportedly dug up a hidden database bug and pulled password hashes. That story lit up the thread — half impressed, half alarmed. Verdict from the crowd: scores are cool, but the real test is fewer false positives, more open models, and more sunlight from Winfunc Research.
Key Points
- •N-Day-Bench evaluates LLMs on discovering real-world “N-day” vulnerabilities disclosed after each model’s knowledge cutoff.
- •The benchmark uses a standardized harness with three agents (Curator, Finder, Judge) and prevents reward hacking; the Finder has 24 shell steps and is blinded to patches.
- •The latest run (Apr 13, 2026) scanned 1,000 advisories, with 47 accepted cases and 953 skipped; all traces are publicly viewable.
- •Leaderboard top 5 average scores: openai/gpt-5.4 (83.93), z-ai/glm-5.1 (80.13), anthropic/claude-opus-4.6 (79.95), moonshotai/kimi-k2.5 (77.18), google/gemini-3.1-pro-preview (68.50).
- •The benchmark is adaptive with monthly test case updates and model checkpoint refreshes; project by Winfunc Research.