February 4, 2026
Benchmarks, bugs, and blowback
We built a real-world benchmark for AI code review
Qodo builds its own code test, crowns itself champ — and the comments go feral
TLDR: Qodo launched a public “real-world” test for AI code reviewers and says its tool came out on top. Commenters cried home‑field advantage, questioned overfitting, argued cost matters more than points, and even found a pricing‑page bug—turning a glossy launch into a debate about trust and value.
Qodo just dropped a “real-world” test for AI code reviewers by planting sneaky mistakes into actual merged pull requests (the code changes devs propose) — then said its own tool scored highest with a 60.1% reliability score. The benchmark is public on GitHub and aims to check both bugs and best-practice rules, not just one-off errors like older tests such as SWE‑Bench.
But the internet’s chorus? Pure side-eye. The top-voted vibe is “home-field advantage,” with users rolling their eyes at a company building a test and then winning it. Another flashpoint: overfitting — a user literally searched the blog for the word and found nada, sparking “are you training to ace your own quiz?” jokes. Then came the budget brigade: readers argued they don’t care about a 10-point bump if the price is higher, name-dropping cheaper rivals with bigger quotas. And in a painfully on-brand twist, someone spotted a bug on Qodo’s pricing page — the annual plan reportedly cost more than monthly, turning the thread into a live roast.
Not all snark, though: a few devs said LLMs (large language models) might not be great rule enforcers and suggested custom linters or multi-agent debates for quality tips. Still, the crowd’s mood was clear: cool idea, show us the receipts — and fix your pricing page.
Key Points
- •Qodo released the Code Review Benchmark 1.0 to evaluate AI code review systems using injected defects in real, merged PRs.
- •The benchmark measures both bug detection (correctness) and best-practice enforcement (code quality).
- •The initial dataset includes 100 PRs with 580 injected issues from active open-source repositories.
- •Qodo reports achieving a 60.1% F1 score, outperforming seven other AI code review platforms.
- •Benchmark assets and evaluated reviews are publicly available via Qodo’s GitHub organization.