We built a real-world benchmark for AI code review

Qodo builds its own code test, crowns itself champ — and the comments go feral

TLDR: Qodo launched a public “real-world” test for AI code reviewers and says its tool came out on top. Commenters cried home‑field advantage, questioned overfitting, argued cost matters more than points, and even found a pricing‑page bug—turning a glossy launch into a debate about trust and value.

Qodo just dropped a “real-world” test for AI code reviewers by planting sneaky mistakes into actual merged pull requests (the code changes devs propose) — then said its own tool scored highest with a 60.1% reliability score. The benchmark is public on GitHub and aims to check both bugs and best-practice rules, not just one-off errors like older tests such as SWE‑Bench.

But the internet’s chorus? Pure side-eye. The top-voted vibe is “home-field advantage,” with users rolling their eyes at a company building a test and then winning it. Another flashpoint: overfitting — a user literally searched the blog for the word and found nada, sparking “are you training to ace your own quiz?” jokes. Then came the budget brigade: readers argued they don’t care about a 10-point bump if the price is higher, name-dropping cheaper rivals with bigger quotas. And in a painfully on-brand twist, someone spotted a bug on Qodo’s pricing page — the annual plan reportedly cost more than monthly, turning the thread into a live roast.

Not all snark, though: a few devs said LLMs (large language models) might not be great rule enforcers and suggested custom linters or multi-agent debates for quality tips. Still, the crowd’s mood was clear: cool idea, show us the receipts — and fix your pricing page.

Key Points

  • Qodo released the Code Review Benchmark 1.0 to evaluate AI code review systems using injected defects in real, merged PRs.
  • The benchmark measures both bug detection (correctness) and best-practice enforcement (code quality).
  • The initial dataset includes 100 PRs with 580 injected issues from active open-source repositories.
  • Qodo reports achieving a 60.1% F1 score, outperforming seven other AI code review platforms.
  • Benchmark assets and evaluated reviews are publicly available via Qodo’s GitHub organization.

Hottest takes

"Company creates a benchmark. Same company is best" — falloutx
"Cmd+F 'Overfitting'…nothing" — mbesto
"Your pricing page has a bug" — aetherspawn
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.