We built a real-world benchmark for AI code review

Qodo builds its own code test, crowns itself champ — and the comments go feral

TLDR: Qodo launched a public “real-world” test for AI code reviewers and says its tool came out on top. Commenters cried home‑field advantage, questioned overfitting, argued cost matters more than points, and even found a pricing‑page bug—turning a glossy launch into a debate about trust and value.

Qodo just dropped a “real-world” test for AI code reviewers by planting sneaky mistakes into actual merged pull requests (the code changes devs propose) — then said its own tool scored highest with a 60.1% reliability score. The benchmark is public on GitHub and aims to check both bugs and best-practice rules, not just one-off errors like older tests such as SWE‑Bench.

But the internet’s chorus? Pure side-eye. The top-voted vibe is “home-field advantage,” with users rolling their eyes at a company building a test and then winning it. Another flashpoint: overfitting — a user literally searched the blog for the word and found nada, sparking “are you training to ace your own quiz?” jokes. Then came the budget brigade: readers argued they don’t care about a 10-point bump if the price is higher, name-dropping cheaper rivals with bigger quotas. And in a painfully on-brand twist, someone spotted a bug on Qodo’s pricing page — the annual plan reportedly cost more than monthly, turning the thread into a live roast.

Not all snark, though: a few devs said LLMs (large language models) might not be great rule enforcers and suggested custom linters or multi-agent debates for quality tips. Still, the crowd’s mood was clear: cool idea, show us the receipts — and fix your pricing page.

Key Points

•Qodo released the Code Review Benchmark 1.0 to evaluate AI code review systems using injected defects in real, merged PRs.
•The benchmark measures both bug detection (correctness) and best-practice enforcement (code quality).
•The initial dataset includes 100 PRs with 580 injected issues from active open-source repositories.
•Qodo reports achieving a 60.1% F1 score, outperforming seven other AI code review platforms.
•Benchmark assets and evaluated reviews are publicly available via Qodo’s GitHub organization.

Hottest takes

"Company creates a benchmark. Same company is best" — falloutx

"Cmd+F 'Overfitting'…nothing" — mbesto

"Your pricing page has a bug" — aetherspawn

February 4, 2026

Benchmarks, bugs, and blowback

Qodo builds its own code test, crowns itself champ — and the comments go feral

Key Points

Hottest takes

February 4, 2026

Benchmarks, bugs, and blowback

We built a real-world benchmark for AI code review

Qodo builds its own code test, crowns itself champ — and the comments go feral

Key Points

Hottest takes

Save News