Show HN: Mdarena – Benchmark your Claude.md against your own PRs

Dev world’s new obsession: pitting your CLAUDE.md against your own past code

TLDR: mdarena lets teams test if their Claude instructions help or hurt by replaying real past pull requests, and early results say short, targeted guides beat giant manifestos with a ~27% bump. Commenters cheered “data over vibes” while skeptics warned about overfitting, token costs, and security—drama included.

Meet mdarena, the DIY arena where developers throw their instruction files into a fight with their own past pull requests—and let automated tests decide the winner. The creator says he built it because research on these “CLAUDE.md” instruction files is all over the place. The comments? They turned it into a spectacle, chanting “data over vibes” while dunking on bloated docs.

The headline result lighting up the thread: a tidy, per-folder set of tips beat a long, consolidated mega-file—by about 27% more tests passing versus no guide at all. Cue memes about a “Markdown Hunger Games” and warnings that “length isn’t wisdom.” One camp is thrilled: finally, a way to prove whether your guidance actually helps. Another camp is side-eying the whole premise: “Are we just overfitting to yesterday’s bugs?” Others nitpick the method—should the bot have to “discover” the file or should it be force-fed into the prompt? Meanwhile, the token cost drama (those files can add 20%+ to usage) and the security caveat (“only run repos you trust,” shell commands and all) sparked nervous laughter.

Bonus chatter: folks loved that it mirrors the SWE-bench style—real tests, not “vibes” judging—while joking that this might finally make teams write tests. One zinger summed it up: “Stop writing novels for robots; give them cheat sheets.”

Key Points

  • mdarena benchmarks CLAUDE.md files by mining merged PRs from a repository and grading agent patches using real tests where available.
  • The pipeline includes mining PRs, running baseline vs. context conditions, and reporting test pass/fail, overlap, token/cost, and statistical significance.
  • It auto-detects test commands from CI and package files (e.g., .github/workflows, package.json, pyproject.toml, Cargo.toml, go.mod) with manual overrides available.
  • Security uses history-free checkouts (git archive) to prevent future-commit leakage; the tool executes repository code and advises caution.
  • In a production monorepo evaluation (20 PRs, Claude Opus 4.6), an existing CLAUDE.md improved test resolution by ~27% over baseline, while a consolidated guidance file performed no better than none.

Hottest takes

“Docs that read like a fantasy novel? Turns out they make the bot dumber” — anonymous
“Overfitting your AI to your greatest hits repo is not ‘science’” — skeptical_dev
“Two markdowns enter, one leaves—welcome to MD Thunderdome” — memelord42
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.