Agent Reading Test

AI bots take a reading test—commenters want receipts

TLDR: A new benchmark scores how well AI coders read real websites, and early shared results range from 11/20 to 14/20. Commenters are split between calling for harsher, weighted scoring and warning the test may punish common agent designs, while others just want a public leaderboard.

The “Agent Reading Test” just dropped, and the internet instantly turned it into a sport. This new benchmark claims to check if coding AIs can actually read real docs—through messy web tricks like hidden text, tabbed pages, and single‑page apps that load blank shells. It plants canary tokens (think: Easter eggs) and asks agents to do real tasks before revealing what they saw. Max score? 20. The vibe? Prove your bot can read.

Commenters are already keeping score. One user posted a 14/20 for “Claude Web Opus 4.6 Extended,” canaries and all, then begged for a “TL;DR mode” so the lazy can spectate without running the test. Another shared an 11/20 for a Qwen setup and admitted they ran it on “low effort,” triggering a chorus of “raise the difficulty!” But the hottest flame comes from folks who claim the whole thing might false‑flag the bots we actually use: many agents browse via mini‑helpers and cheap summarizers, so they could bomb this test without ever seeing the real page. On the flip side, a critic wants negative weights and even an inverse score to punish the most common failures harder. Meanwhile, others just want a public leaderboard, a design explainer, and the GitHub receipts: github.com/agent-ecosystem/agent-reading-test.

Key Points

  • Agent Reading Test benchmarks how well AI coding agents read and interpret documentation websites.
  • It targets failure modes like truncated content, CSS-obscured text, client-side rendering gaps, and tabbed content serialization.
  • Tests embed canary tokens and assign realistic tasks; tokens are revealed to the agent only after task completion.
  • Scoring is out of 20: 1 point per canary token found and 1 point per correct qualitative answer; perfect scores are unlikely.
  • The project complements the Agent-Friendly Documentation Spec (22 checks across 8 categories) and is open-sourced on GitHub.

Hottest takes

"The tests should have negative weights... And whole test inverse score" — dostick
"90% of agents that people actually use today would fail this test inadvertently" — theyCallMeSwift
"11/20 for qwen/qwen3.5-flash-02-23 in Claude Code" — numeri
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.