Browser Agent Benchmark: Comparing LLM models for web automation

Open browser bot test lands—fans ask “Where’s Opus 4.5”

TLDR: Browser Use released an open benchmark of 100 tough web tasks judged by GPT‑4o. Comments split between calling out a missing “Opus 4.5” and asking for practical fuzzy-testing tools, highlighting demands for fairness, completeness, and real-world usefulness when choosing AI that actually works.

Browser Use dropped an open benchmark to compare AI “browser agents”—bots that click, search, and complete tasks online—and the crowd immediately reached for popcorn. It’s a 100-task gauntlet across real sites and tricky interactions (think iframe puzzles and drag‑and‑drop), graded by GPT‑4o acting as the judge.

Strongest take? A commenter demanded the “best model” was missing: “Opus 4.5.” Cue confusion and side‑eyes over model naming and whether the list is complete. Others asked for something more practical: tools for exploratory, fuzzy testing they can actually use today, not just leaderboard flexes.

Drama flared over the idea of LLMs grading other LLMs—“AI marking its own homework” became the meme of the day, with “Who judges the judge?” jokes flying. Supporters countered that with 600,000 test runs, a consistent judge is the only way to scale, and synthetic test sites are too fake to matter.

The vibe: impressive work, spicy skepticism. People love the real‑world focus but want clearer coverage of popular models, and a toolbox they can plug into their QA workflow. Verdict from the commentariat: great benchmark, now show the missing contenders and ship hands‑on testing tools. Meanwhile, the thread kept dunking on “iframe inception” like a boss‑fight level, because of course the internet turned a benchmark into a meme.

Key Points

  • Browser Use released an open-source benchmark for LLM-driven browser agents based on 600,000+ internal test runs.
  • The benchmark includes 100 tasks: 20 each from WebBench, Mind2Web 2, GAIA, BrowseComp, plus 20 custom interaction challenges.
  • Tasks requiring authentication or making real website changes are excluded for scalability reasons.
  • Task difficulty was curated by repeated runs across models and settings, pruning too-easy and unreachable tasks, and verifying remaining hard-but-possible tasks.
  • An LLM judge was standardized and validated against 200 human-labeled traces; GPT-4o was selected as the most human-aligned judge.

Hottest takes

"It’s lacking the best model (Opus 4.5)" — pixel_popping
"good AI-based tool for exploratory (fuzzy?) web testing?" — wiradikusuma
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.