Show HN: Cua-Bench – a benchmark for AI agents in GUI environments

HN loves the idea of AI that clicks around—then asks, “Where are the results?”

TLDR: Cua-Bench promises a way to test AI bots that use real computer screens and clicks, plus quick virtual machines on Macs. The thread loved the idea but demanded published results, asked how it handles flaky popups and timing, and even offered a rival benchmark for a potential team-up—proof over promises matters here.

Hacker News lit up over Cua-Bench, a new open-source test suite for AI agents that can literally use a computer—see the screen, click buttons, and get stuff done. The pitch: build and benchmark “computer-use” bots with Cua, then spin up fast macOS/Linux virtual machines via Lume. The crowd loved the ambition, but the comments quickly turned into a reality check.

The strongest vibe? Show us receipts. One top reply called out the “cool project” energy but dinged the launch for not publishing any model scores, turning the thread into a scoreboard hunger strike. Meanwhile, integration drama bubbled as another dev swooped in with a competing benchmark—“200 web tasks”—and offered a mashup via UiPath’s benchmark and an arXiv writeup. Cue the crossover episode.

Then came the messy reality brigade: testers asked how Cua-Bench handles the chaos of real apps—loading spinners, surprise popups, animations that sometimes appear, sometimes don’t. One user praised the “trajectory export” feature (collecting training data while you evaluate), but wanted clarity on variance and what the mysteriously named “Windows Arena” actually measures. Jokes flew that it sounds like Mortal Kombat for File Explorer—“Finish him, Popup!”

Bottom line: the community’s into the vision and loves the tooling, but they want numbers, clearer definitions, and proof these bots can survive the wild world of flaky GUIs. Until then, it’s hype vs. homework.

Key Points

  • Cua is an open-source platform to build, benchmark, and deploy GUI-using AI agents with isolated, self-hostable sandboxes (Docker, QEMU, Apple Virtualization).
  • Cua-Bench provides standardized evaluation on datasets like OSWorld, ScreenSpot, and Windows Arena, supports custom tasks, and can export trajectories for training/RL.
  • Lume manages macOS/Linux VMs on Apple Silicon using Apple’s Virtualization.Framework, delivering near-native performance for CI/CD, testing, and agent workloads.
  • The platform includes modular packages: cua-agent, cua-computer, cua-computer-server, cua-bench, lume, and lumier (Docker-compatible interface for Lume VMs).
  • The project is MIT-licensed and references third-party components (Kasm, OmniParser; optional ultralytics under AGPL-3.0) with documentation, blog, Discord, and GitHub resources.

Hottest takes

"I made a CUA benchmark too, 200 web tasks" — visarga
"how the benchmarks handle non-determinism" — augusteo
"lack of any actual benchmark results on existing models/agents is disappointing" — rfw300
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.