PA Bench: Evaluating Frontier Models on Multi-Tab Pa Tasks

AI PAs get a multi‑tab reality check — fans cheer, skeptics want messier drama

TLDR: PA Bench tests AI assistants on real multi-app chores in simulated email and calendar. Builders cheer structured, verifiable tasks, while skeptics demand messy real-world chaos and routing across providers when one AI fails. Big step beyond toy demos, but the crowd wants proof it works outside the lab.

Meet PA Bench, the new trial-by-fire for AI personal assistants that have to juggle email and calendar like a frazzled human. Instead of toy tasks, these agents must read airline confirmation emails, then block the right slots in a calendar—inside realistic simulations with verifiable results. The crowd loved the promise: finally, a test that isn’t just “add to cart.” Cue the Inbox vs Calendar memes and a flood of “my AI intern would still double-book me” jokes. You can peek at the project here.

Then the drama kicked in. Founder vibes rolled in hot as shahules announced he’s building tools to “automate high-quality evals” for these agents, drawing both collab energy and eye-rolls: is this the start of an eval arms race? Meanwhile, abhijithneil dropped a spicy idea—why not route across different AI providers so one agent picks up when another faceplants (think OpenAI permissions punted to Gemini)? Fans called it the Avengers of agents, skeptics asked if simulations can ever match the real-world chaos of permissions, pop-ups, and flaky web apps. The split is sharp: builders want structured, measurable progress; skeptics want mess, bugs, and “please confirm your login” screens. Either way, PA Bench just made multi-tab competence the new headline test—and the comments are demanding receipts.

Key Points

•PA Bench is a benchmark for evaluating computer-use agents on multi-application, long-horizon personal assistant workflows.
•Tasks require coordinated interaction with simulated email and calendar web applications under controlled, deterministic conditions.
•Simulations enable reproducible and verifiable evaluations, with backend state accessible to a verifier via structured JSON.
•The benchmark uses a task-centric design, implementing features based on dataset tasks and focusing on write operations.
•Data coherence is ensured by generating base world states (personas, contacts, timelines) from which emails and calendar events are derived.

Hottest takes

“automating the synthesis of high-quality evals” — shahules

“routing setup so the best course of action can be chosen” — abhijithneil

February 25, 2026

Inbox vs Calendar: Cage Match

AI PAs get a multi‑tab reality check — fans cheer, skeptics want messier drama

Key Points

Hottest takes

February 25, 2026

Inbox vs Calendar: Cage Match

PA Bench: Evaluating Frontier Models on Multi-Tab Pa Tasks

AI PAs get a multi‑tab reality check — fans cheer, skeptics want messier drama

Key Points

Hottest takes

Save News