December 20, 2025
Timer vs Terminator
Measuring AI Ability to Complete Long Tasks: Opus 4.5 has 50% horizon of 4h49M
AI can now finish ~5‑hour jobs half the time — hype, gripes, and naming fights
TLDR: Opus 4.5 can now finish nearly 5‑hour tasks with a coin‑flip success rate, and progress is doubling every seven months. Commenters split between awe over real wins, gripes about the “Opus” name, and calls to benchmark rivals—turning long‑task stamina into the new AI battleground.
The headline stat: Opus 4.5 can complete tasks that take humans about 4 hours 49 minutes with a 50% success rate, and the researchers say this “task length” power is doubling every ~7 months. But the real fireworks erupted in the comments. grim_io cheered, calling this a “good way to measure improvement,” finally explaining why AIs ace tests but stumble on projects longer than a coffee break. Dwedit pulled the fire alarm on branding: “Opus is already an audio codec,” sparking a mini meme war of codec vs Codex puns. Meanwhile, subdavis dropped a mic-level anecdote: they asked Opus to “add vector search” and it spun up tools, migrations, and a front end—“15 minutes for what would’ve taken me 4+ hours.” Cue gasps, side-eye, and a lot of “ok but will it do my boring day job?” chatter.
People loved the simple rule-of-thumb: near-100% success for <4‑minute jobs, <10% for >4‑hour ones—an easy lens for non-nerds. yismail demanded “Benchmark Gemini 3.0 Pro too,” turning it into a scoreboard showdown. And simonw confessed they didn’t get “long tasks” until a real porting marathon—their story (link) became the thread’s cautionary tale. The vibe: excitement about AI stamina gains, snark about names, and debate over which model can actually go the distance.
Key Points
- •The study proposes measuring AI capability by the human-time length of tasks models can complete, defining a “task-completion time horizon.”
- •Model success probability is predicted via logistic regression using human expert task times, enabling horizon estimates at fixed success levels.
- •Current models are near 100% on tasks under ~4 minutes for humans but succeed less than 10% on tasks over ~4 hours.
- •Opus 4.5 is reported with a 50% success horizon of 4 hours 49 minutes; Claude 3.7 Sonnet is cited as a top model with limited reliability on longer tasks.
- •Across six years, the 50% horizon has grown exponentially with a ~7-month doubling time; results will be updated and depend on methodological choices.