July 2, 2026
Bench press? More like stress press
CursorBench 3.1
AI coding scorecard drops, and the comments instantly turn into a trust fight
TLDR: CursorBench 3.1 says some AI coding tools can score high while costing less, with Cursor’s Composer 2.5 getting surprising placement. But the comment section immediately split between bargain-hunters, skeptics calling the results self-serving, and users roasting pricey models for going on pointless detours.
Cursor has released CursorBench 3.1, a new test meant to judge how well AI coding helpers handle messy, real-world jobs across multiple files. On paper, it’s a classic benchmark story: new tasks, updated grading, and a shiny chart comparing quality against average cost per task. But the real action is in the crowd reaction, where readers wasted zero seconds turning this into a full-on comment-section brawl over trust, value, and whether some models are basically just expensive chaos machines.
The loudest hot take? A brutal pile-on against Anthropic’s Claude family, with one commenter saying the models are great at “burning tokens” and sending themselves on pointless side quests like reading irrelevant documents or doing web searches nobody asked for. Another mini-drama exploded around Cursor’s own model, Composer 2.5, which appears surprisingly competitive for the price. Some readers were intrigued and asked if it was pulling a bargain-bin masterstroke, but skeptics were not buying it. One commenter flatly called the comparison to GPT-5.5 “absolutely farcical,” while another raised eyebrows that Cursor’s benchmark seems especially flattering to Cursor’s own model and pointed to outside testing that tells a less glamorous story.
And because no internet debate is complete without chart discourse, one baffled reader got hung up on the graph itself, complaining that the cost axis feels backwards. So yes, the benchmark launched — but the community verdict is basically: cool chart, now prove it’s not marketing.
Key Points
- •CursorBench 3.1 evaluates coding agents on ambiguous, multi-file tasks taken from real Cursor sessions.
- •The benchmark compares model score against average cost per task and notes that higher scores are better.
- •CursorBench 3.1 added tasks focused on codebase understanding, bugfinding, planning, and code review.
- •The article says CursorBench 3.0 originally focused on edit, refactor, and bugfix tasks.
- •Average cost per task is computed using each model's published token pricing across task token usage, and small score differences may not be statistically meaningful.