CursorBench 3.1

Cursor has released CursorBench 3.1, a new test meant to judge how well AI coding helpers handle messy, real-world jobs across multiple files. On paper, it’s a classic benchmark story: new tasks, updated grading, and a shiny chart comparing quality against average cost per task. But the real action is in the crowd reaction, where readers wasted zero seconds turning this into a full-on comment-section brawl over trust, value, and whether some models are basically just expensive chaos machines.

The loudest hot take? A brutal pile-on against Anthropic’s Claude family, with one commenter saying the models are great at “burning tokens” and sending themselves on pointless side quests like reading irrelevant documents or doing web searches nobody asked for. Another mini-drama exploded around Cursor’s own model, Composer 2.5, which appears surprisingly competitive for the price. Some readers were intrigued and asked if it was pulling a bargain-bin masterstroke, but skeptics were not buying it. One commenter flatly called the comparison to GPT-5.5 “absolutely farcical,” while another raised eyebrows that Cursor’s benchmark seems especially flattering to Cursor’s own model and pointed to outside testing that tells a less glamorous story.

And because no internet debate is complete without chart discourse, one baffled reader got hung up on the graph itself, complaining that the cost axis feels backwards. So yes, the benchmark launched — but the community verdict is basically: cool chart, now prove it’s not marketing.

July 2, 2026

Bench press? More like stress press

AI coding scorecard drops, and the comments instantly turn into a trust fight

Key Points

Hottest takes

July 2, 2026

Bench press? More like stress press

CursorBench 3.1

AI coding scorecard drops, and the comments instantly turn into a trust fight

Key Points

Hottest takes

Save News