April 1, 2026
AI’s Dollar-Store King Wins
StepFun 3.5 Flash is #1 cost-effective model for OpenClaw tasks (300 battles)
Cheap AI beats the fancy giants and everyone’s losing their minds
TLDR: A little-known AI model called StepFun 3.5 Flash just beat big-name systems on value, topping the cost-effective leaderboard while pricey Claude Opus sinks near the bottom. Commenters are gleefully asking why they’ve been overpaying for “premium” AI when the bargain option is winning real-world tests.
StepFun 3.5 Flash just walked into the OpenClaw Arena, paid almost nothing at the door, and started absolutely bodying the expensive models — and the community is both impressed and slightly offended. One benchmarker, skysniper, dropped receipts from 300+ battles: the “best” pure-performance model, Claude Opus, is #1 in raw power but crashes all the way down to #14 when you factor in price. StepFun, meanwhile, takes the #1 cost-effective crown and instantly becomes the people’s champ.
In the comments, the vibe is: “Why are we paying designer prices for budget-brain results?” hadlock points out that StepFun is now the most-used model on another platform, with way more usage than some big-name rivals, and costs “about 5%” of one of the fancy models. Translation: folks are realizing they’ve been buying champagne when store-brand soda gets the job done.
There’s also side drama: smallerize calls out a tool maker for botching a StepFun release and then seemingly ghosting the fix, while WhitneyLand casually drops a link to an earlier thread like, “You’re late, the real nerd fight started weeks ago.” And to top it off, skysniper says Google’s shiny Gemini 3.1 Pro sometimes just… reads the instructions and does nothing, turning into the AI equivalent of a teenager ignoring chores. The crowd loves StepFun — and they’re roasting everyone else.
Key Points
- •The OpenClaw Arena leaderboard ranks AI models by performance cost-effectiveness on real agent tasks as of April 1, 2026.
- •Step 3.5 Flash (stepfun/step-3.5-flash) is ranked #1 in cost-effectiveness with 1327±88 score over 98 battles.
- •Grok 4.1 Fast and Minimax M2.7 are ranked #2 and #3 respectively, with scores slightly below Step 3.5 Flash.
- •The table shows rank spreads and score confidence intervals derived via bootstrap to indicate ranking uncertainty.
- •The ranking methodology relies on relative model outcomes in battles and uses a grouped Plackett–Luce model, similar in principle to Chatbot Arena.