15× vs. ~1.37×: Recalculating GPT-5.3-Codex-Spark on SWE-Bench Pro

Commenters call “15× faster” a stretch — say it’s ~1.37× and a settings trick

TLDR: OpenAI’s “15× faster” headline for GPT-5.3-Codex-Spark is being challenged by users who say a fair, same-accuracy comparison shows closer to 1.37×. The thread erupts into a hype vs. reality brawl, with calls for independent tests, counter-arguments about what “faster” means, and memes about a “reasoning knob.”

OpenAI dropped a flashy “15× faster” claim for its new GPT-5.3-Codex-Spark on the SWE-Bench Pro coding test — and the comments section immediately turned into a speed-trap court. The top vibe: that number only works if you change the settings, not the model’s brains. Analyst nvanlandschoot recalculated the chart and says that at similar accuracy, Spark is about 1.37× faster, not 15×, because OpenAI compared different “reasoning” modes (basically a knob that makes the model think longer).

Cue the drama. Skeptics like solarkraft say the hype only fools people who don’t actually use these tools, adding that real users are “pretty disillusioned.” Others fire back: charcircuit argues you don’t need equal intelligence to call something faster — just like a smaller model can be snappier than a big one. Meanwhile, pennaMan blames early hardware (“Cerberas”) for today’s rough edges and promises the speed curve will look better soon.

Journalists caught strays too. jiggawatts roasted the tech press for parroting vendor slides: “Why not just benchmark the models yourself?” And yes, there were memes: “15× if you squint at the settings,” and “x‑high is just ‘please think longer.’” To be fair, folks did cheer a legit win: a smaller model jumped to ~3.46× faster with slightly better accuracy in three months. But the community’s bottom line? Don’t sell a configuration tweak as a moon landing.

Key Points

  • OpenAI claimed GPT-5.3-Codex-Spark is “15× faster” than GPT-5.3-Codex on SWE-Bench Pro.
  • When matching models at comparable accuracy, the effective speedup is about 1.37× (≈26.8% faster), not 15×.
  • The 15× figure arises from comparing different reasoning settings (Spark low vs. Codex x-high), not inherent model speed.
  • On this benchmark, x-high yields ~1.43% more accuracy than high but requires ~84% more time, showing steep latency trade-offs.
  • Across generations, performance improved: GPT-5.1-Codex-mini (high) to Spark shows ~3.46× faster and +1.5 pp accuracy (7.09 min at 49.0% to 2.05 min at 50.5%).

Hottest takes

“the effective speedup is closer to ~1.37× rather than 15×” — nvanlandschoot
“people who do [use it] are pretty disillusioned” — solarkraft
“Why not just benchmark the models yourself?” — jiggawatts
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.