Two different tricks for fast LLM inference

OpenAI’s speed demon vs Anthropic’s real-deal pass — fans pick sides

TLDR: OpenAI’s fast mode is insanely quick but uses a lighter model; Anthropic’s is slower but keeps its top model intact. Commenters clash over speed versus reliability, arguing chips, batching, and hype, while predicting both companies will eventually meet in the middle.

The internet is split over the new “fast mode” race, and the comments are pure chaos. OpenAI claims lightning speed — more than 1000 tokens per second — but it’s powered by a lighter model called Spark, not their top coder. Anthropic’s fast mode is slower (about 170 tokens per second) but serves the real Claude Opus 4.6, and that’s sparking a quality vs speed brawl. Some swear Anthropic’s approach is just paying extra to skip the line, a “bus pass” for instant replies. Others worship OpenAI’s monster Cerebras chips like a pizza-sized turbo button.

Hot take central: one crowd accuses Anthropic of overhyping with “neo-hipster” vibes while OpenAI is allegedly laser-focused on cost-cutting and efficiency. Another camp argues the article’s “batch size” theory oversimplifies why fast is fast, while nerds toss in spicy guesses like quantization (shrinking numbers to go faster) and speculative decoding (a helper model that drafts ahead). Meanwhile, pragmatic voices say OpenAI can polish Spark’s accuracy, and Anthropic will just scale up — expect both to converge.

Memes? Plenty. “Fast mode is fast-food: tasty, not nutritious,” and “bus leaves immediately” vs “monster chip pizza party.” The comment section is a Fast & Furious sequel: speed junkies vs quality purists, with OpenAI and Anthropic as rival pit crews.

Key Points

  • Anthropic and OpenAI both launched “fast mode” for coding models, but with different trade-offs and implementations.
  • Anthropic’s fast mode reportedly delivers ~170 tokens/s (~2.5x), serving the actual Opus 4.6 model.
  • OpenAI’s fast mode exceeds 1,000 tokens/s (~15x) using GPT-5.3-Codex-Spark, a faster but less capable variant.
  • The article hypothesizes Anthropic uses low-batch-size inference, trading higher cost for lower latency and moderate speed gains.
  • OpenAI’s fast mode is backed by Cerebras hardware, with very large chips enabling ultra-low-latency, high-throughput inference.

Hottest takes

"focused on cost cutting/efficiency while anthropic is mostly going the opposite direction" — retinaros
"The batch size explanation is wrong" — dist-epoch
"OpenAI have room for improvement ... while Anthropic are left with scaling vertically" — gostsamo
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.