March 10, 2026

Speed run or smoke and mirrors?

Surpassing vLLM with a Generated Inference Stack

Startup says it’s 34% faster than vLLM — commenters yell “show receipts”

TLDR: Infinity claims its new engine writes text up to 34% faster than vLLM, but commenters demand proof of correctness, open benchmarks, and support for features like paged attention. The big question: is this a real breakthrough or just flashy numbers—because faster AI only matters if it’s right and repeatable.

Infinity just claimed its auto-built AI engine “infy” beats popular speed champ vLLM by up to 34.3% more tokens per second on Qwen3-8B (basically: it writes words faster). They say it skips heavy frameworks, fuses operations, and even avoids a common speed trick called “speculative decoding.” Sounds impressive — and the comment section immediately turned into a courtroom.

Skeptics swarmed. One user demanded proof the model outputs are identical, not just fast, asking why Infinity can’t match token-by-token probabilities. Another fired off the meme of the day: “I can run it at a billion tokens per second if you don’t check quality.” Translation: speed without correctness is just a joyride. Others wanted the basics: where’s the source code, how’s batching done, what about quantization (compressing the model), and can it handle paged attention (a memory trick vLLM uses) without falling apart? Even a straightforward ask popped up: any numbers in BF16 (a different precision setting) or just FP8?

Fans of Infinity’s approach say a clean-slate engine could cherry-pick the best ideas and squeeze out real gains. But the vibe today is prove it or it didn’t happen. The case study dropped a bomb; the comments brought the lie detector. Your move, Infinity.

Key Points

  • Infinity’s ‘infy’ system generated a custom inference engine for Qwen3-8B and claims higher throughput than vLLM under identical parameters and hardware.
  • On decode-heavy workloads (ISL=1k, OSL=8k), infy reports 6,712 tok/s, +34.3% over vLLM; on prefill-heavy workloads (ISL=8k, OSL=1k), 22,470 tok/s, +15.9% over vLLM.
  • Benchmarks were run on an H100 80GB SXM5 GPU using FP8, batch size 88, and no speculative decoding, with parameters matching vLLM 0.13.0.
  • The generated stack is model- and hardware-specific, incorporates techniques from vLLM and SGLang, and uses libraries like FlashInfer and DeepGEMM while bypassing framework overhead.
  • Reported gains derive from cross-layer kernel fusion, scheduling micro-optimizations, algorithmic reorganization, and compute graph refinement; the optimization trajectory included 111 iterations.

Hottest takes

"I can run Qwen-8B at 1 billion tokens per second if you don't check the model's output quality." — rfw300
"The fact that they are not doing this makes me suspicious that they are in fact not doing the exact same thing as vLLM." — ntonozzi
"Does it support paged attention like vLLM though?" — storus
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.