March 10, 2026
Speed run or smoke and mirrors?
Surpassing vLLM with a Generated Inference Stack
Startup says it’s 34% faster than vLLM — commenters yell “show receipts”
TLDR: Infinity claims its new engine writes text up to 34% faster than vLLM, but commenters demand proof of correctness, open benchmarks, and support for features like paged attention. The big question: is this a real breakthrough or just flashy numbers—because faster AI only matters if it’s right and repeatable.
Infinity just claimed its auto-built AI engine “infy” beats popular speed champ vLLM by up to 34.3% more tokens per second on Qwen3-8B (basically: it writes words faster). They say it skips heavy frameworks, fuses operations, and even avoids a common speed trick called “speculative decoding.” Sounds impressive — and the comment section immediately turned into a courtroom.
Skeptics swarmed. One user demanded proof the model outputs are identical, not just fast, asking why Infinity can’t match token-by-token probabilities. Another fired off the meme of the day: “I can run it at a billion tokens per second if you don’t check quality.” Translation: speed without correctness is just a joyride. Others wanted the basics: where’s the source code, how’s batching done, what about quantization (compressing the model), and can it handle paged attention (a memory trick vLLM uses) without falling apart? Even a straightforward ask popped up: any numbers in BF16 (a different precision setting) or just FP8?
Fans of Infinity’s approach say a clean-slate engine could cherry-pick the best ideas and squeeze out real gains. But the vibe today is prove it or it didn’t happen. The case study dropped a bomb; the comments brought the lie detector. Your move, Infinity.
Key Points
- •Infinity’s ‘infy’ system generated a custom inference engine for Qwen3-8B and claims higher throughput than vLLM under identical parameters and hardware.
- •On decode-heavy workloads (ISL=1k, OSL=8k), infy reports 6,712 tok/s, +34.3% over vLLM; on prefill-heavy workloads (ISL=8k, OSL=1k), 22,470 tok/s, +15.9% over vLLM.
- •Benchmarks were run on an H100 80GB SXM5 GPU using FP8, batch size 88, and no speculative decoding, with parameters matching vLLM 0.13.0.
- •The generated stack is model- and hardware-specific, incorporates techniques from vLLM and SGLang, and uses libraries like FlashInfer and DeepGEMM while bypassing framework overhead.
- •Reported gains derive from cross-layer kernel fusion, scheduling micro-optimizations, algorithmic reorganization, and compute graph refinement; the optimization trajectory included 111 iterations.