January 13, 2026
Benchmarks & Backseat Drivers
vLLM large scale serving: DeepSeek 2.2k tok/s/h200 with wide-ep
2.2K tokens/sec per GPU sparks cheers, AMD FOMO, latency wars
TLDR: vLLM’s V1 engine hits 2,200 tokens per second per GPU, cutting costs and boosting scale. Commenters cheered the savings, argued about speed versus latency for coders, and reignited the AMD-versus-NVIDIA fight—proof that faster AI is great, but users still want snappy responses and broader hardware support.
vLLM just flipped the nitro: the new V1 engine and a pile of optimizations pushed real-world throughput to 2,200 tokens per second per H200 GPU on Coreweave’s cluster. Translation for non-tech folks: the AI spits out words really fast, and it can do more with fewer machines. The crowd went wild—one commenter marveled at a “40+%” jump and argued this means the cost of AI smarts keeps dropping. Cue the money memes and “AI inflation is over” hot takes. But not everyone’s just popping confetti. A coding-heavy user begged for lower latency (that’s the wait time for answers) and called out that these wins likely come from heavy batching—great for scale, not always great for interactive coding. Meanwhile, the GPU wars lit up: an AMD fan demanded better support for Team Red’s chips, wondering why the party still feels NVIDIA-only. Devs chimed in too, with an Elixir wrapper for vLLM already live (link) and promises to update for v0.11.0. For newbies: vLLM’s speed comes from clever tricks like expert parallelism (sharing “specialist” model parts across machines), MoE (mixture-of-experts), and dual-batch overlap (overlapping mini-jobs). The mood: hype, hustle, and a dash of “please fix latency” drama—classic internet tech theater.
Key Points
- •vLLM v0.11.0 completes migration to the V1 engine and adds multiple serving optimizations for high-throughput LLM inference.
- •Community benchmarks on a CoreWeave H200 cluster show sustained 2.2k tokens/s per H200 GPU, up from ~1.5k tokens/s, enabled by kernel improvements and DBO.
- •Wide-EP combines expert parallelism with data parallelism to address DeepSeek models’ MLA and KV cache constraints, improving effective batch size.
- •vLLM integrates DeepEP all-to-all kernels and supports Perplexity MoE and NCCL AllGather-ReduceScatter backends to reduce synchronization overhead.
- •The performance gains enable workload consolidation, fewer replicas for target QPS, and lower token-per-dollar costs for operators.