April 25, 2026
Benchmarks & broken hearts
DeepSeek-V4 on Day 0: From Fast Inference to Verified RL with SGLang and Miles
DeepSeek-V4 launches and the comment wars over “who’s fastest” explode
TLDR: DeepSeek-V4 launched with a new stack claiming super-fast, cheap AI responses, but the community is already calling out the benchmarks as unfair and impossible to compare. Readers are torn between excitement over the tech and frustration that every team declares victory without racing on the same track.
DeepSeek-V4 just dropped with a wall of nerdy features, but the real fireworks are in the comments, where people are already fighting over whose favorite engine is actually fastest. The blog boasts “day zero” support for this new mega–AI model using SGLang and Miles, promising crazy speed and cheap costs thanks to a ton of clever tricks for storing and reusing past conversation data.
Meanwhile, in the threads, users are basically saying, “Cool story, but where’s the fair fight?” One commenter instantly pulled in a rival blog post from the vLLM camp and a benchmark site, hinting that the numbers being shown aren’t exactly apples-to-apples. Translation for non-nerds: everyone’s claiming their car is the fastest, but nobody’s racing on the same track. That’s all it took for the usual fanboy energy to kick in, with people side-eyeing the charts and suggesting this is more marketing than science.
There’s a strong “benchmark fatigue” vibe: some commenters joke that every week a new chart proves someone else is the king, until the next one drops. Others mock the alphabet soup of features, saying it sounds like a Marvel villain origin story more than an AI stack. Underneath the jokes, though, the message is clear: the tech is impressive, but the community wants honest, comparable tests before crowning any winner.
Key Points
- •SGLang and Miles provide day‑0 open-source support for both inference and RL training on DeepSeek‑V4, with systems built for its hybrid sparse attention, mHC, and FP4 MoE expert weights.
- •DeepSeek‑V4 extends DeepSeek‑V3.2 with hybrid attention (SWA + C4/C128 compression) enabling a 1M‑token context, mHC for better gradients/representations, and FP4 experts optimized for Blackwell.
- •The stack introduces ShadowRadix, a native prefix caching mechanism for hybrid attentions using a radix tree and per‑pool shadows to maintain coherence and enable compressed KV reuse.
- •Inference optimizations include HiSparse CPU‑extended KV, MTP speculative decoding with in‑graph metadata, Flash Compressor, Lightning TopK, hierarchical multi‑stream overlap, and kernel integrations (FlashMLA, FlashInfer+TensorRT‑LLM MoE, DeepGEMM, TileLang).
- •RL training supports full parallelism (DP/TP/SP/EP/PP/CP), TileLang attention, enhanced stability, and FP8 training, with hardware support spanning NVIDIA Hopper/Blackwell/Grace Blackwell and AMD.