KVarN: Native vLLM KV-cache quantization back end by Huawei

Huawei says its memory trick is faster, bigger, and somehow better—and commenters are side-eyeing hard

TLDR: Huawei says its new add-on helps AI systems remember much more while staying fast and accurate, which could matter for longer chats and heavier workloads. Commenters immediately turned it into a trust-and-hype showdown: some are impressed, others are asking why something this good isn’t already in the main project.

Huawei’s KVarN showed up promising the kind of glow-up that makes tech people immediately suspicious: 3–5x more room, faster speeds than the usual setup, and no drop in quality. In plain English, it claims to let chatbots remember way more of a conversation without slowing down or getting dumber. That’s the dream. And the community reaction? A mix of “wait, what?”, “prove it”, and “why isn’t this in the main app already?”

The biggest drama came from the instant trust issues. One commenter basically kicked open the door with: why is this a fork instead of a proper contribution to vLLM, the popular engine it plugs into? That’s classic open-source side-eye: if it’s so good, why is it living in its own house? Another commenter had the exact opposite meltdown, reading the claims and asking if Huawei was really saying it beat a rival method and beat the current standard on quality. That comment has pure “this can’t be right, can it?” energy.

There’s also some accidental comedy in the naming. KVarN is explained as a Swedish “grinder,” and yes, the announcement leans into grinding up KV-caches like coffee beans. The whole vibe is equal parts research paper and espresso-fueled flex. For now, the crowd seems split between hype and cross-examination: if the numbers hold up, this could be a big deal for long chats and AI agents—but first, the comments want receipts.

Key Points

  • KVarN is introduced as a native vLLM KV-cache quantization backend from Huawei for long-context and agentic workloads.
  • The article claims KVarN delivers 3-5x more KV-cache capacity and up to about 1.3x FP16 throughput while maintaining FP16-level accuracy.
  • In the cited Qwen3-32B benchmark on AIME25 with a 16K-context burst and TP=2, KVarN reportedly matched FP16 accuracy, exceeded FP16 throughput, and achieved about 4x KV-cache capacity.
  • KVarN is distributed as a vLLM fork and is enabled by setting `kv_cache_dtype="kvarn_k4v2_g128"`; it runs in float16 compute with a current fixed tile/page size of 128.
  • The implementation quantizes KV cache tiles through Hadamard rotation, iterative variance normalization, and low-bit asymmetric quantization, with the released preset using 4-bit keys and 2-bit values.

Hottest takes

"Why this is not a PR for vLLM ?" — v3ss0n
"Better performance than TQ" — throwa356262
"better quality than FP16?" — throwa356262
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.