Speculative KV coding: losslessly compressing KV cache by up to ~4×

AI fans are cheering a wild memory-saving trick while skeptics ask if it’s genius or just extra work

TLDR: Researchers say they can slash an AI model’s memory storage by up to 4x by having a smaller model predict most of it and saving only the mismatch. Commenters are impressed by the elegance, but some are already joking that it sounds like saving memory by spending extra time.

A new AI research post claims it can shrink one of chatbots’ biggest hidden costs — the stored “working memory” they use to remember long conversations — by as much as 4x without losing any information. The basic idea: run a smaller, cheaper model alongside the main one, let it guess what the memory should look like, then save only the difference. In plain English, it’s like packing a suitcase by predicting what clothes you’ll need and only stuffing in the surprises.

But the real show is in the comments, where readers instantly split into two camps: “clever and elegant” versus “cool, but is this actually worth the hassle?” One commenter delivered the thread’s unofficial translator post, turning the whole paper into a simple “tiny model guesses, reality corrects” summary that others clearly appreciated. Another went full fan-club mode, saying the article was so crisp that “an LLM could never write so crisply,” which is basically a standing ovation in AI circles.

Then came the spicy skepticism. One joker said you could get “∞x compression” by just using the original model to predict itself perfectly — a cheeky way of pointing out the tradeoff is really about time versus memory. Another doubter argued the method may still do too much extra work as conversations get longer. So yes, the idea is turning heads — but the comment section is asking the eternal internet question: smart breakthrough, or brilliant overengineering?

Key Points

  • The article introduces speculative KV coding, a lossless method for compressing LLM KV caches using a smaller predictor model plus arithmetic coding.
  • It reports up to about 4× KV-cache compression on top of fp8 cache storage, which the article describes as roughly 8× gross reduction overall.
  • The post argues that long-context and agentic workflows make KV cache storage and movement a major inference bottleneck.
  • It contrasts lossless compression with lossy approaches such as TurboQuant, which reduce bit-width but may introduce quality loss that must be evaluated empirically.
  • The article explains the method using entropy coding, arguing that because a KV cache is deterministic for fixed weights and prompt, compression cost reflects predictor mismatch rather than intrinsic randomness.

Hottest takes

"An LLM could _never_ write so crisply" — porridgeraisin
"You can use the original model to compress the kv cache and get ∞x compression" — 0-_-0
"why not make it first class and use everywhere, possibly recursively?" — mirekrusin
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.