KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit

900,000x AI memory shrink? Commenters cry “math magic” vs “mind-blown”

TLDR: Researchers claim a way to shrink AI’s memory by using the model’s own predictions, boasting up to 900,000x compression. The community split fast: fans call it clever dictionary magic, skeptics say it’s dreamy math with no proof, joking that until a demo lands, it’s hype versus reality.

A new paper claims it can squash AI “memory” (the KV cache—think short-term notes for the model) by up to 900,000x using the model’s own expectations to store just the tiny differences. The authors say it beats the old per-number compression by treating the whole text as a sequence, not random data. Bold? Very. The thread… explosive. One early reaction summed the vibe: “Extraordinary claims!” while others tried to translate: use the model as a dictionary and reuse shared beginnings across sessions—like auto-deduping all those identical “Once upon a time…” openers. The paper even flexes that compression should get better the longer the conversation goes.

Cue the skeptics. One commenter flagged that “improves with length” line as a big red flag and reminded everyone this is purely theoretical. Another compared it to “speculative decoding” for memory, while a confused (and hilariously sharp) take asked: if the model makes the vectors, aren’t they already what it predicted—so… infinite compression? The debate split cleanly: the wow-this-is-clever camp versus the show-me-real-world-numbers camp. Jokes flew—“900,000x? Great, compress my rent next”—and the crowd wrestled with the gap between math-world entropy bounds and messy reality. Verdict from the peanut gallery: fascinating idea, but until there’s a working demo, it’s Schrödinger’s compression—both revolutionary and imaginary.

Key Points

  • The paper proposes sequential KV compression for transformer caches, shifting from per-vector to sequence-based compression.
  • Layer 1 uses probabilistic prefix deduplication with a PLT-derived trie metric to reuse shared prefixes across sessions.
  • Layer 2 applies predictive delta coding, encoding residuals relative to the model’s own KV predictions, achieving a per-token entropy bound tied to token conditional entropy.
  • Using typical LM perplexity (10–20), the bound is estimated at 3.3–4.3 bits per token position, contrasting with TurboQuant’s ~3 bits per vector component across 64–128 components.
  • Claims a theoretical ~914,000× compression over TurboQuant at the entropy limit and ~914× even with a 1000× overhead; layers compose with existing quantizers.

Hottest takes

“Extraordinary claims! I don't follow the argument though.” — tomrod
“A compression strategy that uses the model itself as the dictionary.” — ddtaylor
“By definition, they are exactly equal to the model's prediction of them!” — aesthesia
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.