From 300KB to 69KB per Token: How LLM Architectures Solve the KV Cache Problem

AI’s memory goes on a crash diet—and the comments go feral

TLDR: Chatbots just slashed their short‑term memory cost from 300 KB to ~69 KB per word-token, making conversations cheaper and cooler to run. Commenters memed it with Voyager 1, pushed cache quantization for even bigger savings, and argued over whether all this slimming risks models forgetting important details.

The nerds brought the receipts, but the comments brought the chaos. Today’s tale: the “KV cache,” the chatbot’s short-term memory. It used to cost a chunky 300 KB per word-ish in 2019. Now, thanks to smarter designs—sharing memory across “heads,” compressing it, and using a “sliding window” that focuses on recent messages—it’s down to about 69 KB per token. Translation: cheaper chats, cooler GPUs, and fewer data centers crying into their power bills.

But the real show is the crowd. One user cracked the meme jackpot: “69KB is how much RAM Voyager 1 has,” instantly turning AI’s new memory diet into a space-age roast. The pragmatists flexed tools: you can quantize this memory—shrink it further by storing it in lower precision. As one tinkerer bragged, running a 70B-parameter model with 4-bit tricks on a Mac showed that KV quantization is what actually makes the magic practical. Meanwhile, the academic set rolled in with Cartridges: compress the document itself into a tiny, reliable summary so the model doesn’t drown in context.

Drama checkpoint: Purists fear all this compression could make models “forget” details. Budget hackers say: show me the watts saved. And somewhere between, everyone’s counting tokens like calories and dunking on GPUs like they’re overpriced gym memberships.

Key Points

  • KV cache stores key–value pairs for past tokens in GPU memory to avoid recomputation, reducing generation complexity from quadratic to linear.
  • GPT-2’s multi-head attention stored independent keys/values per head, costing ~300 KiB per token; a 4,000-token session can use ~1.2 GB just for the cache.
  • Llama 3 adopts grouped-query attention, sharing keys/values across heads to cut cache to ~128 KiB per token with minimal benchmark degradation.
  • DeepSeek V3 introduces multi-head latent attention, compressing KV into a latent space for ~68.6 KiB per token, with DeepSeek V2 ablations showing comparable or better performance than standard MHA.
  • Gemma 3 combines GQA with a sliding window (5:1 local-to-global layers; local layers attend to 1,024 tokens) to manage memory over long contexts.

Hottest takes

“69KB is how much RAM Voyager 1 has” — az09mugen
“you can quantize the kv cache… q8 for keys and q4 for values… cuts memory roughly in half” — LuxBennu
“compress a large document… into a smaller set of tokens… Cartridges” — coppsilgold
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.