Can I Buy Your KV Cache?

AI wants to stop rereading the same page, and the comments are already fighting about it

TLDR: The paper says AI systems could save huge amounts of money by reusing the same prebuilt “memory” of a document instead of rereading it from scratch every time. Commenters were split between calling it brilliantly simple, technically shaky, and hilariously ironic because the paper itself was accused of sounding AI-written.

A fresh AI paper just lobbed a deceptively simple idea into the internet: what if chatbots stopped paying full price to reread the same document over and over? Instead of every bot rebuilding its own memory of a page from scratch, the authors say one copy could be prepared once and everyone else could pay to reuse it. In plain English, it’s a proposal to turn AI reading into something closer to streaming than constant re-downloading — and the authors claim it could slash costs by 9 to 50 times on repeated reads of popular documents.

But the real show is in the replies. One commenter instantly translated the whole concept into startup-speak with “Lambda computing for prompts?”, while another went full sci-fi with “A truly global singleton.” That’s the thread in miniature: half the crowd is impressed by the audacity, half is side-eyeing whether this would actually work cleanly in the messy real world.

The skeptics came armed. One commenter warned that this kind of AI memory is order-dependent, meaning the trick may be far less plug-and-play than the paper’s bold tone suggests. Another user didn’t even get to the science before throwing shade at the writing itself, accusing the abstract of sounding LLM-generated and saying that alone made them less likely to read it. Ouch. Meanwhile, at least one poor soul just wanted a beginner-friendly explainer on what this “KV cache” thing even is — a reminder that while the paper dreams of a new AI economy, plenty of readers are still asking where the instruction manual is.

Key Points

  • The article proposes that document publishers precompute LLM KV caches so other agents can load them and skip the prefill stage.
  • It reports that KV-cache reuse is token-exact relative to recomputing from scratch, matching outputs and logits in the stated test.
  • On Qwen3-4B, the article claims reuse is 9x to 50x cheaper in compute than prefill, with larger gains on longer inputs.
  • The article argues that shipping KV caches to users is uneconomical because egress costs exceed the saved prefill compute, while provider-side hosting avoids that issue.
  • It estimates that serving a 3,774-token document to 80 million agents via reuse would require about 0.03 million in compute, or 49.7x less than repeated prefilling.

Hottest takes

"Lambda computing for prompts?" — root-parent
"A truly global singleton" — sghiassy
"abstract was clearly generated by an LLM" — mistercow
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.