Prompt caching: 10x cheaper LLM tokens, but how?

10x cheaper prompts? Devs cheer, roast bugs, and crown a new AI reader

TLDR: Prompt caching makes AI prompts much cheaper and faster by reusing part of your request, and Sam Rose’s guide breaks it down simply. Comments praised the clarity, roasted a broken page, bemoaned Apple Core ML headaches, and staged a mini ChatGPT-vs-Gemini showdown—real savings meet real-world snags.

The internet is buzzing over a rare feel-good plot twist: prompt caching makes AI prompts cheaper and faster. In plain English, the AI remembers parts of what you send so next time it’s quicker and costs less—Sam Rose’s explainer shows up to 10x cheaper and big speed gains, and the crowd largely said: “finally!” But the cheers quickly turned into popcorn-worthy drama. One reader ran smack into a broken page—“Something Went Wrong… D is not a function”—and the thread instantly turned into a meme roast of front-end gremlins. Another dev confessed an Apple-flavored nightmare: trying to use the cache on Core ML (Apple’s on-device AI framework) and watching everything crawl after 50 tokens. Ouch.

There’s also a spicy myth-bust: changing temperature (the “creativity” knob) doesn’t affect caching. Commenters loved the clarity, with some admitting they had it wrong. And then came the model cage match: one user says ChatGPT 5.2 fumbled the screenshot task while Gemini nailed it, sparking the usual brand loyalties and gentle chaos. Overall vibe: great write-up, real savings, real speed… but also real-world roadblocks and a sizzling side of “which bot is smarter?” Plus a new meme: 10x cheaper? My credit card just exhaled. Read the post and pick a side.

Key Points

  • Cached input tokens are around 10x cheaper than regular input tokens on OpenAI and Anthropic APIs.
  • Anthropic states prompt caching can reduce latency by up to 85% for long prompts; testing confirmed substantial time-to-first-token improvements when all input tokens were cached.
  • Providers are not caching full responses; identical prompts can yield different outputs even when input tokens are reported as cached.
  • Vendor documentation explains how to use prompt caching but leaves unclear what specific data is cached, motivating a deeper technical dive.
  • Prompt caching occurs in the transformer’s attention mechanism; the article outlines LLM architecture with tokenization, embeddings, attention and feedforward stages, and output token generation.

Hottest takes

"I just couldn’t get the KV cache to work, which made it unusably slow after 50 tokens…" — simedw
"Something Went Wrong… D is not a function" — Youden
"ChatGPT 5.2 … failed on it … though Gemini got it right away" — coderintherye
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.