January 7, 2026
Spread the cache thick
Show HN: An LLM response cache that's aware of dynamic data
Butter promises faster AI replies; curiosity rises as skeptics sharpen knives
TLDR: Butter launched a smarter cache for AI chat replies that matches new prompts to old answers without extra model calls. Early reaction is curious but wary, with the coming debate over speed vs. fresh data and the timeless “cache invalidation” headache—important because it could cut latency and costs for AI apps.
Butter just told Hacker News it can make AI answers feel instant by caching them smartly—using templates (think: fill‑in‑the‑blank prompts) and a chat “tree” to match your new question to an old answer without calling another model. The pitch: speed, lower costs, and deterministic results (no mystery AI on the hot path). Early vibes? A soft “I’ll try it” from the first commenter, while the crowd quietly assembles their pitchforks and pom‑poms. The brewing drama centers on the eternal internet fight: speed vs. freshness. Fans say template‑aware caching could finally make chatbots not feel like dial‑up. Skeptics are already side‑eyeing phrases like “regex matching” and whispering the classic meme: there are only two hard things in computing—cache invalidation and naming things. Expect hand‑wringing over “Will this serve stale answers when data changes?” and “How do we safely cache without leaking user info?” Meanwhile, pragmatists love the no‑extra‑AI promise at request time because deterministic > vibes. No matter where you land, everyone agrees on one thing: if Butter actually nails this, devs will be spreading it on everything. If not—well, get ready for toast burned in prod.
Key Points
- •Butter’s proxy now supports automatic template induction for its LLM response cache.
- •The system uses template-aware caching, storing templates with variable placeholders instead of verbatim messages.
- •Incoming queries are matched deterministically and syntactically (e.g., via regex), avoiding extra LLM calls at request time.
- •Butter organizes its cache as a tree aligned with append-only chat context, traversing nodes to find matching templates.
- •Models like GPT-4o and Sonnet 4.5 are stateless, requiring clients to append prior context; APIs like OpenAI Chat Completions and Responses align with this model.