Expensively Quadratic: The LLM Agent Cost Curve

Your AI chat is bleeding cash — users cry “cache tax” as devs push back

TLDR: Long AI chats get pricey fast as “cache reads” — reloading past messages — eat most of the bill around 50k tokens. Comments split between safety limits vs. big-file reads, suspicion of provider margins, and calls for real testing over hype, making this a must-know for anyone using AI agents often.

The post lays out a wallet-wilting truth: as your AI coding chat gets long, the bill explodes because the model keeps rereading the whole conversation. In one example, the tab hit $12.93, and by 27,500 tokens, cache reads were half the cost; by the end, they were 87%. Think of it like your chatbot rewatching the entire season before every new episode. Cool? Maybe. Cheap? Absolutely not. The analysis, pulled from exe.dev conversations and nodding to Anthropic, shows the “scary quadratic” is really “tokens times calls,” and people are feeling it.

Cue the comments cage match. One camp says “just read the big file once and move on”; stuxf slams the brakes, arguing limits exist for a reason—when the agent messes up, you need safety rails, not a blank check. TZubiri throws a Molotov hot take: cached reads might be a gold mine for providers, hinting at fat margins. Meanwhile, alexhans begs the crowd to stop worshipping “magic” tools and start testing with evals and real observability. There’s also sport in the sidelines: Areena_28 calls it a “Classic trap!” and reignites the Rust-vs-JS banter. The vibe? Equal parts math panic, pricing conspiracy, and DevRel therapy—plus memes about the “triangle of doom” and how “pretty squarey” the curve is. Drama, charts, and your credit card sweating—chef’s kiss.

Key Points

  • Coding agent loops incur charges for input tokens, cache writes, output tokens, and cache reads, with prior outputs written to cache for subsequent turns.
  • In a worked example (~$12.93 total), cache reads became 50% of cost at ~27,500 tokens and 87% by the end, dominating expenses as context grows.
  • Analysis of 250 “Shelley” conversations via exe.dev’s LLM gateway shows wide variability; median input ~285 tokens and median output ~100 tokens.
  • Cost trajectories vary by conversation patterns (e.g., code-heavy outputs, tool call outputs, cache expiration requiring rewrites).
  • For long runs (≥100k tokens), cache read costs scale with tokens times the number of LLM calls, making call count a key cost driver.

Hottest takes

"the primary reason ... is for when the agent messes up" — stuxf
"cached tokens probably have a way higher margin" — TZubiri
"start thinking in terms of evals [1] and observability" — alexhans
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.