Inference cost at scale with napkin math

AI price math looked simple — until the comments asked who’s paying the power bill

TLDR: The post says you can estimate the cost of serving AI users with simple back-of-the-envelope math, even for pricey hardware. Commenters weren’t satisfied: they argued the real story is all the extra costs — electricity, cooling, upkeep, and even questionable arithmetic — that can wreck the neat estimate.

A post about “napkin math” for AI costs tried to make one big promise: even if the chips are absurdly expensive and the models keep changing, you can still estimate what each user really costs. The author walks through the back-of-the-envelope version of running a modern chatbot-style system on a high-end graphics processor, arguing that the basic dollars-per-user picture is still surprisingly easy to sketch out.

But the real show started in the comments, where readers immediately turned this neat little math lesson into a full-on “you forgot the electric bill” showdown. One camp said the article’s logic mostly works, especially if you’re just trying to get a rough pricing model. Another camp was absolutely not letting that slide, demanding the missing real-world costs: power, cooling, maintenance, rent, and all the other boring-but-deadly bills that can turn a tidy spreadsheet into a horror movie. One commenter even came armed with wattage numbers for Nvidia’s latest hardware, basically saying: nice napkin, now show us the utility statement.

Then came the classic internet twist: math policing. A baffled reader called out an equation with the energy of a teacher catching a typo on the board, while another said a crucial assumption about model size was buried way too deep in the post. The vibe was half finance debate, half group chat roast — with readers split between “useful shortcut” and “this is missing the parts that actually bankrupt you.”

Key Points

  • The article presents a simple framework for estimating AI inference cost per user using GPU specs and workload assumptions.
  • It identifies memory bandwidth and peak throughput as the two key GPU metrics needed for paper-based cost calculations.
  • Its worked assumptions include a 200,000-token context length and a 32B active-parameter model sized for a single GPU.
  • The article explains the computational cost of matrix multiplication as 2NMd memory accesses and 2NMd floating-point operations in the basic case.
  • It notes that tiling can reduce matrix-multiplication memory access to roughly d(N+M) and introduces attention-based language model inference as the next step in the analysis.

Hottest takes

"plus the datacenter/upkeep bill" — smalltorch
"Power, cooling, maintenance, rent" — BadBadJellyBean
"what kind of math is this?" — stevenaenns
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.