Inference cost at scale with napkin math

AI price math looked simple — until the comments asked who’s paying the power bill

TLDR: The post says you can estimate the cost of serving AI users with simple back-of-the-envelope math, even for pricey hardware. Commenters weren’t satisfied: they argued the real story is all the extra costs — electricity, cooling, upkeep, and even questionable arithmetic — that can wreck the neat estimate.

A post about “napkin math” for AI costs tried to make one big promise: even if the chips are absurdly expensive and the models keep changing, you can still estimate what each user really costs. The author walks through the back-of-the-envelope version of running a modern chatbot-style system on a high-end graphics processor, arguing that the basic dollars-per-user picture is still surprisingly easy to sketch out.

But the real show started in the comments, where readers immediately turned this neat little math lesson into a full-on “you forgot the electric bill” showdown. One camp said the article’s logic mostly works, especially if you’re just trying to get a rough pricing model. Another camp was absolutely not letting that slide, demanding the missing real-world costs: power, cooling, maintenance, rent, and all the other boring-but-deadly bills that can turn a tidy spreadsheet into a horror movie. One commenter even came armed with wattage numbers for Nvidia’s latest hardware, basically saying: nice napkin, now show us the utility statement.

Then came the classic internet twist: math policing. A baffled reader called out an equation with the energy of a teacher catching a typo on the board, while another said a crucial assumption about model size was buried way too deep in the post. The vibe was half finance debate, half group chat roast — with readers split between “useful shortcut” and “this is missing the parts that actually bankrupt you.”

Key Points

•The article presents a simple framework for estimating AI inference cost per user using GPU specs and workload assumptions.
•It identifies memory bandwidth and peak throughput as the two key GPU metrics needed for paper-based cost calculations.
•Its worked assumptions include a 200,000-token context length and a 32B active-parameter model sized for a single GPU.
•The article explains the computational cost of matrix multiplication as 2NMd memory accesses and 2NMd floating-point operations in the basic case.
•It notes that tiling can reduce matrix-multiplication memory access to roughly d(N+M) and introduces attention-based language model inference as the next step in the analysis.

Hottest takes

"plus the datacenter/upkeep bill" — smalltorch

"Power, cooling, maintenance, rent" — BadBadJellyBean

"what kind of math is this?" — stevenaenns

June 20, 2026

Napkin math, monster bill

AI price math looked simple — until the comments asked who’s paying the power bill

Key Points

Hottest takes

June 20, 2026

Napkin math, monster bill

Inference cost at scale with napkin math

AI price math looked simple — until the comments asked who’s paying the power bill

Key Points

Hottest takes

Save News