March 22, 2026

Spill the tea: AI vs hot coffee

LLMs Predict My Coffee

Internet fights over hot coffee math: simple trick or AI magic

TLDR: A blogger tested AI models to predict how fast hot coffee cools; they were decent but not spot-on, with one pricey model closest. Commenters split between ‘this is basic physics and training data,’ ‘just model the big factors,’ and snark that a new coffee-cooling benchmark is born.

A blogger asked a bunch of AI chatbots to predict how a mug of boiling water cools, then actually measured it with a thermometer. The verdict: the bots were fine, not flawless, with the fancy one (Claude) doing best but costing a comically specific $0.61. The real plot twist? The coffee cooled way faster at the start than humans (and AIs) guessed, then slowed way down—exactly the kind of “duh, physics” result that still manages to roast everyone’s intuition.

Cue the comments, where the Internet Barista Council clocked in. One camp waved the “it’s just Newton’s law of cooling, calm down” flag, with users like amha wanting to see the simple model. Pragmatists like andy99 said real engineers pick the two things that matter and ignore the rest. Cynics like kaelandt shrugged that this problem is all over the web—so of course AIs look smart. Then came the snark: leecommamichael declared, “another benchmark is born,” because if it exists, tech will leaderboard it. Meanwhile, IncreasePosts dropped the thermos truth bomb: the mug itself heats up fast, making the early plunge dramatic—put it in a vacuum flask and the curve changes. The jokes poured in: “LLMs can’t even babysit a latte,” “Barista vs Physicist,” and endless memes about a $0.61 Pentagon coffee budget.

Key Points

  • The author prompted multiple LLMs to provide an equation for the temperature of 8 oz boiling water cooling in a 0.57 kg ceramic mug at 20°C ambient, focusing on the first five minutes.
  • LLMs returned bi-exponential forms approaching ambient temperature, with specific coefficients and time constants listed for each model.
  • A kitchen experiment was conducted under the specified conditions, with frequent temperature measurements using a digital thermometer.
  • Predicted curves showed fast initial cooling and slower long-term cooling; experimental data showed an even faster early drop and slower late cooling than predicted.
  • Overall fits were described as OK but not great; Claude 4.6 Opus performed best among the listed models, though at the highest token cost.

Hottest takes

one or two key things will dominate and the rest won’t matter. — andy99
there is a lot of training data online. — kaelandt
and so another benchmark is born. — leecommamichael
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.