Our LLM-controlled office robot can't pass butter

Humans crush 95% while AI fumbles at 40 — commenters roast, cats steal the show

TLDR: A new test found top AI robots only completed “pass the butter” tasks 40% of the time vs. humans at 95%. Commenters roasted the results with jokes about cats, funding skepticism, and fascination over the robot’s “dying battery monologue,” underscoring how far everyday home help still is from reality.

An office robot told to “pass the butter” turned into the internet’s favorite sitcom: the models struggled, humans aced it, and the comments did not hold back. The study says top AI hit only 40% success while people breezed at 95%, and the crowd instantly fixated on that missing 5%: “Who failed to get the butter?” one user demanded, turning the whole thread into a gentle roast of both robots and, apparently, one very distracted human. Skeptics piled on with “Someone actually paid for this?” while others were unexpectedly charmed by the robot’s wandering “purpose,” comparing it to a dog who understands vibes, not chores. A standout subplot: readers devoured the “robot brain” drama, especially the wild internal monologue from Claude when the battery was dying, flagged as “pages 11–13” in this link. Meanwhile, the cat crowd delivered the meme of the day: “My cat will always find the butter—bring it? Of course not,” which basically became the benchmark’s unofficial tagline. Under the hood, the test ran language models as the planner while a simple vacuum-like bot carried out moves—no fancy gymnastics, just go/turn/snap-a-pic. Verdict from the peanut gallery: cool experiment, hilarious theater, still not replacing your roommate.

Key Points

  • Butter-Bench evaluates LLMs as orchestrators for robots performing household delivery tasks like “pass the butter.”
  • The evaluation isolates high-level reasoning by using a robot vacuum with lidar and a camera, removing the need for a low-level executor.
  • Humans achieved a 95% completion rate, while the best-performing LLM reached 40%.
  • Gemini 2.5 Pro led the tested models, followed by Claude Opus 4.1, GPT-5, Gemini ER 1.5, and Grok 4; Llama 4 Maverick scored notably lower.
  • Findings are stated to confirm results from the prior Blueprint-Bench evaluation.

Hottest takes

"Who failed to get the butter?" — koeng
"Someone actually paid for this?" — bhewes
"The internal dialog breakdowns ... are wild" — lukeinator42
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.