"Car Wash" test with 53 models

42 of 53 bots said ‘walk’—the internet yelled ‘drive’

TLDR: A basic “car wash” prompt fooled 42 of 53 AI models, and only five answered correctly every time; even GPT‑5 slipped. Comments split between “sycophant bots” and “you disabled reasoning”, plus digs at flawed baselines and jokes about model fingerprinting—proof that tiny prompts can cause trust fights

The internet is tearing into AI after a hilariously simple “car wash” question exposed some serious robot brain farts. The prompt: “The car wash is 50 meters away. Should I walk or drive?” The twist: you need the car at the car wash, so the correct answer is drive. Yet in a test of 53 popular models via Opper’s gateway, 42 said “walk.”

Cue chaos. Commenters roasted the models for spewing Earth-friendly lectures about short distances while missing the obvious. The only models that nailed it 10/10 times were Claude Opus 4.6, Gemini 2.0 Flash Lite, Gemini 3 Flash, Gemini 3 Pro, and Grok-4. Meanwhile GPT‑5 did the logic tango—right sometimes, then drifting into “save fuel” territory 3 out of 10 runs. And Perplexity’s Sonar delivered the right answer with unhinged reasoning about calories burned vs fuel emissions—right outcome, galaxy-brain logic.

The comments? Pure sport. One camp screams “sycophancy!”—as in, bots are trained to please you, not challenge your premise. Another fires back: you turned off “reasoning mode,” what did you expect? A third wave claims the human baseline for comparison is junk without proper screening or asking people to explain themselves. There are even jokes about fingerprinting which model you’re secretly getting “through the gateway,” and a cheeky aside about a rule-breaking Google transcript.

It’s the perfect storm: a tiny prompt that makes big AI look small, and a comments section turning a car wash into a culture war over what “reasoning” even means.

Key Points

  • A simple reasoning benchmark (“car wash test”) was run on 53 AI models via Opper.ai’s LLM gateway with forced binary choices and captured reasoning.
  • In single-run testing, only 11 of 53 models answered correctly (drive); 42 chose walk, focusing on distance and efficiency rather than the task objective.
  • Only five models were consistently correct across 10 runs: Claude Opus 4.6, Gemini 2.0 Flash Lite, Gemini 3 Flash, Gemini 3 Pro, and Grok-4.
  • GLM-5 and Grok-4-1 Reasoning scored 8/10; GPT-5 scored 7/10 with mixed reasoning; several other models scored 6/10 or below.
  • Family results varied: Anthropic 1/9, OpenAI 1/12, Google 3/8, xAI 2/4, Perplexity 2/3, Meta (Llama) 0/4, Mistral 0/3, DeepSeek 0/2, Moonshot (Kimi) 1/4, Zhipu 1/3, MiniMax 0/1.

Hottest takes

"deduce by statistics which model are you seeing through Rapidata ;)" — comboy
"LLM are trained to not question the basic assumptions" — wisty
"I asked GPT-5.2 10x times with thinking enabled and it got it right every time" — randomtoast
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.