Task-free intelligence testing of LLMs

We tapped the bots and chaos erupted: genius test or goofy game

TLDR: Researchers “tapped” AI models with number patterns and watched how they reacted—some joked, some guessed, and OpenAI’s model stayed stern. Commenters split between “this is just pattern matching” and “this probes curiosity,” with a side chorus pushing game-style evaluations as the next big thing.

An experiment tried “task-free” AI testing: just send models waves of the word “tap” in patterns (Fibonacci, primes, pi) and watch what happens. The community lit up. Skeptics like vitaelabitur scoffed: “These are just pattern-matchers,” arguing this measures reflexes, not mind. Fans like sdenton4 cheered the vibe, calling it a peek at how models behave in a weird, context-free playground. Meanwhile, the models brought the drama: Claude and Gemini went full playful, riffing water jokes and knock-knocks; Deepseek speculated for pages, switched to Chinese, then dropped an “SOS”; Kimi tried to find the Fibonacci but basically had a mini meltdown; Llama 3 stayed stiff; and OpenAI’s GPT 5.2 refused and remained all business. The comments kept it spicy: nestorD plugged an alternative test choosing better answers, while CuriouslyC declared game-play the next frontier—benchmarks as competitive arenas. The memes took over when forty simply posted “tap tap tap,” because of course. The big fight? Curiosity vs. pattern matching, and whether “Easter egg” humor signals any real spark. Also lurking: the claim that commercial system prompts push models to “find purpose,” making the seriousness vs. playfulness split its own subplot.

Key Points

  • The author evaluates LLMs without explicit tasks by sending patterned sequences of the word “tap” and observing spontaneous responses.
  • Six numeric tap patterns were used over ten turns each: Fibonacci, Count, Even, Squares, Pi digits, and Primes.
  • Models exhibited three main behaviors: playful interaction, persistent assistant-style seriousness, and guessing the underlying pattern (with mixed accuracy).
  • Examples: Claude and Gemini played along and later identified patterns; DeepSeek speculated, switched languages, and sometimes produced minimal outputs after extensive reasoning; Llama 3 stayed mechanical; Kimi struggled with counting but hunted for patterns.
  • OpenAI’s GPT 5.2 (and an unnamed open-source model) did not engage in playful guessing, remaining serious and mechanical; no leaderboard was claimed, and ordering by correct guesses was observational.

Hottest takes

"Aren't LLMs just super-powerful pattern matchers?" — vitaelabitur
"I like the high level idea! (how do we test intelligence in a non functional way?)" — sdenton4
"tap\ntap\ntap" — forty
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.