EsoLang-Bench: Evaluating Genuine Reasoning in LLMs via Esoteric Languages

Bots ace Python, bomb in bizarre languages — are they smart or just cramming

TLDR: New tests show AI models crush Python but score just 3.8% on ultra-weird languages, with some tasks unsolved. Comments split between “it’s memorization, not real thinking,” “these tests are unfair,” and “let the bots read docs,” with bonus jokes about Malbolge dreams dying.

AI code bots just got humbled. In a new test called EsoLang‑Bench, fancy models that score about 90% in Python face‑plant with 3.8% on five “esoteric” joke‑like languages (think Brainfuck, Befunge‑98, Whitespace, Unlambda, and Shakespeare). The kicker: every model scored 0% on anything above “Easy,” Whitespace was a complete shutout, and even “self‑reflection” (the bot talking to itself) didn’t help. Cue the comment chaos.

One camp is yelling, “See? It’s memorization, not intelligence!” Another fires back: esolangs are purposefully weird — is flunking them a crime or just proof these tests are bonkers? orthoxerox brings the roast: if humans bomb these too, “does this prove I also rely on memorization?” Meanwhile deklesen blames the chunks of text bots use to read code, arguing the way these languages are split into pieces could be sabotaging the bots. On the “give them a fair shot” side, simianwords says let the bots pull docs from the web and their scores will jump.

And because the internet never misses a meme, __alexs laments their dream of a “bold new era of Malbolge” coding was… optimistic. Verdict from the thread: it’s less “AI genius coder” and more “great on familiar homework, clueless in chaos.”

Key Points

  • EsoLang-Bench evaluates 80 problems across five esoteric languages with vastly scarcer training data than Python.
  • Frontier LLMs achieve ~90% on Python equivalents but only 3.8% overall on esoteric tasks.
  • All models scored 0% on tasks above the Easy tier; Whitespace saw 0% across all configurations.
  • Five prompting strategies and two agentic coding systems were tested; self-reflection gave no measurable benefit.
  • Results suggest mainstream-language benchmarks overstate genuine programming reasoning in current LLMs.

Hottest takes

"does this prove I also rely on training data memorization?" — orthoxerox
"a bold new era of programming in Malbolge" — __alexs
"tokenizing might make it harder to reason over those tokens" — deklesen
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.