April 2, 2026
Count to five? Cue the riot
Even GPT-5.2 Can't Count to Five: Zero-Error Horizons in Trustworthy LLMs
Paper says it flubs simple puzzles; commenters roast, meme, and ask why tools weren’t used
TLDR: A new metric says GPT‑5.2 bungles simple counting and bracket checks, prompting commenters to shrug, roast, and meme that chatbots aren’t calculators—especially with tools off. The debate: measure reliability with ZEH or just let models use tools? It matters because people keep pitching these systems for high‑stakes jobs.
A new paper drops a buzzy metric called Zero-Error Horizon (ZEH)—think: “how far can this bot go without making a single mistake?”—and then claims GPT‑5.2 can’t handle baby puzzles like checking if “11000” has an even number of 1s or if “(((())))))” is properly balanced. The authors call it “surprising.” The internet calls it Tuesday.
Top comments came in hot. One camp basically yelled, “Of course it fails—LLMs are autocomplete with vibes.” As one put it, it’s a text generator, not a calculator, especially with tools disabled. Another camp went practical: why not just let it use Python for math-y stuff? The philosopher crowd chimed in with a spicy brain analogy: LLMs are all fast intuition (System 1), so grading them on slow, step-by-step logic (System 2) with ZEH might be missing the point. Meanwhile, the meme lords crowned the thread with the Monty Python classic: “One! Two! Five!”
Buried under the roasting: the paper also probes Qwen2.5, saying ZEH loosely tracks accuracy and might reveal when real algorithmic skills start to emerge. They admit ZEH is expensive to compute but brag about up to 10× speedups using tree tricks and “online softmax.” The crowd? More interested in the fails than the formulas. The vibe: don’t trust chatbots for safety-critical tasks—and don’t act shocked when they goof.
Key Points
- •Zero-Error Horizon (ZEH) is introduced as the maximum range a model can solve without any errors.
- •ZEH evaluations show GPT-5.2 fails on simple tasks like parity of "11000" and balanced parentheses in "((((())))))".
- •Applying ZEH to Qwen2.5 reveals that ZEH correlates with accuracy but exhibits different detailed behavioral patterns.
- •ZEH provides clues about the emergence of algorithmic capabilities in large language models.
- •Computing ZEH is computationally expensive, but tree structures and online softmax can deliver up to 10x speedups.