Even GPT-5.2 Can't Count to Five: Zero-Error Horizons in Trustworthy LLMs

Paper says it flubs simple puzzles; commenters roast, meme, and ask why tools weren’t used

TLDR: A new metric says GPT‑5.2 bungles simple counting and bracket checks, prompting commenters to shrug, roast, and meme that chatbots aren’t calculators—especially with tools off. The debate: measure reliability with ZEH or just let models use tools? It matters because people keep pitching these systems for high‑stakes jobs.

A new paper drops a buzzy metric called Zero-Error Horizon (ZEH)—think: “how far can this bot go without making a single mistake?”—and then claims GPT‑5.2 can’t handle baby puzzles like checking if “11000” has an even number of 1s or if “(((())))))” is properly balanced. The authors call it “surprising.” The internet calls it Tuesday.

Top comments came in hot. One camp basically yelled, “Of course it fails—LLMs are autocomplete with vibes.” As one put it, it’s a text generator, not a calculator, especially with tools disabled. Another camp went practical: why not just let it use Python for math-y stuff? The philosopher crowd chimed in with a spicy brain analogy: LLMs are all fast intuition (System 1), so grading them on slow, step-by-step logic (System 2) with ZEH might be missing the point. Meanwhile, the meme lords crowned the thread with the Monty Python classic: “One! Two! Five!”

Buried under the roasting: the paper also probes Qwen2.5, saying ZEH loosely tracks accuracy and might reveal when real algorithmic skills start to emerge. They admit ZEH is expensive to compute but brag about up to 10× speedups using tree tricks and “online softmax.” The crowd? More interested in the fails than the formulas. The vibe: don’t trust chatbots for safety-critical tasks—and don’t act shocked when they goof.

Key Points

•Zero-Error Horizon (ZEH) is introduced as the maximum range a model can solve without any errors.
•ZEH evaluations show GPT-5.2 fails on simple tasks like parity of "11000" and balanced parentheses in "((((())))))".
•Applying ZEH to Qwen2.5 reveals that ZEH correlates with accuracy but exhibits different detailed behavioral patterns.
•ZEH provides clues about the emergence of algorithmic capabilities in large language models.
•Computing ZEH is computationally expensive, but tree structures and online softmax can deliver up to 10x speedups.

Hottest takes

"Is this seriously surprising to anyone who knows the absolute minimum..." — throwuxiytayq

"The real surprise is that someone writing a paper... doesn't understand the baseline capabilities" — parliament32

"One! Two! Five!" — justinator

April 2, 2026

Count to five? Cue the riot

Paper says it flubs simple puzzles; commenters roast, meme, and ask why tools weren’t used

Key Points

Hottest takes

April 2, 2026

Count to five? Cue the riot

Even GPT-5.2 Can't Count to Five: Zero-Error Horizons in Trustworthy LLMs

Paper says it flubs simple puzzles; commenters roast, meme, and ask why tools weren’t used

Key Points

Hottest takes

Save News