GPT-5.5 hallucinates 3x more than MIT-licensed GLM-5.2

AI giant gets roasted as smaller free rival looks more honest

TLDR: A new benchmark says GPT-5.5 makes up answers about three times as often as the free GLM-5.2 when it gets stuck, fueling doubts that ever-bigger AI is really better. Commenters immediately split between dunking on the hype and warning that the stats are being oversimplified.

The numbers lit the fuse, but the comment section brought the fireworks. The article’s big claim is that GPT-5.5 makes things up far more often than the openly available, MIT-licensed GLM-5.2 when it hits questions it can’t answer. In plain English: the flashier, more expensive AI may be more likely to bluff, while the smaller rival is getting praise for knowing when to say, “I don’t know.” That instantly turned the thread into a full-on brawl over whether bigger AI models are actually smarter, or just louder.

One camp was ready to drag the hype machine through the streets. The mood was basically: how is a massive premium model still this bad at admitting uncertainty? People mocked the idea that sheer size equals quality, and the article’s example — where a giant model confidently spit out the wrong coding answer while GLM-5.2 paused and said the request itself was flawed — gave critics plenty of ammo. The vibe was very “the expensive overachiever bombed the open-book test.”

But the skeptics showed up fast. One commenter warned that these hallucination scores are easy to misuse, because they only measure what happens when the model already doesn’t know the answer. Another got annoyed at the framing itself, basically yelling, stop editorializing the title. And then came the armchair training experts, arguing this isn’t about size at all — it’s about AI being raised on too many neat, tidy textbook answers, so it panics when real life gets messy. In other words: same old internet, numbers dropped, nuance evaporated, chaos thrived.

Key Points

  • The article says GLM-5.2, an open-weight MIT-licensed model from Z.ai, scores within 4 points of GPT-5.5 and 9 points of Claude Fable 5 on the Artificial Analysis Intelligence Index.
  • It reports hallucination rates from the AA-Omniscience benchmark of 28% for GLM-5.2, 36% for Opus 4.8, 48% for Fable 5, 86% for GPT-5.5, and 94% for DeepSeek V4 Pro.
  • The article cites the US government’s restriction of Claude Fable 5 three days after release as a prominent example of frontier-model risk.
  • DeepSeek V4 Pro is described as having 1.6T parameters with 49B active parameters, while GLM-5.2 is described as having 753B parameters with roughly 40B active parameters.
  • In a Python asyncio prompt test described by the article, GLM-5.2 reportedly identified a technical impossibility quickly, while DeepSeek V4 Pro used more time and reasoning tokens but returned an incorrect answer.

Hottest takes

Hallucination rate scores are a little tricky to interpret — aesthesia
Please don't editorialize titles — cwillu
In a book you never see a question which admit no answer — frankohn
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.