May 28, 2026

Fact-checkers? More like fight-checkers

Five frontier LLMs disagree on 67% of 1k real-world fact-check claims

AI fact-checkers are fighting, and the comments are loving the chaos

TLDR: A new study found top AI bots disagreed on most real fact-check questions, raising doubts about treating them like reliable judges. The comments turned it into a circus: some joked the bots are becoming "more human," while others questioned whether the research itself leaned on AI too much.

The big number sending people into a spiral: five top AI chatbots disagreed on 67% of 1,000 real fact-check questions. In plain English, when asked whether real-world claims were true, mostly true, misleading, or false, the bots often could not get on the same page. Even juicier, about a third of the claims had major disagreements, where one bot might lean close to true while another swung all the way toward false. That instantly turned the comment section into a mix of stand-up comedy, shrugging realism, and suspicion.

The funniest mood-setter came fast: "They get more human by the day". That pretty much became the unofficial tagline of the thread, with readers joking that if the bots are bickering, dodging certainty, and splitting into camps, maybe they really are becoming just like us. But not everyone was impressed. One blunt commenter basically waved off the whole experiment, saying people only care how Claude or Codex perform anyway — a classic fan-war move that turned a research post into a popularity contest.

Then came the drama. One commenter suggested the models could inspect their own reasoning if they wanted, dropping a link like a grenade. Another went after the researchers instead, questioning how much of the report itself was written with AI and calling out the irony of an ethics section without clear disclosure. So yes, the study says AI fact-checkers don’t agree — but the real spectacle is that humans instantly didn’t agree on what that means either.

Key Points

  • Lenz Research tested five frontier LLMs on 1,000 recent real-world fact-check claims using a forced four-label rubric: True, Mostly True, Misleading, or False.
  • The models showed disagreement on 67% of claims, either through dissent from the majority verdict or because no strict majority formed.
  • In 34% of claims, the largest disagreement between model verdicts spanned at least two rubric buckets, indicating substantial differences in judgments.
  • Overall panel agreement was measured at Krippendorff’s alpha (ordinal) = 0.639, which the article describes as nontrivial but limited agreement.
  • The models agreed most on definitive True/False outcomes, while the middle categories—Mostly True and Misleading—showed much less convergence.

Hottest takes

"They get more human by the day." — christophilus
"I think ppl only care about how Claude or codex does." — ipunchghosts
"I wonder if anything of this matters when the authors don't disclose exactly how much of their report was written and made with LLMs" — embedding-shape
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.