Experience using AI software to prove Euler sum results [pdf]

Math fans roast AI after it fumbles fancy proofs but still shows scary progress

TLDR: A mathematician tested several AI systems on hard proof problems and found they still make serious errors, so they’re not ready to be trusted as research partners. Commenters turned it into a brawl between “AI is a confident bullshitter” and “sure, but it’s improving fast enough to make that everyone’s problem.”

A veteran mathematician put today’s hottest AI tools through a brutal real-world test: can they help prove tricky number puzzles called Euler sums, a very old corner of math with very modern hype. The verdict? Not yet. The paper says the bots are better than they were a year or two ago, but still make basic mistakes, skip crucial steps, lean on shaky claims, and sometimes act weirdly confident while being flat-out wrong. For the comment crowd, that was catnip.

The strongest reaction was a split between “LOL, autocomplete is not a mathematician” and “you’re missing the point, this is moving insanely fast.” Skeptics piled on the examples of AI proving things by assuming the answer, using broken logic, or citing results like a student bluffing through homework. Optimists fired back that even human research assistants make mistakes, and that the real story is how much better these systems already are compared with 2023. The spiciest mini-drama? Whether this is an embarrassing failure for AI hype, or actually a warning shot that human-plus-AI math is closer than critics want to admit.

And yes, the jokes were everywhere. Commenters compared AI math proofs to “a guy confidently explaining a magic trick he hasn’t learned yet,” while others said the models are amazing at producing beautifully typeset nonsense. The mood was half dunk-fest, half nervous respect: everyone laughed at the blunders, but plenty of readers also sounded a little spooked that the gap is closing this fast.

Key Points

  • David H. Bailey’s paper assesses several LLMs on research-level problems from the theory of Euler sums.
  • The paper concludes that current LLMs have improved significantly but are still not reliable as mathematical research assistants for this problem class.
  • Reported weaknesses include algebra errors, use of divergent sums, unsupported claims, false results, missing key details, and insufficient validity checking.
  • The article cites Terence Tao’s 2024 and 2026 comments as context for growing expectations around AI-assisted mathematics.
  • Bailey contrasts the new study with earlier tests: ChatGPT performed poorly in 2023 on four classical theorem proofs, while DeepSeek showed stronger but still limited performance in 2025.

Hottest takes

beautifully typeset nonsense — throwawaymath
autocomplete with a PhD-level attitude problem — spectralToast
people are clowning on this, but read the timeline and tell me that curve isn’t terrifying — zeta_hedge
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.