June 6, 2026

Proof, panic, and robot homework

Benchmarks in Leipzig

Math boffins say AI nearly cleared their toughest puzzle set — commenters are not calm

TLDR: Researchers tested AI on 100 very hard math questions and got the unsolved count down to just 2, a striking sign of how far these systems have come. But commenters instantly argued over whether this is true reasoning or just clever use of already-published answers, with a side of robot-doom jokes.

A bunch of mathematicians gathered in Leipzig, cooked up 100 brutally hard research questions, and then let today’s best artificial intelligence chatbots take a swing. By the end, only 2 questions were left unsolved across the most intense testing stage — and yes, the internet immediately turned this into a full-on comment-section cage match.

The biggest split? One camp is yelling, basically, “This is huge!” The study leader jumped in personally to stress these weren’t cute school quiz questions but the kind of problems a focused PhD student might need days or even weeks to work through. That gave the results a real wow factor. But the skeptics were equally loud, warning people not to oversell it: these were questions with known answers already out in the literature, meaning critics think the models may be better at remixing existing human knowledge than breaking brand-new ground.

That tension is the real drama. Some readers zoomed in on the paper’s “only 2 unsolved” claim and side-eyed the wording, arguing the headline stat sounds flashier than the model-by-model results shown on the benchmark site at math.sciencebench.ai. Others went straight for meme mode, with one deadpan fan favorite: “As long as it’s not conscious, we’re safe.” In other words: part awe, part nitpicking, part sci-fi panic — classic internet.

Key Points

  • From April 1 to May 15, 2026, 49 mathematicians compiled a dataset of 100 research-level mathematics questions with known answers.
  • Most of the benchmark creation occurred during a three-day workshop, *Benchmarks in Leipzig*, with 35 participants.
  • The benchmark was evaluated in three stages: one attempt by five LLMs, then 20 runs per model for three models, then three runs for two heavy-thinking models.
  • The number of completely unsolved questions dropped from 41 after Stage 1 to 16 after Stage 2.
  • Only 2 questions remained unsolved after Stage 3, and the article states this shows impressive mathematical reasoning progress in LLMs.

Hottest takes

"not about solving frontier challenges" — zerobees
"As long as it’s not conscious, we’re safe." — Towaway69
"much harder than any exam question in any exam" — christianstump
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.