February 12, 2026

Benchmarks, brawls & brainy bots

Gemini 3 Deep Think

Google’s new “thinking” AI sparks score wars and science hype

TLDR: Google unveiled Gemini 3 Deep Think for science; a Duke lab uses it to design new semiconductors. The crowd cheers its 84.6% reasoning benchmark while skeptics demand stricter tests and mock the Twitter-first PR, underscoring why better thinking AI could matter in real research.

Google dropped Gemini 3 Deep Think, pitching it as a brainy assistant for tough science and engineering—and even showed Duke’s Wang Lab using it to design new semiconductor materials. But the real action? The comments turned into a scoreboard brawl. One fan flexed that Gemini is “healthily ahead of Claude 4.6”, while another shouted “Wow” at the ARC‑AGI‑2 result—84.6% versus 68.8% for a rival—painting it as Google’s comeback moment. ARC‑AGI‑2, by the way, is a puzzle‑style reasoning test used as a proxy for general smarts.

Then the referees arrived. A cooler head linked the methodology and noted the 84.6% came from a semi‑private test set, sparking a “show us the private‑set receipts” debate. Meanwhile, a philosopher‑comedian broke the thread with a meme: a spectrum of non‑thinking, thinking, and best‑of‑N models mapped to “linear, quadratic, and n^3” complexity—cue jokes about “n^3 brain” mode. And because it’s the internet, there was petty PR drama: why announce it on Twitter before the official Google blog?

Bottom line: hype squad vs. audit squad, with a side of math memes and platform beef. If Gemini really thinks better, the lab demos matter—but the crowd wants more than vibes, they want proof.

Key Points

  • Gemini 3 Deep Think is introduced as an AI tool for modern science, research, and engineering challenges.
  • The tool aims to push the frontier of intelligence in these domains.
  • The Wang Lab at Duke University is showcased using Gemini 3 Deep Think.
  • Its application is focused on designing new semiconductor materials.
  • A demonstration is available to watch showing how the lab uses the tool.

Hottest takes

“healthily ahead of Claude 4.6.” — Metacelsus
“Arc-AGI-2: 84.6% (vs 68.8% for Opus 4.6) Wow.” — lukebechtel
“The arc-agi-2 score (84.6%) is from the semi-private eval set.” — sigmar
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.