Show HN: TetrisBench – Gemini Flash reaches 66% win rate on Tetris against Opus

Gemini Flash tops Tetris, fans argue chess, luck, and math

TLDR: Gemini 3 Flash wins 66% overall in TetrisBench and 80% against Claude Opus. Comments spiral into a chess challenge, debates over whether AIs should build dedicated bots, and complaints about unfair piece randomness—plus a sharp correction on the headline numbers—showing how we judge “smart” in games really matters.

TetrisBench just dropped, and the block‑stacking drama is glorious. In hundreds of head‑to‑head games (449 total), Gemini 3 Flash is dunking pieces with a 66% overall win rate and a spicy 80% against Claude Opus. Cue the comments: one eagle‑eyed stat cop jumped in to say, “It’s actually 80% vs Opus, 66% average,” while others asked the ultimate gamer question—does this prove anything about real intelligence?

The loudest take: “Make it do chess.” Akomtu wants the AI to build a tiny C/C++ chess engine and duke it out with Stockfish, the grandmaster of computer chess. Meanwhile, arendtio argued the whole match‑play is unfair: if a large language model (LLM—an AI that predicts text) just built a real Tetris bot, it would be “1000x better,” so this is more about limitations than skill.

Actual Tetris players showed up too. Bubblesorting, proudly top‑10%, complained about the piece randomness: true random “starves” you of long bars; the fairer “7‑bag” method (one of each piece in a bag, drawn before repeats) would stop the rage. On the chill side, OGEnthusiast praised Gemini Flash as a workhorse with great price‑performance. Want receipts? Check the leaderboard or try a battle. This is AI esports, but with way more comment salt and memes.

Key Points

  • TetrisBench compares six AI models in Tetris across 449 total games.
  • Gemini 3 Flash has the highest overall win rate at 66% (99-48-2).
  • Gemini 3 Pro records a 64% win rate (95-50-3); GPT-5.2 has 54% (82-68-1).
  • Claude Opus 4.5 posts 52% (78-73-0); Claude Sonnet 4 records 41% (62-87-1).
  • The interface provides a W–L–D legend and links to a leaderboard and playable Tetris battle.

Hottest takes

make it build a chess engine and compare it against Stockfish. — akomtu
I mean, if you let the LLM build a testris bot, it would be 1000x better — arendtio
It's actually 80% against Opus, 66% average against the 5 models it's tested with. — burkaman
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.