Advancing AI Benchmarking with Game Arena

DeepMind adds poker and Werewolf; fans cheer, nitpick, and wonder if we’re training liars

TLDR: DeepMind expanded its Game Arena beyond chess to poker and Werewolf to test bluffing and social skills, with Gemini 3 models topping the charts. Commenters debated curated poker, flexed Gemini vs. ChatGPT, and set an AGI bar at beating modern video games—raising big questions about what we’re really measuring.

Google DeepMind’s game lab just leveled up: after chess, the Kaggle Game Arena now includes poker and the party game Werewolf to test how AIs handle bluffing, social clues, and messy reality. The scoreboard flex is clear — Gemini 3 Pro and Flash sit on top — and DeepMind claims these models play more like people, using “intuition” instead of brute-force number crunching. Cue the internet: some are hyped for real-world-ish chaos, others are side‑eyeing the “teach it to deceive” vibe, even as DeepMind frames Werewolf as a safe sandbox for catching lies.

The comment section came in swinging. One camp wants purer tests: eamag questioned why it’s curated poker hands instead of normal play, while tiahura yelled for the classic roguelike NetHack. Platform drama? Absolutely. chaostheory dropped an anecdote that Gemini is now beating ChatGPT at understanding what people mean — a spicy flex that sparked brand‑loyalty skirmishes. Meanwhile, researcher ofirpress upped the stakes with a meta twist: agents building agents to battle in “CodeClash,” turning the benchmark into a bot-on-bot arms race. And the AGI gatekeepers arrived: cv5005 declared the real bar is an AI that can “sit down” and beat a modern video game using only sight and sound. Meme lords chimed in with “poker face” jokes and werewolf GIFs, but the core debate is deadly simple: are we measuring progress — or just training better bluffers?

Key Points

  • Google DeepMind and Kaggle expanded Game Arena with Werewolf and poker to benchmark AI reasoning under uncertainty.
  • The chess benchmark measures strategic reasoning, adaptation, and long-term planning, with updated leaderboards.
  • Gemini 3 Pro and Gemini 3 Flash lead the chess leaderboard with top Elo ratings, outperforming Gemini 2.5.
  • Werewolf is a natural-language, team-based benchmark assessing communication, negotiation, and ambiguity handling.
  • Game Arena’s controlled game environments support agentic safety research, including detection and safe handling of deception.

Hottest takes

curate poker hands instead of a normal poker — eamag
found Gemini to perform better than ChatGPT when it came to intent analysis — chaostheory
My personal threshold for AGI is when an AI can 'sit down' - it doesn’t need to have robotic hands, but it needs to only use visual and audio inputs to make its moves - and complete a modern RPG or FPS single player game — cv5005
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.