MTG Bench: Testing how well LLMs can play Magic

AI Tries to Play Magic, and the Comment Section Instantly Starts Backseat Dueling

TLDR: MTG Bench tests whether AI can play *Magic: The Gathering* without a rules babysitter, and it also exposed big differences in how expensive these systems are to run. Commenters were split between “this is awesome” and “this isn’t real Magic without opponents,” turning a niche benchmark into a full-on backseat gaming debate.

A new benchmark called MTG Bench is trying to answer a delightfully nerdy question: can a large language model — basically an AI chatbot — play Magic: The Gathering well enough without a built-in referee enforcing the rules? The creator says that if an AI is truly good at the game, it shouldn’t need a digital hall monitor. That idea alone lit up the comments, because apparently nothing brings out the internet’s inner coach like watching a bot fumble a card game.

The biggest split was between people calling this very cool and people saying it misses the whole soul of the game. One commenter loved the idea of using AI to reason through gaming choices, while another praised “obscure benchmarks” because they feel less likely to be trained to death already. But the real spicy take came from players who argued the test strips out the best part of Magic: other people. One blunt response basically said that without opponents, this isn’t a game — it’s just paperwork with cards. Ouch.

There was also some low-key platform drama. The article points out one company handled tool use cheaply and cleanly, while another seemingly burned through way more tokens, which is the AI-world version of a grocery receipt jump scare. Meanwhile, the comments got wonderfully niche: people linked RuneBench for RuneScape, asked whether Card Forge should be adapted, and casually flexed local model tournaments on an RTX 5090. Even in a story about AI playing fantasy cards, the community somehow turned it into a mashup of theorycrafting, brand side-eye, and armchair coaching.

Key Points

•MTG Bench tests whether LLMs can simulate playing Magic: The Gathering without a formal rules engine.
•The benchmark gives models MCP-based primitive library tools, while the LLM handles the rest of gameplay, including action sequencing.
•Legality checks and scoring were performed with gpt-5.5 (medium), and the author found models were better at judging legality than executing legal turns.
•The article argues remote MCP servers reduce agent-loop costs, especially with OpenAI, because cached input is charged once rather than after each tool call.
•The benchmark penalizes over-eager tool use because irreversible information exposure, such as mistakenly drawing a card, makes the simulation illegal even if state is later reverted.

Hottest takes

"Without opponents you simply don't have a game" — OwenCR

"I love obscure benchmarks" — OsrsNeedsf2P

"Qwen 3.6 27B won narrowly over Gemma 4" — jmccaf

June 11, 2026

Draw phase? More like drama phase

AI Tries to Play Magic, and the Comment Section Instantly Starts Backseat Dueling

Key Points

Hottest takes

June 11, 2026

Draw phase? More like drama phase

MTG Bench: Testing how well LLMs can play Magic

AI Tries to Play Magic, and the Comment Section Instantly Starts Backseat Dueling

Key Points

Hottest takes

Save News