Square Minus Square – A coding agent benchmark

AI stumbles on square puzzle; commenters yell “give it eyes”

TLDR: Benchmarking coding agents on a “two squares, minimal triangles” task found none of the AI models solved it, even with screenshot-based debugging. Commenters push grounded vision—Molmo2 gets a shout—while others doubt AI’s spatial reasoning, making this a buzzy test of what agents still can’t do.

Two squares, one deceptively simple mission: carve out the first square minus the overlap, with the fewest triangles. The creator threw the challenge at multiple coding agents—and even tried it by hand—and the mic-drop moment was this: no LLM solved it. Cue the crowd gasping, laughing, and arguing. The loudest chorus? Vision, vision, vision. User wariatus demanded a grounded vision model so agents can actually analyze screenshots, insisting most models misread what’s on the screen and name-dropping Molmo2 as promising. Meanwhile, people side-eyed the rollercoaster: top-tier bots like Opus, Gemini 3 Pro, and GPT 5.2 sometimes nailed pieces, then crashed spectacularly. There was grudging respect for the feedback loop: agents taking screenshots and critiquing their own work seem surprisingly sharp, and readers want even stricter self-checking. Gemini 3 Flash drew snark for “solving” with too many vertices—like showing up to a potluck with seventeen forks for a two-fork meal. The vibe: half “lol triangles,” half “why can’t machines do basic geometry?” Under the jokes sits a real question: will giving agents vision fix this, or are they just bad at spatial reasoning and staying minimal under constraints? For now, this tiny Rust function exposes some very big AI blind spots.

Key Points

  • A geometry benchmark asks agents to triangulate the area of one rotated square minus its intersection with another using minimal triangles in a standalone Rust function.
  • A custom visualization framework was built to display outputs and capture screenshots and video.
  • Each coding agent was tested twice, with the better run selected for comparison.
  • No LLM successfully solved the task, though many used screenshots effectively to identify bugs, highlighting the value of feedback loops.
  • Top models (Opus, Gemini 3 Pro, GPT 5.2) showed mixed reliability, and Gemini 3 Flash added unnecessary vertices/triangles.

Hottest takes

"Have you tried to equip those agents with an access to grounded vision model to analyse that image?" — wariatus
"most models can’t understand such imput properly" — wariatus
"I am now experimenting with Molmo2 and it looks promising" — wariatus
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.