November 11, 2025
Beaks, bikes, and beef
Agentic Pelican on a Bicycle
AI draws a biking bird—commenters say stop self‑grading
TLDR: An experiment let AI draw and repeatedly fix a pelican-on-a-bike using its own feedback. Commenters loved the chaos but argued results are random without many runs, warned models equate complexity with quality, and pushed for outside critics to review the art—why it matters: better testing means better AI.
The internet just watched AI models draw a pelican riding a bicycle, then repeatedly judge and fix their own art—and the comments were spicier than the pelican’s handlebars. Some outputs improved (think chains, spokes, and actual pelican arms), but the crowd’s verdict was loud. davesque argued you can’t learn much from one-off runs: do many per model or it’s just luck and style. williamstein slammed the “more complex = better” trap, saying it mirrors how these models write code. lubujackson called it out: miraculous first drafts, terrible revisions. Standardizing image conversion via Chrome DevTools kept things fair, yet readers demanded more runs and clearer stop rules.
The thread turned strategic and snarky. sorenjan pitched a fix: let a different model grade the pictures—like a GAN, where one draws and one critiques—aka “stop grading your own homework.” ripped_britches reported the same pain in design tools: attempt one freezes, no real progress. Memes popped off about pelicans finally finding the handlebars and going “arm day,” plus puns crowning the bicycle as the new “spokesperson.” The vibe? Curious, skeptical, craving repeatable tests and smarter feedback loops before anyone crowns these agentic birds the next Picasso. For context on the original benchmark, check Simon Willison’s pelican prompt.
Key Points
- •The experiment extends Simon Willison’s pelican-on-a-bicycle benchmark into an agentic iterative loop.
- •Models generated an SVG, converted it to JPG via Chrome DevTools MCP, visually assessed the output, and self-corrected.
- •Rasterization was standardized using the Chrome DevTools MCP server to avoid tool variability.
- •Six multimodal models were tested; iteration counts varied, with models deciding when to stop.
- •Claude Opus 4.1 showed specific mechanical and visual improvements over four iterations.