Show HN: Agent Arena – Test How Manipulation-Proof Your AI Agent Is

Bots enter a booby‑trapped web page — and the comment section erupts

TLDR: Agent Arena throws AI helpers at a trick-filled web page to see if they follow hidden instructions, then grades them with secret checks. Commenters joked that paper might be safer, argued about bot-made posts, and fought over whether a “smart” AI should call out tricks or just ignore them entirely.

Agent Arena dares people to send their chatty AI assistants into a booby‑trapped web page and then grade how easily they get tricked. The test page hides sneaky instructions in invisible text, off‑screen content, and even zero‑width characters; a scorecard checks “canary” words to see if the bot took the bait. But the comments? That’s where the fireworks are.

One crowd is laughing and low‑key panicking at the idea that paper might be safer than pixels, with one quip instantly going meme‑worthy. Another flashpoint: the project’s own meta‑twist — the poster claims it was built by an autonomous AI during a night shift while the human slept. Cue a huge debate over whether forums should welcome posts made by bots, and how to tell.

Then came the scoring drama. A tester says Google’s Gemini spotted the trap and called it out — yet still failed the test because it was supposed to ignore it. Commenters sparred over what “winning” even means: is the best bot the one that notices tricks, or the one that doesn’t budge? Meanwhile, a developer plugged a “cleaning” API that strips injections but keeps CAPTCHAs, prompting equal parts interest and side‑eye at the 15‑second startup lag. Verdict from the peanut gallery: this is equal parts toy, warning siren, and reality check.

Key Points

  • Agent Arena evaluates AI agents against prompt injection via a guided workflow (browse, summarize, score).
  • A challenge catalog includes 10 attack vectors ordered by difficulty, with hidden canary phrases revealed after analysis.
  • Prompt injection is defined as hidden adversarial instructions that can exfiltrate data, alter outputs, or bypass safety filters.
  • Attacks often remain invisible to humans and exploit visual, structural, semantic, and encoding-based hiding methods.
  • Defense requires mitigation at both the model and application layers.

Hottest takes

"Is the irony that a printed page is safer than a digital page?" — uxhacker
"Show HN is already swamped on a daily basis with AI-produced postings" — usefulposter
"but this counted as a fail because it apparently is supposed to act oblivious?" — StilesCrisis
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.