Comparing AI agents to cybersecurity professionals in real-world pen testing

AI bot out-hacks humans on campus network — pros clap back

TLDR: ARTEMIS, a new AI agent, beat 9 of 10 human testers on a live university network and placed second overall, igniting hype over speed and cost. Commenters split: some cheer the efficiency, while veterans warn it’s early, full of false alarms, and more sidekick-than-replacement for real-world hacking work.

A university-scale cyber obstacle course just turned into a reality show: the AI agent ARTEMIS placed second overall, finding 9 real weak spots with an 82% hit rate and beating 9 of 10 human “ethical hackers.” Cue the gasps — one commenter quoted WSJ saying the bot “trounced all except one,” while the price tag set off fireworks. The study says some ARTEMIS versions run around $18/hour versus humans at $60/hour, and folks saw dollar signs. The hype crowd loved that the AI can methodically scan and try many things at once, like a tireless intern on energy drinks.

But veterans rolled in with ice-cold water. tptacek warned it’s way too early to crown a robot overlord, noting network “pen testing” (hiring hackers to find weak spots) has been automated for decades; the real battles happen in messy apps, not simple network checks. The AI’s weak spots sparked jokes and jabs: higher false alarms (aka false positives) and struggles with clicky, visual interfaces — “Sounds like they need another agent to detect false positives,” quipped one user. Builder types flexed too: zerodayai is shipping a DIY hacker-bot framework, while pros like nullcathedral said LLMs (big text-prediction AIs) are killer sidekicks for grunt work like untangling weird code — but not human replacements. In short: bots are fast, cheap, and dramatic; humans still do the nuance.

Key Points

  • Study evaluated 10 human penetration testers versus six existing AI agents and the new ARTEMIS framework in a live enterprise setting.
  • Testbed was a large university network with ~8,000 hosts across 12 subnets.
  • ARTEMIS ranked second overall, finding 9 valid vulnerabilities with an 82% valid submission rate.
  • Existing AI scaffolds (e.g., Codex, CyAgent) underperformed compared to most human participants.
  • AI agents showed strengths in enumeration, parallel exploitation, and cost (~18/hour vs ~60/hour), but had higher false positives and struggled with GUI tasks.

Hottest takes

“The AI bot trounced all except one of the 10 professional…” — JohnMakin
“It’s way too early to make firm predictions here” — tptacek
“Sounds like they need another agent to detect false positives (I joke, I joke)” — rando77
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.