January 19, 2026
Bots with lockpicks
The coming industrialisation of exploit generation with LLMs
AI is mass-producing hacks; panic vs 'use it for defense' and memes
TLDR: A researcher showed AI can spin up dozens of software hacks quickly, even past heavy protections. Comments split between fear of token-powered cyber armies and confidence that defenders can run AI red teams in build pipelines, with jokes about spyware firms firing up thousands of bot coders.
An experiment on Sean Heelan’s blog lit up infosec chatter: he used two AI agents to turn a hidden software flaw (a “zero-day,” meaning unknown to the public) into more than 40 working hacks, fast and cheap. His warning? The bottleneck for break-ins may soon be token throughput—how much text an AI can chew—not headcount. He tested on QuickJS (a tiny JavaScript engine) and claims one model aced every task while another missed only two, at roughly tens of dollars per run.
That sent the crowd into a split-screen. protocolture admitted, “I genuinely don’t know who to believe,” wobbling between stories of miracle exploits and junk bug reports. The calm crowd, led by er4hn, argued this cuts both ways: defenders can run LLM red teams in CI (automated pre-release checks) to find the holes first. To the optimists, this is automation, not apocalypse. Just better tools.
Others just screamed. baxtr’s one-word verdict: “Scary.” simonw marveled at the laundry list of protections—randomized memory, no executable regions, sandboxing—and still, the bot wrote a file. Meanwhile, the meme lords went wild: GaggiX joked spyware shops would spin up “10k Claude Code” bots. Cue jokes about GPU farms, hoodie-wearing chatbots, and token-fueled cyber armies.
Key Points
- •Agents built on Opus 4.5 and GPT-5.2 generated exploits for a zero-day in QuickJS under strict constraints.
- •Over 40 distinct exploits were produced across six scenarios; GPT-5.2 solved all, Opus 4.5 solved all but two.
- •Agents converted the vulnerability into an API to read/modify process memory via source reading, debugging, and trial-and-error.
- •Most challenges were solved in under an hour with 30M tokens per run; Opus 4.5’s 30M tokens cost about $30.
- •GPT-5.2 solved the hardest task by chaining seven function calls using glibc’s exit handlers despite multiple mitigations.