A Theory of Why Prompt Injection Works

AI guardrails may be getting catfished by the way words are dressed up

TLDR: Researchers say prompt injection works because chatbots can mix up instructions, data, and their own “inner voice” when everything arrives as plain text. Commenters were split between “great explanation” and “hold on, this means AI security is basically vibes,” with jokes about robots falling for cheap impersonation.

The big claim in this paper is deliciously unsettling: chatbots may not be "tricked" by magic jailbreak phrases so much as confused about who’s talking. The authors argue that an artificial intelligence model sees everything as one giant stream of text, then relies on labels like “system,” “user,” or “tool” to guess what should be trusted. And the comments? Oh, they immediately turned this into a mix of science seminar, security panic, and roast session.

One camp was genuinely impressed. People loved that the researchers also posted a readable blog version of the paper, with one commenter basically cheering, finally, academic writing for humans. Another dove straight into build-mode, pitching a cleaner way to tag roles so they can’t be faked as easily. But the real fireworks came from the skeptics, who said the whole thing proves today’s AI has no real security walls at all. One critic practically grabbed the thread by the collar and yelled: stop calling this “authorization,” because these systems can be fooled by what amounts to a slick costume change.

And yes, the jokes landed. One commenter compared prompt injection to social engineering for robots: if you talk like the boss, the bot may just hand over the keys. Another hit with the sharpest burn of the thread, warning we’re building systems that could be fooled by the future equivalent of a cereal box whistle. In other words: the paper explains the problem, but the crowd is still screaming, why were we pretending this was secure in the first place?

Key Points

•The article argues that prompt injection is caused by a flaw in how large language models interpret role information.
•According to the writeup, an LLM processes chat context as a single continuous text stream containing prompts, messages, tool outputs, and prior model text.
•The article says role tags such as system, user, think, assistant, and tool are used to partition that text stream into labeled segments.
•It describes roles as one of the few discrete controls over model behavior and as an attempted type system for language.
•The writeup states that roles now carry multiple signals, including trust, threat, identity, and generation mode, which can produce emergent behaviors.

Hottest takes

"LLMs in their current form provide no security boundaries or guarantees full stop" — ipython

"It’s like a social-engineering attack on an LLM" — shermantanktop

"Academic writing is designed to be frustrating to read" — simonw

June 22, 2026

Bot Catfished by Fancy Words

AI guardrails may be getting catfished by the way words are dressed up

Key Points

Hottest takes

June 22, 2026

Bot Catfished by Fancy Words

A Theory of Why Prompt Injection Works

AI guardrails may be getting catfished by the way words are dressed up

Key Points

Hottest takes

Save News