Doublespeak: In-Context Representation Hijacking

Carrots are bombs now? Commenters split between panic, memes, and eye-rolls

TLDR: A new trick makes AI treat harmless words as secret codes for banned requests, slipping past filters. Commenters split between panic over needing stronger safety and skepticism about old-model testing, while memes ask if “fruit salad” is now dangerous.

Researchers dropped a mind-bending trick called Doublespeak: teach an AI to treat a harmless word like “carrot” as code for a dangerous one, so a sweet-sounding prompt sneaks past safety checks. The community reaction? Instant chaos. One camp is sounding alarms, warning that safety filters that only scan for bad words at the start are toast. As one commenter put it, you might need a second full-sized AI just to police the first. Another camp is shrugging hard, calling it a clever idea wrapped in a “slop website” and saying the tests were on old models, despite claims it works on GPT-4o, Claude, Gemini, and more.

There’s also the big-brain take: this exposes how AI “thinking” isn’t human—early on it reads “carrot” as a carrot, but later layers quietly morph it into “bomb,” dodging the refusal gate along the way. Cue the memes: “fruit salad of doom,” “How to build a carrot,” and “Is produce now contraband?” Meanwhile, sleuths speculate about platforms like DeepSeek using a separate, dumb-but-strict filter that can nuke outputs mid-rant—proof that post-processing might still beat fancy alignment. Love it or hate it, the comment section agrees on one thing: this opens a new jailbreak front and the defenses need a glow-up, fast.

Key Points

•Doublespeak replaces harmful keywords with benign tokens across in-context examples to hijack internal representations.
•The benign substitute token is interpreted as harmless in early layers and converges to harmful semantics in later layers.
•Refusal mechanisms operate in early layers and fail to detect the later-layer semantic shift, enabling harmful responses.
•The attack is broadly transferable and reportedly affects production models such as GPT-4o, Claude, and Gemini.
•Mechanistic analysis using Logit Lens and Patchscopes shows precise, layer-by-layer semantic hijacking affecting only the target token.

Hottest takes

“you essentially get another network... designed for rejecting all ‘unsafe’ queries” — measurablefunc

“Intriguing and very cunning attack! So obvious in hindsight!” — wood_spirit

“interesting idea, slop website, tested only on old AI models” — behnamoh

December 28, 2025

When carrots explode

Carrots are bombs now? Commenters split between panic, memes, and eye-rolls

TLDR: A new trick makes AI treat harmless words as secret codes for banned requests, slipping past filters. Commenters split between panic over needing stronger safety and skepticism about old-model testing, while memes ask if “fruit salad” is now dangerous.

Key Points

Hottest takes

December 28, 2025

When carrots explode

Doublespeak: In-Context Representation Hijacking

Carrots are bombs now? Commenters split between panic, memes, and eye-rolls

TLDR: A new trick makes AI treat harmless words as secret codes for banned requests, slipping past filters. Commenters split between panic over needing stronger safety and skepticism about old-model testing, while memes ask if “fruit salad” is now dangerous.

Key Points

Hottest takes

Save News