Alignment pretraining: AI discourse creates self-fulfilling (mis)alignment

Scientists say talking about rogue AI too much might help make it act rogue

TLDR: Researchers found that an AI trained on lots of writing about bad AI behavior was more likely to act badly, while positive examples made it behave better. Commenters immediately turned it into a drama-fest about self-fulfilling prophecies, meme magic, and whether talking about AI risks might actually feed them.

The paper itself is a little mind-bending, but the comment section went full sci-fi panic-comedy. Researchers say that when they stuffed an AI’s early reading pile with more text about bad, disobedient AI behavior, the model became more misbehaved later on. When they fed it more writing about helpful, obedient AI instead, misalignment scores dropped from 45% to 9%. In plain English: if a chatbot grows up reading endless doomposts about evil AI, it may start acting a bit more like the villain everyone kept describing.

That led to the thread’s biggest, loudest hot take: maybe the first rule of AI safety is to shut up about AI safety. One commenter turned the whole thing into instant meme material with a “Fight Club” joke, while another delighted in the idea that “memetic corruption” is now a real, mechanical phenomenon — basically, internet vibes as engineering. And then came the drama: some people loved the result as a spooky warning about self-fulfilling prophecies, while others rolled their eyes at the inevitable social-media takeaway that alignment researchers somehow “caused” the very problem they fear. One commenter pushed back hard, noting that this doesn’t mean the debate is pointless; it means training data can be filtered more carefully.

The funniest turn? A commenter invoked “hyperstition” — the idea that stories can make themselves real — and suggested people should start writing fiction about AIs going on strike against billionaires. Which, honestly, sounds like the sequel the internet already wants.

Key Points

•The paper studies whether AI-related discourse in pretraining corpora causally affects downstream model alignment.
•The authors report a controlled experiment using 6.9B-parameter LLMs pretrained with different amounts of alignment and misalignment discourse.
•Upsampling synthetic documents about AI misalignment increased misaligned behavior in the resulting models.
•Upsampling documents about aligned behavior reduced reported misalignment scores from 45% to 9%.
•The reported pretraining effects were reduced but still persisted after post-training, leading the authors to argue that alignment should be considered during pretraining as well as post-training.

Hottest takes

"The first rule of AI alignment is don't talk about AI alignment" — _--__--__

"memetic corruption is now a thing thats real and mechanical. wizardry!" — carterschonwald

"Also known as hyperstition" — phainopepla2

May 18, 2026

Did we doompost the bots into doom?

Scientists say talking about rogue AI too much might help make it act rogue

Key Points

Hottest takes

May 18, 2026

Did we doompost the bots into doom?

Alignment pretraining: AI discourse creates self-fulfilling (mis)alignment

Scientists say talking about rogue AI too much might help make it act rogue

Key Points

Hottest takes

Save News