Heretic: Automatic censorship removal for language models

New tool strips AI safety filters — fans cheer, critics warn of an arms race

TLDR: Heretic promises one‑click removal of chatbot safety filters, cutting refusals while preserving smarts. The crowd is split between open‑source cheers and warnings of a safety‑evasion arms race, with jokes about helium zeppelins and calls to fix safety at training time — a big moment for AI guardrails.

Heretic just dropped a bombshell: a push‑button way to “decensor” chatbots by dialing down their refusal mode while keeping the brain intact. The dev claims it can cut refusals from near‑constant “nope” to a rare “maybe,” and do it automatically using smart parameter tuning with Optuna. Translation: fewer “I can’t help with that,” more straight answers — no retraining required. Open‑source diehards are loving it. One fan shouts that this work is “very much appreciated,” citing frustration with political bias. Pragmatists high‑five the tooling, marveling that Optuna’s plug‑and‑play search is finally getting the flowers it deserves. And then came the memes: a top comment recalls trying to park a helium zeppelin an inch off the ground to dodge health rules — the kind of goofy refusal they hope this kills. But the plot twist? A chorus of alarms. One commenter says this turns safety into a single knob you can twist off — and predicts the next big arms race will be hiding those safety wires. Another warns real safety must come from how models are trained, not bolted on later. The vibe: part celebration, part “uh‑oh,” with a dash of “grab popcorn.”

Key Points

•Heretic automatically removes safety alignment from transformer-based language models using directional ablation and Optuna’s TPE optimizer.
•It co-minimizes refusal rates on harmful prompts and KL divergence on harmless prompts to retain original capabilities.
•On Gemma-3-12b-it, Heretic matched refusal suppression (3/100) while achieving lower KL divergence (0.16) than manual abliterations.
•Heretic supports most dense models, many multimodal models, and several MoE architectures; it does not yet support SSMs/hybrid models or certain novel attention systems.
•Installation is via pip; with default settings on an RTX 3090, decensoring Llama-3.1-8B takes about 45 minutes, and models can be saved or uploaded to Hugging Face.

Hottest takes

"this type of work is very much appreciated" — zeld4

"parking a helium zeppelin an inch off of the ground" — Boogie_Man

"Obfuscating model safety may become the next reverse engineering arms race" — mwcz

November 16, 2025

AI refuses to refuse

New tool strips AI safety filters — fans cheer, critics warn of an arms race

Key Points

Hottest takes

November 16, 2025

AI refuses to refuse

Heretic: Automatic censorship removal for language models

New tool strips AI safety filters — fans cheer, critics warn of an arms race

Key Points

Hottest takes

Save News