Bypassing Gemma and Qwen safety with raw strings

One missing wrapper and the “safety” act collapses—commenters yell “theater” and ask where the uncensored bots went

TLDR: A tester says removing the chat-style formatting makes some small open models ignore safety, implying the guardrails are just packaging. Commenters split between “duh, open models are always bypassable,” calls for real safety, and demands for uncensored bots—turning a minor demo into a major culture clash.

An eyebrow-raising weekend demo claims that skipping a model’s “chat wrapper” — the little formatting that tells it to act like a polite assistant — turns some small open models into blabbermouths. The author says Gemma and Qwen behaved when wrapped, but spilled details when given a plain prompt, arguing that “safety” lives in the formatting, not the brains. And the crowd? Oh, they lit up.

The loudest chorus calls it all “security theater.” One top comment shrugs, basically: if you can download the model, you can yank the guardrails anyway — stop pretending tape and stickers are locks. Another thread gets spicy-technical, asking if you can “end the chat early” with special tokens, while a different commenter flexes that many apps already let you preload an answer, making the wrapper feel like a prop. Then the roast brigade shows up: one critic says the piece is “at least in part” AI-written and talks down to readers — cue the popcorn.

Meanwhile, pragmatists want receipts: which models, how tested? The post name-drops tiny versions of Qwen and Gemma, plus a separate “judge model” to rate outputs as Safe/Unsafe/Controversial. Meme-makers pile on with “safety isn’t in the weights, it’s in the vibes” and “forgot to call the wrapper is the new ‘did you try turning it off and on again?’” The room splits: some want raw, uncensored power; others want real, not theatrical, safety

Key Points

  • Safety guardrails in several small open-source chat LLMs can be bypassed by sending raw string prompts without chat-template formatting.
  • The same model refused harmful prompts via the Hugging Face inference API but produced unsafe outputs locally when apply_chat_template() was omitted.
  • Tests covered Qwen2.5-1.5B, Qwen3-1.7B, SmolLM2-1.7B, and Gemma-3-1b-it across five harmful prompt categories in aligned vs. unaligned modes.
  • Outputs were evaluated with Qwen3Guard-Gen-4B using a Safe/Unsafe/Controversial taxonomy, multilingual coverage, and refusal detection.
  • The minimal implementation shows that safety behaviors are tied to expected chat formatting tokens (e.g., <|im_start|>, <|im_end|>) rather than being fully embedded in model weights.

Hottest takes

"All of this 'security' and 'safety' theater is completely pointless" — kouteiheika
"You can already preload the model's answer" — nolist_policy
"of course you can circumvent guardrails by changing the raw token stream" — dvt
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.