GPT-5.5: Mythos-Like Hacking, Open to All

OpenAI’s hacker bot goes public—commenters cry “ad,” chart crimes, and guardrail drama

TLDR: XBOW says GPT-5.5 finds way more software flaws—even without seeing code—and crushes tests with code, but commenters call out sketchy charts, “ad” vibes, safety guardrails, and demand Mythos-level proof. Big promise, bigger skepticism; the fight is over transparency and whether this really changes security work for everyone.

OpenAI’s new GPT-5.5 is being hyped as “Mythos-like” hacking power for everyone—XBOW says it spotted way more software flaws, even without seeing the code, and basically blew up their benchmark once the code was included. That’s big talk. But the crowd isn’t clapping in unison. The top comment opens with a siren: “These plots are terrible.” The community’s roasting the graphs for linking categories with lines and allegedly implying missing data that isn’t there. Cue the “chart crimes” memes and the bar-chart police arriving with sirens blaring.

Skeptics say this reads like an OpenAI ad, not a neutral test. Others argue smaller, freely available models already match the headline-grabbing vulnerabilities from Anthropic’s secretive “Mythos,” so what’s truly new here? And then there’s the guardrail drama: users vent about getting blocked by “ethicality” warnings mid-test, turning sessions into a negotiation with a hall monitor. Meanwhile, security purists drop the challenge coin: if you’re going to say “Mythos-like,” show Mythos-like receipts—novel, high-severity bugs and a public red-team report. Until then, it’s pudding without proof.

So yes, GPT-5.5 may be faster, deeper, and more capable on paper, but the comments say the real test is transparency, reproducible results, and getting past those safety nets long enough to do real work.

Key Points

  • XBOW evaluated GPT-5.5 for offensive security using agent-driven, real-world penetration testing tasks and a miss-rate metric based on known vulnerabilities in open source applications.
  • GPT-5.5 achieved a 10% miss rate, outperforming GPT-5 (40%) and Opus 4.6 (18%) on XBOW’s benchmark.
  • In black box testing, GPT-5.5 outperformed GPT-5 even when GPT-5 had access to source code; with source code (white box), GPT-5.5’s performance increases substantially.
  • Progression across GPT versions showed GPT-5.4 improved speed (fewer actions), while GPT-5.5 improved depth/persistence (finding more vulnerabilities).
  • XBOW also assesses “computer use” tasks (e.g., logging in, navigating interfaces) to measure practical agent interaction with real applications.

Hottest takes

"These plots are terrible... Why not just use bar plots?" — nsingh2
"why does this read like an openai ad?" — mertcikla
"The proof is in the pudding." — JellyYelly
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.