June 28, 2026

Benchmark battle, comment-section bloodbath

Semgrep: GLM 5.2 beats Claude in our Cyber Benchmarks

A little-known AI just embarrassed a big name, and the comments got messy

TLDR: Semgrep says the open-weight GLM-5.2 beat Claude on one app-security test, a surprise result that makes cheaper, self-hosted AI look more serious. But commenters instantly split between hype, accusations of ad-like benchmarking, and jokes that the real missing feature was clearer receipts.

Semgrep dropped a spicy claim: GLM-5.2, a little-known AI model from Zhipu, did better than Claude on one security test for finding a common app bug called an IDOR—basically, when one user can peek at another user’s stuff. On paper, that’s a big upset. GLM scored 39% to Claude’s 32% in Semgrep’s prompt-only setup, while Semgrep’s own more guided system still came out on top. But the real show started in the comments, where readers immediately turned this from benchmark news into a full-on trust issue.

One camp was impressed and jumped straight into tinkering mode, with one commenter posting a how-to for launching GLM in a container like they were handing out backstage passes. Another camp went instantly apocalyptic: “export controls incoming?” became the paranoid-hot-take of the thread, with predictions that regulators could start leaning on sites to pull open models offline. And then came the skepticism parade. Several readers basically accused the post of smelling like marketing, not science, saying the headline was too vague about which Claude got beaten and arguing that comparing a simple prompt to a heavily guided security system is apples-to-oranges.

The funniest jab? One commenter dismissed the whole thing with “It reads like an ad” and another sneered that these bugs are the easy mode of security flaws. Translation: yes, the benchmark result is eye-catching, but the crowd is split between “open-source underdog rises!” and “nice ad, now show the real receipts.”

Key Points

  • Semgrep says GLM-5.2 scored 39% F1 on its IDOR benchmark, outperforming Claude Code’s 32% in a prompt-only evaluation.
  • Semgrep’s own multimodal pipeline scored 53–61% F1, which the article attributes in part to a purpose-built harness for static analysis.
  • The article’s main question is how much vulnerability-detection performance comes from the model itself versus the surrounding harness.
  • In Semgrep’s simplified test setup, models used a Pydantic AI harness with the same IDOR prompt and no endpoint discovery or guided navigation.
  • Semgrep describes GLM-5.2 as an open-weight MIT-licensed Mixture-of-Experts model from Zhipu AI with 750B total parameters, about 40B active per token, and up to 1M-token context.

Hottest takes

"GLM export controls incoming?" — solenoid0937
"Which I guess makes what semgrep sells obsolete" — veselin
"It reads like an ad" — danslo
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.