GLM-5.1: Towards Long-Horizon Tasks

GLM-5.1 runs a coding marathon — fans cheer, users rant about gibberish and 361GB downloads

TLDR: GLM-5.1 claims it keeps improving the longer it works, topping benchmarks and grinding a database from 3.5k to 21.5k queries per second. Commenters are split between wowed and wary: some praise quick, cheap fixes, others slam long-context nonsense, a “gimped” lite plan, and a 361GB home setup hurdle.

GLM-5.1 just showed up claiming it’s a marathon coder: the company says the longer it runs, the better it gets. They boast top scores on a tough software test (SWE-Bench Pro) and a wild demo where the model kept tuning a vector database for 600+ tries, jumping from 3.5k to 21.5k queries per second. Translation: it doesn’t just sprint; it keeps iterating, tweaking, and fixing for hours. They even flexed on GPU code and open-ended web builds—no score, just vibes and improvement.

But the comments? Oh, they’re spicy. One developer says GLM-5.1 is great for quick fixes and cheap, Sonnet-level work, but warns it can “start spouting gibberish” past super-long chats (128k tokens). Another is furious that the “Coding Lite” plan feels gimped, complaining about loops, contradictions, and even random Chinese characters popping up—“useless now” for serious code, they say. Local-model enthusiasts are howling at the hardware tax: quantized releases exist, but one flavor weighs in at 361 GB, prompting “good luck running this at home” jokes and fridge-sized download memes (link).

Meanwhile, a TypeScript fan swears it writes better code than big-name rivals—until long sessions where it goes off the rails. The mood swings from “marathon genius” to “sleep-talking after mile 20.” Even a moderator jumped in to warn the launch post could get flamed. Verdict: record-breaking claims meet real-world chaos, and the thread is loving the drama.

Key Points

•GLM-5.1 is introduced as a flagship model for agentic engineering with stronger coding capabilities than GLM-5, achieving state-of-the-art on SWE-Bench Pro and leading GLM-5 on NL2Repo and Terminal-Bench 2.0.
•The model is designed to remain effective over long horizons, sustaining optimization through iterative reasoning, experiments, and strategy revision across hundreds of rounds and thousands of tool calls.
•Three demonstrations span tasks with varying feedback: vector search with a numeric metric, a GPU kernel benchmark with per-problem speedups, and an open-ended web app build without external metrics.
•In VectorDBBench, restructuring into an outer optimization loop with the Claude Code framework enabled autonomous, multi-iteration submissions; GLM-5.1 reached 21.5k QPS after 600+ iterations and 6,000+ tool calls, about 6× the best 50-turn result (3,547 QPS).
•Performance gains followed staircase-like transitions, including shifts to IVF cluster probing with f16 compression (~6.4k QPS) and a two-stage u8 prescoring + f16 reranking pipeline (~13.4k QPS), while maintaining Recall ≥95% after adjustments.

Hottest takes

"start spouting gibberish" — bigyabai

"It is useless now for any serious coding work" — RickHull

"not going to be able to run even with high end hardware" — Yukonv

April 7, 2026

Marathon coder or sleep-talker?

GLM-5.1 runs a coding marathon — fans cheer, users rant about gibberish and 361GB downloads

Key Points

Hottest takes

April 7, 2026

Marathon coder or sleep-talker?

GLM-5.1: Towards Long-Horizon Tasks

GLM-5.1 runs a coding marathon — fans cheer, users rant about gibberish and 361GB downloads

Key Points

Hottest takes

Save News