GLM-OCR: Accurate × Fast × Comprehensive

Fax-nightmare slayer or benchmark mirage—users clash over footnotes and Korean

TLDR: GLM-OCR claims top scores and easy, fast OCR for messy documents, stirring hope among users drowning in bad scans. But the comments split: skeptics question benchmarks, language support (Korean) disappoints, and academics demand footnotes—turning hype into a “prove it” moment for real-world use.

GLM-OCR just strutted in claiming top scores on a big document test and promising fast, cheap, open-source magic for reading messy scans. Translation for non-nerds: it’s an app that turns crumpled, blurry PDFs into actual text you can use. The crowd went wild—then split into camps. One fan begged for salvation, saying they deal with “faxed, wet-signed, low-res” contract chaos and big-name AI tools choke on it. Others threw shade at the scoreboard: are benchmarks real life or just vibes? The skepticism peaked with a spicy “can a tiny model really beat Google’s Gemini?” cage match.

Language fans added fuel: a Korean user said most new models whiff on everyday screenshots, joking that whatever’s on their iPhone often does better. Meanwhile, researchers piled on with “footnote-gate,” complaining these models skip the tiny-but-crucial notes at the bottom of academic papers—dealbreaker territory. The alternatives brigade marched in with links to rivals like LightOnOCR-2-1B and PaddleOCR-VL-1.5, turning the thread into a shopping list.

So yes, GLM-OCR is fast, open, and easy (cloud option, no expensive hardware). But the vibe is hope vs. reality: business folks want a fax-fixer now, skeptics want proof beyond charts, and academics want their footnotes back. The memes? “Benchmarks are lying,” “Korean kerfuffle,” and “Footnote or bust.”

Key Points

  • GLM-OCR is a multimodal OCR model built on the GLM-V encoder–decoder with MTP loss and reinforcement learning.
  • It integrates CogViT, a cross-modal connector with token downsampling, and a GLM-0.5B decoder in a two-stage layout+recognition pipeline using PP-DocLayout-V3.
  • The model scores 94.62 on OmniDocBench V1.5, ranking first and achieving state-of-the-art results across document understanding tasks.
  • With 0.9B parameters, GLM-OCR supports vLLM, SGLang, and Ollama for efficient inference suitable for high-concurrency and edge deployments.
  • It is fully open-source with an SDK, offering cloud deployment via Zhipu MaaS API or self-hosting with a complete pipeline and input handling via data URIs.

Hottest takes

"monstrously poor resolution, wet signed, all kinds of shit" — aliljet
"Is it possible for such a small model to outperform gemini 3 or is this a case of benchmarks not showing the reality?" — rdos
"has a common failure mode that prevents me from using: extracting footnotes and similar from the full text of academic works" — bugglebeetle
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.