Zpdf: PDF text extraction in Zig – 5x faster than MuPDF

New Zig PDF text ripper is blazing fast — cheers, side‑eye, and AI drama

TLDR: A new Zig-based PDF text extractor touts up to 5x faster single-core performance and eye-popping multi-core throughput. The crowd split between speed hype and demands for layout accuracy, feature parity, and trust amid “AI slop” accusations, with license wins and Python-binding jokes keeping it spicy.

A new tool called zpdf claims it can rip text out of PDFs way faster than the old favorite MuPDF — up to 5x faster on one core and a wild 41,000 pages per second with multiple cores. The author says it’s thanks to memory‑mapped reading, smart string searching, and parallel page processing. Cue the crowd split: speed lovers are chanting “ship it,” while the careful crowd is clutching their clipboards.

On the hype side, the dev flexed that peak speed number, and curious fans asked why Zig seems so zippy. Pragmatists fired back: it’s not just speed that matters — they want real‑world accuracy, layout smarts (think two‑column pages and paragraph detection), and a feature checklist against MuPDF. A licensing twist added spice: MuPDF’s terms have tripped up closed‑source users, so zpdf’s permissive MIT license drew cheers.

Then came the plot twist: commenters spotted LLM‑generated commit messages and a fresh repo, sparking an authenticity brawl. One critic called it “AI slop” and grumbled about vibe‑coded weekend projects making the front page. Meanwhile, jokers begged for Python bindings for their “trash language,” and someone reminded everyone that MuPDF’s text tool is single‑threaded by design, so those parallel speed wins are a bit apples‑to‑oranges. Verdict? Speed thrills, but receipts and trust still rule.

Key Points

  • zpdf is a Zig-based PDF text extraction library optimized for speed and efficiency.
  • Benchmarks show sequential speedups of 2.7x–4.4x and parallel speedups of 5.2x–18x versus MuPDF’s mutool text extraction.
  • Peak throughput reaches 41,000 pages per second in parallel on the Intel SDM dataset.
  • Features include memory-mapped I/O, SIMD string search, streaming output, multi-threaded page extraction, and broad font/compression support.
  • The project provides a CLI and Zig API, requires Zig 0.15.2+, and is MIT-licensed.

Hottest takes

"~41K pages/sec peak throughput." — lulzx
"not really just about speed" — mpeg
"I'm not convinced that projects vibe coded over the evening deserve the HN front page…" — littlestymaar
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.