Text classification with Python 3.14's ZSTD module

Python gets a word‑shrinking trick; crowd fights over AI-as-zip and gzip hacks

TLDR: Python 3.14 adds built‑in Zstd compression, enabling a neat trick: classify text by which compressor makes it smallest. The comments clash—some say AI is just compression, others argue zlib/gzip already do this, and skeptics caution compressors aren’t real classifiers—yet the method is fast and intriguing.

Python 3.14 just dropped a built‑in Zstandard “shrink ray” for words, and the community instantly turned it into a food fight. The blog shows a cheeky taco vs padel demo—compress a sentence with a “tacos” memory and it gets smaller than with “padel,” so boom, it’s a classifier. Cue the info‑theory fan club cheering that compression can reveal patterns, backed by a [2023 paper] and nods to Kolmogorov complexity. But then the comments spice it up: Scene_Cast2 claims “LLMs”—large language models—are basically lossless compression machines and wonders if they out‑shrink Zstd. That sparked a mini flame war over whether AI is just fancy zip.

Meanwhile throwaway81523 strolls in like “actually…” and points out Python’s zlib can already do incremental compression, and even gzip can if you use the forbidden back door—calling the post’s “this changes everything” vibe a little overhyped. ks2048 throws ice water on the party, warning that real compressors and real classifiers want different things, so don’t expect magic ML for free. The practical drama? Compressors “remember” what you feed them, so the fix is to rebuild them often—fortunately, that’s fast. Drive‑by humor lands too: OutOfHere wonders if copy.deepcopy saves the day, and chaps adds a wild tale of using image file sizes to spot sketchy behavior. Verdict: half dazzled, half skeptical, 100% spicy.

Key Points

  • Python 3.14 adds a standard compression.zstd module implementing Facebook’s Zstandard algorithm.
  • Zstd supports incremental compression with internal state, enabling per-class compressor models for text classification by comparing compressed lengths.
  • A code example shows ZstdDict-based class dictionaries and ZstdCompressor to classify by smallest compressed output.
  • ZstdCompressor’s compress method updates internal state, so the article recommends rebuilding the compressor to avoid contaminating class models.
  • A simple learning workflow is outlined with per-class buffers and frequent compressor rebuilds; parameters like window size can be tuned.

Hottest takes

"LLMs are just lossless compression machines" — Scene_Cast2
"Python's zlib does support incremental compression" — throwaway81523
"I'm skeptical of using compressors directly for ML/AI/etc." — ks2048
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.