How to squeeze a lexicon (2001) [pdf]

2001 hack to cram huge word lists has devs at war over "timeless" vs "just use a DB"

TLDR: A 2001 paper explains a compact way to check if a word is in a list using a minimal automaton, beating bulkier methods. Comments split between praising timeless efficiency and saying modern databases or Bloom filters are enough, with extra drama over DAWG vs ADFA definitions — useful, funny, and very nerdy.

An old-school 2001 paper just resurfaced, showing how to pack massive word lists into a tiny, fast structure called a minimal ADFA — basically a super-compact map that says “is this word in the list?” without storing every letter. The paper compares this to tries (tree-like word maps), hashing, and Bloom filters (space savers that sometimes guess wrong). The crowd went full popcorn. One camp swooned over “elegant, memory-cheap engineering,” claiming this beats bloated apps and modern overkill. Another camp shrugged: “Just use a database and move on,” saying today’s hardware makes all this compression theater unnecessary. Things got spicy when someone insisted a DAWG (a compressed word graph) isn’t the same as an ADFA — cue a pedant-off with diagrams and barking jokes. Bloom filter fans chimed in with “false positives are fine for most use-cases,” only to get roasted by spelling-checker purists who need guaranteed answers. Language nerds loved the bit about English vs highly inflected languages: affix stripping is cute until Polish or Finnish shows up. The memes? “2001 PDF still beats your microservice,” “Bloom filters are vibes with lies,” and a lot of “my grandma’s crossword app did this.” Classic internet energy: nostalgia, nitpicking, and surprisingly wholesome CS lore.

Key Points

•Minimal ADFAs provide compact, fast-access representations for finite string sets (lexicons).
•Traditional DFA minimization is resource-intensive for large collections; the paper promotes a more efficient construction algorithm for minimal ADFAs.
•The algorithm is presented for three ADFA variants, with minor improvements and comparisons to other data structures.
•Hashing approaches, including perfect hashing and Bloom filters, have space and certainty trade-offs and may require storing or compressing strings.
•Tries (abbreviated and full) offer different space-time trade-offs, with memory reduction possible via lists or binary search trees for child pointers.

Hottest takes

“2001 PDF still beats half the SaaS I pay for” — retroByte

“Bloom filters are just vibes with lies” — hash_skeptic

“It’s called a DAWG, and yes I’m here to bark” — automataFan

February 8, 2026

Old code, hot takes

2001 hack to cram huge word lists has devs at war over "timeless" vs "just use a DB"

Key Points

Hottest takes

February 8, 2026

Old code, hot takes

How to squeeze a lexicon (2001) [pdf]

2001 hack to cram huge word lists has devs at war over "timeless" vs "just use a DB"

Key Points

Hottest takes

Save News