How to squeeze a lexicon (2001) [pdf]

2001 hack to cram huge word lists has devs at war over "timeless" vs "just use a DB"

TLDR: A 2001 paper explains a compact way to check if a word is in a list using a minimal automaton, beating bulkier methods. Comments split between praising timeless efficiency and saying modern databases or Bloom filters are enough, with extra drama over DAWG vs ADFA definitions — useful, funny, and very nerdy.

An old-school 2001 paper just resurfaced, showing how to pack massive word lists into a tiny, fast structure called a minimal ADFA — basically a super-compact map that says “is this word in the list?” without storing every letter. The paper compares this to tries (tree-like word maps), hashing, and Bloom filters (space savers that sometimes guess wrong). The crowd went full popcorn. One camp swooned over “elegant, memory-cheap engineering,” claiming this beats bloated apps and modern overkill. Another camp shrugged: “Just use a database and move on,” saying today’s hardware makes all this compression theater unnecessary. Things got spicy when someone insisted a DAWG (a compressed word graph) isn’t the same as an ADFA — cue a pedant-off with diagrams and barking jokes. Bloom filter fans chimed in with “false positives are fine for most use-cases,” only to get roasted by spelling-checker purists who need guaranteed answers. Language nerds loved the bit about English vs highly inflected languages: affix stripping is cute until Polish or Finnish shows up. The memes? “2001 PDF still beats your microservice,” “Bloom filters are vibes with lies,” and a lot of “my grandma’s crossword app did this.” Classic internet energy: nostalgia, nitpicking, and surprisingly wholesome CS lore.

Key Points

  • Minimal ADFAs provide compact, fast-access representations for finite string sets (lexicons).
  • Traditional DFA minimization is resource-intensive for large collections; the paper promotes a more efficient construction algorithm for minimal ADFAs.
  • The algorithm is presented for three ADFA variants, with minor improvements and comparisons to other data structures.
  • Hashing approaches, including perfect hashing and Bloom filters, have space and certainty trade-offs and may require storing or compressing strings.
  • Tries (abbreviated and full) offer different space-time trade-offs, with memory reduction possible via lists or binary search trees for child pointers.

Hottest takes

“2001 PDF still beats half the SaaS I pay for” — retroByte
“Bloom filters are just vibes with lies” — hash_skeptic
“It’s called a DAWG, and yes I’m here to bark” — automataFan
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.