Show HN: TurboQuant for vector search – 2-4 bit compression

Dev drops no‑training search trick; crowd split between “revolution” and “duh”

TLDR: A dev released TurboQuant, a no‑training method that squishes search vectors into 2–4 bits with strong benchmarks and fast updates. Commenters are split between calling it a legit breakthrough and saying it’s just a well‑packaged obvious trick, while others hype real wins like model sizes dropping 28–42%.

Hold onto your hyperspheres: a lone dev just dropped TurboQuant in Python, a mathy compression trick that squishes giant AI vectors down to 2–4 bits each without training—and Hacker News lit up. The pitch is simple enough for normies: spin the data, snap it into tiny buckets, get 16x smaller storage and zippy searches, no retraining, add stuff anytime. Fans are calling it a “diet for your embeddings” and posting speed/size wins, with recall (accuracy) numbers that look almost too clean.

Then the drama started. Veteran voice antirez swooped in like “this is what I’ve been saying”—basically rotate, quantize, chill—sparking the eternal fight: is this clever math or obvious trick with great branding? Meanwhile, another dev flexed a power move: plugging TurboQuant into llama.cpp and claiming it shrinks AI model weights by 28–42% (link), sending the “ship it” crowd into overdrive. One commenter even suggested mixing it with high‑performance search stacks for “near zero indexing time” and better results.

The vibe? Equal parts “this changes everything” and “we’ve seen this movie.” Jokes flew about tossing vectors into a “spin cycle,” and someone invoked Shannon like a meme—“the math dad said you can’t do better.” Whether it’s a breakthrough or a rebrand, the crowd agrees on one thing: those benchmarks slap (antirez tweet).

Key Points

  • py-turboquant implements TurboQuant in Python to compress vectors to 2–4 bits per coordinate for vector search.
  • The method is data‑oblivious: a shared random rotation and precomputed Lloyd‑Max quantizers remove the need for training, enabling online additions.
  • Queries are rotated once and scored against centroids without decompressing database vectors, reducing computation.
  • Benchmarks on GloVe (d=200) and OpenAI DBpedia (d=1536, d=3072) show 8–16× compression with millisecond‑level search latencies and high recall.
  • The approach claims distortion within 2.7× of Shannon’s distortion‑rate lower bound, suggesting near‑optimal compression for given bit widths.

Hottest takes

“you just rotate and use the 4 bit centroids” — antirez
“shrinks models 28–42%” — pidtom
“near zero indexing time and better recall” — richardjennings
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.