November 28, 2025
Vectors, vibes, and 55GB of drama
28M Hacker News comments as vector embedding search dataset
28M HN comments go searchable; privacy jitters, 55GB shock, and 'worth it?' vibes
TLDR: ClickHouse released a 28.74M-comment Hacker News dataset with “smart” search, sparking excitement and privacy worries. The community debated file size shock (55GB) and whether vector search beats regular search, while one user flexed they’d already built a similar tool—making this both impressive and contentious.
ClickHouse just dropped a monster: a searchable dataset of 28.74 million Hacker News comments with “smart” labels called embeddings (think: search by meaning, not just words). The crowd went full drama mode. Some cheered the live demo of semantic search and AI summaries using OpenAI, while others clutched pearls over permanence and privacy. One commenter sighed, “Oh to have had a delete option”, setting off a mini panic about whether our ancient hot takes are now forever findable.
Then came the flex: “I already did this” energy from a user who’s been embedding HN comments since 2023 and even hosts it at hn.fiodorov.es with code on GitHub. Cue the classic internet combo of applause and side-eye. Meanwhile, the 55GB number triggered a collective spit-take—yes, all of HN plus metadata in one thicc file—and the comment section lit up with “is vector search better than normal search?” debates. One skeptic asked if the squeeze is worth the juice, while a pragmatist happily scratched a todo off their list. The vibe: half impressed by the speed and scale, half worried their decade-old snark just got a second life. Welcome to the era of searchable vibes.
Key Points
- •Dataset includes 28.74 million Hacker News posts with 384-d embeddings generated by all-MiniLM-L6-v2.
- •ClickHouse provides the complete dataset as a single Parquet file hosted in an S3 bucket.
- •Instructions show how to create a ClickHouse table, load the Parquet data, and build an HNSW vector index.
- •Example HNSW parameters M=64 and ef_construction=512 are given, with guidance to tune for performance and quality.
- •A demo app retrieves posts via vector search and summarizes them using LangChain and OpenAI’s gpt-3.5-turbo, requiring OPENAI_API_KEY.