Show HN: Hacker News archive (47M+ items, 11.6GB) as Parquet, updated every 5m

47M HN posts in one mega-archive — nerds rejoice, lawyers fume

TLDR: A new, constantly updating archive puts 47M+ Hacker News posts on easy display, thrilling data fans and educators. The comments blew up over licensing, privacy laws, and whether daily merges might erase deleted or moderated content, turning a tech flex into a debate about legality and historical accuracy.

Hacker News just got bottled and shelved: a public dataset claims to pack 47M+ posts and comments into a neat bundle that updates every five minutes. Fans call it a time machine for one of tech’s most influential forums; skeptics call it a lawsuit waiting to happen. The thread was a popcorn buffet.

On the “wow” side, teachers begged for slice-and-serve subsets — like “Show HN” launches or the famous monthly “Who’s Hiring” threads — to train students on real-world data. On the “uh-oh” side, one commenter demanded to know the license, dropping a spicy “do-whatever-you-want” joke that had the crowd howling. Another waved the YC legal page like a warning flag, accusing the project of playing fast and loose with privacy rules (think Europe’s GDPR and California’s CPRA). Meanwhile, data purists worried the nightly “clean” merge might erase deleted or moderated comments, turning live snapshots into censored history.

There was even a mini-food fight over why “deleted” and “dead” are stored as 0/1 instead of true/false, because of course Hacker News can’t resist a modeling debate. Jokes flew about “archiving every ‘Actually…’ since 2006” and “DuckDB-ing history,” but under the memes, the vibe split: open access vs. legal risk, completeness vs. cleanliness. Pass the popcorn.

Key Points

•A complete Hacker News archive since 2006 is published as Parquet on Hugging Face and updated every 5 minutes.
•The dataset includes 47,134,791 items, with completed monthly files through 2026-02 and live 5-minute blocks for today.
•Data is organized as one Parquet file per month plus today/ 5-minute blocks, merged into the monthly file at midnight UTC.
•Accompanying stats.csv and stats_today.csv track counts, ID ranges, file sizes, fetch durations, and commit timestamps for verification.
•The dataset is directly queryable with DuckDB from Hugging Face and is compatible with datasets, pandas, and huggingface_hub; type is encoded as small integers (1–5).

Hottest takes

“do whatever the fuck you want with the data as long as you don’t get caught” — bstsb

“Wouldn’t that lose deleted/moderated comments?” — palmotea

“Mods, enforce your license terms, you’re playing fast and loose with the law (GDPR/CPRA)” — GeoAtreides

March 18, 2026

Data drop or drama bomb?

47M HN posts in one mega-archive — nerds rejoice, lawyers fume

Key Points

Hottest takes

March 18, 2026

Data drop or drama bomb?

Show HN: Hacker News archive (47M+ items, 11.6GB) as Parquet, updated every 5m

47M HN posts in one mega-archive — nerds rejoice, lawyers fume

Key Points

Hottest takes

Save News