Show HN: Open database of link metadata for large-scale analysis

Open link database drops and everyone asks if it can survive messy, changing feeds

TLDR: A new open database archives link titles, descriptions, and dates so people can study link rot and trends. Commenters zeroed in on one big worry: as RSS feeds change over time, should it keep raw data, a fixed format, or both, to make long-term analysis reliable.

An open trove of link metadata just landed, and the crowd pounced on the messiest question: what happens when feeds change? The project RSS-Link-Database-2025 tracks titles, descriptions, and dates from RSS (Really Simple Syndication) so researchers can study “link rot” and save the web’s breadcrumbs. Aherontas set the tone by asking how it handles “feed evolution over time” — force one tidy format, or keep the raw data too?

That question dominated. Long-term datasets — the kind that span years and software changes — can break hearts (and scripts). Translation: the web is a moving target. Many want both: clean, comparable fields for easy analysis, and the original payload so nothing gets lost. The project’s wider suite (a capture app and yearly repos) drew cheers for steady archiving, while skeptics warned that summaries get truncated, fields get yanked, so plans must survive real-world mess.

It’s the kind of topic that sparks jokes about feeds “going through puberty” and data “needing therapy,” but the core is serious: if this vault gets the schema choices right, it could become a go-to time machine for media analysis. If not, expect painful spreadsheets and sad, soggy charts. Either way, the stakes feel real.

Key Points

  • RSS-Link-Database-2025 provides link metadata (title, description, publish date) for the year 2025.
  • Data capture is performed using a Django application (Django-link-archive).
  • The project is part of a suite that includes yearly daily RSS Git repositories from 2020 to 2026.
  • Goals include archiving and enabling data analysis, such as verifying link rot.
  • The repository is public and licensed under GPL-3.0, with files like README.md and sources.json describing structure and sources.

Hottest takes

“how you handle feed evolution over time” — Aherontas
“normalize to a fixed schema or store the raw payload” — Aherontas
“Longitudinal datasets tend to get tricky” — Aherontas
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.