February 14, 2026

Waybacklash: keepers vs. gatekeepers

Internet Increasingly Becoming Unarchivable

News giants lock the library to stop AI, commenters cry history heist

TLDR: Major publishers are shutting out the Internet Archive to keep AI firms from harvesting their articles, raising fears the web’s history will vanish. Commenters clash: some cheer IP protection, others mourn a “history heist,” while many propose DIY archiving or research-only access to keep memories alive.

The internet’s memory lane just hit a toll booth, and the comments are on fire. The Guardian is limiting the Internet Archive’s API access and hiding article pages from the Wayback Machine, while The New York Times is straight-up hard blocking the Archive’s crawlers. Publishers say they’re stopping AI firms from using libraries as a “backdoor” to suck up journalism, but Archive founder Brewster Kahle warns the public record will shrink. One researcher even sighed that the “good guys” are becoming collateral damage. Cue community meltdown.

Commenters split fast. Data hawks like ninjagoo claim about 20% of news sites now block both Internet Archive and Common Crawl, painting a picture of a web slowly going dark. Doom-posters like OGEnthusiast say today’s web is “AI slop” anyway, joking that pre-2022 snapshots will become vintage collectibles. The DIY crowd rallies behind macinjosh’s call for a SETI@home-style, people-powered archiving movement—“install an extension, save the web.” Pragmatists like derefr pitch a compromise: a private archive only for vetted researchers, no AI access. Meanwhile, builders like Havoc say even harmless scraping gets nuked by bot walls, making the modern web feel like a nightclub with a bouncer who hates everyone.

The vibe? High drama, low trust. Fans of open history cry “let the library stay open”, while publishers insist, “not for bots.” The internet is being preserved—or padlocked—depending on which comments you upvote.

Key Points

  • The Guardian restricted the Internet Archive’s access by excluding its content from Archive APIs and filtering article pages from the Wayback Machine while leaving non-article pages accessible.
  • The Financial Times blocks bots that scrape paywalled content, including those from OpenAI, Anthropic, Perplexity, and the Internet Archive; generally only unpaywalled FT stories appear in the Wayback Machine.
  • Michael Nelson warned that archives like Common Crawl and the Internet Archive may become collateral damage as publishers seek to avoid LLM data scraping.
  • The Guardian has not documented specific scraping via the Wayback Machine but is taking proactive measures and coordinating with the Internet Archive, which has been receptive.
  • The New York Times is hard blocking Internet Archive crawlers and added archive.org_bot to its robots.txt file at the end of 2025, disallowing access to its content.

Hottest takes

"the value of a pre-2022 (ChatGPT launch) Internet snapshot on physical media will probably increase astronomically" — OGEnthusiast
"We need something like SETI@home/Folding@home but for crawling and archiving the web" — macinjosh
"a private archiver that only serves registered academic / journalistic research projects" — derefr
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.