Crawling a billion web pages in just over 24 hours, in 2025

He crawled a billion web pages in a day – the internet is arguing about everything except that

TLDR: A developer crawled a billion web pages in about a day for under $500, proving DIY web search is still possible on a budget. Commenters immediately pounced, arguing the true challenge is getting past bot blockers, doubting some of the tech claims, and bragging they could do it cheaper and better.

A lone developer just pulled off a wild stunt: crawling a billion web pages in just over 24 hours for a few hundred bucks, like a one‑man budget Google. But the real show is in the comments, where the crowd immediately pivots from “wow” to “actually…” and starts swinging.

One camp is impressed but insists the real villain of modern web crawling isn’t speed or storage, it’s the anti‑bot walls. An SEO startup founder crashes the party to say bandwidth is cheap, the painful part is dodging services like Cloudflare that block automated visitors. Others gripe that only grabbing old‑school HTML pages (no JavaScript) is like “crawling the web with blinkers on,” cool for nostalgia, not reality.

Then the hardware nerds arrive. Someone calls out the author’s claim that modern solid‑state drives are “near RAM speed,” basically yelling “that’s not how numbers work” and launching a mini‑lecture on how drives are still much slower than memory. Another commenter says using Amazon’s cloud is for rich kids and flexes that the same crawl could be done for under $100 on sketchy‑cheap servers. And just when things calm down, a skeptical reader side‑eyes a line about a database “struggling” at 120 operations per second, hinting the author either misread the docs or seriously misconfigured things. The crawl was fast — but the comments tore it apart even faster.

Key Points

  • The author replicated and updated a large-scale web crawl in 2025, targeting roughly a billion pages within about 24 hours to compare with a well-known 2012 crawl by Michael Nielsen.
  • The project operated under a budget of a few hundred dollars, with the final 25.5-hour active crawl costing around $462, close to the historical cost benchmark.
  • The crawler fetched only HTML and did not execute JavaScript, instead parsing `<a>` tags for links, demonstrating that a substantial portion of the web remains accessible without JS rendering.
  • Strict politeness measures were implemented, including honoring robots.txt, using an informative user agent, maintaining an exclusion list, limiting seeds to the top 1 million domains, and enforcing a 70-second per-domain delay.
  • Instead of a disaggregated microservice-style architecture, the system used about a dozen independent, fully self-contained nodes, each handling a shard of domains, to maximize single-node performance and stay within budget while still providing basic fault tolerance.

Hottest takes

"the biggest issue with crawling is the blocking part and how much you need to invest to circumvent Cloudflare" — bndr
"Am I missing something here? Even Optane is an order of magnitude slower than RAM" — throwaway77385
"you could probably do this under 100$ with some optimization" — finnlab
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.