It seems that OpenAI is scraping [certificate transparency] logs

OpenAI’s bot spotted sniffing new sites; devs split between “duh” and “not cool”

TLDR: A dev saw an OpenAI-labeled bot hit his site seconds after issuing a new web certificate, suggesting it scans public certificate logs. Comments split between “this is normal” and “OpenAI scrapes everything,” with a reminder that headers can be faked—either way, public logs accelerate discovery.

A developer minted a fresh website “lock” (a TLS certificate) and, almost instantly, their server logs showed a visit to /robots.txt from “OAI-SearchBot/1.3” — pointing to OpenAI’s bot page. Cue the crowd: is OpenAI trawling public certificate transparency (CT) logs — the public database that lists newly issued website certificates — to find new stuff to crawl?

The hot takes landed fast. One camp rolled eyes: “this has been happening forever,” basically calling it standard search-engine behavior. Another went full popcorn: “OpenAI’s whole business model is scraping lol,” accusing the AI giant of vacuuming anything public. The skeptic squad chimed in with drama: what if it’s not OpenAI at all, but a rando copying their user-agent to look like the big dog? Meanwhile, security nerds reminded everyone CT logs exist precisely for public oversight, not secrecy — think of them as a public phone book for website trust — and if you want fewer breadcrumbs, use wildcard certs.

And the memes? Someone suggested: “Let’s prompt-inject it,” aka prank the crawler with spicy instructions. Pragmatists shrugged: “It’s public info; if I had to scrape the web, I’d start there too.” The real tension: transparency versus privacy vibes, with a side of “is it OpenAI or cosplay?”

Key Points

  • A newly issued TLS certificate was followed almost immediately by a request to /robots.txt from a user agent identifying as OAI-SearchBot/1.3.
  • The timing led the author to infer that the crawler discovered the hostname via Certificate Transparency (CT) logs.
  • Log details show the request used HTTP/2 over TLS 1.3 and returned a 404 for robots.txt.
  • A suggestion to hash domains in CT logs was countered with the argument that it would undermine CT’s verifiability and oversight of CAs.
  • The author notes that wildcard certificates can mitigate exposure of specific hostnames in CT, and briefly questions the utility of DNSSEC/NSEC3 in this context.

Hottest takes

"I think there whole business model is based off scraping lol" — drwhyandhow
"This could be OpenAI, or it could be another company using their header pattern" — Aurornis
"Let's prompt inject it" — gmerc
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.