March 25, 2026
Bots, brackets, and a bread aisle brawl
Show HN: Robust LLM Extractor for Websites in TypeScript
Dev tool or data heist? New AI scraper sparks robots.txt war
TLDR: Lightfeed Extractor uses AI and a stealthy browser to scrape websites and export clean data. The crowd is split: some cheer the token savings and “fix my broken JSON” features, while others slam the anti‑bot vibe and robots.txt concerns, sparking an ethics vs. efficiency showdown.
A new open‑source tool, Lightfeed Extractor, promises to let an AI read websites like a human and pull out clean data for your spreadsheets. It even runs a stealthy browser to dodge blocks and converts messy webpages into tidy markdown to save AI costs. Cool tech… until the comments exploded.
The mood? Split right down the middle. Privacy hawks are fuming over the “stealth mode” brag and the vibe that robots.txt—the polite “no trespassing” sign of the web—might be ignored. One user deadpanned, “Robots.txt anyone?” while another accused the project of not caring at all. Meanwhile, data‑hungry devs are drooling over token savings from converting HTML to markdown, but worry the cleanup could break things—what happens to tables, ratings, or tiny details? “Do you lose info?” becomes the nervous refrain.
Then there’s the JSON drama. The tool claims it can recover broken AI‑generated JSON (those curly‑bracket meltdowns that crash pipelines). Commenters chimed in with a twist: some say that’s why other AI tools use XML—because closing tags keep the robot on track. Cue meme: “One bad bracket and your day is ruined.”
Final act: anti‑bot patches. Curious devs ask how often sites actually block this thing. Critics smell a cat‑and‑mouse game; fans call it a superpower for tracking prices and products. Either way, the web’s new data vacuum just rolled into aisle 5.
Key Points
- •Lightfeed Extractor is a TypeScript library for LLM-powered web data extraction integrated with Playwright.
- •Features include stealth browser automation, AI-driven navigation, HTML-to-markdown preprocessing, and URL cleaning.
- •Schema-based JSON extraction uses LLMs with token limits/tracking, plus JSON recovery for malformed outputs.
- •Installation supports multiple LLM providers via LangChain (OpenAI, Google Gemini, Anthropic, Ollama).
- •An example demonstrates e-commerce product extraction using Playwright, Zod schemas, and a Gemini model.