How I wrote JustHTML, a Python-based HTML5 parser, using coding agents

AI-made HTML parser drops; praise, 'port' accusations, and a 'not 100%' test fight

TLDR: An AI‑assisted Python HTML parser claims full test compliance and a sleek, no‑dependency design. Commenters quickly split the room: supporters cheer the ambition, while critics call it a Rust‑to‑Python port and dispute the “100%” claim with failing tests—turning the launch into a live fact‑check of AI‑built code.

JustHTML landed with a bang: a tiny Python library that reads messy webpages, built with coding agents (AI helpers) and boasting “100%” of the big HTML5 test suite, zero add‑ons, and even a search-by-CSS feature. The dev’s saga—17 steps, a pit stop rewriting the tokenizer in Rust, then a crisis over whether the world needs another parser—had readers hooked. Early hype from simonw called it “neat” and fully-tested, but the celebration didn’t last.

Commenters lit the fuse. minusf questioned originality, calling it a likely Python port of Rust’s html5ever and demanding clearer credit. Aloisius ran the tests and reported only 88.6% passing, poking holes in the “100%” story with example errors. furyofantares dunked on the blog’s AI-flavored formatting (“17 tiny sections?”). Amid the fire, one brave soul asked for a Postgres database plug‑in—because why not parse the web inside your spreadsheets. The author’s cheeky nod to the “adoption agency algorithm” and its “Noah’s Ark” rule (keep only three of a kind) spawned two‑by‑two memes, but the thread’s vibe was clear: cool demo, but receipts, please. Fans love the hustle and the no-dependency design; skeptics want real numbers, clearer credit, and fewer AI vibes in the write‑up.

Key Points

  • JustHTML is a pure-Python HTML5 parser with zero dependencies and a CSS selector query API.
  • The author claims JustHTML passes 100% of the html5lib test suite using extensive test-driven development.
  • Complex HTML5 parsing challenges are addressed, notably the adoption agency algorithm and the Noah’s Ark clause.
  • Development used coding agents (GitHub Copilot Agent mode, Claude Sonnet 3.7) with automated iteration against html5lib-tests.
  • Performance work included a Rust-based tokenizer to slightly surpass html5lib speed, and consideration of html5ever, with a pivot to a pure-Python approach.

Hottest takes

"isn't this more like a port of `html5ever` from rust to python using LLM" — minusf
"I'm not seeing 100% pass rates." — Aloisius
"Is it really too much to do a little more editing of the LLM output" — furyofantares
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.