January 28, 2026
Test-driven clicking, anyone?
A verification layer for browser agents: Amazon case study
Small bots with a strict babysitter: fans cheer, cynics rage
TLDR: A tiny local bot completed an Amazon shopping flow by verifying every step like a unit test, not by seeing more pixels or using bigger models. Commenters love the accessibility-style approach but argue over an API key requirement and one blunt “slop” takedown—fueling a lively debate on safer, cheaper automation.
Robots don’t need better eyesight; they need a hall monitor. That’s the vibe as a new Amazon shopping demo claims small, local bots can finish a full cart-and-checkout flow if every click is verified. Instead of guessing from pixels, the bot reads a structured page snapshot and passes “Jest‑style” checks after each step—think “if it didn’t pass, it didn’t happen.” The planner (big brain) writes a checklist; a tiny local helper does the clicking. Result: 7/7 steps passed, success true; and a big ~43% token trim in a baseline, with no fancy vision model in sight. Author tonyww jumps in to stress this isn’t replacing Selenium or Playwright; it’s about reliability when pages change.
Comments lit up. joeframbach asks why not use the browser’s accessibility tree—“technically the AI agent is a vision‑impaired customer”—which supporters echo: if it works for screen readers, it should work for bots. asyncadventure applauds the “no more silent failures” vibe. Then the gloves come off: ewuhic drops “Slop shit discussing slop shit,” earning eye‑rolls and upvotes in equal measure. Meanwhile, practical folks like vilecoyote poke holes: the quickstart needs an API key for “importance ranking”—so how “local” is local, and does verification survive without it? Even the demo’s slower runtime becomes meme fuel: “fast is cute, right is king.” Verdict from the crowd: love the test‑first idea, but don’t bury the gotchas.
Key Points
- •Reliability is achieved through a verification layer that gates each browser action with assertions over structured snapshots.
- •A local autonomous run succeeded with small models when verification was enforced (Demo 3 re-run: 7/7 steps, success true).
- •Token efficiency improved via interface design (structure + filtering), reducing Demo 0 from ~35,000 to 19,956 tokens (~43%).
- •The target architecture pairs a strong planner (DeepSeek-R1) with a ~3B-class local executor, using Sentience for verification gates.
- •Setup uses the Sentience Python SDK and Playwright; the runtime performs inline snapshot + verification, with no vision models required in the core loop.