March 10, 2026
Who watches the Watchbot?
Ask HN: How are people doing AI evals these days?
Chaos, duct tape, and AI grading its own homework — welcome to eval land
TLDR: HN reveals AI testing is a chaos buffet—from no tests to duct-taped scripts to “AI judging AI”—with a few teams doing automated benchmarks. Commenters split between quick‑and‑dirty pragmatists and score‑seeking standardizers, warning that choosing the right model now affects money, trust, and user safety.
Hacker News asked a simple question—how do you test AI models?—and the replies read like a confessional. One user summed up the vibe as “very, very heterogeneous and fast moving,” with teams split between “no evals at all” and a frantic mix of tools like LangFuse and promptfoo glued together with ad‑hoc scripts. Another chimed in with, “definitely not happy with it,” capturing the shared fatigue as new models drop every other week.
Then came the spicy twist: are people even testing coding agents? One commenter flatly said, “I don’t think people are,” before pitching the wild card—“AI to evaluate itself.” Cue the memes: “AI grading its own homework,” “who watches the watchbot,” and jokes about a robot red‑penning another robot. It’s part gallows humor, part real talk about speed vs. rigor.
Meanwhile, a more buttoned‑up camp is building automated benchmarking: PMs write questions, humans mark pass/fail once, then “AI‑as‑a judge” approximates that verdict as teams swap models and prompts. Others want one score to rule them all—“I would like a single number”—especially for messy “agent” pipelines and memory features. The original poster pushes for shared, simple standards with ai‑evals.io and practical examples. The mood: chaotic but curious; everyone’s hacking, few are satisfied, and nobody wants to be left behind.
Key Points
- •AI evaluation practices are highly heterogeneous across teams, ranging from none to basic integration tests.
- •Some teams integrate observability and evaluation tools (e.g., LangFuse, Arize Phoenix, deepeval, Braintrust, promptfoo, pydanticai) into their workflows.
- •Evaluations are often an afterthought, but interest in structured evals is increasing.
- •The author advocates for simple, common eval practices across roles and shares resources (ai-evals.io, eval-ception repo).
- •For coding agents, approaches include AI self-evaluation and custom-built platforms; a custom eval runner for codebases is being explored.