March 3, 2026

When AIs start grading each other

Launch HN: Cekura (YC F24) – Testing and monitoring for voice and chat AI agents

Hackers ask: Can an AI babysit another AI without burning the house down

TLDR: Cekura launched a tool that uses fake users and robot judges to test chatbots and voice bots before they go live. Hacker News loves the idea but is worried that AI judging AI just doubles the cluelessness, with some users saying they can’t even plug it in and others flaunting DIY alternatives.

A new startup, Cekura, just dropped into Hacker News promising a robot army that pretends to be customers and stress-tests your chatbots before they embarrass you in front of real people. But the community instantly turned the launch post into group therapy for everyone who’s been burned by dumb-but-confident AI agents.

One founder in the thread proudly asked for feedback, but was immediately hit with a practical reality check: moinism basically said, “Cool story, but I can’t even use this because my bot doesn’t have an API.” Translation: nice tool, shame about the real-world constraints. Then came the existential horror: FailMore described the classic nightmare where you use one AI to judge another AI, only to discover neither has common sense. The crowd clearly feels this—everyone knows that eerie moment when a bot sails past something a human would instantly call “uh, that’s wrong, right?”

Meanwhile, another commenter casually drops a link to their own open-source Franken-lab setup that gives testing tools a digital “mouth and ears,” basically saying, “I built my own, want to talk?” It’s classic HN energy: part applause, part side-eye, part “I already hacked this together last weekend.” The vibe: people love the idea of automated testing for bots, but no one’s convinced the bots—or their judges—are actually smart enough yet.

Key Points

  • Cekura provides simulation-based testing and monitoring for voice and chat AI agents using LLM-based judges.
  • Test coverage is built from scenario generation and by importing real user conversations to extract evolving test cases.
  • A mock tool platform simulates external tool calls, improving reliability and speed without touching production APIs.
  • Deterministic, structured evaluators (conditional action trees) ensure repeatable CI tests and reduce stochastic noise.
  • Cekura monitors live sessions to detect multi-turn failures; offers a 7-day free trial and paid plans from $30/month.

Hottest takes

"Any ideas how to solve the agent's don't have total common sense problem?" — FailMore
"I have an agent, but no exposed api so can't really use your product even though I have a genuine need" — moinism
"launch a chromium instance... so that playwright can finally have mouth and ears" — michaellee8
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.