What CI looks like at a 100-person team (PostHog)

33M tests a week and an AI fixer—speed flex or flaky mess

TLDR: PostHog runs millions of tests weekly and built an AI tool to auto-diagnose failures and patch flaky tests. Commenters split between speed-loving pragmatists and purists slamming “AI-slop,” warning that quarantining flakes and bot-written fixes could weaken quality even as the team races to ship faster.

PostHog bragged big numbers—575k jobs in a week, 1.18B log lines, 33M tests—and introduced Mendral, an AI agent that spots flaky tests, quarantines them, and opens fixes. It’s the “we wish we had this at Docker” story for a fast-moving, 100‑person team shipping constantly to a single mega codebase. But the real show is the comments: the purists came swinging. One critic snapped that 99.98% isn’t good—only 100% is, and “never quarantine a flaky test.” Others rolled their eyes at the scale (“221 jobs per commit?”) with a deadpan zinger: “This is why Bazel was invented.”

Then the vibe turned from QA debate to literary roast. One reader said the post gave them “slop nausea,” dragging the writing style (“It’s not X, it’s Y”) while another yelled “AI‑slop,” warning that bots that “fix” tests often loosen rules and let bugs in. The big fear: an LLM‑ification of engineering where AI writes code, AI writes tests, and humans just rubber‑stamp. Meanwhile, some pragmatists admired the sheer velocity—burning “300 days of compute in 24 hours” sounds wild—and argued tools like Mendral are survival gear, not shortcuts. In short: heroic automation or flaky cover‑up? The thread reads like a speed‑run vs. soul‑of‑software cage match, starring PostHog and an angry peanut gallery.

Key Points

  • PostHog’s CI ran 575,894 jobs and 33.4 million tests in one week, processing 1.18 billion log lines and consuming 3.6 years of compute time.
  • Every commit to main at PostHog triggers ~221 parallel jobs; the team merges ~65 commits/day and tests ~105 PRs/day.
  • Despite a 99.98% test pass rate, scale produces significant noise: ~14% compute spent on failures/cancellations and ~3.5% job re-runs.
  • Flaky tests impose substantial investigation overhead at 100-person scale, as frequent failures require re-runs and log analysis.
  • Mendral, a GitHub App AI agent, ingests logs at scale to diagnose CI failures, quarantine flaky tests, and open PRs with fixes.

Hottest takes

“Only acceptable rate is 100%” — zX41ZdbW
“Jesus, this is why Bazel was invented” — IshKebab
“This is obviously AI‑slop” — SirensOfTitan
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.