Show HN: Why it's hard to know which deployment caused a production incident

New tool points at the guilty deploy — crowd splits between 'AI slop' and 'finally'

TLDR: Valiant claims it can rank which recent change likely caused an outage, using simple, explainable scoring. Comments split between a blunt “AI slop” takedown and practical ops folks cheering the idea and asking for PagerDuty/OpsGenie hooks to surface suspects the second alarms fire.

Show HN delivered pure on-call drama: Valiant promises to answer the midnight question, “Which deploy broke it?” by watching app changes and comparing before/after metrics, then ranking the likely culprit. The pitch is simple: no machine learning, just explainable scores across errors, speed, and usage.

The crowd? Split. One hit-and-run comment sneered “AI slop,” a perfect drive-by dunk that sparked eye-rolls—because the tool explicitly says no AI. On the other side, ops folks practically begged for it to hook into alert tools like PagerDuty and OpsGenie (apps that wake you when your site is down) so when an alarm goes off, a suspect list pops up with names and timestamps. The line “5 deploys in one hour, which one broke it?” earned knowing nods from anyone who’s ever babysat a flaky release.

Beyond the snark, readers riffed on the 3am panic meme and the office blame-game, imagining a dashboard that finally settles “who broke prod” without playing detective in ten tabs. Love it or loathe it, the vibe was clear: speed matters during outages. If Valiant can actually finger the right change—and do it without mystery math—some sleep-deprived engineers might finally get their nights back.

Key Points

•Valiant correlates Kubernetes change events with Prometheus metrics to identify which change likely caused an incident.
•It uses deterministic, explainable scoring across error rate, latency, RPS, CPU, and memory, without machine learning.
•The tool ranks concurrent changes and provides impact classifications (NONE/LOW/MEDIUM/HIGH) with confidence scoring.
•Features include CI/CD webhook ingestion, intent-execution linking via Git SHA, custom PromQL metrics, automatic analysis, retention, and immutable snapshots.
•Architecture: Go backend, Next.js frontend, PostgreSQL, Prometheus (HTTP API); quick start via Docker Compose and a REST API for programmatic access.

Hottest takes

"AI slop" — direwolf20

"5 deploys in one hour, which one broke it?" — af_arc

"integration with PagerDuty/OpsGenie" — af_arc

February 8, 2026

Who broke prod this time?

New tool points at the guilty deploy — crowd splits between 'AI slop' and 'finally'

TLDR: Valiant claims it can rank which recent change likely caused an outage, using simple, explainable scoring. Comments split between a blunt “AI slop” takedown and practical ops folks cheering the idea and asking for PagerDuty/OpsGenie hooks to surface suspects the second alarms fire.

Key Points

Hottest takes

February 8, 2026

Who broke prod this time?

Show HN: Why it's hard to know which deployment caused a production incident

New tool points at the guilty deploy — crowd splits between 'AI slop' and 'finally'

TLDR: Valiant claims it can rank which recent change likely caused an outage, using simple, explainable scoring. Comments split between a blunt “AI slop” takedown and practical ops folks cheering the idea and asking for PagerDuty/OpsGenie hooks to surface suspects the second alarms fire.

Key Points

Hottest takes

Save News