OTelBench: AI struggles with simple SRE tasks (Opus 4.5 scores only 29%)

AI flubs outage-fixing basics, commenters blame OTel, hype, and hope

TLDR: An open test found top AI solved just 29% of basic tasks to follow a user’s request across many services. Comments split between blaming the tracing standard’s complexity, saying tool-armed AI can shine, and warning humans struggle too—making this a wake‑up call for anyone trusting bots with uptime.

OTelBench just dropped a reality check: top chatbots scored only 29% at adding “tracing” — the breadcrumbs that let engineers follow a user’s click through dozens of microservices. The open-source test hit 14 models with 23 tasks across 11 languages, costing $522 for 966 runs. It sounds simple—start a trace, pass an ID, send data—but in practice it’s fussy and easy to break. OpenTelemetry (OTel), the common standard, promises order, yet even fans admit it’s not plug‑and‑play.

And the comments? On fire. One camp shrugs, “maybe the problem is OTel,” with whalesalad roasting the spec more than the bots. Tooling maximalists like jcims say AI shines when given code access, deploy pipelines, and context. Skeptics pile on: dgxyz says humans struggle and wouldn’t bet uptime on it; asyncadventure calls 29% “optimistic” because tracing needs business knowledge, not just syntax. Comic relief via winton: try the model three or four times and you’ll ship in 10 minutes—cue the “just rerun it” meme. Verdict: split between “AI isn’t ready,” “OTel is too hard,” and “give the bots real tools.” The OTelBench team dares readers to reproduce and clap back.

Key Points

•OTelBench is an open-source benchmark evaluating LLMs on OpenTelemetry instrumentation in microservices.
•The benchmark spans 23 tasks across 11 programming languages to reflect real-world polyglot systems.
•Evaluations used the Harbor framework, enabling reproducibility and community contributions.
•The full run included 966 evaluations at a token cost of $522.
•Instrumentation criteria include starting traces, propagating context (TraceID), adhering to conventions, and exporting via standard environment variables using recent OTel SDKs.

Hottest takes

"maybe you are the problem... To me this says more about OTel than AI" — whalesalad

"I've been incredibly impressed... when properly equipped with useful tools and context" — jcims

"The 29% success rate actually seems optimistic" — asyncadventure

January 29, 2026

Trace fail, takes prevail

AI flubs outage-fixing basics, commenters blame OTel, hype, and hope

Key Points

Hottest takes

January 29, 2026

Trace fail, takes prevail

OTelBench: AI struggles with simple SRE tasks (Opus 4.5 scores only 29%)

AI flubs outage-fixing basics, commenters blame OTel, hype, and hope

Key Points

Hottest takes

Save News