April 15, 2026

Telemetry tea, served scalding

Airbnb discloses a billion-series Prometheus metrics pipeline

Airbnb’s billion-metrics upgrade sparks vendor wars and reliability roast

TLDR: Airbnb is migrating to OpenTelemetry and Prometheus for a massive metrics overhaul, reporting big performance gains. The comments explode over push vs scrape, Mimir vs VictoriaMetrics, and whether Prometheus is reliable—making this a must-watch signal of where monitoring is headed

Airbnb just dropped a “billion-series” flex, moving its monitoring from old-school StatsD to OpenTelemetry (OTLP) and Prometheus—with a vmagent assist. The company dual-wrote during migration, then bragged about big wins: CPU time on metrics collapsed from 10% to under 1%, plus fewer UDP packet-loss tears. But the real fireworks? The comments. One camp cheered the move as industry-proof that OpenTelemetry is the new default, while another dragged Prometheus as “not reliable,” turning the thread into an on-call trauma circle.

Vendor drama went spicy: why Grafana Mimir over VictoriaMetrics? With Airbnb piping through vmagent, fans accused the stack of being a cross-vendor mashup, and the “who’s better” debate hit full volume. Meanwhile, the push-vs-scrape holy war resurfaced: one commenter couldn’t fathom pushing OTLP directly when the classic Prometheus “just scrape a URL” approach is so simple, while others argued that at Airbnb scale, pushing is cleaner and more reliable.

There were memes galore—“numbers go brrr,” “histogram hipsters,” and “UDP is a lifestyle”—but the nerdiest plot twist was Airbnb admitting high-cardinality data (tons of unique labels) caused memory blowups, fixed by using “delta temporality” (send changes, not totals). Translation: they got the speed, survived the chaos, and kept the dashboards pretty. The crowd’s split between “OTLP is the future” and “scrape forever,” and we’re here for the popcorn

Key Points

  • Airbnb migrated from a StatsD/Veneur-based metrics pipeline to an OpenTelemetry (OTLP) and Prometheus-based stack.
  • They standardized protocols: OTLP for internal services, Prometheus for OSS workloads, and StatsD (DogStatsD) as a legacy fallback.
  • A dual-write strategy via a shared metrics library (covering ~40% of services) enabled low-friction rollout of OTLP alongside existing StatsD.
  • Measured benefits included CPU time for metrics processing dropping from ~10% to under 1%, improved reliability over UDP StatsD, and adoption of Prometheus exponential histograms.
  • High-volume, high-cardinality services saw memory and GC regressions when enabling OTLP; mitigated by configuring delta temporality in the common metrics library.

Hottest takes

"Reliable is not a word I would associate with it" — codeduck
"The irony is that this may be a $0 revenue user for Grafana Labs" — dig1
"Curious why the team choose Grafana Mirmir over VM cluster?" — jameson
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.