LiteLLM (YC W23): Founding Reliability Engineer – $200K-$270K and 0.5-1.0% equity

Big paycheck, bigger pager: internet split on this “own-it-all” reliability gig

TLDR: LiteLLM is hiring its first reliability engineer to keep a massive AI gateway from crashing, offering $200k–$270k and 0.5–1% equity. The community is split between “career-making challenge with honest scope” and “one-person blast shield,” with heated debates over on-call stress, equity, and sleep.

LiteLLM, the open-source AI traffic cop used by NASA, Netflix, and Stripe, just posted a first-ever Reliability Engineer role—$200K–$270K plus 0.5–1% equity—and the comments lit up like a 3 a.m. on-call alert. Fans cheered the rare honesty: the listing spells out exactly what breaks, from memory leaks to database slowdowns, and promises full ownership. Skeptics clutched their pagers, calling it “you are the blast shield for half the AI internet.”

One camp says the pay is strong for a 10-person, $7M ARR startup with 36K GitHub stars, and the chance to harden a system routing hundreds of millions of AI calls is a career rocket. Others argue the scope is “Staff+++ with pager tax” and want bigger equity or separate on-call cash. The NASA name-drop spawned “Mission Control” memes, while “this will add 50ms at P99” became an instant t-shirt. People explained P99 in plain terms—“the slowest 1% of users”—and joked that the real job is “wrestling Kubernetes at stupid o’clock.”

Bottom line: the community is torn between “dream challenge, massive impact” and “sleep-with-your-laptop stress.” Whether you see LiteLLM as a resume rocket or a burnout blender depends on how much you love firefighting with a smile—and a stopwatch.

Key Points

  • LiteLLM is hiring a Founding Reliability Engineer with $200K–$270K salary and 0.5–1.0% equity.
  • LiteLLM routes hundreds of millions of LLM API calls daily for customers including NASA, Adobe, Netflix, Stripe, and Nvidia; $7M ARR, 10-person team, YC W23.
  • Role mix is ~60% operational reliability and ~40% performance engineering, owning production health end-to-end.
  • Technical focus includes Python async memory management, Postgres at scale, and compatibility across 100+ provider APIs (OpenAI, Anthropic, Bedrock, Vertex).
  • Targets and practices include <10ms overhead at 5K+ RPS, P50/P95/P99 benchmarks gating releases, structured logging, distributed tracing, Prometheus metrics, canary/rollback, and SLOs.

Hottest takes

“That’s not a job, that’s a firehose with stock options” — crashloopcoffee
“0.5% at 10 people and $7M ARR is real; add on-call pay and I’m in” — comp_goblin
“‘Blameless post-mortems’ but your phone will be very blamed at 3 a.m.” — sleeplessSRE
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.