Continuous batching from first principles (2025)

Speed boost for chatbots ignites fairness fight and 'first principles' memes

TLDR: The post pitches a speed trick: process many chats at once so AI replies faster. Commenters love the efficiency but clash over fairness and worst-case waits, asking how mixed long/short requests are scheduled and joking that “first principles” is now a meme—because speed means nothing if some users get stranded.

The blog dives into how chatbots answer one word at a time and how “continuous batching” packs multiple conversations together so GPUs don’t sit idle. It walks from attention (the part that makes words relate) to KV caching (the model’s short-term memory), then lands on a simple promise: squeeze more chats in, get more speed. Neat! But the comments? A whole other show.

Engineers swarmed the thread with fairness vs. throughput drama. User charcircuit asked the killer question: what happens when different chats need different “experts” (think specialized parts of a big model)? Translation: who gets a seat on the memory bus, and who gets booted to the curb? Suddenly, it’s an airline boarding brawl for tokens. Others echoed the worry: great averages don’t matter if a few users get stuck watching the typing cursor blink.

The sharpest heat came from umairnadeem123, poking the sore spot of tail latency—those worst waits users feel. Big context vs. tiny prompts: do you group by expected length or just run first-come, first-served? Without “aging” (a policy to prevent starvation), they’ve seen the top 5% of waits shrink while unlucky users get forgotten. Meanwhile, asteroidburger brought the popcorn, joking that “first principles” is sliding into meme territory. Verdict: cool theory, but the crowd wants receipts, knobs, and proof this won’t turn into a theme-park line from hell. Read the post and the comments for the real entertainment.

Key Points

  • LLM generation is autoregressive and token-by-token, causing noticeable first-token latency and streaming outputs.
  • Inference is computationally heavy, as each generated token requires passing through billions of parameters.
  • Continuous batching optimizes throughput by running multiple conversations in parallel and swapping out completed ones.
  • Attention is the locus of cross-token interaction, computed via Q, K, V projections and QKT with quadratic complexity in sequence length.
  • A boolean attention mask controls allowable token interactions; KV caching is cited as a foundation for deriving continuous batching.

Hottest takes

Scheduling gets very complicated — charcircuit
How long until “first principles” is a meme — asteroidburger
p95 improve while a few tenants get starved — umairnadeem123
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.