Postgres Postmaster does not scale

On-the-hour meeting rush jams logins; the crowd shouts: use a bouncer

TLDR: Recall.ai hit a quirky database bottleneck: Postgres’s single-thread “doorman” slowed new logins during on-the-hour traffic waves. Commenters split between “throw a connection proxy at it,” “rethink the data layout with sharding,” and “add jitter so everything doesn’t start at :00,” with a few joking about replacing the doorman entirely.

Millions of meetings start on the dot, Recall.ai’s servers surge, and suddenly the Internet is arguing about the world’s grumpiest doorman: Postgres’s single-threaded “postmaster.” The company found their database’s gatekeeper choking during those on-the-hour stampedes, making new connections wait a painful 10–15 seconds. They even built a giant, synchronized test to prove it. Result: the doorman was maxing out a whole CPU core just spawning new connections. Ouch.

That’s when the comments lit up. Team Pragmatic showed up first: “This is why you put a gate in front of the gate,” said folks like vel0city, pointing to connection pools like pgbouncer and Amazon’s RDS Proxy—basically a velvet rope that stops the crush from hitting the doorman all at once. Team Big Architecture fired back with “why not split the crowd?” Atherton wondered if they’re writing to a single database and suggested sharding per customer. Meanwhile, Team Chaos Tamer dropped a simple life hack: don’t do stuff at round hours—add jitter! One commenter even linked a guide on avoiding round-hour traffic spikes here.

Then came the spice. One brave soul asked, can’t we just replace the doorman altogether? Cue veteran eye-rolls. Another chimed in with the meme-y promise that “PgDog” will fix it all, prompting equal parts curiosity and side-eye. The vibe: use a bouncer now, rethink the club layout tomorrow, and stop scheduling parties at midnight.

Key Points

  • Recall.ai experiences extreme synchronized load spikes as most meetings start on the hour, requiring immediate compute readiness.
  • They observed sporadic 10–15s delays in PostgreSQL connection setup despite normal resource metrics and successful TCP handshakes.
  • Investigation identified the PostgreSQL postmaster’s single-threaded main loop as a bottleneck under high worker churn.
  • The postmaster can saturate a CPU core, slowing backend forking, connection establishment, and parallel worker handling.
  • A production-like reproduction environment using Redis pub/sub and 3,000+ EC2 instances replicated the delay for instrumentation and analysis.

Hottest takes

"tools like pgbouncer were designed to solve" — vel0city
"cant we replace postmaster with something better?" — vivzkestrel
"One of the many problems PgDog will solve" — levkk
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.