Mount Mayhem at Netflix: Scaling Containers on Modern CPUs

Netflix blames CPUs for slow starts; commenters cry 'AI blog' and ask 'why two sites'

TLDR: Netflix found a CPU-level chokepoint: too many file mounts when containers start causes lockups and slow streams. Comments split between fans of the behind‑the‑scenes story and snark about AI‑style writing, two tech blogs, and whether disabling hyper‑threading or finally moving beyond Docker is the real fix.

Hit play on Netflix and, behind the curtain, hundreds of tiny “containers” spin up like popcorn. Netflix says a recent runtime upgrade slammed into a surprise bottleneck: the CPU and its file “mount” traffic cop. Each container image has dozens of layers, and building them means thousands of mounts and unmounts. When too many containers start at once, the operating system grabs a global lock to process these mounts—think one toll booth for a freeway—and everything stalls. Health checks time out, servers freeze, and your binge gets a speed bump. Chaos peaks when new servers and containers all launch at once.

The comments went full soap opera. One camp cheers the behind‑the‑scenes peek, while another drags the blog style as AI‑written and formulaic. Confusion explodes over Netflix having two tech sites: netflixtechblog.medium.com vs netflixtechblog.com. Folks also ask if it really took this long to move from Docker to containerd. Meanwhile, a hardware hot take lands: disabling HT (hyper‑threading, the CPU trick that runs two tasks per core) might help when locks pile up. The room divides between “cool detective story,” “who wrote this, ChatGPT?,” and “please, fewer layers.” The impact is relatable: fewer startup stalls means faster streams for you.

Key Points

  • Netflix encountered severe mount lock contention when rapidly starting many containers, leading to health check timeouts and system stalls.
  • The issue correlated with images having many layers (50+), especially on AWS r5.metal instances.
  • containerd’s per-layer operations (open_tree, mount_setattr, move_mount) and overlayfs construction generate numerous bind mounts and unmounts.
  • Kernel VFS global mount table locks cause contention, making performance a function of the number of container image layers.
  • For 100 containers with 50 layers, startup can require about 20,200 mount operations due to containerd running the process twice.

Hottest takes

"Why is this so badly AI written?" — owenthejumper
"Interesting, another case of removing HT improving performance" — haneul
"It took them this long to move from docker to containerd?" — parliament32
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.