November 24, 2025
Big Kube Energy
Building the largest known Kubernetes cluster, with 130k nodes
Google flexes 130k-node mega cluster; commenters ask why and shout AWS did it first
TLDR: Google tested a 130,000-node Kubernetes cluster to flex AI scale, but commenters question the need, poke at unimpressive control-plane numbers, and point to AWS’s earlier 100k claim. It matters because AI is hitting power limits, and improvements for mega-users could make everyday clusters more reliable.
Google just bragged about spinning up a 130,000-node Kubernetes cluster (think: a giant farm of servers) to prove Google Kubernetes Engine (GKE) can handle monster AI workloads. They say it pushed out 1,000 “Pods” (little app containers) per second and stored over a million objects, all while hinting that the real bottleneck now isn’t chips—it’s electricity. One NVIDIA GB200 chip slurps 2,700 watts, so mega-clusters could eat hundreds of megawatts. Google teased tools like MultiKueue (to juggle jobs across clusters) and faster networks to keep the AI beast fed.
But the comments? Pure chaos. The top vibe: skeptical side-eye. rvz basically says if you “require” 130k nodes, maybe your architecture is the problem. hazz99 isn’t wowed, arguing the control plane’s QPS (queries per second) sounds… mid. Others drag Google’s storage mount (“GCS fuse”) as wobbly in the real world, and zoobab drops a meme: “The new mainframe.” Then the competitive spice hits—blurrybird points out AWS + Anthropic claimed 100k nodes, igniting a cloud power-lifting contest.
So is this a historic scale moment or just cloud peacocking? The crowd’s split between “cool science” and “marketing math,” with jokes about mega-clusters being mega-stress. Still, even haters admit: hardening for the extremes could make normal clusters faster and sturdier for everyone.
Key Points
- •Google Cloud ran a 130,000-node GKE cluster in experimental mode, doubling the official 65,000-node support.
- •The test sustained 1,000 Pods per second and stored over 1 million objects in optimized distributed storage.
- •Demand for large clusters is driven by AI workloads, with many customers operating 20–65K nodes and expectations around 100K nodes.
- •Power constraints are highlighted: a single NVIDIA GB200 GPU needs 2,700W, implying clusters may reach hundreds of megawatts and require multi-data-center orchestration.
- •Architectural innovations include Kubernetes read scalability features (KEP-2340, KEP-4988), investments in MultiKueue, managed DRANET, and improved topology awareness.