Ask HN: Scheduling stateful nodes when MMAP makes memory accounting a lie

Server meltdown sparks 'just use Kubernetes' vs 'let nodes say no' brawl

TLDR: A coordinator misread “low rows” as low load and kept hammering a nearly full server, causing a loop. Commenters split between Kubernetes-style reservations, latency-based backpressure, OS signals like PSI, and cheeky calls for ML to babysit metrics—because naive measurements can crash real systems.

A spicy Ask HN confessional lit up the crowd: a coordinator kept shoving data onto a “quiet” server because it had fewer rows, but the box was actually stuffed to the brim with chunky data and near out-of-memory. Cue chaos: the coordinator ignored the “I’m full” signals and basically DDOS’d its own node. The community’s verdict? Row count is a lie, memory is a drama queen, and your scheduler needs therapy.

Team Kubernetes swaggered in first: declare memory reservations so the scheduler treats your capacity like hard facts, regardless of lazy-loaded trickery. The performance purists snapped back: let latency be the truth—if a node gets slow or jittery, feed it less and close the loop to the balancer. The OS whisperers pulled out Pressure Stall Information (PSI), a Linux stat that shows when CPU and memory are starved, with a “look at active pages” wink. Then the chaos agents piled on with the hottest take: “NP-hard? Perfect for machine learning,” turning the “God Equation” meme into “let AI parent your cluster.”

Meanwhile, pragmatists cheered a “dumb coordinator, smart nodes” plan: fire by disk space, let workers 429 (Too Many Requests) when stressed, and separate disk balancing from memory-heavy query work. Peak meme: “mmap is gaslighting your RAM.” Checkmate.

Key Points

  • A distributed stateful engine uses a Coordinator to assign data segments to Worker Nodes, with heavy reliance on mmap and lazy loading.
  • A failure occurred when the Coordinator misread Node A’s low logical row count as underutilization and repeatedly tried to load new segments.
  • Node A was near OOM (~197GB RAM) due to very wide rows and large blobs, making row count a poor proxy for resource usage.
  • OS page cache and lazy loading made application-level RSS and disk metrics unreliable for memory-aware scheduling.
  • The author proposes options: rely on node-enforced backpressure (HTTP 429), build per-segment cost models, or decouple storage balancing from query/memory balancing, and seeks references.

Hottest takes

"Ok, so you are dealing with a classic - you measure A, but what matters is B." — majke
"Latency backpressure is a pretty conventional thing to do." — bcoates
"That's perfect for machine learning." — wmf
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.