June 3, 2026

Free memory? Plot twist: nope

When does fragmentation occur in the CUDA caching allocator?

Your GPU says there’s room left, then suddenly acts full — and commenters are DONE

TLDR: PyTorch explained why a GPU can appear to have space left but still fail: its memory manager can split free space into unusable pieces. Commenters weren’t shocked — they said this happens constantly in real AI systems, and some have already abandoned the default approach entirely.

PyTorch dropped a deep dive on a deeply annoying mystery: why your graphics card can look like it still has free memory, then refuse to load one more thing anyway. The short version is that memory can get chopped into awkward little pieces, so even if the total adds up, there’s no single chunk big enough to use. That’s the official explanation. But in the comments, the vibe is less “interesting technical detail” and more “this is my daily villain origin story.”

The strongest reaction came from people running large language models — the chatbots and AI systems that already live on the edge of maxing out expensive hardware. One commenter basically said this “edge case” is not an edge case at all; it’s the normal miserable state of modern AI serving. Their fix? Stop trusting the built-in memory manager and grab one giant chunk up front instead. That’s the kind of comment that lands like a mic drop: less “thanks for the explanation,” more “we gave up and built around it.”

And that’s where the drama sits. PyTorch is saying, “Here’s why this happens, and here’s a feature that helps in some cases.” The community response is: “Some cases?” People dealing with nonstop request traffic say the mess never really goes away, because long-lived data squats in memory while short-lived data constantly shuffles around it. It’s classic tech-comment-section energy: one side offering a careful mental model, the other replying with the internet’s favorite bug report — “cool story, still broken in production.”

Key Points

  • The article explains that fragmentation in PyTorch’s CUDA caching allocator can cause allocation failures even when total free GPU memory is theoretically sufficient.
  • It identifies high-utilization GPU workloads, especially LLM serving with CUDA graphs and KV cache, as common situations where fragmentation becomes visible.
  • The allocator manages memory as segments obtained from CUDA and blocks within those segments, with allocation implemented through block splitting and deallocation through adjacent-block merging.
  • Free blocks can merge only when they are adjacent and separated by no active allocation, so allocation order can affect whether memory becomes reusable.
  • Without expandable segments, each `cudaMalloc` call creates a separate segment, allocations from 1 MiB to 10 MiB use 20 MiB segments, allocations of 10 MiB or more are rounded to the nearest 2 MiB, and blocks in different segments cannot merge.

Hottest takes

"I treat the failure mode ... as the steady state, not an edge case" — keynha
"I’ve stopped relying on the caching allocator for KV at all" — keynha
"vLLM reserves one big region at startup" — keynha
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.