Virtualizing Nvidia HGX B200 GPUs with Open Source

DIY GPU VMs wow devs while skeptics shout “not open”

TLDR: Engineers showed how to run virtual machines on NVIDIA’s HGX B200 using open-source tooling. Comments erupted over whether this is truly “open,” how safe multi-tenant isolation is, and if AMD and InfiniBand fit—mixing excitement with skepticism and turning a niche GPU guide into community drama.

A new guide claims you can spin up virtual machines on NVIDIA’s monster HGX B200 boards using open-source tools. The author jumps in saying the real headache was NVLink—NVIDIA’s super-fast GPU-to-GPU highway—because it makes isolating tenants tricky while keeping speed. Folks loved the transparency, with AMA invites flying, but that’s where the comment drama revved up. One skeptic asked point-blank if NVIDIA’s Fabric Manager—the control software for these GPU clusters—is even open source, quipping this felt more like “open kimono” than open-source, and pressed hard on security boundaries for multi-tenant setups. Another thread spun into the eternal Team Green vs Team Red meme, with a commenter noting you can swap “nvidia” for “amdgpu” and much of the how-to still works—cue the AMD stans. A practical crowd asked whether you can slice one GPU for multiple VMs without relying on NVIDIA’s MIG (their built-in slicing tech), and a speed freak wanted to know if InfiniBand—the fast network used in supercomputers—runs at full blast inside each VM. The vibe: builders are hyped, skeptics smell marketing, and everyone’s making jokes about cloud giants guarding secrets while indie hackers leak the DIY playbook. Chaos, memes, and GPU heat—classic

Key Points

  • The article explains how to virtualize NVIDIA HGX B200 GPUs using open-source tools on Linux.
  • HGX B200 uses SXM modules with NVLink/NVSwitch, creating a uniform high-bandwidth fabric that complicates virtualization.
  • Host setup includes switching drivers and permanently binding GPUs to vfio-pci, and aligning software versions between host and guest.
  • A PCI topology mismatch can block device initialization; QEMU is used to craft custom PCI layouts for guests.
  • A Large-BAR stall issue is addressed by upgrading to QEMU 10.1+ or disabling BAR mmap with x-no-mmap=true; GPU partitioning is managed with NVIDIA Fabric Manager and its API.

Hottest takes

"The hardest part was virtualizing GPUs with NVLink in the mix" — ben_s
"It’s not clear that anything in this article relates to Open Source at all" — otterley
"Replace any 'nvidia' for 'amdgpu' for Team Red" — mindcrash
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.