February 21, 2026

Snail-speed superbrain, big-time drama

Show HN: Llama 3.1 70B on a single RTX 3090 via NVMe-to-GPU bypassing the CPU

Yes, it runs a giant AI on a gamer GPU — but bring a snack

TLDR: A new engine streams a huge 70B AI from SSD to a single RTX 3090, skipping the CPU, but it only spits ~0.2 tokens/sec. Comments split between ‘cool demo’ and ‘too slow,’ while others push for Windows-style direct GPU loads and tiered expert models.

Hacker News just melted down over a wild demo: a C++/CUDA engine that streams a massive Llama 70B model off a fast SSD straight into a single RTX 3090’s memory, skipping the CPU entirely. The headline number had everyone clutching their keyboards: ~0.2 tokens per second (think: a few words a minute). Cue the split. One camp cheered the engineering flex and the 33x speedup over the naive disk method; the other camp rolled eyes, saying it’s cool science fair stuff but not remotely “chat” ready. As one skeptic put it, just run a smaller model in GPU memory — the 8B version spits ~49 tokens/sec and actually feels instant.

Then the dreamers arrived with fireworks. Windows gamers wondered if Microsoft’s DirectX “load straight to GPU” tricks could rescue speed, while hardwareheads drooled over faster PCIe lanes: with a beefier connection, this could climb to about 0.5 tokens/sec, they note. Power users took it further: what if you shuffle only the “experts” of a giant model in and out of memory, MoE-style, so the hot parts live on the GPU and the rest nap on SSD? The memes wrote themselves — “brew coffee between tokens,” “my 3090 finally doing cardio.” Verdict? A glorious stunt or the first step to big-brain models on budget rigs. Pick your fighter.

Key Points

  • NTransformer runs Llama 3.1 70B (Q6_K) on a single NVIDIA RTX 3090 by streaming layers via PCIe and optionally using NVMe direct I/O that bypasses the CPU.
  • Tiered caching (VRAM + pinned RAM + NVMe/mmap) yields a 33× speedup over an mmap baseline for 70B, achieving ~0.2 tok/s on RTX 3090 + 48 GB RAM with 23.1 GB VRAM used.
  • For Llama 3.1 8B (Q8_0), VRAM-resident and auto-tiered modes deliver ~49 tok/s using ~10 GB VRAM; for 70B, an mmap baseline reached 0.006 tok/s due to page cache thrashing.
  • The pipeline overlaps NVMe reads, PCIe H2D DMA, and GPU compute (SLEP), with userspace NVMe (gpu-nvme-direct) reading GGUF weights directly into pinned GPU-accessible memory.
  • PCIe H2D bandwidth at Gen3 x8 (~6.5 GB/s) is the bottleneck; on Gen4 x16 (B550/X570), tier-B would become compute-bound with expected ~0.5 tok/s.

Hottest takes

"0.2 tok/s is fine for experimentation" — randomtoast
"running a 1T model ... JIT predict-ahead expert swaps" — rl3
"could this be used for multi-tier MoE?" — Wuzado
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.