A 35B MoE on a 16 GB GPU, without the offload tax

Your old graphics card just got invited back to the big leagues, and the comments are losing it

TLDR: A giant AI model that used to be too big for a 16 GB graphics card can now run on one without a huge slowdown. The community is split between celebrating a win for regular users and questioning whether the trick will hold up outside the demo, which is exactly why people care.

The big headline here is simple: a model so huge it normally wouldn’t even load on a 16 GB graphics card can now squeeze in and still run fast. That’s the kind of claim that makes the internet instantly split into two camps: the “this changes everything for home users” crowd and the “show me the real-world catch” skeptics. And wow, the reactions are loud. Fans are calling it the moment older cards like the RTX 3090 got their “main character comeback arc,” because instead of needing a monster setup, you can suddenly run something much bigger on hardware people already own.

The article says the trick is basically keeping only the parts the model uses most often on the card, while parking the rarely used parts in normal computer memory without causing the usual giant speed drop. Translation: less wasted space, almost the same speed. In the comments, that sparked instant bragging rights, disbelief, and a lot of “my dusty 3090 is so back” energy. Some users joked this is the tech equivalent of fitting a king-size mattress into a studio apartment and still having room to dance.

But not everyone is cheering. Critics are side-eyeing the phrase “learns from live traffic”, asking what happens when usage suddenly changes, or whether this only looks magical in cherry-picked tests. Others are roasting the whole field with jokes like, “every week AI invents a new way to say ‘we moved stuff around and prayed.’” Still, even the doubters seem to agree on one thing: when the alternative is “does not run at all,” getting near full speed on cheaper hardware is the kind of stunt that gets the community talking.

Key Points

  • The article says Luce Spark enables 33B–35B MoE models to run on a 16 GB GPU by keeping frequently routed experts resident and offloading the rest to system RAM.
  • Measured on an RTX 3090, Qwen3.6 35B-A3B is reported at 13.3 GiB versus about 20.5 GiB before, and Laguna XS.2 33B-A3B at 14.6 GiB versus 18.8 GiB.
  • For A3B models, only about 3B parameters are active per token, with the router selecting roughly 8 of 256 experts at each layer, despite total model sizes of 33B–35B parameters.
  • Spark learns expert placement from live routing traffic, stores the profile with the model, and reloads it on restart without requiring an offline calibration dataset.
  • The article reports that fused-graph decoding preserves near all-GPU performance, reaching about 100 tok/s at 60% residency versus an all-GPU ceiling of about 119 tok/s and a naive offload result of 66 tok/s.

Hottest takes

"my dusty 3090 just got a second career" — @tensortrash
"this is basically stuffing a mansion into a closet and calling it optimization" — @cachemeoutside
"when the other option is ‘does not run,’ the haters sound very academic" — @vramgoblin
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.