Two Qwen3 models on one DGX Spark: the residency math

He tried to cram two AI brains into one box, and the comments turned into a tech group chat meltdown

TLDR: The big breakthrough was that the setup finally worked only after swapping out the larger AI model for one that actually performs actions instead of just “thinking” silently. In the comments, readers turned it into a debate about whether local AI is freedom, a money pit, or an excuse to buy even more hardware.

This story starts with a very relatable modern crisis: one machine was supposed to handle a growing little army of AI helpers, and suddenly the easy setup was no longer enough. The writer wanted one big, slow-and-smart model for heavy thinking and one smaller, faster model for quick replies, both living on the same powerful computer. In theory, neat. In practice? A weekend of chaos, crashes, and the kind of silent failure that makes developers stare into the void. The biggest plot twist: the larger model wasn’t “broken” in the usual way — it was happily thinking to itself and then just not actually doing the tool action it was supposed to do.

And the comments absolutely ran with it. The author jumped in to shut down one brewing theory, stressing that this was not some simple parser bug but a deeper “the model decides and then never says it out loud” problem. That sparked the thread’s main mood: part war story, part support group, part shopping debate. One commenter sounded like half the internet in 2026, saying they want local AI because online services feel overpriced, but also fear buying hardware that becomes “old news” in months and don’t want to babysit it. Another chimed in with the classic community energy: “Have you tried this other stack?” Meanwhile, power users were flexing speed numbers like street racers comparing engines. The vibe was equal parts DIY dream, buyer’s remorse panic, and benchmark bragging rights — with a side of “my side-hustle agents better justify this purchase.”

Key Points

  • The article documents moving a multi-agent local LLM backend from Ollama to vLLM on a DGX Spark to support two co-resident Qwen3 models behind a LiteLLM proxy.
  • Initial memory configurations for the 80B model failed because vLLM's gpu_memory_utilization is based on total GPU memory, not currently free memory.
  • The author reports that co-resident vLLM processes should keep combined GPU memory fractions below roughly 0.95 to leave room for CUDA framework overhead.
  • A working configuration for the 80B model on the Spark was gpu_memory_utilization 0.80, max_model_len 32k, and max_num_seqs 2, yielding about 20.8 GiB of KV cache after weights.
  • Tool-calling failed with a Qwen3-Next Thinking checkpoint under tool_choice auto because the model reasoned in <think> but did not emit tool calls; replacing it with an Instruct checkpoint resolved the issue.

Hottest takes

"the model reasons inside <think>, decides, and never emits the tool call" — devashish86
"worried anything I get will be obsolete in a couple months" — shireboy
"that was enough of a taste for me to jump into 2 sparks" — wolttam
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.