How to run Qwen 3.5 locally

Local Qwen 3.5 hype: ‘Runs on my GPU!’ vs ‘Why won’t it?’

TLDR: Qwen3.5 can run locally with llama.cpp and Unsloth, promising big features on consumer hardware. Some users report shockingly fast performance on mid-range GPUs, while others struggle with GPU offloading and memory limits—sparking a lively debate over hardware, tools, and whether local AI now beats many paid cloud services

Alibaba’s new Qwen3.5 guide promises big-brain AI at home: tiny to huge models, a massive 256K context window (think: giant memory), and speedy local runs with llama.cpp and Unsloth’s smart “shrink-to-fit” model files. The community immediately split into two camps: the “it flies on my machine” crew and the “why is my GPU crying” crowd. One user flexed that the 35B model (a big one) hums along on an 8GB RTX 3050 and even codes well, while another swore they’ve compiled it “umpteen ways” and still can’t get GPU offloading to behave on an older 4GB card—Ollama worked, llama.cpp didn’t. Meanwhile, a speed demon bragged that the 9B model hits ~100 tokens per second on a consumer GPU and “beats most paid services,” triggering a wave of “cloud who?” replies. There’s practical advice too: make sure your memory can handle the model or it’ll slog from your hard drive; and yes, “presence penalty” can help stop repetition, but might dent quality. The thread even birthed a meme: “you can use ‘true’ and ‘false’ interchangeably,” which, in AI-land, feels a little too real. Bottom line: Qwen3.5 is thrilling, chaotic, and wildly personal—your setup decides if it’s a rocket or a roadblock

Key Points

  • Qwen3.5 is Alibaba’s multimodal hybrid reasoning LLM family, spanning large (35B-A3B, 27B, 122B-A10B, 397B-A17B) and Small (0.8B, 2B, 4B, 9B) models.
  • The models support up to 256K context and 201 languages, with thinking and non-thinking modes for different settings.
  • Unsloth Dynamic 2.0 quantized GGUF files (Dynamic 4-bit with selective upcasting) are used for local inference and fine-tuning.
  • llama.cpp is the recommended inference backend, with downloads via Hugging Face Hub and quantization options such as Q4_K_M, UD-Q4_K_XL, and UD-Q2_K_XL.
  • Hardware guidance includes ensuring total memory exceeds the quantized model size; 35B/27B can run on ~22GB devices, and Small series can run near full precision on ~12GB.

Hottest takes

“run the 35B-A3B model on an 8GB RTX 3050, it’s pretty responsive” — Twirrim
“compiled it umpteen ways and still haven’t gotten GPU offloading working properly” — Curiositry
“gives a stable ~100 tok/s. This outperforms the majority of online llm services” — moqizhengz
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.