A 10 year old Xeon is all you need (for 26B-A4B MTP Drafters without GPU)

Guy makes a clunky old server write at human speed and the comments lose it

TLDR: A developer got a modern AI system working on a 2016 server with no graphics card, which is surprising because these tools usually want much newer hardware. The comments loved the stunt but instantly demanded hard speed numbers, cheaper setups, and verdicts on whether even older junk machines could still pull it off.

A scrappy developer rolled into Hacker News with a wild flex: he got a modern AI model running on a 10-year-old recycled server with no graphics card at all. Translation for normal humans: this is like entering a rusty minivan into a street race and somehow not embarrassing yourself. The machine is old, slow, and packed with outdated memory, but with a mountain of custom tweaks, it still managed to spit out text at roughly reading speed. And yes, the community immediately turned the post into a mix of applause, interrogation, and basement-lab boasting.

The loudest reaction was basically: "Cool story, but show us the real numbers." One commenter zeroed in on the fuzzy phrase "reading speed," asking what that actually means in tokens per second — because in tech circles, if you don’t post benchmark numbers, someone will absolutely show up with a calculator and a raised eyebrow. Others instantly treated the post like a challenge run. One person wanted a version that fits into 64GB so they could still have room left over for "other tasks," while another was already plotting how to squeeze even more speed out of a newer system by making the processor and graphics chip team up.

And then came the beautiful chaos: the proud owner of an even older relic asked if their truly ancient machine could run it at all. That’s the vibe here — part admiration, part skepticism, part digital junkyard Olympics. The article says old hardware still has life left in it; the comments say prove it, optimize it, and please tell us if our prehistoric home server can join the party.

Key Points

  • The article describes running a quantized Gemma 4 26B A4B model with an MTP drafter on a 2016 Xeon E5-2620 v4 server with 128 GB DDR3 RAM and no GPU.
  • It explains that LLM decoding is primarily limited by memory bandwidth because model weights must be repeatedly moved from RAM into CPU cache for each generated token.
  • The author says mainstream tools such as Ollama and standard llama-cpp do not provide the required model support or low-level tuning controls for this workload.
  • The article provides a detailed `llama-cli` command using an `ik_llama.cpp` implementation and multiple optimization flags to make CPU-only inference viable on old hardware.
  • A key optimization in the setup is speculative decoding with an MTP drafter, configured to draft up to three tokens and use autotuning.

Hottest takes

"reading speed" but that varies for everyone, no? — Eonexus
"fits into 64GB and leaves a couple GB free for other tasks ?" — potus_kushner
"I have an ancient DDR3 Xeon... You reckon it would even build / run at all?" — asimovDev
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.