April 20, 2026
Fast tokens, faster backlash
We got 207 tok/s with Qwen3.5-27B on an RTX 3090
Dev claims ‘207 tokens per second’ on old GPU and sparks a full-on AI comment war
TLDR: A small team claims a big speed boost for running AI on a home gaming graphics card, shouting about “207 tokens per second” like a high score. Commenters quickly split between fans of the speedup and critics calling it overhyped, Nvidia‑locked, and potentially lower‑quality — exposing deep rifts over who local AI really serves.
A tiny open‑source team showed off a turbo‑charged way to run a big AI model on a four‑year‑old graphics card, bragging about “207 tokens per second” like it’s a new land‑speed record. But while the code nerds were busy drooling over charts, the comments turned into a full‑blown custody battle over truth, hype, and who AI is really for.
One camp is hyped: GreenGames drops numbers like a race announcer, claiming over 5× speed‑ups and “no more cloud bills” if you’ve got a beefy gaming card at home. But almost immediately lostmsu jumps in with the bucket of cold water: you didn’t really hit 207, you used a shortcut called “speculative decoding” that may make the AI’s answers worse. In plain English: it talks faster because it’s kind of guessing ahead.
Then the politics of hardware kick off. The project’s slogan says local AI should be for everyone, but dirtikiti fires back that it only works on Nvidia’s pricey cards: if you care about freedom, why no support for other chips like Vulkan? SilverElfin wonders why they optimized for a gamer GPU instead of normal laptops. And Aurornis brings peak cynicism, calling it yet another “vibecoded” repo spawned by AI tools, implying it’s more hype than breakthrough. The real benchmark isn’t just speed — it’s how fast the comments turned toxic.
Key Points
- •Lucebox released two targeted LLM inference projects: Megakernel for Qwen 3.5-0.8B and DFlash 27B for Qwen3.5-27B, each with benchmarks and writeups.
- •Megakernel runs all 24 layers in a single CUDA dispatch, achieving 1.87 tok/J (220 W) and 413 tok/s decode, outperforming llama.cpp BF16 and PyTorch HF baselines.
- •DFlash 27B is the first GGUF port of DFlash speculative decoding, reaching up to 135.8 tok/s on an RTX 3090 and fitting 128K context within 24 GB.
- •The DFlash port uses a Q4_K_M GGUF target with a BF16 draft and a DDTree verifier (budget=22), plus Q4_0 KV compression and a sliding feature buffer to meet memory limits.
- •The projects advocate for efficient local AI with MIT-licensed source, detailed writeups, reproducible benchmarks, and quickstart commands.