TurboQuant model weight compression support added to Llamacpp

Dev drops huge speed upgrade, comment section declares ‘Witchcraft’ and demands benchmarks

TLDR: A new TurboQuant upgrade makes big AI models run faster and smaller without hurting performance, and the dev swears every test passes. The comments explode into a battle over trust, benchmarks, and Apple vs Nvidia bragging rights, turning a technical patch into a full-blown speed cult showdown.

TurboQuant just squeezed big AI models into smaller, faster packages without breaking anything, and the community is acting like someone overclocked reality. The dev shows charts saying “no slowdowns, no accuracy loss,” and one commenter immediately yells “Pics or it didn’t happen—post the full benchmark CSV!” Bench nerds arrive with calculators in hand, zooming in on every tiny number like it’s the Zapruder film of AI.

The biggest drama? A change that quietly disables an upstream feature because it caused crashes on a popular model. Half the comments are cheering, calling it “bug exorcism,” while others accuse the project of forking into a “Turbo-only cult.” One person summed it up: “So we’re just turning features off now and calling it optimization?” and instantly got dogpiled by users pointing to the “ALL TESTS PASS” line like it’s a legal defense.

Meanwhile, CUDA vs Apple Silicon has become a full-on meme war. Apple fans are flexing their laptops like race cars, bragging that their chips are “free speed upgrades.” PC gamers on giant GPUs clap back with “cool, enjoy your 99% of Q8, I’ll enjoy actually gaming.” In between, the dev calmly drops deep technical explanations—while the crowd mostly argues about who gets more tokens per second and whether this is “black magic compression” or just very good engineering.

Key Points

•TurboQuant TQ3_1S and TQ4_1S weight compression has been integrated into a llama.cpp fork with extensive tests on five models, two hardware platforms, and four KV cache configurations showing no speed regressions.
•Perplexity measurements on wikitext-2 for Qwen and Phi-4 models show no meaningful regressions, with all PPL values matching known-good references and differences within noise levels.
•Restructuring of ggml’s Metal operations (MUL_MAT and MUL_MAT_ID dispatch, including MoE paths) is verified to preserve existing functionality, with all Metal tests passing.
•An upstream graph-level Hadamard rotation for KV cache quantization caused crashes on Phi-4 and was redundant with TurboQuant’s kernel-level WHT, so it is disabled by default in the fork, resolving the crash while keeping TurboQuant rotation unaffected.
•CUDA support for TQ3_1S/TQ4_1S includes a dequantization path using cuBLAS MUL_MAT and a fused mul_mat_vec kernel with warp-shuffle WHT; performance gaps to q4_0 and Metal are analyzed and further CUDA optimizations are proposed.

Hottest takes

“If this is ‘no regression’ why did my trust in upstream just get quantized to 4 bits” — _bitflipbandit

“Apple Silicon is basically cheat codes for LLMs at this point” — _m1cultist

“At this stage llama.cpp is less a repo and more a PvP arena for kernel engineers” — _segfaultsupreme

April 4, 2026

AI goes faster, drama goes turbo

Dev drops huge speed upgrade, comment section declares ‘Witchcraft’ and demands benchmarks

Key Points

Hottest takes

April 4, 2026

AI goes faster, drama goes turbo

TurboQuant model weight compression support added to Llamacpp

Dev drops huge speed upgrade, comment section declares ‘Witchcraft’ and demands benchmarks

Key Points

Hottest takes

Save News