January 28, 2026
Pure C, pure chaos
I have written gemma3 inference in pure C
No Python, no GPU: one file runs AI in pure C — and the comments are raging
TLDR: A dev built Gemma 3 AI to run on a regular CPU in pure C, no Python or GPU required. Commenters split between praising the minimalist, dependency-free craft and dismissing it as slow, outdated, or pointless without CPU optimizations—still, it proves big AI can be simple and portable.
An indie dev just dropped gemma3.c, a tiny, from‑scratch program that runs Google’s Gemma 3 AI model on a plain old CPU — no Python, no fancy graphics card, just pure C. It streams replies, chats interactively, and fits in a single lean setup. Sounds heroic, right? Cue the comment section fireworks.
The top snark: “Did we need proof?” sneered one skeptic, arguing the headline flex was obvious. Another engineer brought the math hammer, saying the real world runs on SIMD — those special CPU tricks — and that pure C loops are cute until you feel the speed hit. Translation: nice demo, but without those optimizations, it’s slo-mo. Meanwhile, the nihilists showed up with “but why tho?” pointing out Gemma 3 isn’t the latest and won’t power production apps anyway.
But it wasn’t all doom. One romantic coder fell in love with the minimalist vibe, swooning over a single ~600‑line kernel file and a clean, readable design. Fans cheered the “no dependencies” purity, the throwback hacker energy, and the fact that it actually runs: about 1–3 tokens per second, 8 GB of weight files, and a chat mode that works on Mac/Linux (Windows via WSL).
Verdict: a provocation disguised as a project — half the crowd calls it art, the other half asks for benchmarks and a reason.
Key Points
- •gemma3.c is a CPU-only inference engine for the Gemma 3 4B IT model implemented entirely in C11 with zero external dependencies.
- •It supports the full Gemma 3 architecture, including GQA, hybrid attention (5:1 local:global), and SwiGLU, with a native SentencePiece tokenizer (262,208 vocab).
- •Weights are loaded via memory-mapped BF16 SafeTensors, and the project provides streaming token-by-token output, interactive chat mode, and both CLI and library APIs.
- •The included Python script handles Hugging Face authentication, shard downloading, resume, and integrity checks; manual alternatives include huggingface-cli and git lfs.
- •Performance is ~2–5 tokens/s prefill and ~1–3 tokens/s generation on CPU; memory usage is ~8 GB disk for weights and ~3 GB RAM, with configurable context size; code is MIT-licensed and weights use Google’s Gemma license.