I have written gemma3 inference in pure C

No Python, no GPU: one file runs AI in pure C — and the comments are raging

TLDR: A dev built Gemma 3 AI to run on a regular CPU in pure C, no Python or GPU required. Commenters split between praising the minimalist, dependency-free craft and dismissing it as slow, outdated, or pointless without CPU optimizations—still, it proves big AI can be simple and portable.

An indie dev just dropped gemma3.c, a tiny, from‑scratch program that runs Google’s Gemma 3 AI model on a plain old CPU — no Python, no fancy graphics card, just pure C. It streams replies, chats interactively, and fits in a single lean setup. Sounds heroic, right? Cue the comment section fireworks.

The top snark: “Did we need proof?” sneered one skeptic, arguing the headline flex was obvious. Another engineer brought the math hammer, saying the real world runs on SIMD — those special CPU tricks — and that pure C loops are cute until you feel the speed hit. Translation: nice demo, but without those optimizations, it’s slo-mo. Meanwhile, the nihilists showed up with “but why tho?” pointing out Gemma 3 isn’t the latest and won’t power production apps anyway.

But it wasn’t all doom. One romantic coder fell in love with the minimalist vibe, swooning over a single ~600‑line kernel file and a clean, readable design. Fans cheered the “no dependencies” purity, the throwback hacker energy, and the fact that it actually runs: about 1–3 tokens per second, 8 GB of weight files, and a chat mode that works on Mac/Linux (Windows via WSL).

Verdict: a provocation disguised as a project — half the crowd calls it art, the other half asks for benchmarks and a reason.

Key Points

•gemma3.c is a CPU-only inference engine for the Gemma 3 4B IT model implemented entirely in C11 with zero external dependencies.
•It supports the full Gemma 3 architecture, including GQA, hybrid attention (5:1 local:global), and SwiGLU, with a native SentencePiece tokenizer (262,208 vocab).
•Weights are loaded via memory-mapped BF16 SafeTensors, and the project provides streaming token-by-token output, interactive chat mode, and both CLI and library APIs.
•The included Python script handles Hugging Face authentication, shard downloading, resume, and integrity checks; manual alternatives include huggingface-cli and git lfs.
•Performance is ~2–5 tokens/s prefill and ~1–3 tokens/s generation on CPU; memory usage is ~8 GB disk for weights and ~3 GB RAM, with configurable context size; code is MIT-licensed and weights use Google’s Gemma license.

Hottest takes

"Did we need any proof of that ?" — w4yai

"There’s such a massive performance differential vs. SIMD" — austinvhuang

"but why tho? next gemma is coming and no one uses gemma 3 in prod anyway" — behnamoh

January 28, 2026

Pure C, pure chaos

No Python, no GPU: one file runs AI in pure C — and the comments are raging

TLDR: A dev built Gemma 3 AI to run on a regular CPU in pure C, no Python or GPU required. Commenters split between praising the minimalist, dependency-free craft and dismissing it as slow, outdated, or pointless without CPU optimizations—still, it proves big AI can be simple and portable.

Key Points

Hottest takes

January 28, 2026

Pure C, pure chaos

I have written gemma3 inference in pure C

No Python, no GPU: one file runs AI in pure C — and the comments are raging

TLDR: A dev built Gemma 3 AI to run on a regular CPU in pure C, no Python or GPU required. Commenters split between praising the minimalist, dependency-free craft and dismissing it as slow, outdated, or pointless without CPU optimizations—still, it proves big AI can be simple and portable.

Key Points

Hottest takes

Save News