Accelerating Gemma 4: faster inference with multi-token prediction drafters

Google says Gemma got way faster, and the crowd is already asking where the catch is

TLDR: Google says Gemma 4 can now generate text much faster by using a small helper model, with no drop in answer quality. Commenters are excited but also frustrated, asking why Google’s own services don’t showcase it better and why some tools already show broken or unusable options.

Google just announced a big speed boost for Gemma 4, its open AI model family, claiming some setups can now run up to 3x faster while giving the same answers and reasoning as before. In plain English: the model gets a tiny helper that tries to guess a few words ahead, and the bigger model quickly checks the work instead of doing every single step the slow way. Google says this could make chat apps snappier, coding tools feel more alive, and even help smaller devices save battery.

But the real popcorn moment is in the comments, where the community instantly split into hyped, skeptical, and mildly annoyed camps. One of the loudest reactions wasn’t even about speed — it was basically, “Cool, but why isn’t Google pushing this harder on its own cloud?” That kicked off the classic tech-world side-eye: if this is so great, why does using it still feel like a scavenger hunt? Meanwhile, practical users jumped in with the most relatable drama of all: the button is there, but it doesn’t work. One commenter asked why LM Studio shows the option if it can’t actually be enabled, which is the kind of software tease that starts blood feuds.

Others were ready to benchmark immediately, with one user admitting Gemma had been losing to a rival model on pure speed — and declaring that once you cross 100 tokens per second, "some magic happens." And then came the nostalgia comedian, comparing today’s AI text generation to watching a 300 baud modem crawl through the internet’s ancient past. Translation: yes, faster is nice, but everyone still knows this is only one stop on the road to “wow, we used to tolerate that?”

Key Points

•Google announced Multi-Token Prediction drafters for the Gemma 4 model family to accelerate inference.
•The article says the new drafters can provide up to a 3x speedup without degrading output quality or reasoning.
•Google explains that standard LLM inference is memory-bandwidth bound, making token generation slow because parameters must repeatedly move from VRAM to compute units.
•The approach uses speculative decoding, where a lightweight drafter predicts multiple tokens and the larger target model verifies them in parallel.
•Google says the implementation includes shared activations, shared KV cache, and additional embedder clustering optimizations for the E2B and E4B edge models.

Hottest takes

“shouldn’t Google want me to be able to use and pay for it through Gemini and vertex?” — mchusma

“the UI, but it never seems to allow me to enable it” — these

“This seems like going from 300 baud to 1200” — skybrian

May 5, 2026

Fast tokens, faster drama

Google says Gemma got way faster, and the crowd is already asking where the catch is

TLDR: Google says Gemma 4 can now generate text much faster by using a small helper model, with no drop in answer quality. Commenters are excited but also frustrated, asking why Google’s own services don’t showcase it better and why some tools already show broken or unusable options.

Key Points

Hottest takes

May 5, 2026

Fast tokens, faster drama

Accelerating Gemma 4: faster inference with multi-token prediction drafters

Google says Gemma got way faster, and the crowd is already asking where the catch is

TLDR: Google says Gemma 4 can now generate text much faster by using a small helper model, with no drop in answer quality. Commenters are excited but also frustrated, asking why Google’s own services don’t showcase it better and why some tools already show broken or unusable options.

Key Points

Hottest takes

Save News