Google's Gemma 4 AI Models Get 3x Speed Boost By Predicting Future Tokens

Google launched its Gemma 4 Open models this spring, promising new levels of power and performance for local AI. Google’s grip on AI may get even tighter with the release of Multi-Token Prediction (MTP) drafters for Gemma. Google says these experimental models leverage a form of speculative decoding to predict future tokens, which can speed up generation compared to the way the model generates tokens on its own.

The latest Gemma models are built on the same underlying technology that powers Google’s Frontier Gemini AI, but they are designed to run natively. Gemini is optimized to run on Google’s custom TPU chips, which operate in huge clusters with super-fast interconnects and memory. A single high-power AI accelerator can run the largest Gemma 4 models at full accuracy, and quantization will allow it to run on consumer GPUs.

Gemma allows users to tinker with AI on their own hardware instead of sharing all their data with Google or anyone else’s cloud AI system. Google also changed the license of Gemma 4 to Apache 2.0, which is much more permissive than the custom Gemma license employed by Google for previous releases. However, most people have inherent limitations in the hardware they have to run native AI models. This is where MTP comes in.

LLMs like Gemma (or Gemini) generate tokens automatically – that is, they generate one token at a time based on previous tokens. Each one has to do just as much computing work as the last, even if the token output is just a filler word or an important piece of information in a complex logic problem.

The problem with rolling your own AI is that your system memory is probably not very fast compared to the high bandwidth memory (HBM) used in enterprise hardware. As a result, the processor spends a lot of time transferring parameters from VRAM to compute units for each token, and compute cycles are going unused during this process.

Gemma 4 26b on NVIDIA RTX PRO 6000. Standard estimate (left) versus MTP Drafter (right) in tokens per second. Same output quality, half the latency.

MTP uses that time to bypass the heavy model and generate speculative tokens with a lightweight drafter. While the draft models are small (only 74 million parameters in Gemma 4 E2B), they have also been optimized in several ways to speed up speculative token generation. For example, the drafter shares the key value cache (essentially the LLM’s active memory), so it does not need to recalculate the context on which the main model has already worked. E2B and E4B drafters also use sparse decoding techniques to reduce groups of possible tokens.

<a href

Google’s Gemma 4 AI models get 3x speed boost by predicting future tokens

Like this:

Related

Leave a Comment Cancel reply

Share this:

Like this:

Related

Leave a Comment Cancel reply