Google’s latest DiffusionGemma open AI model comes with a 4x speed boost

Another day, another AI model from Google. This time, Google DeepMind has released a new member of the Gemma 4 open model family, but it is fundamentally different from the rest of the lineup. DiffusionGemma does not generate outputs linearly like most AI models. Instead, it can produce an entire block of text in parallel. Google says this makes it faster and more efficient when running on local hardware like Nvidia DGX or simple gaming GPUs.

Most AI models are designed to be autoregressive – they generate text one token at a time, from left to right. DiffusionGemma has more similarities with image generation models, starting from static and then annealing it to create the desired content. This model takes a field of placeholder tokens, running them multiple times over a canvas to generate potential tokens and using them to improve the estimation of others. At the end of the process, the model finalizes its token output into one large block – the “denoised” text canvas.

DiffusionGemma is very large in the scope of Google’s open model. It is a mixture of experts (MOE) model that has a total of 26 billion parameters, but only 3.8 billion are activated during inference. This means it must fit into the 18GB RAM allocation of a high-end GPU. In testing with an RTX 5090, DiffusionGemma spits out about 700 tokens per second. With a single Nvidia H100 AI accelerator, DiffusionGemma can produce 1,000+ tokens per second. This is approximately four times the output of an autoregressive Gemma model of similar size.

sudoku before after11

This approach to text generation shifts the bottleneck from memory bandwidth to computation, allowing up to 256 tokens to be generated in parallel. Google says this provides measurable increases in nonlinear tasks such as in-line editing, molecular sequencing, and mathematical graphs. The animation above shows how DiffusionGemma was tuned to solve Sudoku puzzles, an extremely challenging task for standard autoregressive AI models because each token depends on future tokens. DiffusionGemma’s ability to continuously self-correct large sets of tokens makes this easy.



<a href

Leave a Comment