
GenAI image generators like Stable Diffusion do not create images from left to right pixel by pixel. They start with noise and refine the entire image in parallel until it converges, a process called propagation. For years, applying the same principles to text generation has been largely out of reach.
Standard language models work like a typewriter: one token at a time, left to right, with no ability to modify the committed output. That pattern works in the cloud, where the batch size keeps the GPU saturated. For local inference or low-concurrency deployments, the GPU remains idle most of the time.
Google’s DiffusionGemma, released this week, is an open source experimental model that applies diffusion to text generation at production scale. Built on the Gemma 4 backbone and released under the Apache 2.0 license, it is the first diffusion language model natively supported in the open source VLLM inference platform. It generates 256-token blocks in parallel rather than sequentially, with each token position linked to each other. Google says DiffusionGemma generates text up to 4 times faster than the standard model on GPUs. At batch size 1 on a single Nvidia H100, the FP8 version reaches 1,008 tokens per second. According to VLLM benchmark results published today, on the H200, it reaches 1,288 – almost six times higher than the standard autoregressive baseline.
Despite the increase in speed, Google did not sell much of the release. The company’s launch post directly acknowledged that DiffusionGemma’s overall output quality is lower than the standard Gemma 4. "For applications that demand maximum quality, we recommend deploying standard Gemma 4."
What does DiffusionGemma do?
DiffusionGemma does not generate tokens in order. It starts with a block of 256 random placeholder tokens, effectively a blank canvas, and runs multiple refinement passes on the entire block at once. On each pass, it evaluates each situation and locks the ones it is most confident about. Uncertain situations are randomized and reconsidered on the next pass, with the model using what was solved in the previous round to inform the next attempt. The block progressively converges until a state is stable enough to stabilize the rest.
Two things follow from that architecture.
- Self-improvement. An autoregressive model that commits to the wrong token is stuck, because subsequent tokens are already based on the mistake. DiffusionGemma can identify low-confidence situations and reevaluate them on the next pass.
-
Bidirectional reference. Each position simultaneously acts on every other position in the block, including tokens that appear later in the sequence. This makes the model better suited for structurally limited generation tasks where left-to-right generation fails.
Google demonstrated both properties with a sophisticated Sudoku solver. The base model solved zero puzzles. After fine-tuning on the Sudoku dataset, it reached 80% success rate and converged to 12 denoising steps instead of 48. The gains in efficiency came directly from the model’s ability to self-correct and stop quickly.
How was it built?
DiffusionGemma runs as a 26b mixture of experts model that activates only 3.8b parameters during estimation. Quantized, this fits within 18GB of VRAM on consumer hardware, including the Nvidia RTX 4090 and 5090. Google and NVIDIA have also optimized for enterprise Hopper and Blackwell servers using the NVFP4 kernel.
VLLM integration requires new work because DiffusionGemma does not fit the standard serving model. A typical VLLM batch applies the same attention type to every request. DiffusionGemma requests alternate between causal and bidirectional attention as they cycle through prompt reads, canvas refinement, and block commits. The team built per-request attention switching into both the Triton and FlashAttention 4 backends and reused existing speculative decoding paths for the refinement loop.
The new ModelState interface created by the team for this integration is designed to support additional diffusion models in VLLM as they emerge.
Where speed wins and where it doesn’t
DiffusionGemma’s speed advantage is real but conditional. Where this applies depends entirely on the deployment context.
Number. At batch size 1 on a single H100, published benchmarks of VLLM put the FP8 model at about five times the standard autoregressive baseline. On the H200, about six times. Those extreme figures reflect optimal conditions: single user, dedicated hardware, FP8 quantization.
Where it wins. Local estimation, single-user applications, and low-concurrency service. In those situations the GPU has extra computation and the memory is the bandwidth bottleneck. DiffusionGemma’s parallel block generation fills that gap.
Where this does not happen. High-throughput cloud service. When a server is batching hundreds of concurrent requests, autoregressive models already saturate the available compute and DiffusionGemma’s parallel decoding provides diminishing returns.
Quality roofing. AI researcher Guilherme O’Tina put a better point on X. "Local artifacts vs hallucinations are different problems and that’s what decides where it really wins," O’Tina wrote.
How does it compare
Diffusion language models are not new. Researchers have created them on a small scale for several years, and Inception Labs’ Mercury Coder implemented the approach commercially for coding tasks in 2025. What DiffusionGemma has added is scale – a 26B MOE backbone, native VLLM service, and a general-purpose instruction-tuned model rather than a domain-specific one.
The more useful comparison for engineers evaluating it against existing estimation tooling is speculative decoding, and the difference matters. Speculative decoding keeps a standard autoregressive target model and uses a smaller draft model to predict many further tokens. The target model verifies them one at a time. If sampling is correct, the output distribution remains the same as the target. The architecture is unchanged.
Andrew Kuncevich, an ML and AI researcher who focuses on production AI systems, puts it straight to X. "DiffusionGemma is different. It doesn’t just predict the future of the token. It creates a noisy 256-token canvas and repeatedly maps the entire block in parallel. So it’s not just a decoding trick – it’s a different generational paradigm," Kuncevich wrote.
Compared to the standard Gemma 4, the trade is speed for quality. Google’s benchmark data shows DiffusionGemma below standard Gemma 4 on common output quality metrics, with the difference varying by task.
On structured constrained tasks, including problems requiring code infilling, template generation, and bidirectional constraint propagation, the architecture has a structural advantage that fine-tuning can surface, as Sudoku results demonstrate. On the open-ended generation, the standard Gemma 4 remains the more robust choice.
What does this mean for enterprises
DiffusionGemma works through a standard VLLM OpenAI-compliant endpoint that does not require any Diffusion-specific pipeline changes.
This is not a general purpose model upgrade.
For teams running local or low-concurrency inference, architecture options have expanded. Until now, cutting generation latency on dedicated GPU hardware meant using a smaller model and accepting the quality trade-off. DiffusionGemma provides a third path on consumer hardware, with VLLM support, at the same parameter footprint.
For limited generation workloads, bidirectional attention is worth evaluating. Code stuffing, structured data generation, and tasks where the correct output depends on context that has not yet been generated have a structural edge in this architecture.
The ModelState interface built for this integration is designed to be generalized as additional diffusion models emerge.
Quality compromise is real and Google accepts it. For teams doing local inference on dedicated GPU hardware, this is worth testing.
<a href