Google launches DiffusionGemma for faster local AI inference
In brief
- DiffusionGemma generates text blocks simultaneously via diffusion instead of token-by-token processing
- Model activates only 3.8B of 26B parameters during inference, runs in 18GB quantized VRAM
- Google targets developers building latency-sensitive tools like inline editing and code infilling
- Standard Gemma 4 remains recommended for maximum output quality applications
How DiffusionGemma works
DiffusionGemma generates entire blocks of text simultaneously instead of predicting word-by-word. This lets the model self-correct and format complex markdown in real time. The architecture enables parallel generation of 256 tokens at once, which the model then refines over multiple passes.
Bidirectional attention is central to this approach. Every token in a block can attend to the others, which Google said could help in areas like math graphs, amino acid sequences, and structured editing. This differs fundamentally from traditional autoregressive models that predict one token after another.
Performance and deployment
DiffusionGemma can generate more than 1,000 tokens per second on a single NVIDIA H100 and more than 700 tokens per second on an NVIDIA GeForce RTX 5090. The company said the model can deliver up to four times faster output on dedicated GPUs.
Google is positioning the model for researchers and developers building latency-sensitive tools, including inline editing, code infilling, rapid iteration, and non-linear text generation. The model is most useful for local and low concurrency inference, while traditional autoregressive models may remain more efficient in high-volume cloud deployments.
Open access and caveats
DiffusionGemma is available through Hugging Face with support for MLX, vLLM, Hugging Face Transformers, Unsloth, and NVIDIA NeMo. The model is released under an Apache 2.0 license. Google said official llama.cpp support is coming soon.
Google was careful to frame DiffusionGemma as experimental. The company's standard Gemma 4 models remain the better option for applications requiring maximum output quality. For developers exploring interactive local AI systems where speed matters, though, DiffusionGemma offers a different trade-off.


