Google launches DiffusionGemma for faster local AI inference

By Khal · Jun 10, 2026 (1 month ago) · 1 min read

Editorial illustration for: Google launches DiffusionGemma, experimental open model for faster local AI inference — Image generated for editorial use · Generated via fal.ai FLUX — editorial license · Published Jun 10, 2026, 6:13 p.m. (1 month ago)

In brief

DiffusionGemma generates text blocks simultaneously via diffusion instead of token-by-token processing
Model activates only 3.8B of 26B parameters during inference, runs in 18GB quantized VRAM
Google targets developers building latency-sensitive tools like inline editing and code infilling
Standard Gemma 4 remains recommended for maximum output quality applications

How DiffusionGemma works

DiffusionGemma generates entire blocks of text simultaneously instead of predicting word-by-word. This lets the model self-correct and format complex markdown in real time. The architecture enables parallel generation of 256 tokens at once, which the model then refines over multiple passes.

Bidirectional attention is central to this approach. Every token in a block can attend to the others, which Google said could help in areas like math graphs, amino acid sequences, and structured editing. This differs fundamentally from traditional autoregressive models that predict one token after another.

Performance and deployment

DiffusionGemma can generate more than 1,000 tokens per second on a single NVIDIA H100 and more than 700 tokens per second on an NVIDIA GeForce RTX 5090. The company said the model can deliver up to four times faster output on dedicated GPUs.

Google is positioning the model for researchers and developers building latency-sensitive tools, including inline editing, code infilling, rapid iteration, and non-linear text generation. The model is most useful for local and low concurrency inference, while traditional autoregressive models may remain more efficient in high-volume cloud deployments.

Open access and caveats

DiffusionGemma is available through Hugging Face with support for MLX, vLLM, Hugging Face Transformers, Unsloth, and NVIDIA NeMo. The model is released under an Apache 2.0 license. Google said official llama.cpp support is coming soon.

Google was careful to frame DiffusionGemma as experimental. The company's standard Gemma 4 models remain the better option for applications requiring maximum output quality. For developers exploring interactive local AI systems where speed matters, though, DiffusionGemma offers a different trade-off.

Google launches DiffusionGemma for faster local AI inference

In brief

How DiffusionGemma works

Performance and deployment

Open access and caveats

Related stories

US Oil Prices Fall 8% on US-Iran Ceasefire and Diplomatic Talks

BlackRock, Goldman Sachs, Fidelity back Clarity Act with $30 trillion firepower

Tom Lee: Crypto Exchange Shutdowns Signal Market Bottom