Google Open-Sources DiffusionGemma: A 4× Faster Text Model That Ditches Autoregression

Google DeepMind and NVIDIA have jointly released DiffusionGemma 26B-A4B, the first large-scale, open-weight text diffusion language model. Where GPT, Claude, and conventional Gemma generate text one token at a time, left-to-right, DiffusionGemma denoises 256-token blocks in parallel. The result, according to Google’s benchmarks, is more than 1,000 tokens per second on a single H100 and over 700 tokens per second on a consumer NVIDIA RTX 5090 — roughly 3.5–4× the throughput of the autoregressive Gemma 4.

The model is released under Apache 2.0 and is the first large-scale open-weight diffusion LLM in the industry. Apache 2.0 — not the more restrictive Gemma licence. For NZ developers and researchers, that means it can be deployed commercially without an enterprise agreement.

🔍 THE BOTTOM LINE DiffusionGemma is a real paradigm shift wrapped in a single release: the same physical hardware now produces four times the text throughput, with a measurable but bounded quality cost. For batch workloads — code infilling, document editing, structured output, code completion — the trade-off is well worth it. For open-ended generation, autoregressive still wins.

What is text diffusion, and why does it matter?

Standard language models generate text autoregressively — one token at a time, each conditioned on all the previous tokens. The chain of dependencies creates an inherent sequential bottleneck. No matter how fast the GPU, you’re limited by the fact that token N+1 cannot be produced before token N.

Diffusion language models (DLMs) adapt the Stable Diffusion idea to text:

Initialise a canvas of 256 mask/noise tokens at the target length
Run multiple denoising steps that refine the entire canvas in parallel
Commit tokens that become confident at each step; re-refine the remainder
After 48 or so denoising steps, the canvas contains a coherent 256-token block

Because the denoising happens in parallel, the model emits roughly 15-20 tokens per forward pass and up to 1,000+ tokens per second on a single H100. That parallelism is what unlocks the 4× speedup.

Architecture and specifications

DiffusionGemma is a Mixture-of-Experts (MoE) model with 25.2 billion total parameters, of which only 3.8 billion activate per token. The MoE design means it has the knowledge capacity of a 26B model with the computational cost closer to a 4B model. Key specifications:

Total parameters: 25.2B (MoE)
Active parameters per token: 3.8B
Context window: 256K tokens
Languages: 140+ (text input)
Input modalities: text, image, video (interleaved)
Output modality: text
License: Apache 2.0
Knowledge cutoff: January 2025
VRAM requirement: ~18GB (fits on RTX 5090/4090 at quantised precision)
Frameworks: Hugging Face Transformers, vLLM, Unsloth, MLX (day-zero support); llama.cpp in progress

The NVFP4 (4-bit floating-point) data format is natively supported on NVIDIA Blackwell GPUs, dramatically accelerating compute throughput.

The quality trade-off — and how to think about it

Google is candid about the cost. On standard knowledge and reasoning benchmarks, DiffusionGemma trails the autoregressive Gemma 4:

MMLU Pro: 77.6 (DiffusionGemma) vs 82.6 (AR Gemma 4)
GPQA: 73.2 vs 82.3
MMMU Pro: 54.3 vs 73.8

That is a meaningful gap on hard knowledge and reasoning. But the trade-off inverts on the tasks where diffusion shines:

Code infilling — filling in a function body given the surrounding context. Diffusion’s bidirectional attention is structurally advantaged.
Document editing — modifying a paragraph in place. AR models have to re-generate everything to the right; diffusion can rewrite the block directly.
Structured outputs — JSON, XML, code with strict syntax. The block-level refinement is more reliable than left-to-right generation.
In-line correction — fixing a typo in the middle of a long document. Diffusion can edit one region without disrupting others.

For batch, structured, or editing workloads, the 4× throughput is a real productivity unlock. For open-ended creative writing or long-form reasoning, AR remains the better default.

Why this matters for New Zealand

DiffusionGemma is genuinely free to use, runs locally on a 24GB consumer GPU, and is fast enough to be a practical daily tool rather than a research curiosity. For NZ developers and small businesses, that changes the procurement default:

A freelance writer or agency can run DiffusionGemma on a 5090-equipped workstation for code infilling and document editing at near-instant speed
A research lab can run a serious open-weight model without sending data to a third party — important under the Privacy Act 2020 and the Biometric Processing Privacy Code 2025
A university course on LLMs can teach both paradigms in a single semester

The wider signal is bigger than DiffusionGemma. The 4× throughput number on consumer hardware was the dominant benchmark in 2024-2025; the dominant benchmark in 2026 is becoming tasks completed per dollar on a $2,000 consumer GPU. For New Zealand’s digital-sovereignty debate — where the gap between Australia and NZ is documented in our Australia vs NZ AI regulation comparison — the procurement default is shifting.

What comes next

The architecture is new enough that the open-source ecosystem is still catching up. llama.cpp support is in progress. The local inference tooling is less mature than for AR models. And the quality gap on hard reasoning means AR models won’t disappear — they’re complementary, not competitive.

The interesting question for the next 12-18 months: does the quality gap close? If open-weight diffusion models reach AR parity on standard benchmarks while maintaining the 4× throughput, the entire batch-vs-stream trade-off in AI infrastructure gets rewritten. Providers like Cohere Command A+ (open-sourced May 20, 218B total / 25B active, Apache 2.0) and the Mistral line have shown the open-weight pattern works. DiffusionGemma is the first signal that the same pattern can work for a non-autoregressive architecture.

❓ FAQ

Is DiffusionGemma better than GPT-5.5 or Claude Opus 4.8? No. On hard knowledge, reasoning, and agentic tasks, the closed-source frontier models still win. DiffusionGemma is a tool for a different job: high-throughput batch generation, structured output, and local-first deployment.

Can I use it commercially? Yes. Apache 2.0 allows commercial use, modification, and redistribution. The only requirement is preserving the licence notice.

What hardware do I need? About 18GB of VRAM at quantised precision. A 24GB consumer card (RTX 5090, 4090) handles it comfortably. Day-zero support is on Hugging Face Transformers, vLLM, Unsloth, and MLX.

Why is it 4× faster if it has 26B parameters? The 4× number is throughput, not quality. The model is a Mixture-of-Experts with 3.8B active parameters per forward pass — comparable in compute to a small model. The 4× speedup comes from generating 256 tokens in parallel rather than 256 tokens sequentially.

Will diffusion replace autoregression? Not in the next 12-18 months. The two paradigms are complementary. AR for open-ended generation and long-form reasoning; diffusion for batch, structured, and editing tasks. The most interesting 2027 product will be a model that uses both depending on the task.

🔍 THE BOTTOM LINE

DiffusionGemma is a real release, not a research demo. Apache 2.0, 18GB VRAM, 4× faster than autoregressive Gemma 4, 140+ languages, day-zero Hugging Face support. The quality gap is real but bounded, and the cost of trying it is zero. For NZ developers, researchers, and procurement teams, this is the kind of release that should change the default conversation about which models are even on the table.