Google Open-Sources DiffusionGemma, 4x Faster LLM
Google has released DiffusionGemma, an open-source large language model using a text diffusion architecture that generates text four times faster than traditional LLMs while consuming less RAM. The model produces 256 tokens simultaneously rather than one at a time, achieving over 1,000 tokens per second on an Nvidia H100 GPU.
Based on Google's Gemma 4 26B model, DiffusionGemma uses a mixture-of-experts architecture with 26 billion parameters but activates only 3.8 billion per query, enabling it to run on consumer-grade GPUs like the GeForce RTX 5090. The model is available on Hugging Face under an open-source license.
