What it is
Tiny helper models that ride alongside Gemma 4 and guess 4-8 tokens ahead per forward pass. The main model just verifies. Right guess, you get the whole sequence in one pass. Wrong guess, fall back to normal. No quality loss because the big model still signs off on every token.
Same speculative-decoding trick frontier labs use internally to make their APIs cheap. Now exposed for open-weight serving, Apache 2.0, drop-in. Drafter weights are released alongside Gemma 4 itself, so anyone already serving Gemma can swap them in.
Why it matters
Gemma is the most-deployed open-weight family. A 2.7-3.5x throughput jump on long-context generation, for free, changes the math for anyone running inference at scale. Cheaper agents, faster RAG, less GPU.
The drafters share the target model’s KV cache, so no recomputation overhead. Weights live on Hugging Face. Works with vLLM, SGLang, TGI, MLX, Transformers, Ollama out of the box. 362 points on Hacker News the day it landed — open-source serving stack people noticed immediately.
You Might Also Like
- Moonshine Open Weights stt the Tiny Speech Model That Punches way Above its Weight
- Sora is Dead ltx 2 3 Lightricks Ships 22b Open Source Video Audio in a Single Forward Pass
- Fara 7b Microsofts Tiny Model That can Actually use Your Computer
- Google ai Studio 2 0 Ships Full Stack Vibe Coding With Firebase and its Free
- Gemma 4 Scores 89 on Aime With Just 4b Active Parameters Googles Open Model bet Gets Real

Leave a comment