Z-Lab (Chen, Liang, Liu) shipped DFlash this week. 3.6k GitHub stars, +671 in a single day. It’s an inference speedup layer for any LLM, and the trick is genuinely new.
What’s actually different
Speculative decoding has been around for a while: a small draft model guesses N tokens, the big model verifies them in one pass. EAGLE-3 is the current champ, but its draft side still generates token-by-token — that’s the cap on speedup, around 2-3x.
DFlash swaps the draft for a lightweight block diffusion model. Instead of one token at a time, it parallel-drafts a whole block in a single forward pass. Result: 2.5x faster than EAGLE-3 on most workloads, ~4.5x on reasoning models with thinking mode on. Evals are published on GSM8K, MATH500, HumanEval, MBPP, MT-Bench — not just throughput numbers.
How you actually run it
Fully open source, with four backends wired up out of the box: vLLM, SGLang, Transformers, MLX. Pre-trained DFlash variants for Qwen3, Qwen3.5, Qwen-Coder, Gemma-4, and LLaMA-3.1 sit on Hugging Face. You drop one in as the draft model on top of your existing target — no fine-tuning, no retraining.
If you’re serving LLMs in production and latency is the thing keeping you up, this is the repo to read this week.
You Might Also Like
- Saguaro Speculative Speculative Decoding the yo Dawg i Heard you Like Speculation Approach to Faster llm Inference
- 27k Github Stars in Weeks Learn Claude Code by Shareai lab Breaks Down ai Coding Agents Into 12 Lessons
- Cursor Composer 2 Takes on Anthropic and Openai With a 0 50 m Token Coding Model and the Benchmarks Back it up
- Hypura Runs a 31gb Model on a 32gb mac at 2 2 tok s Llama cpp Just Ooms
- 708 Github Stars in 48 Hours Claude Token Efficient Universal Claude md and the Fight Over Claudes Most Expensive Habit

Leave a comment