Top AI Product

Every day, hundreds of new AI tools launch across Product Hunt, Hacker News, and GitHub. We dig through the noise so you don't have to — surfacing only the ones worth your attention with honest, no-fluff reviews. Explore our latest picks, deep dives, and curated collections to find your next favorite AI tool.


DFlash beats EAGLE-3 by 2.5x using block diffusion as the speculative draft model

Z-Lab (Chen, Liang, Liu) shipped DFlash this week. 3.6k GitHub stars, +671 in a single day. It’s an inference speedup layer for any LLM, and the trick is genuinely new.

What’s actually different

Speculative decoding has been around for a while: a small draft model guesses N tokens, the big model verifies them in one pass. EAGLE-3 is the current champ, but its draft side still generates token-by-token — that’s the cap on speedup, around 2-3x.

DFlash swaps the draft for a lightweight block diffusion model. Instead of one token at a time, it parallel-drafts a whole block in a single forward pass. Result: 2.5x faster than EAGLE-3 on most workloads, ~4.5x on reasoning models with thinking mode on. Evals are published on GSM8K, MATH500, HumanEval, MBPP, MT-Bench — not just throughput numbers.

How you actually run it

Fully open source, with four backends wired up out of the box: vLLM, SGLang, Transformers, MLX. Pre-trained DFlash variants for Qwen3, Qwen3.5, Qwen-Coder, Gemma-4, and LLaMA-3.1 sit on Hugging Face. You drop one in as the draft model on top of your existing target — no fine-tuning, no retraining.

If you’re serving LLMs in production and latency is the thing keeping you up, this is the repo to read this week.


You Might Also Like


Discover more from Top AI Product

Subscribe to get the latest posts sent to your email.



Leave a comment