DeepSeek didn’t release a new model this time. It made the existing one run faster. DSpark is a semi-parallel speculative decoding framework, and it’s already serving live traffic on DeepSeek-V4 Flash and Pro. HackerNews put it on the front page with 600+ points.
What it actually does
Speculative decoding drafts several tokens cheaply, then verifies them in one pass. The problem: acceptance rates decay across a block. DSpark bolts a lightweight sequential module onto the parallel draft head to model token dependencies inside each block, plus a confidence head that scores how likely each token survives verification. Result: throughput up 51%–400%, lower latency, and acceptance length 16.3%–30.9% higher than Eagle3 and DFlash. Overall inference speed jumps as much as 80%.
Why it matters
No retraining, no new weights — DSpark ships as a module attached to existing checkpoints. DeepSeek also open-sourced DeepSpec, the full codebase for training and evaluating draft models, and it works on Qwen and Gemma too. Making everyone’s models cheaper to run, for free.
You Might Also Like
- Dflash Beats Eagle 3 by 2 5x Using Block Diffusion as the Speculative Draft Model
- Saguaro Speculative Speculative Decoding the yo Dawg i Heard you Like Speculation Approach to Faster llm Inference
- Deepseek v4 pro Hits gpt 5 Parity on 5 of 7 Benchmarks at a Fraction of the Cost
- Deepclaude Lets Claude Code run on Deepseek v4 pro 0 87 vs 15 per Million Tokens
- Title ds4 Deepseek v4 Metal Local Inference Engine by Antirez Redis Creator Runs v4 Flash on a Single Macbook

Leave a comment