If you’ve been following the LLM inference optimization space at all, you know speculative decoding is table stakes at this point. A small draft model guesses tokens, a big model verifies them in a batch — it works, and it’s everywhere. But here’s the thing that always bugged people: the drafting and the verification still happen one after the other. You draft, then you verify, then you draft again. Sequential. Waiting.
[Saguaro](https://arxiv.org/abs/2603.03251) says: what if we just… don’t wait?
Published on March 3rd by Tanishq Kumar (Stanford PhD student), Tri Dao (yes, the FlashAttention guy, now at Princeton and Together AI), and Avner May, this paper introduces what they call “speculative speculative decoding” — and before you roll your eyes at the name, the idea is genuinely clever. While the target model is busy verifying a draft, the draft model doesn’t sit idle. Instead, it predicts what the verification outcome might be and starts preparing the *next* round of speculation ahead of time. If the actual verification result matches one of those pre-computed branches, boom — the next speculation is ready to go instantly. Zero drafting latency for that round.
The results are hard to ignore: up to 2x faster than already-optimized speculative decoding baselines, and roughly 5x faster than vanilla autoregressive generation. The paper has already been [accepted at ICLR 2026](https://openreview.net/forum?id=aL1Wnml9Ef), which gives it some serious credibility beyond just an arXiv drop.
It picked up traction on [Hacker News](https://news.ycombinator.com/item?id=47242637) the same day it dropped, with about 33 points and a handful of comments — one of which was, predictably, a “Yo Dawg” meme about speculating on speculation. Fair enough. But the more substantive discussion highlighted how Saguaro essentially combines the branching logic from tree-based speculation with the pipelining of draft and verify stages, which is a combination nobody had really nailed before.
Having Tri Dao’s name on the paper definitely helps with visibility, but the contribution stands on its own. If you’re running any kind of LLM serving infrastructure and speculative decoding is already in your stack, Saguaro looks like a pretty compelling next step. The 2x speedup over spec decode is real throughput you’re leaving on the table.

Leave a comment