Orthrus is a dual-architecture framework that wraps a frozen Qwen3-8B base model with a lightweight trainable diffusion module. It delivers up to 7.8x more tokens per forward pass while producing the exact zero-shot accuracy of the base model — no sampling drift, no quality regression.
## How it works
Most speculative decoding methods (EAGLE-3, DFlash) trade some output drift for speed. Orthrus uses an exact intra-model consensus mechanism — its output matches the original base model’s predictive distribution token-for-token. In head-to-head comparisons it verifies significantly more tokens per forward pass than EAGLE-3 or DFlash, and the published benchmarks back this up across reasoning and coding tasks.
## The release
Open-source via GitHub at chiennv2000/orthrus. Pretrained weights available on Hugging Face as chiennv/Orthrus-Qwen3-8B and Orthrus-Qwen3-4B. The project ships with a complete training and inference codebase, plus benchmark scripts. Currently #2 on Hacker News with 211+ points.
## Why it matters
The persistent tradeoff in LLM inference has been: faster output or exact output, pick one. Orthrus is the first method credibly claiming both at this magnitude. If your serving bill scales with tokens per second per GPU, a 7.8x effective throughput on the same hardware rewrites your unit economics overnight.

Leave a comment