Top AI Product

Every day, hundreds of new AI tools launch across Product Hunt, Hacker News, and GitHub. We dig through the noise so you don't have to — surfacing only the ones worth your attention with honest, no-fluff reviews. Explore our latest picks, deep dives, and curated collections to find your next favorite AI tool.


Orthrus-Qwen3 hits 7.8x tokens-per-forward on Qwen3-8B with identical output distribution

Orthrus is a dual-architecture framework that wraps a frozen Qwen3-8B base model with a lightweight trainable diffusion module. It delivers up to 7.8x more tokens per forward pass while producing the exact zero-shot accuracy of the base model — no sampling drift, no quality regression.

## How it works

Most speculative decoding methods (EAGLE-3, DFlash) trade some output drift for speed. Orthrus uses an exact intra-model consensus mechanism — its output matches the original base model’s predictive distribution token-for-token. In head-to-head comparisons it verifies significantly more tokens per forward pass than EAGLE-3 or DFlash, and the published benchmarks back this up across reasoning and coding tasks.

## The release

Open-source via GitHub at chiennv2000/orthrus. Pretrained weights available on Hugging Face as chiennv/Orthrus-Qwen3-8B and Orthrus-Qwen3-4B. The project ships with a complete training and inference codebase, plus benchmark scripts. Currently #2 on Hacker News with 211+ points.

## Why it matters

The persistent tradeoff in LLM inference has been: faster output or exact output, pick one. Orthrus is the first method credibly claiming both at this magnitude. If your serving bill scales with tokens per second per GPU, a 7.8x effective throughput on the same hardware rewrites your unit economics overnight.


Discover more from Top AI Product

Subscribe to get the latest posts sent to your email.



Leave a comment