Top AI Product

Every day, hundreds of new AI tools launch across Product Hunt, Hacker News, and GitHub. We dig through the noise so you don't have to — surfacing only the ones worth your attention with honest, no-fluff reviews. Explore our latest picks, deep dives, and curated collections to find your next favorite AI tool.


RTPurbo turns a full-attention LLM sparse in a few hundred training steps — 9.36x prefill speedup at 1M context

“Full Attention Strikes Back” introduces RTPurbo, a method that converts a standard full-attention LLM into a sparse-attention one with only a few hundred training steps — near-lossless accuracy, big efficiency gains.

## The numbers

Up to 9.36x prefill speedup at 1M-token context, and about 2.01x decode speedup. The trick: keep the full KV cache only for “retrieval heads” — the attention heads that actually do long-range lookups — and add a lightweight token indexer to sparsify the rest.

## The core insight

Full-attention LLMs are already intrinsically sparse. Most attention heads don’t need the whole context most of the time. RTPurbo doesn’t retrain from scratch to get sparsity; it surfaces the sparsity that’s already there with minimal adaptation. That’s why a few hundred steps suffice where other approaches need full retraining.

## Why it matters

Long-context inference is expensive because attention scales quadratically. Methods that retrofit sparsity onto existing models — cheaply, without quality loss — are how you make 1M-context models economical to serve. A near-10x prefill speedup is the difference between “long context is a demo” and “long context is in production.”


Discover more from Top AI Product

Subscribe to get the latest posts sent to your email.



Leave a comment