“Full Attention Strikes Back” introduces RTPurbo, a method that converts a standard full-attention LLM into a sparse-attention one with only a few hundred training steps — near-lossless accuracy, big efficiency gains.
## The numbers
Up to 9.36x prefill speedup at 1M-token context, and about 2.01x decode speedup. The trick: keep the full KV cache only for “retrieval heads” — the attention heads that actually do long-range lookups — and add a lightweight token indexer to sparsify the rest.
## The core insight
Full-attention LLMs are already intrinsically sparse. Most attention heads don’t need the whole context most of the time. RTPurbo doesn’t retrain from scratch to get sparsity; it surfaces the sparsity that’s already there with minimal adaptation. That’s why a few hundred steps suffice where other approaches need full retraining.
## Why it matters
Long-context inference is expensive because attention scales quadratically. Methods that retrofit sparsity onto existing models — cheaply, without quality loss — are how you make 1M-context models economical to serve. A near-10x prefill speedup is the difference between “long context is a demo” and “long context is in production.”

Leave a comment