Gated DeltaNet-2, from the NVIDIA and MIT team behind the original, fixes a subtle flaw in how linear-attention models manage memory. Prior delta-rule models (Gated DeltaNet, KDA) used a single scalar gate to do two jobs at once — erasing old content and writing new content. v2 decouples them, and the gains show up exactly where you’d expect: long-context retrieval.
## The core problem
Linear attention squeezes an unbounded KV cache into a fixed-size recurrent state. The hard part isn’t just deciding what to forget — it’s editing that compressed memory without scrambling the associations already stored. One gate forced to both erase and write makes clean edits impossible. Two gates fix it.
## The numbers
Gated DeltaNet-2 beats KDA and Mamba-3 — the latest and best recurrent architectures — head to head at 1.3B parameters. The biggest gains are on RULER long-context retrieval: S-NIAH-3 jumps from 63 to 90 over KDA, and multi-key needle retrieval climbs from 28 to 38.
## Why it matters
The original Gated DeltaNet already got picked up by Qwen3.5. Linear-attention architectures are how you get cheap long context without quadratic attention cost — and retrieval quality has been their weak spot. If v2’s editing improvements hold at scale, the next generation of efficient long-context models has a new default building block.

Leave a comment