DelTA is a new method for reinforcement learning from verifiable rewards (RLVR) — the training technique behind most of today’s reasoning models. The insight is sharp: the policy-gradient update in RLVR implicitly acts as a linear discriminator over token-gradient vectors, deciding which token probabilities go up or down.
## The problem it fixes
That discriminator gets dominated by shared high-frequency patterns — formatting tokens, boilerplate — which dilutes the sparse, genuinely discriminative directions that separate a high-reward answer from a low-reward one. In plain terms: RL training wastes signal reinforcing the model’s punctuation and structure instead of its reasoning.
## What DelTA does
It reweights the RLVR surrogate using token coefficients derived from the positive-negative discriminator induced by the update itself — concentrating the learning signal on the tokens that actually distinguish good answers from bad.
## Why it matters
RLVR is how labs turn base models into reasoners right now. A credit-assignment fix that costs nothing at inference and sharpens what the model learns from each reward is the kind of unglamorous improvement that compounds across every training run. Expect it to quietly show up in next-generation reasoning models.

Leave a comment