DenoiseRL is a reinforcement learning method that trains reasoning models on something most pipelines throw away: wrong answers. Instead of leaning on a stronger teacher model for supervision, it learns directly from the failures of weak models, turning bad reasoning traces into training signal.
## Conditioning on mistakes
The trick is to start the model from a broken place on purpose. DenoiseRL conditions the policy on incorrect reasoning prefixes — derived from structured perturbations of weak-model failures — and explicitly optimises it to notice the mistake, revise, and still reach a verified correct answer. Rather than only rewarding clean trajectories, it rewards recovery. That removes the dependence on an expensive stronger-teacher model, which makes the approach more scalable: weak failures are abundant and free.
## Why it matters
A reasoning model that only ever sees good reasoning is brittle the moment its own chain goes off the rails mid-problem. Training explicitly for recovery builds the self-correction muscle that long reasoning chains need. The paper reports DenoiseRL consistently beating strong on-policy RL baselines on competitive math and general reasoning, with self-corrective behaviour getting stronger as problems get harder — which is exactly when recovery matters most.

Leave a comment