AXPO — Agent Explorative Policy Optimization — targets a quiet failure mode in how multimodal agents are trained with reinforcement learning: the Thinking-Acting Gap. Standard RL recipes like GRPO end up underweighting the tool-call moments that decide whether the agent succeeds.
## Why GRPO drops the ball on tool use
The diagnostic data is sharp. Under GRPO training, the multimodal agent attempts a tool call on only about 30% of rollouts — most of the time it just thinks and answers, because thinking is the safer default. When it does try a tool, roughly 40% of those tool-using rollouts within a group are all wrong, which collapses the learning signal exactly where it matters most: at the boundary between thinking and acting.
## Resample where it counts
AXPO fixes that with a targeted intervention. It keeps the thinking prefix fixed and resamples the tool call and its continuation, paired with uncertainty-based prefix selection — so the gradient budget concentrates on the high-variance acting step rather than getting absorbed by the easy thinking step. Across nine multimodal benchmarks on Qwen3-VL-Thinking 2B/4B/8B, SFT + AXPO beats SFT + GRPO on average with Pass@4 gains of +2.8, +2.3, and +1.8 percentage points.
## Why it matters
Multimodal agentic reasoning lives or dies on tool calls. A training method that pinpoints the gradient leak at the act step — and patches it without redesigning the whole RL pipeline — is the kind of unglamorous fix that compounds.

Leave a comment