Top AI Product

Every day, hundreds of new AI tools launch across Product Hunt, Hacker News, and GitHub. We dig through the noise so you don't have to — surfacing only the ones worth your attention with honest, no-fluff reviews. Explore our latest picks, deep dives, and curated collections to find your next favorite AI tool.


NVIDIA’s AXPO Closes the Thinking-Acting Gap in Multimodal Agentic RL

AXPO — Agent Explorative Policy Optimization — targets a quiet failure mode in how multimodal agents are trained with reinforcement learning: the Thinking-Acting Gap. Standard RL recipes like GRPO end up underweighting the tool-call moments that decide whether the agent succeeds.

## Why GRPO drops the ball on tool use

The diagnostic data is sharp. Under GRPO training, the multimodal agent attempts a tool call on only about 30% of rollouts — most of the time it just thinks and answers, because thinking is the safer default. When it does try a tool, roughly 40% of those tool-using rollouts within a group are all wrong, which collapses the learning signal exactly where it matters most: at the boundary between thinking and acting.

## Resample where it counts

AXPO fixes that with a targeted intervention. It keeps the thinking prefix fixed and resamples the tool call and its continuation, paired with uncertainty-based prefix selection — so the gradient budget concentrates on the high-variance acting step rather than getting absorbed by the easy thinking step. Across nine multimodal benchmarks on Qwen3-VL-Thinking 2B/4B/8B, SFT + AXPO beats SFT + GRPO on average with Pass@4 gains of +2.8, +2.3, and +1.8 percentage points.

## Why it matters

Multimodal agentic reasoning lives or dies on tool calls. A training method that pinpoints the gradient leak at the act step — and patches it without redesigning the whole RL pipeline — is the kind of unglamorous fix that compounds.


Discover more from Top AI Product

Subscribe to get the latest posts sent to your email.



Leave a comment