π-Bench is a new benchmark testing something most agent evaluations skip: can an AI assistant anticipate what you need before you spell it out? It comprises 100 multi-turn tasks across 5 domain-specific user personas, and the headline finding is sobering — proactive assistance remains hard for frontier agents.
## What it tests
Users rarely state requests fully. They begin underspecified, and their needs emerge gradually across a conversation. π-Bench builds in hidden user intents, inter-task dependencies, and cross-session continuity, then measures whether an agent can identify and act on unstated intent over extended interactions. Crucially, it scores proactivity and task completion separately.
## The key distinction
The paper draws a sharp line between completeness (did the agent finish the task?) and proactivity (did it reduce the user’s burden?). Frontier agents often do the former while failing the latter — they execute what you asked but don’t anticipate what you’ll need next. The benchmark also shows prior interaction helps: agents resolve proactive intent better in later tasks once they’ve accumulated context.
## Why it matters
The whole pitch of “personal AI agent” rests on proactivity — an assistant you must fully instruct every time isn’t much of an assistant. π-Bench quantifies how far the field still is from that, and gives builders a concrete target to optimize against rather than a vibe.

Leave a comment