Testing whether an AI agent actually behaves the way you intended has been one of the messier parts of shipping with LLMs. Microsoft ASSERT — Adaptive Spec-driven Scoring for Evaluation and Regression Testing, open-sourced in early June — tries to make that automatic. It’s an MIT-licensed framework that reads a plain-language description of how an agent should and shouldn’t behave, then writes the test suite for you.
## How ASSERT works
You describe the rules in normal English — say, a research agent must never email people outside the company, and should keep confidential details to senior staff. ASSERT turns that into a structured set of acceptable and unacceptable behaviors, generates problem scenarios, runs them against your system, and scores the results. It also records the agent’s intermediate steps and tool calls, so you can see exactly where a run went wrong.
## Why it matters
ASSERT works across LangChain, CrewAI, LiteLLM, OpenAI and more, and its LLM-judge scoring reportedly lands within 80–90% of human annotators. For teams who’ve been hand-writing evals, turning written intent straight into executable regression tests is the kind of plumbing that makes agents safer to ship.

Leave a comment