AssemblyAI just put a flag in the voice agent ground. Their new Voice Agent API is one endpoint that takes speech in and gives speech out — STT, dialogue orchestration, the whole pipeline managed. It’s the same job Vapi, Retell, and LiveKit Agents have been splitting up between them, but this one comes from the company that actually owns the speech-to-text layer.
What you actually get
It’s not a no-code builder. It’s an API for developers who want a production voice agent without stitching together a Deepgram + GPT + ElevenLabs stack themselves. AssemblyAI’s own Universal-Streaming STT sits underneath, clocking 307ms median latency against Deepgram Nova-3’s 516ms, and roughly 2× faster on P99 (1,012ms vs 1,907ms). Streaming runs $0.15/hour, billed by session duration. Multilingual coverage already includes English, Spanish, French, German, Italian, and Portuguese, with more on the roadmap.
Why this one matters
Vapi and Retell rebuild on top of someone else’s STT — usually Deepgram or AssemblyAI itself. AssemblyAI is now shipping its own end-to-end stack, and STT is the part that breaks first when a voice agent stalls mid-sentence. If you’re building support bots, outbound call agents, or live-translation flows where speech in, speech out is the whole product, there’s now a single API that owns the slowest link in the chain.
You Might Also Like
- Agent Action Protocol aap the Missing Layer Above mcp That Actually Makes Agents Production Ready
- Ai Agents Keep Deleting User Files jai Stanford ai Agent Sandbox Offers a one Command fix
- Agent Builder by Thesys When ai Agents Stop Talking and Start Showing
- Openharness got an Open Source Idea let ai Agents Build it for you for Free
- Google Workspace Studio Just Made ai Agents a Thing Everyone can Build

Leave a comment