Microsoft is leaning harder on its own models, and voice is a clear example. At Build 2026 on June 2, the company introduced MAI-Voice-2, the second generation of its in-house text-to-speech model, built to make speech a native interface for Azure Copilot.
## What MAI-Voice-2 does
The model delivers expressive speech synthesis across 15 languages, with voice cloning and a wide range of emotional styles — including angry, confused, embarrassed, and whispered registers. It pairs low-latency speech recognition with natural text-to-speech, so an application can hold a real-time, two-way voice conversation rather than just reading text aloud. Two headline additions are identity preservation and voice prompting, all in a single unified system.
## Built for enterprise
MAI-Voice-2 is tuned for the workplace. It is fine-tuned on enterprise terminology, from cybersecurity alerts to DevOps commands, and aims for accuracy in noisy office settings. Governance is built in: every voice interaction is logged, encrypted, and held to the same policies as text-based Copilot use, and consent is enforced at the system level so only authorized, licensed voices can be synthesized — no unlicensed cloning in production.

Leave a comment