Most text-to-speech still sounds like reading aloud. Higgs Audio v3, a 4B open model from Boson AI released in early June, is built for conversation instead — it streams expressive speech before a sentence is even finished, which is what voice agents actually need to feel real-time.
## What Higgs Audio v3 does
It covers 100+ languages (85 at production quality with sub-5% error rates) and does zero-shot voice cloning from a short clip that carries across languages. The headline is control: inline tokens let you set emotion (21 types), styles like singing, shouting, or whispering, sound effects, pauses, and prosody right in the text, rather than wrestling with separate settings. Architecturally it’s an autoregressive decoder on a Qwen3-4B backbone that interleaves text and audio tokens, encoding audio into 8 codebooks at 25fps and decoding to a 24kHz waveform.
## Control and cloning
Open weights are available under a research/non-commercial license, with a hosted API in free public preview; commercial use needs a separate license. For builders, a small, controllable, streaming TTS that clones a voice and emotes on command is the missing layer between a smart agent and one that sounds like a person.

Leave a comment