Most text-to-speech systems convert audio into discrete tokens at some stage of the pipeline. dots.tts, a new open-source TTS model from RedNote’s Hilab team, throws that step out entirely. It’s a 2B-parameter, fully continuous, end-to-end autoregressive system — a semantic encoder, an LLM, and a flow-matching acoustic head running over a 48kHz AudioVAE, with no discrete tokens anywhere.
## What dots.tts does
It generates speech with voice cloning and emotional expressiveness, and ships with an interleaved “1T1A” mode that alternates one text token with one audio step — enabling low-latency streaming when paired with a duplex dialogue model. On standard benchmarks it posts open-source state-of-the-art results, including the highest average speaker similarity across a 24-language multilingual test.
## Why it matters
The whole thing ships under Apache-2.0 with pretrained checkpoints plus full inference and fine-tuning code, so it’s usable commercially rather than just a demo. For anyone building voice agents or dubbing pipelines, a continuous, streamable, openly licensed model that holds its own against the best closed options is a real addition. Singing and combined speech-plus-sound generation aren’t covered in this release.

Leave a comment