Hume AI open-sources TADA — an LLM-based TTS with zero hallucinations and 0.09 RTF

LLM-based text-to-speech systems have a dirty secret: they hallucinate. Words get skipped, phrases get invented, and entire sentences sometimes come out garbled. The root cause is a fundamental mismatch — text and audio operate on completely different timescales, and when you force a language model to bridge that gap with hundreds of audio tokens per second of speech, things go wrong. Hume AI’s newly open-sourced TADA model attacks this problem at the architectural level, and the results are striking.

The Hallucination Problem in LLM-Based TTS

Traditional TTS systems (concatenative, parametric) were reliable but robotic. The newer wave of LLM-based TTS models — think Bark, VALL-E, and their descendants — brought natural-sounding speech but introduced a new class of failures. Because these models compress audio into discrete tokens at rates of 12.5 to 75 tokens per second, the language model has to predict long sequences where a single misstep cascades into skipped words, repeated phrases, or outright fabricated content.

The numbers tell the story. In standardized testing on the LibriTTSR dataset (1,000+ samples), competing models show measurable hallucination rates: FireRedTTS-2 produced 41 hallucinated samples, Higgs Audio V2 had 24, and VibeVoice 1.5B showed 17. These aren’t edge cases — they’re systemic failures that make LLM-based TTS unreliable for production use cases like audiobook narration, voice agents, or accessibility tools where every word matters.

There’s also the speed problem. High token rates mean high compute costs and slow generation. Most LLM-based TTS systems struggle to hit real-time speeds on reasonable hardware, which limits their usefulness for interactive applications.

How TADA Solves It: One Text Token, One Acoustic Vector

TADA stands for Text-Acoustic Dual Alignment, and the core idea is deceptively simple: instead of converting audio into a long sequence of discrete tokens, TADA aligns one continuous acoustic vector per text token. Text and speech move in lockstep through the language model.

This architectural choice has cascading benefits:

Zero hallucinations by construction. Because there’s a strict one-to-one mapping between text tokens and acoustic output, the model physically cannot skip words or invent content. It’s not that TADA has a low hallucination rate — it’s that hallucination is architecturally impossible. Across those same 1,000+ LibriTTSR test samples, TADA produced exactly zero hallucinated outputs.

Dramatically fewer tokens per second of audio. TADA operates at just 2-3 frames per second of audio, compared to 12.5-75 in conventional approaches. This means a standard 2,048-token context window can accommodate roughly 700 seconds of audio — nearly 12 minutes. A conventional system maxes out at about 70 seconds in the same window. That’s a 10x improvement in context efficiency.

Speed that leaves competitors behind. With fewer tokens to generate, TADA achieves a real-time factor (RTF) of 0.09 — meaning it generates speech more than 11x faster than real-time playback. Hume AI claims this makes it over 5x faster than comparable LLM-based TTS systems.

What the Benchmarks Actually Show

Raw speed and zero hallucinations would be meaningless if TADA sounded terrible. So how does voice quality hold up?

In human evaluations on the EARS dataset (which tests expressive, long-form speech), TADA scored:

Speaker similarity: 4.18 out of 5.0
Naturalness: 3.78 out of 5.0

These scores placed TADA second overall in the evaluation — competitive with systems that sacrifice speed and reliability for quality. It’s a strong showing, especially considering the model’s focus on efficiency and reliability over raw audio fidelity.

The model comes in two sizes:

Model	Parameters	Language Support
TADA-1B	1 billion	English only
TADA-3B-ML	3 billion	8 languages (multilingual)

Both are built on the Llama architecture, which means they benefit from the extensive tooling and optimization work the open-source community has already done for Llama-based models.

TADA vs. the Open-Source TTS Landscape

The open-source TTS space has gotten crowded in 2026, so where does TADA fit relative to the alternatives?

Kokoro remains the lightweight champion at just 82 million parameters — small enough to run on almost anything. It delivers impressively low word error rates, but it can’t clone voices and operates in a fundamentally different weight class than TADA.

Sesame CSM (Conversational Speech Model) excels at conversational scenarios with its strong handling of non-verbal cues and tonal shifts between speakers. Its 1B parameter model is purpose-built for dialogue, which gives it an edge in that specific use case.

Fish Audio S2 is the heavyweight at 4 billion parameters, trained on 10 million hours of audio. It offers inline emotion control and achieved an 81.88% win rate against GPT-4o-mini-tts on EmergentTTS-Eval. But it requires substantially more compute.

TADA carves out a distinct niche: it’s the reliability-first choice. If your application absolutely cannot tolerate hallucinated or skipped words — voice agents handling medical information, accessibility tools for visually impaired users, or production audiobook pipelines — TADA’s zero-hallucination guarantee is uniquely valuable. The speed advantage also makes it the strongest candidate for on-device deployment, where compute budgets are tight.

Who Built This and Why It’s Open-Source

Hume AI, founded in 2021 by CEO Alan Cowen, has raised over $80 million in funding (including a $50M Series B led by EQT Ventures) to build AI with emotional intelligence. The company is best known for its Empathic Voice Interface and the Octave TTS model available through its commercial API.

Open-sourcing TADA is a strategic move. By releasing the foundational TTS architecture as open-source while keeping their more expressive, emotionally-aware models (like Octave) as commercial offerings, Hume AI gets community adoption and ecosystem building without cannibalizing their paid products. The TADA models, code, and a companion arXiv paper were all released simultaneously on March 10, 2026, appearing on Hacker News and gaining traction across Hugging Face and GitHub.

The timing matters too. With ElevenLabs, OpenAI, and Google all pushing proprietary TTS, and open-source alternatives multiplying rapidly, releasing a competitive open-source model helps Hume AI establish technical credibility in a space where brand recognition drives API revenue.

Limitations Worth Knowing

TADA’s architectural trade-offs aren’t free:

Naturalness ceiling. The 3.78/5.0 naturalness score, while competitive, trails behind models like Fish Audio S2 that use higher token rates and larger training sets. If raw audio quality is your top priority and you have the compute budget, other options may sound better.
No voice cloning details. The initial release focuses on the core TTS capability. Voice cloning workflows and fine-tuning documentation are limited compared to more mature open-source projects.
Multilingual coverage. The 3B multilingual model supports 8 languages — decent, but well behind Fish Audio S2’s 50+ language coverage.
Early ecosystem. As a brand-new open-source release, community tooling, tutorials, and third-party integrations are still catching up.

FAQ

Is Hume AI TADA free to use?
Yes. TADA is open-sourced with pre-trained models available on Hugging Face (1B and 3B variants) and code on GitHub. You can download and run it locally without any API costs.

How does TADA compare to ElevenLabs or OpenAI TTS?
TADA targets a different use case. ElevenLabs and OpenAI offer polished, high-quality voice APIs with extensive voice libraries and fine-grained controls. TADA prioritizes zero hallucination and speed, making it better suited for applications where reliability and on-device deployment matter more than voice variety.

Can TADA run on consumer hardware?
The 1B parameter model is designed to be lightweight enough for on-device deployment. Specific hardware requirements depend on your inference setup, but the low token rate (2-3 per second of audio) means compute demands are significantly lower than competing LLM-based TTS systems.

What languages does TADA support?
The 1B model is English-only. The 3B multilingual model supports 8 languages. For broader language coverage, alternatives like Fish Audio S2 (50+ languages) or Kokoro may be more appropriate.

What’s the difference between TADA and Hume AI’s Octave?
Octave is Hume AI’s commercial TTS product available through their API, focused on emotionally expressive speech. TADA is the open-source foundational model emphasizing speed and reliability. Think of TADA as the engine and Octave as the fully loaded car.

Top AI Product