Fish Audio Just Open-Sourced S2 — and It Beats GPT-4o-mini-tts With an 81.88% Win Rate

Text-to-speech has been good enough to read your emails aloud for years. But getting AI voices to actually sound like they mean what they’re saying? That’s been the frustrating part. You want a whisper here, a confident tone there, maybe a laugh mid-sentence — and most TTS tools either ignore you or make you jump through hoops with SSML tags that feel like writing XML in 2006.

Fish Audio’s newly open-sourced S2 model takes a different approach: just tell it what you want, in plain English, right inside your text.

The Problem With Current TTS Tools

Anyone who has spent time with text-to-speech APIs knows the drill. You get impressively natural-sounding voices — until you need fine-grained control over how something is said. Most platforms offer a global “style” or “emotion” setting, but that applies to the entire utterance. Want the speaker to shift from cheerful to serious mid-paragraph? Good luck.

ElevenLabs, MiniMax, and even OpenAI’s TTS offerings give you high-quality output, but the control mechanisms are limited. SSML markup exists, but it’s clunky and only supports predefined tags. The gap between “sounds natural” and “sounds like a real person who changes tone mid-conversation” has been wide open.

That’s exactly the gap Fish Audio S2 targets.

How Fish Audio S2 Tackles Inline Emotion Control

S2’s headline feature is what Fish Audio calls inline emotion control. Instead of selecting a global voice style, you embed natural-language instructions directly into your text at the exact positions where you want the tone to shift.

Here’s what that looks like in practice:

[whisper in small voice] Don't tell anyone, but [normal tone] the quarterly numbers are actually great.
Welcome to the show! [professional broadcast tone] Today's top story...
[pitch up] Really? [laugh] I can't believe that happened.

These aren’t predefined tags from a fixed list. S2 accepts free-form textual descriptions — meaning you can write [excited but trying to stay calm] or [sarcastic] and the model interprets your intent. This is a fundamental shift from the dropdown-menu approach that most TTS platforms use.

The system also supports multi-speaker dialogue generation with consistent timbre across turns, which makes it particularly useful for podcast-style content, audiobook production, and conversational AI agents.

What’s Under the Hood: 4B Parameters and 10 Million Hours of Audio

S2 runs on a Dual-Autoregressive (Dual-AR) architecture with roughly 4 billion parameters along the time axis and 400 million along the depth axis. The design splits the work:

Slow AR handles the temporal sequence, predicting the primary semantic codebook
Fast AR fills in the remaining 9 residual codebooks at each time step, capturing fine acoustic detail

The model was trained on over 10 million hours of audio across approximately 50 languages (with support for around 80 languages total). For post-training alignment, Fish Audio used Group Relative Policy Optimization (GRPO) — the same family of RL techniques that have become popular in LLM alignment — to fine-tune for semantic accuracy, instruction adherence, acoustic quality, and speaker similarity.

Voice cloning requires just 15 seconds of reference audio. The system places reference audio tokens in the prompt and uses SGLang’s RadixAttention to cache KV states, hitting an 86.4% average prefix-cache hit rate — which is why voice cloning requests after the first one are significantly faster.

On a single NVIDIA H200, S2 achieves a real-time factor of 0.195 with time-to-first-audio around 100ms and throughput exceeding 3,000 acoustic tokens per second.

The Benchmark Numbers: S2 vs. the Competition

This is where things get interesting. Fish Audio didn’t just release a model — they released benchmark results that put S2 ahead of both open-source and closed-source competitors:

EmergentTTS-Eval: S2 achieves an 81.88% win rate against gpt-4o-mini-tts — the highest among all evaluated models, including closed-source systems from Google and OpenAI.

Audio Turing Test: S2 scores a posterior mean of 0.515 with instruction rewriting, beating Seed-TTS (0.417) by 24% and MiniMax-Speech (0.387) by 33%.

Seed-TTS Eval (Word Error Rate):
– S2: 0.54% (Chinese) / 0.99% (English)
– MiniMax Speech-02: 0.99% / 1.90%
– Seed-TTS: 1.12% / 2.25%

MiniMax Multilingual Testset (24 languages): S2 achieves the best WER in 11 languages and the best speaker similarity in 17 languages, outperforming both MiniMax and ElevenLabs across the majority of tested languages.

Fish Instruction Benchmark (emotion/style control quality): 4.51 out of 5.0.

These numbers position S2 as arguably the strongest TTS model available right now — and it’s fully open-source.

What You Actually Get: The Full Open-Source Package

Fish Audio didn’t just drop model weights and call it a day. The S2 release includes:

Model weights on HuggingFace
Fine-tuning code for custom voice and style training
Production-ready inference stack built on SGLang-Omni for streaming

The GitHub repository (fishaudio/fish-speech) has accumulated over 25,000 stars, making it one of the most popular open-source TTS projects. The complete package means you can self-host the entire system — no API dependency, no per-character costs, full control over your deployment.

For those who prefer a managed service, Fish Audio also offers S2 through their API with pay-as-you-go pricing. Their existing pricing structure uses a credit system (1 credit per Chinese character, 0.5 credits per character for other languages), though specific S2 API pricing may differ as the model just launched.

Limitations and What to Watch

S2 is impressive on paper, but a few things are worth noting:

Hardware requirements: Running the full 4B parameter model locally requires serious GPU resources. The benchmarks were run on NVIDIA H200 hardware — not exactly consumer-grade.
Emotion control consistency: Some GitHub issues from earlier Fish Speech versions suggest that emotion tags don’t always produce the expected results, especially for subtle emotional shifts. S2 should improve on this, but community feedback is still early.
Language coverage depth: While 80 languages are nominally supported, performance will vary significantly. The strongest results are in English, Chinese, Japanese, and other well-represented languages in the training data.

Compared to ElevenLabs, Fish Audio S2 offers the advantage of being fully open-source and self-hostable, plus the inline emotion control is more flexible. ElevenLabs still has a more polished consumer-facing platform and a larger voice library.

Compared to OpenAI’s TTS, S2 beats gpt-4o-mini-tts on benchmarks and offers far more granular control, but OpenAI’s integration with the broader GPT ecosystem is a significant advantage for developers already in that stack.

Compared to MiniMax Speech, the benchmark numbers favor S2 across the board, though MiniMax’s tight integration with their video and multimodal models may matter for certain use cases.

Who Should Pay Attention

Developers building voice AI agents who need dynamic, context-aware speech with real-time emotion shifts
Content creators producing podcasts, audiobooks, or video narration who want fine-grained control without SSML headaches
Companies looking to self-host TTS to avoid per-character API costs and maintain data privacy
Researchers working on speech synthesis, voice cloning, or multilingual TTS

FAQ

Is Fish Audio S2 really free to use?
Yes. The model weights, fine-tuning code, and inference stack are fully open-sourced. You can self-host without any licensing fees. Fish Audio also offers a paid API for those who don’t want to manage infrastructure.

How does Fish Audio S2 compare to ElevenLabs?
S2 outperforms ElevenLabs on multilingual benchmarks (best speaker similarity in 17 of 24 tested languages) and offers inline emotion control that ElevenLabs doesn’t support. ElevenLabs has a more polished UI and larger community voice library.

What hardware do I need to run Fish Audio S2 locally?
The full model has roughly 4B parameters and was benchmarked on NVIDIA H200 GPUs. For production use, you’ll want at least an H100 or equivalent. Smaller GPU setups may work with quantized versions, but expect reduced quality.

How many languages does Fish Audio S2 support?
Approximately 80 languages, trained on 10 million hours of audio across about 50 languages. Best performance is in well-represented languages like English, Chinese, Japanese, Korean, Spanish, and Arabic.

Can Fish Audio S2 clone any voice from just 15 seconds of audio?
Yes, S2 supports zero-shot voice cloning with as little as 15 seconds of reference audio. The cloned voice maintains consistent timbre across different emotions and styles, though quality depends on the clarity of the reference audio.

Top AI Product