SwanBench-Speech is a benchmark for the part of speech generation that single-utterance demos hide: long form. Most TTS evaluations score short clips on naturalness and prosody. What breaks when you push to minutes of continuous speech — drift, identity loss, expressive monotony — they tend not to measure.
## 1,101 samples across 17 scenarios
The benchmark covers 1,101 samples spanning 17 common speech scenarios, evaluating acoustic quality, semantic faithfulness, and expressiveness — the three places long-form synthesis tends to crack. Seventeen scenarios spread the test surface beyond audiobook reads to dialogue, news, lectures, narration, and the kinds of situational shifts that catch a model leaning too heavily on whatever style it picked first.
## Why it matters
Long-form synthesis is the production bar for podcasts, audiobooks, dubbed video, character voices in games, and any voice agent that has to maintain identity across a real conversation. A clip that sounds great at five seconds can fall apart at five minutes — pitch wanders, energy flattens, the speaker stops sounding like the speaker. Benchmarks shape what labs train against. A focused long-form, multi-scenario suite gives the field something specific to push on, instead of optimising the same MOS scores on short samples and quietly stalling on the use cases that actually pay.

Leave a comment