Kitten TTS: 15 Million Parameters, 25MB on Disk, and Zero GPU Required for Natural Speech

Most text-to-speech models worth using weigh hundreds of megabytes and expect a GPU. Kitten TTS, from the KittenML team, takes a different bet — what if you could get genuinely expressive voice synthesis from a model small enough to fit on a smartwatch?

On March 19, KittenML dropped three new models on Hacker News and promptly racked up 430+ points with 160 comments. The smallest variant packs 15 million parameters into under 25MB of disk space. It runs on pure CPU, outputs 24 kHz audio, and according to the team, hits state-of-the-art expressivity among models of comparable size. The GitHub repo now sits at 12.2K stars and 659 forks.

That is a bold claim for something smaller than a single high-resolution photo. But the numbers behind Kitten TTS tell an interesting story about where edge AI is heading.

Three Models, One Philosophy: Smaller Is Better

The v0.8 release introduced a tiered model lineup, each targeting a different point on the size-quality spectrum:

Model	Parameters	Disk Size (quantized)	Target Use Case
Mini	80M	~80 MB	Desktop/server, highest quality
Micro	40M	~41 MB	Mobile apps, balanced trade-off
Nano	15M	~25 MB (int8)	Edge devices, browsers, wearables

All three models share the same architecture: a lightweight transformer encoder paired with a neural vocoder. Both components were trained with quantization-aware training (QAT), meaning the model learned to compensate for precision loss during training rather than having quantization bolted on afterward. This is a key reason the Nano model punches above its weight — it was designed from the ground up to work at int8+fp16 precision, not forced into it.

Each model ships with eight built-in English voices — Bella, Jasper, Luna, Bruno, Rosie, Hugo, Kiki, and Leo — split evenly between male and female. The output is 24 kHz WAV audio with adjustable speed control and built-in text preprocessing for numbers, currencies, and units.

The entire stack runs on ONNX Runtime, which means deployment is straightforward across Linux, macOS, and Windows. Python 3.8+ is the only hard requirement. No CUDA, no GPU drivers, no cloud API keys.

Why a 25MB TTS Model Matters Right Now

The obvious question: who actually needs a TTS model this small? The answer turns out to be a surprisingly long list.

Offline-first applications. Voice assistants that work without internet access have been a white whale for years. A 25MB model that runs on a Raspberry Pi or a $100 Android phone changes the math entirely. One Hacker News commenter already built a low-latency streaming voice assistant on CPU using Kitten TTS as the speech layer — complete with voice activity detection and an async pipeline to minimize latency.

Browser-based TTS. The ONNX Runtime Web backend means Kitten TTS can synthesize speech directly in the browser. Multiple community projects have already ported the model to run client-side using WebAssembly. For web apps that need voice output without sending user text to a server, this is a significant privacy win.

Embedded and IoT. Smart home devices, kiosks, educational toys, accessibility tools — all areas where adding voice output previously meant either a cloud dependency or a mechanical-sounding espeak fallback. A model under 25MB with natural-sounding output fits where nothing else could.

Cost. Cloud TTS APIs charge per character. For high-volume applications — audiobook generators, e-learning platforms, automated call systems — the cost adds up fast. A self-hosted model with an Apache 2.0 license and zero inference cost is compelling.

The broader trend here is clear. As AI models get deployed on devices rather than data centers, size becomes a first-class engineering constraint. Kitten TTS is betting that for many real-world use cases, “good enough in 25MB” beats “perfect in 2GB.”

Kitten TTS vs. Piper, Kokoro, and Coqui: Where Does It Actually Stand?

No TTS model exists in a vacuum. Here is how Kitten TTS compares to the other open-source options developers are evaluating:

Piper TTS is Kitten TTS’s closest competitor in the lightweight space. Developed by the Rhasspy project, Piper offers over 900 voices across multiple languages and has been the go-to choice for Raspberry Pi TTS for years. On raw CPU speed, Piper is significantly faster — benchmark data from a community comparison on GitHub shows Piper fp16 achieving an RTF (real-time factor) of 0.192 versus Kitten TTS fp16 at 0.693. In plain terms, Piper generates speech about 3.5x faster on equivalent hardware. Piper’s int8 variant is also slightly smaller on disk (22MB vs. 25MB). However, Piper has a critical weakness on ultra-constrained devices: testers found that Piper crashed on a Galaxy A14 smartphone because Chrome killed the tab when memory exceeded 80MB, while Kitten TTS completed every test without a single out-of-memory error. If your target is truly resource-constrained hardware, Kitten TTS has the edge in reliability.

Kokoro TTS (82M parameters) sits at the other end of the trade-off curve. It produces notably higher-quality speech with 21 expressive voices, but at 330MB on disk and an RTF of 1.880 (or 3.564 for int8), it is both larger and slower than both Kitten and Piper. Kokoro is the right choice when quality is the top priority and you have a decent CPU or GPU available. It is not a realistic option for edge deployment.

Coqui XTTS was once the most popular open-source TTS project, particularly for voice cloning. But the Coqui organization shut down in late 2023, and while the code remains available, nobody from the original team maintains it. XTTS-v2 also requires a GPU for practical use. For preset voices and edge deployment, Kitten TTS is the better bet; for voice cloning specifically, the community still reaches for Coqui forks.

espeak-ng deserves a mention as the ultra-minimal option. It is extraordinarily tiny and runs anywhere, but it produces formant-synthesized speech that sounds unmistakably robotic. Kitten TTS fills the gap between espeak’s mechanical output and Piper/Kokoro’s larger footprint.

The competitive picture is nuanced. Kitten TTS is not the fastest, not the highest quality, and not the most feature-rich. What it does better than anything else is fit natural-sounding speech into the smallest possible package while staying reliable on devices where competitors fail.

Community Reception: Enthusiasm with Caveats

The Hacker News thread tells a story of genuine excitement tempered by early-stage friction.

On the positive side, developers praised the model’s quality relative to its size and immediately started building on it. Community projects emerged quickly — a CLI wrapper called “purr,” browser ports using WebAssembly, and a self-hosted API server with a web UI and audiobook processing support.

But the criticism was also pointed. The biggest complaint: dependency bloat. Despite marketing itself as a tiny model, one developer reported that pip install pulled 7.1GB of dependencies, primarily NVIDIA CUDA libraries that are completely unnecessary for CPU-only inference. The community-built “purr” wrapper addressed this by stripping out torch and GPU packages, but it highlights a gap between the model’s lightweight philosophy and its current packaging.

Other issues raised by users:

Number pronunciation requires manual text preprocessing workarounds. The model struggles with reading digits, dates, and formatted numbers naturally.
Voice diversity is limited. Several users requested more professional, business-appropriate voices, noting the current options skew toward an anime/cartoon aesthetic. The team committed to releasing additional voices within weeks.
Documentation gaps around GPU usage, advanced configuration, and integration patterns.

To their credit, the KittenML team was highly responsive in the thread. The creator engaged with nearly every concern, committing to dependency fixes, mobile SDKs, multilingual support (targeted for April), and voice cloning capabilities by May. The project is tagged as “developer preview,” so API changes between releases are expected.

What Is on the Roadmap

KittenML has outlined several upcoming features:

Multilingual support — currently English-only, with community-contributed German and Spanish checkpoints available but less polished. Official multilingual models are planned.
Mobile SDKs — native integration for iOS and Android beyond the current Python-based workflow.
Voice cloning — the ability to generate custom voice profiles.
KittenASR — a companion automatic speech recognition module, suggesting the team is building toward a complete edge voice pipeline.
Optimized inference engine — potential speed improvements to close the gap with Piper.

The Apache 2.0 license means commercial use is unrestricted. For teams needing custom voices, enterprise licensing, or integration support, KittenML offers commercial support separately.

FAQ

Is Kitten TTS free for commercial use?
Yes. The project is released under the Apache 2.0 license, which permits unrestricted commercial use, modification, and redistribution. KittenML also offers paid commercial support for custom voice development, integration assistance, and enterprise licensing, but the core models and code are free.

How does Kitten TTS compare to cloud TTS services like Google Cloud TTS or Amazon Polly?
Cloud services generally produce higher-quality speech with more language and voice options. The trade-offs are cost (per-character pricing), latency (network round-trip), privacy (text is sent to external servers), and offline availability (none). Kitten TTS runs locally with zero ongoing cost, full privacy, and no internet requirement. For applications where those factors outweigh peak audio quality — embedded devices, offline apps, high-volume synthesis — Kitten TTS is a strong alternative.

What languages does Kitten TTS support?
Currently, only English is officially supported with eight voices. Community members have contributed experimental German and Spanish checkpoints. The KittenML team has announced plans for official multilingual support in an upcoming release.

Can Kitten TTS clone voices or generate custom voices?
Not yet. The current release includes eight preset voices only. Voice cloning is on the roadmap and is expected to arrive in a future update. For voice cloning today, Coqui XTTS forks or Kokoro remain the primary open-source options.

What hardware do I need to run Kitten TTS?
The Nano model (25MB) runs on virtually any hardware with a CPU — Raspberry Pi, budget smartphones, older laptops, and even web browsers via WebAssembly. No GPU is required. Python 3.8 or later is needed, and the runtime supports Linux, macOS, and Windows. The Mini model (80MB) benefits from a more capable CPU but still does not require a GPU.

Top AI Product

Leave a comment Cancel reply