Cohere Transcribe Tops the Open ASR Leaderboard With a 5.42% Word Error Rate — and It’s Fully Open Source

Speech recognition has been one of those AI fields where open-source models consistently trailed behind proprietary offerings. OpenAI’s Whisper changed the game in 2022, but even Whisper Large v3 couldn’t match the accuracy of closed-source APIs from the likes of Google and Deepgram. That gap just narrowed significantly. Cohere dropped Transcribe on March 26, 2026 — a 2B-parameter ASR model that immediately claimed the #1 spot on the HuggingFace Open ASR Leaderboard, beating every open-source and closed-source dedicated ASR model on the board. And it ships under an Apache 2.0 license.

What the Numbers Actually Show

Let’s talk benchmarks, because Cohere Transcribe’s leaderboard performance isn’t a marginal win — it’s a clear lead across the board.

The model achieves an average word error rate (WER) of 5.42% across the Open ASR Leaderboard’s eight benchmark datasets. Here’s how it breaks down:

AMI: 8.13%
Earnings22: 10.86%
GigaSpeech: 9.34%
LibriSpeech Clean: 1.25%
LibriSpeech Other: 2.37%
SPGISpeech: 3.08%
TED-LIUM: 2.49%
VoxPopuli: 5.87%

For comparison, Whisper Large v3 sits at 7.44% average WER. ElevenLabs Scribe v2 comes in at 5.83%. Qwen3-ASR-1.7B scores 5.76%. Cohere Transcribe beats all of them.

Beyond automated benchmarks, Cohere ran human preference evaluations where Transcribe was pitted head-to-head against competitors. The results: a 64% win rate against Whisper Large v3, 67% against NVIDIA Canary Qwen 2.5B, and 78% against IBM Granite 4.0. The 61% average win rate across all models suggests this isn’t just a benchmark-gaming exercise — human evaluators consistently preferred Transcribe’s output for accuracy, coherence, and usability.

The Architecture: Why 90% of Parameters Live in the Encoder

Cohere Transcribe takes an unconventional design approach. It’s a 2B encoder-decoder model built on a Fast-Conformer encoder paired with a lightweight Transformer decoder using cross-attention (X-attention). The key architectural decision: over 90% of the model’s parameters are concentrated in the encoder.

This matters for practical deployment. Traditional autoregressive decoders are the bottleneck in most speech models — they generate tokens one at a time, which means inference speed scales poorly with output length. By front-loading the computation into the encoder and keeping the decoder minimal, Cohere Transcribe pushes most of the heavy lifting into a parallelizable stage. The result is what Cohere describes as “best-in-class throughput within the 1B+ parameter model cohort.”

The model was trained on 500,000 hours of curated audio-transcript pairs using cross-entropy loss. At 2B parameters, it’s small enough to run on consumer-grade GPUs for self-hosting — a significant advantage for teams that need on-premise transcription for compliance or privacy reasons.

14 Languages, Enterprise Focus

Cohere Transcribe currently supports 14 languages: English, French, German, Italian, Spanish, Portuguese, Greek, Dutch, Polish, Chinese, Japanese, Korean, Vietnamese, and Arabic. That’s a smaller footprint than Whisper’s 100+ languages, but Cohere’s language selection is clearly enterprise-driven — these are the languages most commonly needed in multinational business contexts.

In multilingual benchmarks, Cohere reports that Transcribe outperforms or matches the best open-source model in each of its 13 non-English languages across tracked benchmarks. For companies operating across European and Asian markets, this focused multilingual quality could matter more than Whisper’s broader but thinner language coverage.

The enterprise angle runs deeper than just language selection. Cohere has positioned this model as a building block for “speech intelligence” — not just raw transcription, but integration with downstream NLP tasks like summarization, entity extraction, and sentiment analysis using Cohere’s text models. If you’re already in Cohere’s ecosystem for text processing, adding Transcribe to your pipeline creates a natural end-to-end workflow.

How It Compares to the Competition

The ASR landscape in 2026 is more crowded than ever. Here’s where Cohere Transcribe fits:

vs. Whisper Large v3: Whisper remains the default open-source choice with its massive 100+ language support and battle-tested reliability. But at 7.44% average WER, it’s noticeably less accurate than Transcribe on English benchmarks. Whisper also lacks native real-time streaming support, which limits some use cases.

vs. ElevenLabs Scribe v2: Scribe v2 is a strong competitor at 5.83% WER, but it’s a proprietary API — no self-hosting option. For teams that need data sovereignty, Transcribe’s Apache 2.0 license is a decisive advantage.

vs. Qwen3-ASR-1.7B: Alibaba’s entry scores 5.76% WER with fewer parameters (1.7B vs 2B), making it an efficient choice. But Cohere Transcribe still edges it out on accuracy and offers broader language support.

vs. Gemini Native Audio: Google’s approach differs fundamentally — Gemini processes audio as a general-purpose LLM task rather than using a dedicated ASR architecture. This gives it advantages in diarization and handling accented speech, but it comes with higher latency and cost. For pure transcription accuracy and throughput, a dedicated model like Transcribe has the architectural edge.

vs. Deepgram Nova-3: Deepgram remains a top choice for real-time streaming and speed-critical applications. Their API is mature and well-documented. But Deepgram’s models are proprietary — if you need to self-host, Transcribe is the stronger option.

Pricing and Access

Cohere offers Transcribe through multiple access paths:

Free API: Available immediately with rate limits, suitable for experimentation and low-volume use
Model Vault: Dedicated cloud inference without rate limits, with pricing calculated per hour-instance and discounts for longer commitments
Self-hosted: Download the weights from HuggingFace and run it on your own infrastructure — no license fees, no API costs, no data leaving your servers

The self-hosting option under Apache 2.0 is the real differentiator here. For organizations processing millions of audio hours — call centers, legal firms, healthcare providers — the ability to run inference on their own GPUs without per-minute API charges can translate to massive cost savings.

Specific Model Vault pricing isn’t publicly listed; enterprise customers need to contact Cohere’s sales team for custom quotes.

Why This Release Matters

Cohere has carved out a niche as the enterprise-focused AI company — less consumer flash than OpenAI, more focused on building tools that slot into existing business workflows. Transcribe fits that strategy perfectly. While competitors chase general-purpose audio understanding (voice assistants, audio reasoning), Cohere built a model that does one thing — transcription — and does it better than anything else on the leaderboard.

The Apache 2.0 licensing is a strategic move, too. By making the weights freely available, Cohere lowers the barrier for developers to build on Transcribe, which feeds back into their broader platform ecosystem. Try the model for free, build your pipeline around it, then scale up with Cohere’s API and text models when you need enterprise-grade infrastructure.

For the open-source ASR community, this is a significant moment. A well-funded enterprise company releasing a state-of-the-art model under a permissive license raises the floor for what “good enough” transcription looks like. If you’re building a speech-to-text pipeline in 2026 and you’re not evaluating Cohere Transcribe alongside Whisper, you’re leaving accuracy on the table.

FAQ

Is Cohere Transcribe really free to use?
Yes, in two ways. The model weights are available on HuggingFace under Apache 2.0, so you can download and run it without any license fees. Cohere also offers a free API tier with rate limits for quick experimentation. For production workloads without rate limits, you’ll need to either self-host or pay for Cohere’s Model Vault service.

How does Cohere Transcribe compare to Whisper for non-English languages?
Transcribe supports 14 languages compared to Whisper’s 100+. If you need transcription in a language Transcribe doesn’t cover, Whisper is still your best open-source option. But for the 14 languages Transcribe does support — which cover most major enterprise markets — it matches or outperforms the best open-source alternatives in each language.

Can I run Cohere Transcribe on my own hardware?
Yes. At 2B parameters, the model is designed to run on consumer-grade GPUs. This makes it practical for on-premise deployment where data privacy or compliance requirements prevent sending audio to external APIs.

Does Cohere Transcribe support real-time streaming transcription?
Cohere’s documentation focuses on batch transcription rather than real-time streaming. If low-latency streaming is your primary requirement, you may want to evaluate dedicated streaming solutions alongside Transcribe for batch and near-real-time workloads.

What makes Cohere Transcribe different from using Gemini or GPT for transcription?
Large language models like Gemini can transcribe audio, but they treat it as a general-purpose task — attaching audio to a text prompt. Cohere Transcribe is a purpose-built ASR model with an architecture optimized specifically for transcription accuracy and throughput. This specialization means lower latency, lower cost per audio hour, and better accuracy on standard benchmarks compared to using a general-purpose LLM for the same task.

Top AI Product

Leave a comment Cancel reply