Cohere Transcribe: state-of-the-art speech recognition

Cohear is announcing Transcribe, a state-of-the-art automatic speech recognition (ASR) model that is open source and available today. for download.

Speech is rapidly becoming a core tool for AI-enabled workloads and automation – from transcription and speech analytics to real-time customer support agents.

Our objective was simple: to push the limits of dedicated ASR model accuracy under practical conditions. The model was trained from the start with a deliberate focus on minimizing the word error rate (WER), keeping production readiness in mind. In other words, not just a research artifact, but a system designed for everyday use.

Cohere Transcribe reflects that intention. It is available for open-source use with full infrastructure control, maintains a manageable inference footprint suitable for practical GPU and local use, provides best-in-class service efficiency, and is also available through model vault – Cohere’s secure, fully managed model estimation platform.

Cohere Transcribe is currently ranked #1 for accuracy on HuggingFace Open ASR LeaderboardSetting a new benchmark for real-world transcription performance.

This marks our zero-to-one in bringing high-performance speech recognition to enterprise AI workflows. Read on for more details.

model overview
Name coherent-transcript-03-2026
architecture Conformer-Based Encoder-Decoder
input Audio Waveform → Log-Mel Spectrogram
Production written text
model size 2b
Sample A large conformer encoder extracts acoustic representations, followed by a lighter conformer decoder for token generation.
training objectives Standard supervised cross-entropy on output tokens; trained from the beginning
Languages

Trained on 14 languages:

  • European: English, French, German, Italian, Spanish, Portuguese, Greek, Dutch, Polish
  • AIPAC: Chinese (Mandarin), Japanese, Korean, Vietnamese
  • Mena: Arabic
license Apache 2.0

Image 1: Cohear Transcribe is an open-weight conformer ASR model that converts speech audio to text in 14 supported languages.

model display

accuracy

CoHire Transcribe is the latest standard for English speech recognition accuracy. It leads the HuggingFace Open ASR leaderboard with an average word error rate of only 5.42%, outperforming all open- and closed-source dedicated ASR alternatives, including Whisper Large v3, ElevenLabs Scribe v2, and Qwen3-ASR-1.7b. It captures the model’s versatile ability in real-world speech tasks, such as robustness to multiple-speaker environments, boardroom-style acoustics (e.g. AMI dataset), and diverse utterances (e.g. Voxpopuli dataset).

Sample Average WER ami earning 22 gigaspeech ls clear ls other spg speech tedium Voxpopuli
coherent transcription 5.42 8.13 10.86 9.34 1.25 2.37 3.08 2.49 5.87
zoom scribe v1 5.47 10.03 9.53 9.61 1.63 2.81 1.59 3.22 5.37
IBM Granite 4.0 1B Speech 5.52 8.44 8.48 10.14 1.42 2.85 3.89 3.10 5.84
nvidia canary queen 2.5b 5.63 10.19 10.45 9.43 1.61 3.10 1.90 2.71 5.66
Qwen3-ASR-1.7B 5.76 10.56 10.25 8.74 1.63 3.40 2.84 2.28 6.35
elevenlabs scribe v2 5.83 11.86 9.43 9.11 1.54 2.83 2.68 2.37 6.80
Kyutai STT 2.6B 6.40 12.17 10.99 9.81 1.70 4.32 2.03 3.35 6.79
OpenAI Whisper Large v3 7.44 15.95 11.29 10.02 2.01 3.91 2.94 3.86 9.54
Voxtral Mini 4B Realtime 2602 7.68 17.07 11.84 10.38 2.08 5.52 2.42 3.79 8.34

Image 2: Hugging Face Open ASR Leaderboard as of 03.26.2026. It is a widely used, standardized benchmark that evaluates automatic speech recognition systems in curated datasets using word error rate (WER) as the primary metric, calculated on normalized context-hypothesis alignment, where lower WER indicates higher transcription fidelity. View Live Leaderboard Here.

Critically, these benefits are not limited to the benchmark dataset. We see the same state-of-the-art performance in human assessment, where trained reviewers assess transcription quality in real-world audio for accuracy, consistency, and usability. The consistency across both evaluation methods reinforces that Cohear Transcribe’s performance reliably translates from controlled trials to practical enterprise settings.

Bar chart showing transcription win rates (%) by model: ElevenLabs Scribe v2 (51%), Queue3-ASR-1.7b (55%), Voxtral Mini 3b Realtime 2507 (55%), Zoom Scribe v1 (56%), OpenAI Whisper Large v3 (64%), NVIDIA Canary Queue 2.5b (67%), IBM Granite 4.0 1b Speech (78%), with an average of 61%.
Image 3: Human preference assessment of model transcripts in English. In pairwise comparisons, annotators were asked to express preferences for generations that primarily preserved meaning – but also avoided hallucinations, correctly identified named entities, and provided verbatim transcripts with proper formatting. A score of 50% or greater indicates that cohear transcripts were preferred on average compared to face-to-face transcripts.
Bar chart showing transcription win rates (%) for three ASR models—Qwen3-ASR-1.7b, OpenAI Whisper Large v3, and Voxtral Mini 4b Realtime—in six languages: Italian (60%, 55%, 58%), French (51%, 51%, 54%), German (44%, 52%, 49%), Spanish (48%, 52%, 43%), Portuguese (48%, 41%, 40%), and Japanese (70%, 66%, 64%).
Image 4: Human assessment of ASR accuracy for a selection of supported languages. A score of 50% or greater indicates that cohear transcripts were preferred on average compared to face-to-face transcripts.

Flow

In production settings, ASR systems must operate under strict latency and throughput constraints; Regardless of whether accurate, slow or resource-intensive transcription can directly impact user experience, operational efficiency, and costs.

Transcribe 1B+ extends the Pareto frontier while providing state-of-the-art accuracy (low WER) while maintaining best-in-class throughput (high RTFX) within the parameter model group.

Scatter plot comparing seven ASR models by word error rate (accuracy, lower is better) versus throughput. Cohere Transcribe, NVIDIA Canary Queen 2.5b, and IBM Granite show high throughput at low error rates, while Whisper Large v3 and Voxtral Realtime have high error rates at low throughput.
Image 5: Throughput (RTFX) versus accuracy (WER) plot for key models larger than 1B in size. RTFX (real time factor multiple) measures how fast an audio model processes its input relative to real time.

“We are really impressed with what Cohear has created with Transcribe. The speed is extraordinary – turning minutes of audio into a usable transcript in seconds – and it immediately opens up new possibilities for real-time products and workflows.

In our testing, the model handled everyday speech very well and delivered strong, reliable transcription quality. The overall experience has been intuitive and easy to work with. We are excited to partner with Cohair and continue exploring what we can create with this technology.

Paige Dickey Vice President Radical Ventures

Zero to one, and beyond.

We are working towards a deeper integration of Cohere Transcribe AnswerCohere’s AI agent orchestration platform. With planned updates, Cohear Transcribe will evolve from a high-accuracy transcription model into a comprehensive foundation of enterprise speech intelligence.

launch.

Cohere Transcribe is now available for download hugging face. Follow the setup instructions to run the model locally, or even in an edge environment.

You can also access Cohere Transcribe through our API Free, low-setup use subject to rate limits. see documentation For usage details and integration guidance.

For production deployments without rate limits, provision a dedicated model vault. This enables low-latency, private cloud inference without having to manage infrastructure. Pricing is calculated on an hourly rate, with discounted plans for longer-term commitments. Contact our team To discuss your requirements.

Major Contributors: Julian Mack (Member of Technical Staff), Ekagra Ranjan (Member of Technical Staff), Cassie Kao (Product Manager), Bharat Venkatesh (Manager of Technical Staff), Pierre Harvey Richmond (Manager of Technical Staff).



<a href

Leave a Comment