Cohere's open-weight ASR model hits 5.4% word error rate — low enough to replace speech APIs in production pipelines

crimedy7 illustration of a robot transcribing a meeting ar cca420ed 5206 436c 8e98 8eb21595bd8e 1
Enterprises building voice-enabled workflows have limited options for production-grade transcription: closed APIs with data residency risks, or open models that trade accuracy for deployability. Cohere’s new open-weighted ASR model, Transcribe, is built to compete on all four key differentiators – contextual accuracy, latency, control, and cost.

Cohere says Transcribe outperforms existing leaders in terms of accuracy – and unlike closed APIs, it can run on an organization’s own infrastructure.

Cohair, which can be accessed via the API or in Cohair’s model vault as cohair-transcribe-03-2026, has 2 billion parameters and is licensed under Apache-2.0. The company said Transcribe’s average word error rate (WER) is just 5.42%, so it makes fewer mistakes than similar models.

It is trained in 14 languages: English, French, German, Italian, Spanish, Greek, Dutch, Polish, Portuguese, Chinese, Japanese, Korean, Vietnamese and Arabic. The company did not disclose which Chinese dialect the model was trained on.

Cohere said it trained the model “with a deliberate focus on reducing WER, while keeping production readiness in mind.” According to Cohear, the result is a model that enterprises can plug directly into voice-powered automation, transcription pipelines, and audio search workflows.

Self-hosted transcription for production pipelines

Until recently, enterprise transcription has been a trade-off – closed APIs offer accuracy but are locked into the data; Open models offered control but lagged in performance. Unlike Whisper, which was launched as a research model under an MIT license, Transcribe is available for commercial use from release and can run on an organization’s own local GPU infrastructure. Early users found the commercial-ready open-source approach to be worthwhile for enterprise deployment.

Organizations can bring Transcribe to their local instances, as Cohere said the model has a more manageable inference footprint for local GPUs. The company said they were able to do this because the model “extends the Pareto frontier, delivering state-of-the-art accuracy (low WER) while maintaining best-in-class throughput (high RTFX) within the 1B+ parameter model group.”

How does Transcribe stack up?

Transcribe has partnered with better-performing speech-model giants, including OpenAI’s Whisper, which powers ChatGPT’s voice feature, and ElevenLabs, which many big retail brands deploy. It is currently on top Hugging Face ASR LeaderboardLeading with an average word error rate of 5.42%, Whisper Large v3 is outperformed by 7.44%, ElevenLabs Scribe v2 with 5.83%, and Qwen3-ASR-1.7b with 5.76%.

Based on other datasets tested by Hugging Face, Transcribe also performed well. On the AMI dataset, which measures meeting understanding and dialogue analysis, Transcribe recorded a score of 8.15%. For the Voxpopuli dataset that tests understanding of different pronunciations, the model scored 5.87%, surpassed only by Zoom Scribe.

Early users have flagged accuracy and local deployment as standout factors – especially for teams that are routing audio data through external APIs and want to bring that workload in-house.

For engineering teams building RAG pipelines or agent workflows with audio input, Transcribe provides a path to production-grade transcription without the data residency and latency penalties of closed APIs.



<a href

Leave a Comment