What is this: MiMo-V2.5-ASR is Xiaomi MiMo’s 8B open-source speech recognition model, MIT-licensed and available on HuggingFace, designed for bilingual Chinese-English transcription in dialects, noisy audio, code-switched speech, and song lyrics.
Problem:Most ASR models are benchmarked on clean studio data and deployed in the real world, where audio is noisy, speakers overlap, and people switch languages mid-sentence. The gap between benchmark accuracy and production accuracy is where voice products quietly fail.
Solution: Staged training, combining large-scale mid-training, supervised fine-tuning, and a reinforcement learning algorithm, specifically targeting scenarios where traditional models break down. Basic punctuation from prosody means the transcripts are ready for use.
what makes it different: On the Open ASR leaderboard, MiMo-V2.5-ASR posts a 5.73% average WER on English, 7.44% below Whisper Large-v3. Its score on WooBid is 19.55% vs. FunASR-1.5’s 29.08%. On lyrics, 3.95% on M4Singer vs 4.25% on Gemini 2.5 Pro. These are not cherry-picked scenarios – these are difficult ones.
key features: :
- Eight Chinese dialects are natively supported, including Wu, Cantonese, Hokkien, Sichuanese
-
Chinese-English code-switching without any language tags
-
Song transcription under accompaniment and pitch variation
-
Robustness to multi-speaker and noisy environments
-
Native punctuation, no post-processing required
-
MIT License, Python API, Gradio Demo, Self-Hosted
benefits: :
-
Production-grade accuracy on audio conditions that actually exist in the field
-
One model replaces multiple regional or domain-specific ASR solutions
-
Self-hosting eliminates per-call API costs and keeps data on your infra
-
Ready-to-use punctuation output cuts one step from each downstream pipeline
what’s that for: ML engineers and voice product teams are building bilingual or Chinese-language transcription pipelines that require accuracy outside the lab.
Open-source ASR has been catching up to closed models for years. MiMo-V2.5-ASR is a data point that the gap is now much smaller, and in some scenarios has eliminated.
<a href