Bilingual ASR for dialects, code-switching, and songs – MiMo-V2.5 Voice

Whisper changed what people expected from open-source ASR. Three years later, the leaderboard looks very different.

What is this: MiMo-V2.5-ASR is Xiaomi MiMo’s 8B open-source speech recognition model, MIT-licensed and available on HuggingFace, designed for bilingual Chinese-English transcription in dialects, noisy audio, code-switched speech, and song lyrics.

Problem:Most ASR models are benchmarked on clean studio data and deployed in the real world, where audio is noisy, speakers overlap, and people switch languages ​​mid-sentence. The gap between benchmark accuracy and production accuracy is where voice products quietly fail.

Solution: Staged training, combining large-scale mid-training, supervised fine-tuning, and a reinforcement learning algorithm, specifically targeting scenarios where traditional models break down. Basic punctuation from prosody means the transcripts are ready for use.

what makes it different: On the Open ASR leaderboard, MiMo-V2.5-ASR posts a 5.73% average WER on English, 7.44% below Whisper Large-v3. Its score on WooBid is 19.55% vs. FunASR-1.5’s 29.08%. On lyrics, 3.95% on M4Singer vs 4.25% on Gemini 2.5 Pro. These are not cherry-picked scenarios – these are difficult ones.

key features: :

  • Eight Chinese dialects are natively supported, including Wu, Cantonese, Hokkien, Sichuanese

  • Chinese-English code-switching without any language tags

  • Song transcription under accompaniment and pitch variation

  • Robustness to multi-speaker and noisy environments

  • Native punctuation, no post-processing required

  • MIT License, Python API, Gradio Demo, Self-Hosted

benefits: :

  • Production-grade accuracy on audio conditions that actually exist in the field

  • One model replaces multiple regional or domain-specific ASR solutions

  • Self-hosting eliminates per-call API costs and keeps data on your infra

  • Ready-to-use punctuation output cuts one step from each downstream pipeline

what’s that for: ML engineers and voice product teams are building bilingual or Chinese-language transcription pipelines that require accuracy outside the lab.

Open-source ASR has been catching up to closed models for years. MiMo-V2.5-ASR is a data point that the gap is now much smaller, and in some scenarios has eliminated.



<a href

Leave a Comment