Bilingual ASR For Dialects, Code-switching, And Songs - MiMo-V2.5 Voice

Whisper changed what people expected from open-source ASR. Three years later, the leaderboard looks very different.

What is this: MiMo-V2.5-ASR is Xiaomi MiMo’s 8B open-source speech recognition model, MIT-licensed and available on HuggingFace, designed for bilingual Chinese-English transcription in dialects, noisy audio, code-switched speech, and song lyrics.

Problem:Most ASR models are benchmarked on clean studio data and deployed in the real world, where audio is noisy, speakers overlap, and people switch languages mid-sentence. The gap between benchmark accuracy and production accuracy is where voice products quietly fail.

Solution: Staged training, combining large-scale mid-training, supervised fine-tuning, and a reinforcement learning algorithm, specifically targeting scenarios where traditional models break down. Basic punctuation from prosody means the transcripts are ready for use.

what makes it different: On the Open ASR leaderboard, MiMo-V2.5-ASR posts a 5.73% average WER on English, 7.44% below Whisper Large-v3. Its score on WooBid is 19.55% vs. FunASR-1.5’s 29.08%. On lyrics, 3.95% on M4Singer vs 4.25% on Gemini 2.5 Pro. These are not cherry-picked scenarios – these are difficult ones.

key features: :

Eight Chinese dialects are natively supported, including Wu, Cantonese, Hokkien, Sichuanese
Chinese-English code-switching without any language tags
Song transcription under accompaniment and pitch variation
Robustness to multi-speaker and noisy environments
Native punctuation, no post-processing required
MIT License, Python API, Gradio Demo, Self-Hosted

benefits: :

Production-grade accuracy on audio conditions that actually exist in the field
One model replaces multiple regional or domain-specific ASR solutions
Self-hosting eliminates per-call API costs and keeps data on your infra
Ready-to-use punctuation output cuts one step from each downstream pipeline

what’s that for: ML engineers and voice product teams are building bilingual or Chinese-language transcription pipelines that require accuracy outside the lab.

Open-source ASR has been catching up to closed models for years. MiMo-V2.5-ASR is a data point that the gap is now much smaller, and in some scenarios has eliminated.

<a href

Bilingual ASR for dialects, code-switching, and songs – MiMo-V2.5 Voice

Like this:

Related

Leave a Comment Cancel reply

Share this:

Like this:

Related

Leave a Comment Cancel reply