End-to-End Speech Conversion for Stuttering Transcription and Correction

View a PDF of the paper titled StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction by Qianheng Xu.
View PDF HTML (experimental)

abstract:More than 70 million people worldwide experience stuttering, yet most automated speech systems misinterpret incoherent utterances or fail to transcribe them accurately. Existing methods for stutter correction rely on hand-crafted feature extraction or multi-stage automatic speech recognition (ASR) and text-to-speech (TTS) pipelines, which separate transcription from audio reconstruction and often increase distortions. This work introduces StutterZero and StutterFormer, the first end-to-end waveform-to-waveform models that directly convert stuttered speech into fluent speech by jointly predicting its transcription. StutterZero employs a convolutional-bidirectional LSTM encoder-decoder with attention, while StutterFormer integrates a dual-stream transformer with shared acoustic-linguistic representation. Both architectures are trained on synthesized paired stuttering-fluent data from the SEP-28K and LibriStutter corpora and evaluated on unseen speakers from the FluencyBank dataset. Across all benchmarks, StutterZero achieved a 24% reduction in Word Error Rate (WER) and a 31% improvement in Semantic Similarity (BERTScore) compared to the leading Whisper-Medium model. StutterFormer achieved better results with a 28% reduction in WER and a 34% improvement in BERTScore. The results directly validate the feasibility of end-to-end stuttering-to-fluent speech conversion, providing new opportunities for inclusive human-computer interaction, speech therapy, and accessibility-oriented AI systems.

Submission History

From: Qianheng Xu Mr. [view email]
[v1]

Tue, 21 Oct 2025 17:54:36 UTC (9,663 KB)
[v2]

Wed, 5 Nov 2025 00:00:48 UTC (9,657 KB)



<a href

Leave a Comment