When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition
Pehuén Moure, Niclas Pokel, Bilal Bounajma, Yingqiang Gao, Roman Boehringer + 2 more
TLDR
Audio-language models struggle to leverage clinical context for dysarthric speech recognition, but context-dependent fine-tuning significantly improves performance.
Key contributions
- Current ALMs fail to meaningfully use clinical context for dysarthric ASR, often degrading performance.
- Introduced a new benchmark on the SAP dataset to test context utilization for dysarthric speech.
- Context-dependent LoRA fine-tuning achieved a 52% relative WER reduction (0.066) over baselines.
- Fine-tuning showed significant gains for speakers with Down syndrome and mild-severity dysarthria.
Why it matters
This paper highlights a critical limitation of current audio-language models in handling dysarthric speech, even with clinical context. It provides a crucial benchmark and a promising fine-tuning approach to make ASR more inclusive for atypical speech.
Original Abstract
Automatic speech recognition (ASR) systems remain brittle on dysarthric and other atypical speech. Recent audio-language models raise the possibility of improving performance by conditioning on additional clinical context at inference time, but it is unclear whether these models can make use of such information. We introduce a benchmark built on the Speech Accessibility Project (SAP) dataset that tests whether diagnosis labels, clinician-derived speech ratings, and progressively richer clinical descriptions improve transcription accuracy for dysarthric speech. Across matched comparisons on nine models, we find that current models do not meaningfully use this context: diagnosis-informed and clinically detailed prompts yield negligible improvements and often degrade word error rate. We complement the prompting analysis with context-dependent fine-tuning, showing that LoRA adaptation with a mixture of clinical prompt formats achieves a WER of 0.066, a 52% relative reduction over the frozen baseline, while preserving performance when context is unavailable. Subgroup analyses reveal significant gains for Down syndrome and mild-severity speakers. These results clarify where current models fall short and provide a testbed for measuring progress toward more inclusive ASR.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.