ArXiv TLDR

Text-Utilization for Encoder-dominated Speech Recognition Models

🐦 Tweet
2604.26514

Albert Zeyer, Tim Posielek, Ralf Schlüter, Hermann Ney

cs.CLcs.AIcs.NE

TLDR

This paper improves speech recognition by efficiently using text-only data in encoder-dominated models, showing simpler methods can be more effective.

Key contributions

  • Investigates efficient text-only data utilization for encoder-dominated speech recognition models.
  • Compares modality matching and dynamic downsampling for integrating text data into the encoder.
  • Shows larger encoders with smaller decoders perform comparably or better than larger decoder setups.
  • Demonstrates simple configurations, like random duration models, often outperform complex alternatives.

Why it matters

This research offers practical strategies to enhance speech recognition by leveraging text data more effectively, particularly for faster, encoder-heavy models. It simplifies training pipelines by showing that simpler methods can yield superior results, making advanced speech models more accessible and efficient.

Original Abstract

This paper investigates efficient methods for utilizing text-only data to improve speech recognition, focusing on encoder-dominated models that facilitate faster recognition. We provide a comprehensive comparison of techniques to integrate text-only data, including modality matching and dynamic downsampling to reach text-level representations within the encoder. Our experiments on the LibriSpeech corpus show that a larger encoder with a smaller decoder can equal or surpass the performance of architectures with larger decoders. We demonstrate that simple configurations, such as random duration models, are often more effective than complex alternatives, significantly simplifying the training pipeline. All code and recipes are made publicly available.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.