ArXiv TLDR

Evaluation of Automatic Speech Recognition Using Generative Large Language Models

🐦 Tweet
2604.21928

Thibault Bañeras-Roux, Shashi Kumar, Driss Khalil, Sergio Burdisso, Petr Motlicek + 4 more

cs.CL

TLDR

Generative LLMs significantly improve ASR evaluation by semantically assessing hypotheses, outperforming traditional WER and other semantic metrics.

Key contributions

  • Evaluates generative LLMs for semantic ASR evaluation using three distinct approaches.
  • LLMs achieve 92-94% human agreement for hypothesis selection, far surpassing WER's 63%.
  • Demonstrates generative LLM embeddings perform comparably to encoder models for semantic distance.
  • Offers a promising, interpretable direction for future semantic ASR evaluation methods.

Why it matters

This paper introduces a novel approach to ASR evaluation using generative LLMs, addressing the limitations of traditional metrics like WER. By showing LLMs' strong correlation with human perception, it paves the way for more accurate and meaningful assessment of speech recognition systems. This could lead to more robust and user-friendly ASR technologies.

Original Abstract

Automatic Speech Recognition (ASR) is traditionally evaluated using Word Error Rate (WER), a metric that is insensitive to meaning. Embedding-based semantic metrics are better correlated with human perception, but decoder-based Large Language Models (LLMs) remain underexplored for this task. This paper evaluates their relevance through three approaches: (1) selecting the best hypothesis between two candidates, (2) computing semantic distance using generative embeddings, and (3) qualitative classification of errors. On the HATS dataset, the best LLMs achieve 92--94\% agreement with human annotators for hypothesis selection, compared to 63\% for WER, also outperforming semantic metrics. Embeddings from decoder-based LLMs show performance comparable to encoder models. Finally, LLMs offer a promising direction for interpretable and semantic ASR evaluation.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.