wav2vec: Unsupervised Pre-training for Speech Recognition

April 11, 20191904.05862

Steffen Schneider, Alexei Baevski, Ronan Collobert, Michael Auli

cs.CL

TLDR

wav2vec introduces an unsupervised pre-training method for speech recognition that learns audio representations from raw data, significantly improving performance with limited labeled data.

Key contributions

Proposes a multi-layer convolutional neural network trained via noise contrastive binary classification on unlabeled audio.
Demonstrates up to 36% reduction in word error rate (WER) on WSJ with minimal transcribed data.
Achieves state-of-the-art 2.43% WER on nov92 test set, outperforming Deep Speech 2 using much less labeled data.

Why it matters

This paper matters because it shows that unsupervised pre-training on raw audio can drastically reduce the need for expensive labeled speech data while improving recognition accuracy, making speech technology more accessible and scalable.

Original Abstract

We explore unsupervised pre-training for speech recognition by learning representations of raw audio. wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training. We pre-train a simple multi-layer convolutional neural network optimized via a noise contrastive binary classification task. Our experiments on WSJ reduce WER of a strong character-based log-mel filterbank baseline by up to 36% when only a few hours of transcribed data is available. Our approach achieves 2.43% WER on the nov92 test set. This outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers