Tacotron: Towards End-to-End Speech Synthesis
Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss + 9 more
TLDR
Tacotron is an end-to-end text-to-speech model that synthesizes natural-sounding speech directly from text characters without requiring complex intermediate components.
Key contributions
- Introduces a sequence-to-sequence model that learns text-to-speech mapping from scratch using paired text and audio data.
- Achieves higher naturalness in speech synthesis compared to traditional parametric systems, with a mean opinion score of 3.82.
- Generates speech at the frame level, enabling faster synthesis than sample-level autoregressive approaches.
Why it matters
This paper matters because it simplifies the text-to-speech pipeline by removing the need for handcrafted intermediate modules, reducing domain expertise requirements and brittleness. By demonstrating a fully end-to-end trainable model that produces high-quality, natural speech efficiently, Tacotron paves the way for more accessible and scalable speech synthesis technologies.
Original Abstract
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. Given <text, audio> pairs, the model can be trained completely from scratch with random initialization. We present several key techniques to make the sequence-to-sequence framework perform well for this challenging task. Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. In addition, since Tacotron generates speech at the frame level, it's substantially faster than sample-level autoregressive methods.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.