Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis

Weiss, Ron J.; Skerry-Ryan, RJ; Battenberg, Eric; Mariooryad, Soroosh; Kingma, Diederik P.

Computer Science > Computation and Language

arXiv:2011.03568 (cs)

[Submitted on 6 Nov 2020 (v1), last revised 5 Feb 2021 (this version, v2)]

Title:Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis

Authors:Ron J. Weiss, RJ Skerry-Ryan, Eric Battenberg, Soroosh Mariooryad, Diederik P. Kingma

View PDF

Abstract:We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs. The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop. Output waveforms are modeled as a sequence of non-overlapping fixed-length blocks, each one containing hundreds of samples. The interdependencies of waveform samples within each block are modeled using the normalizing flow, enabling parallel training and synthesis. Longer-term dependencies are handled autoregressively by conditioning each flow on preceding blocks.This model can be optimized directly with maximum likelihood, with-out using intermediate, hand-designed features nor additional loss terms. Contemporary state-of-the-art text-to-speech (TTS) systems use a cascade of separately learned models: one (such as Tacotron) which generates intermediate features (such as spectrograms) from text, followed by a vocoder (such as WaveRNN) which generates waveform samples from the intermediate features. The proposed system, in contrast, does not use a fixed intermediate representation, and learns all parameters end-to-end. Experiments show that the proposed model generates speech with quality approaching a state-of-the-art neural TTS system, with significantly improved generation speed.

Comments:	6 pages including supplement, 3 figures. accepted to ICASSP 2021
Subjects:	Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2011.03568 [cs.CL]
	(or arXiv:2011.03568v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2011.03568

Submission history

From: Ron J Weiss [view email]
[v1] Fri, 6 Nov 2020 19:30:07 UTC (506 KB)
[v2] Fri, 5 Feb 2021 19:07:32 UTC (716 KB)

Computer Science > Computation and Language

Title:Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators