Prosodic Representation Learning and Contextual Sampling for Neural Text-to-Speech

Karlapati, Sri; Abbas, Ammar; Hodari, Zack; Moinet, Alexis; Joly, Arnaud; Karanasou, Penny; Drugman, Thomas

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2011.02252 (eess)

[Submitted on 4 Nov 2020]

Title:Prosodic Representation Learning and Contextual Sampling for Neural Text-to-Speech

Authors:Sri Karlapati, Ammar Abbas, Zack Hodari, Alexis Moinet, Arnaud Joly, Penny Karanasou, Thomas Drugman

View PDF

Abstract:In this paper, we introduce Kathaka, a model trained with a novel two-stage training process for neural speech synthesis with contextually appropriate prosody. In Stage I, we learn a prosodic distribution at the sentence level from mel-spectrograms available during training. In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information available in text. To do this, we use BERT on text, and graph-attention networks on parse trees extracted from text. We show a statistically significant relative improvement of $13.2\%$ in naturalness over a strong baseline when compared to recordings. We also conduct an ablation study on variations of our sampling technique, and show a statistically significant improvement over the baseline in each case.

Comments:	5 pages and 3 figures
Subjects:	Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
Cite as:	arXiv:2011.02252 [eess.AS]
	(or arXiv:2011.02252v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2011.02252

Submission history

From: Sri Karlapati [view email]
[v1] Wed, 4 Nov 2020 12:20:21 UTC (272 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Prosodic Representation Learning and Contextual Sampling for Neural Text-to-Speech

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Prosodic Representation Learning and Contextual Sampling for Neural Text-to-Speech

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators