Improving Trajectory Modelling for DNN-based Speech Synthesis by using Stacked Bottleneck Features and Minimum Generation Error Training

Wu, Zhizheng; King, Simon

doi:10.1109/TASLP.2016.2551865

Computer Science > Sound

arXiv:1602.06727 (cs)

[Submitted on 22 Feb 2016 (v1), last revised 5 Apr 2016 (this version, v3)]

Title:Improving Trajectory Modelling for DNN-based Speech Synthesis by using Stacked Bottleneck Features and Minimum Generation Error Training

Authors:Zhizheng Wu, Simon King

View PDF

Abstract:We propose two novel techniques --- stacking bottleneck features and minimum generation error training criterion --- to improve the performance of deep neural network (DNN)-based speech synthesis. The techniques address the related issues of frame-by-frame independence and ignorance of the relationship between static and dynamic features, within current typical DNN-based synthesis frameworks. Stacking bottleneck features, which are an acoustically--informed linguistic representation, provides an efficient way to include more detailed linguistic context at the input. The minimum generation error training criterion minimises overall output trajectory error across an utterance, rather than minimising the error per frame independently, and thus takes into account the interaction between static and dynamic features. The two techniques can be easily combined to further improve performance. We present both objective and subjective results that demonstrate the effectiveness of the proposed techniques. The subjective results show that combining the two techniques leads to significantly more natural synthetic speech than from conventional DNN or long short-term memory (LSTM) recurrent neural network (RNN) systems.

Comments:	submitted to IEEE/ACM Transactions on Audio, Speech and Language Processing 2016 (AQ)
Subjects:	Sound (cs.SD); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
Cite as:	arXiv:1602.06727 [cs.SD]
	(or arXiv:1602.06727v3 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.1602.06727
Related DOI:	https://doi.org/10.1109/TASLP.2016.2551865

Submission history

From: Zhizheng Wu [view email]
[v1] Mon, 22 Feb 2016 11:11:04 UTC (192 KB)
[v2] Mon, 4 Apr 2016 11:18:07 UTC (199 KB)
[v3] Tue, 5 Apr 2016 11:31:02 UTC (199 KB)

Computer Science > Sound

Title:Improving Trajectory Modelling for DNN-based Speech Synthesis by using Stacked Bottleneck Features and Minimum Generation Error Training

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Improving Trajectory Modelling for DNN-based Speech Synthesis by using Stacked Bottleneck Features and Minimum Generation Error Training

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators