research-article

Learning and Modeling Unit Embeddings Using Deep Neural Networks for Unit-Selection-Based Mandarin Speech Synthesis

Authors:

Li-Rong DaiAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), Volume 19, Issue 3

Article No.: 38, Pages 1 - 14

https://doi.org/10.1145/3372244

Published: 09 January 2020 Publication History

Abstract

A method of learning and modeling unit embeddings using deep neutral networks (DNNs) is presented in this article for unit-selection-based Mandarin speech synthesis. Here, a unit embedding is defined as a fixed-length embedding vector for a phone-sized unit candidate in a corpus. Modeling phone-sized embedding vectors instead of frame-sized acoustic features can better measure the long-term dependencies among consecutive units in an utterance. First, a DNN with an embedding layer is built to learn the embedding vectors of all unit candidates in the corpus from scratch. In order to enable the extracted embedding vectors to carry both acoustic and linguistic information of unit candidates, a multitarget learning strategy is designed for the DNN. Its optional prediction targets include frame-level acoustic features, unit durations, monophone and tone identifiers, and context classes. Then, another two DNNs are constructed to map linguistic features toward the extracted embedding vectors. One of them employs the unit vectors of preceding phones besides the linguistic features of current phone as its input. At synthesis time, the distances between the unit vectors predicted by these two DNNs and the ones derived from unit candidates are used as a part of the target cost and a part of the concatenation cost, respectively. Our experiments on a Mandarin speech synthesis corpus demonstrate that learning and modeling unit embeddings improve the naturalness of hidden Markov model (HMM)-based unit selection speech synthesis. Furthermore, integrating multiple targets for learning unit embeddings achieves better performance than using only acoustic targets according to our subjective evaluation results.

References

[1]

Alan W. Black and Nick Campbell. 1995. Optimising selection of units from speech databases for concatenative synthesis. In EUROSPEECH. International Speech Communication Association, 581--584.

[2]

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015).

[3]

Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. 2016. WaveNet: A generative model for raw audio. In 9th ISCA Speech Synthesis Workshop (SSW9 ’16). 125--125.

[4]

Andrew J. Hunt and Alan W. Black. 1996. Unit selection in a concatenative speech synthesis system using a large speech database. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 1996 (ICASSP ’96), Conference Proceedings., Vol. 1. IEEE, 373--376.

[5]

Y. Jiang, X. Zhou, C. Ding, Ya-Jun Hu, Zhen-Hua Ling, and Li-Rong Dai. 2018. The USTC system for blizzard challenge 2018. In Blizzard Challenge Workshop.

[6]

H. Kawahara, I. Masuda-Katsuse, and A. de Cheveigne. 1999. Restructuring speech representations using pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech Communication 27 (1999), 187--207.

Digital Library

[7]

Diederik P. Kingma and Jimmy Ba. 2014. ADAM: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

[8]

Tao Lin. 1962. The relationship between light tone and syntactic structure of modern Chinese. Studies of the Chinese Language 7, 6 (1962), 301–304.

[9]

Zhen-Hua Ling, Shi-Yin Kang, Heiga Zen, Andrew Senior, Mike Schuster, Xiao-Jun Qian, Helen M. Meng, and Li Deng. 2015. Deep learning for acoustic modeling in parametric speech generation: A systematic review of existing techniques and future trends. IEEE Signal Processing Magazine 32, 3 (2015), 35--52.

[10]

Zhen-Hua Ling and Ren-Hua Wang. 2006. HMM-based unit selection using frame sized speech segments. In 9th International Conference on Spoken Language Processing.

[11]

Zhen-Hua Ling and Ren-Hua Wang. 2007. HMM-based hierarchical unit selection combining Kullback-Leibler divergence with likelihood criterion. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2007 (ICASSP ’07), Vol. 4. IEEE, IV--1245.

[12]

Zhen-Hua Ling and Zhi-Ping Zhou. 2018. Unit selection speech synthesis using frame-sized speech segments and neural network based acoustic models. Journal of Signal Processing Systems 90, 7 (2018), 1053–1062.

Digital Library

[13]

Li-Juan Liu, C. Ding, Y. Jiang, M. Zhou, and S. Wei. 2017. The IFLYTEK system for blizzard challenge 2017. In Blizzard Challenge Workshop.

[14]

Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, Nov. (2008), 2579--2605.

[15]

Thomas Merritt, Robert A. J. Clark, Zhizheng Wu, Junichi Yamagishi, and Simon King. 2016. Deep neural network-guided unit selection synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16), 2016. IEEE, 5145--5149.

[16]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).

[17]

Antoine Perquin, Gwénolé Lecorvé, Damien Lolive, and Laurent Amsaleg. 2018. Phone-level embeddings for unit selection speech synthesis. In International Conference on Statistical Language and Speech Processing. Springer, 21--31.

[18]

Vincent Pollet, Enrico Zovato, Sufian Irhimeh, and Pier Batzu. 2017. Unit selection with hierarchical cascaded long short term memory bidirectional recurrent neural nets. In Proceedings of Interspeech 2017. 3966--3970.

[19]

Y. Qian, F. Soong, and Z.-J. Yan. 2013. A unified trajectory tiling approach to high quality speech rendering. IEEE Transactions on Audio, Speech and Language Processing 21, 2 (2013), 280--290.

Digital Library

[20]

Hiroaki Sakoe and Seibi Chiba. 1978. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing 26, 1 (1978), 43--49.

[21]

Stan Salvador and Philip Chan. 2007. Toward accurate dynamic time warping in linear time and space. Intelligent Data Analysis 11, 5 (2007), 561--580.

Digital Library

[22]

Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, Rif A. Saurous, Yannis Agiomvrgiannakis, and Yonghui Wu. 2018. Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). IEEE, 4779–4783.

[23]

Koichi Shinoda and Takao Watanabe. 2000. MDL-based context-dependent subword modeling for speech recognition. Acoustical Science and Technology 21, 2 (2000), 79--86.

[24]

Akira Tamamori, Tomoki Hayashi, Kazuhiro Kobayashi, Kazuya Takeda, and Tomoki Toda. 2017. Speaker-dependent WaveNet vocoder. In Interspeech. 1118--1122.

[25]

Keiichi Tokuda, Takayoshi Yoshimura, Takashi Masuko, Takao Kobayashi, and Tadashi Kitamura. 2000. Speech parameter generation algorithms for HMM-based speech synthesis. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2000 (ICASSP’00), Vol. 3. IEEE, 1315--1318.

[26]

Vincent Wan, Yannis Agiomyrgiannakis, Hanna Silen, and Jakub Vit. 2017. Google’s next-generation real-time unit-selection synthesizer using sequence-to-sequence LSTM-based autoencoders. In Proceedings of the Interspeech. 1143--1147.

[27]

Zhizheng Wu, Cassia Valentini-Botinhao, Oliver Watts, and Simon King. 2015. Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP ’15), 2015. IEEE, 4460--4464.

[28]

Xian-Jun Xia, Zhen-Hua Ling, Yuan Jiang, and Li-Rong Dai. 2014. HMM-based unit selection speech synthesis using log likelihood ratios derived from perceptual data. Speech Communication 63 (2014), 27--37.

[29]

T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. 1999. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In EUROSPEECH. 2347--2350.

[30]

Heiga Zen, Andrew Senior, and Mike Schuster. 2013. Statistical parametric speech synthesis using deep neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP ’13), 2013. IEEE, 7962--7966.

[31]

Heiga Zen, Keiichi Tokuda, and Alan W. Black. 2009. Statistical parametric speech synthesis. Speech Communication 51, 11 (2009), 1039--1064.

Digital Library

[32]

Xiao Zhou, Zhen-Hua Ling, Zhi-Ping Zhou, and Li-Rong Dai. 2018. Learning and modeling unit embeddings for improving HMM-based unit selection speech synthesis. In Proceedings of Interspeech 2018. 2509--2513.

Cited By

Dawood HSaleem SHassan FJaved A(2022)A robust voice spoofing detection system using novel CLS-LBP features and LSTMJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2022.02.02434:9(7300-7312)Online publication date: Oct-2022
https://doi.org/10.1016/j.jksuci.2022.02.024
Wang HSharma AShabaz M(2022)Research on digital media animation control technology based on recurrent neural network using speech technologyInternational Journal of System Assurance Engineering and Management10.1007/s13198-021-01540-x13:S1(564-575)Online publication date: 7-Mar-2022
https://doi.org/10.1007/s13198-021-01540-x
Zhou XLing ZDai L(2021)UnitNet: A Sequence-to-Sequence Acoustic Model for Concatenative Speech SynthesisIEEE/ACM Transactions on Audio, Speech, and Language Processing10.1109/TASLP.2021.309382329(2643-2655)Online publication date: 2021
https://doi.org/10.1109/TASLP.2021.3093823

Index Terms

Learning and Modeling Unit Embeddings Using Deep Neural Networks for Unit-Selection-Based Mandarin Speech Synthesis
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing

Recommendations

Unit Selection Speech Synthesis Using Frame-Sized Speech Segments and Neural Network Based Acoustic Models

This paper proposes to select frame-sized speech segments for waveform concatenation speech synthesis using neural network based acoustic models. First, a deep neural network (DNN) based frame selection method is presented. In this method, three DNNs ...
Unit-Selection Speech Synthesis Method Using Words as Search Units

Unit-selection speech-synthesis systems have been proposed. In most of the unit-selection speech-synthesis systems, search units are rather short such as syllables, phonemes and diphones. However, when applied to large speech databases, shorter units ...
Concatenative speech synthesis for Amharic using unit selection method
MEDES '12: Proceedings of the International Conference on Management of Emergent Digital EcoSystems

In this paper we propose algorithms and methods that address critical issues in developing a general Amharic text-to-speech synthesizer. Converting grapheme to phoneme in Amharic is a very challenging task because of the two necessary and yet ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 19, Issue 3

May 2020

228 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3378675

Editor:
Imed Zitouni
Microsoft, USA

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 January 2020

Accepted: 01 October 2019

Revised: 01 July 2019

Received: 01 April 2019

Published in TALLIP Volume 19, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

the National Key R&D Program of China
the National Nature Science Foundation of China
the Key Science and Technology Project of Anhui Province

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
190
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)2

Reflects downloads up to 21 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Dawood HSaleem SHassan FJaved A(2022)A robust voice spoofing detection system using novel CLS-LBP features and LSTMJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2022.02.02434:9(7300-7312)Online publication date: Oct-2022
https://doi.org/10.1016/j.jksuci.2022.02.024
Wang HSharma AShabaz M(2022)Research on digital media animation control technology based on recurrent neural network using speech technologyInternational Journal of System Assurance Engineering and Management10.1007/s13198-021-01540-x13:S1(564-575)Online publication date: 7-Mar-2022
https://doi.org/10.1007/s13198-021-01540-x
Zhou XLing ZDai L(2021)UnitNet: A Sequence-to-Sequence Acoustic Model for Concatenative Speech SynthesisIEEE/ACM Transactions on Audio, Speech, and Language Processing10.1109/TASLP.2021.309382329(2643-2655)Online publication date: 2021
https://doi.org/10.1109/TASLP.2021.3093823

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents