A Review of End-to-End Chinese – Mandarin Speech Synthesis Techniques

Gong, Wenzhuo; Hong, Yang; Liu, Hancheng; He, Yutong

doi:10.1007/978-981-16-8885-0_31

Wenzhuo Gong⁹,
Yang Hong⁹,
Hancheng Liu⁹ &
…
Yutong He⁹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1513))

Included in the following conference series:

International Conference on Computing and Data Science

803 Accesses
1 Citations

Abstract

Text-to-speech (TTS) is currently a significant research field of speech synthesis. It has been used in intelligent conversational artificial intelligence (AI), chatbot, speech interaction, and a large variety of other application scenarios. Credit to the application of deep learning and end-to-end technology in speech synthesis, the current advanced TTS system is able to synthesize speech close to the real human voice. However, its application in Chinese Mandarin is hindered by the diversity of pronunciation of Chinese characters and the complexity of Chinese grammar. This article will follow the development of speech synthesis technology to elaborate on the research and development of Mandarin speech synthesis and discuss the existing problems of current research. This work is anticipated to further improve the performance of the Chinese-Mandarin end-to-end system and point out some valuable research directions which can bring the research of Mandarin speech synthesis into a new stage.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Conventional and contemporary approaches used in text to speech synthesis: a review

Article 13 November 2022

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Article 29 September 2022

A novel method for Mandarin speech synthesis by inserting prosodic structure prediction into Tacotron2

Article 05 July 2021

References

Parlikar, A., Black, A.W.: Data-driven phrasing for speech synthesis in low-resource languages. In: Proceedings of 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4013–4016. IEEE (2012)
Google Scholar
Takamichi, S., Toda, T., Shiga, Y., Sakti, S., Neubig, G., Nakamura, S.: Improvements to HMM-based speech synthesis based on parameter generation with rich context models. In: Proceedings of the INTERSPEECH, pp. 364–368 (2013)
Google Scholar
Chen, L., Gales, M.J.F., Braunschweiler, N., Akamine, M., Knill, K.: Integrated automatic expression prediction and speech synthesis from text. In: Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7977–7981. IEEE (2013)
Google Scholar
Bigorgne, D., et al.: Multilingual PSOLA text-to-speech system. In: Proceedings of 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 187–190. IEEE (1993)
Google Scholar
Portele, T., Steffan, B., Preuß, R., Sendlmeier, W.F., Hess, W.: HADIFIX-a speech synthesis system for German. In: Proceedings of the Second International Conference on Spoken Language Processing (ICSLP’92), pp. 1227–1230 (1992)
Google Scholar
Hunt, A.J., Black, A.W.: Unit selection in a concatenative speech synthesis system using a large speech database. In: Proceedings of 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings (ICASSP), pp. 373–376. IEEE (1996)
Google Scholar
Sagisaka, Y., Kaiki, N., Iwahashi, N., Mimura, K.: ATR μ-talk speech synthesis system. In: Proceedings of the Second International Conference on Spoken Language Processing (ICSLP’92), pp. 483–486 (1992)
Google Scholar
Tao, J., Zhao, S., Cai, L.: Research on Chinese speech synthesis system based on statistical prosody model. J. Chin. Inf. Process. 16(1), 1–6 (2002). https://doi.org/10.3969/j.issn.1003-0077.2002.01.001
Article Google Scholar
Zhang, D., Chen, Z., Huang, H.: Design and implementation of mapping address algorithm in Chinese text-to-speech system. J. Softw. 13(1), 105–110 (2002)
Google Scholar
Zhang, W., Wu, X., Zhao, Z., Wang, R.: A method of cutting voice database based on virtual variable length. J. Softw. 17(05), 983–990 (2006)
Article Google Scholar
Dong, M., Lua, K.T.: Using prosody database in Chinese speech synthesis. In: Proceedings of the Interspeech, pp. 243–246 (2000)
Google Scholar
Chou, F., Tseng, C., Lee, L.: A set of corpus-based text-to-speech synthesis technologies for Mandarin Chinese. IEEE Trans. Speech Audio Process. 10(7), 481–494 (2002). https://doi.org/10.1109/TSA.2002.803437
Article Google Scholar
Li, Y., Tao, J., Zhang, M., Pan, S., Xu, X.: Text-based unstressed syllable prediction in Mandarin. In: Proceedings of the Interspeech, pp. 1752–1755 (2010)
Google Scholar
Li, Y., Tao, J., Xu, X.: Hierarchical stress modeling in Mandarin text-to-speech. In: Proceedings of the Interspeech, pp. 2013–2016 (2011)
Google Scholar
Yang, C., Ling, Z., Dai, L.: Unsupervised prosodic phrase boundary labeling of Mandarin speech synthesis database using context-dependent HMM. In: Proceedings of the ICASSP 2013, pp. 6875–6879. IEEE (2013)
Google Scholar
Yang, F., Yang, S., Zhu, P., Yan, P., Xie, L.: Improving Mandarin end-to-end speech synthesis by self-attention and learnable Gaussian bias. In: Proceedings of 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 208–213. IEEE (2019)
Google Scholar
Zhang, C., Zhang, S., Zhong, H.: A prosodic Mandarin text-to-speech system based on Tacotron. In: Proceedings of 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 165–169. IEEE (2019)
Google Scholar
Xiao, Y., He, L., Ming, H., Soong, F.K.: Improving prosody with linguistic and bert derived features in multi-speaker based Mandarin Chinese neural TTS. In: Proceedings of the ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 6704–6708 (2020)
Google Scholar
Lu, Y., Dong, M., Chen, Y.: Implementing prosodic phrasing in chinese end-to-end speech synthesis. In: Proceedings of the ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7050–7054. IEEE (2019)
Google Scholar
Lei, S., Lee, L.: Digital synthesis of mandarin speech using its special characteristics. J. Chin. Inst. Eng. 6(2), 107–115 (1983). https://doi.org/10.1080/02533839.1983.9676732
Article Google Scholar
Huang, T., Wang, C., Pao, Y.: A Chinese text-to-speech synthesis system based on an initial-final model. Comput. Process. Chin. Orient. Lang. 1(1), 59–70 (1983)
Google Scholar
Zhou, K., Cole, T.: A chip designed for Chinese text-to-speech synthesis. J. Electr. Electron. Eng. Aust. 4(4), 314–318 (1984)
Google Scholar
Zhou, K.: A Chinese text-to-speech synthesis system using the logarithmic magnitude filter. J. Electr. Electron. Eng. Aust. 6(4), 270–274 (1986)
Google Scholar
Qin, D., Hu, N.C.: A new method of synthetic Chinese speech on a unlimited vocabulary. In: Cybernetics and Systems’88, Proceedings of the Ninth European Meeting on Cybernetics and Sytems Research, pp. 1223–1230 (1988)
Google Scholar
Shi, B., Lu, S.: A Chinese speech synthesis-by-rule system. In: Speech, Hearing and Language: Work in Progress, U.C.L. Number 3, pp. 219–236. University College London (1989)
Google Scholar
Cai, L., Liu, H., Zhou, Q.: Design and achievement of a Chinese text-to-speech system under windows. Microcomputer 3 (1995)
Google Scholar
Chu, M., Lu, S.: High intelligibility and naturalness Chinese TTS system and prosodic rules. In: Proceedings of the XIII International Congress of Phonetic Sciences, Stockholm, pp. 334–337 (1995)
Google Scholar
Hwang, S., Chen, S., Wang, Y.: A Mandarin text-to-speech system. In: Proceedings of the ICSLP 96, pp. 1421–1424. IEEE (1996)
Google Scholar
Suen, C.: Computer synthesis of Mandarin. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings (ICASSP), pp. 698–700. IEEE (1976)
Google Scholar
Shih, C., Sproat, R.: Issues in text-to-speech conversion for Mandarin. Int. J. Comput. Linguist. Chin. Lang. Process. 1, 37–86 (1996)
Google Scholar
Chou, F., Tseng, C.: Corpus-based Mandarin speech synthesis with contextual syllabic units based on phonetic properties. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 893–896. IEEE (1998)
Google Scholar
Tseng, S.: Spontaneous Mandarin production: results of a corpus-based study. In: Proceedings of the International Symposium on Chinese Spoken Language Processing, pp. 29–32. IEEE (2004)
Google Scholar
Gu, H., Wang, K.: An acoustic and articulatory knowledge integrated method for improving synthetic Mandarin speech’s fluency. In: Proceedings of the International Symposium on Chinese Spoken Language Processing, pp. 205–208. IEEE (2004)
Google Scholar
Xu, J., Fu, G., Li, H.: Grapheme-to-phoneme conversion for Chinese text-to-speech. In: Proceedings of the International Conference on Spoken Language Processing (2004)
Google Scholar
Zhang, J.T.F.L., Jia, H.: Design of speech corpus for mandarin text to speech. In: Proceedings of the Blizzard Challenge 2008 Workshop (2008)
Google Scholar
Cai, L., Cui, D., Cai, R.: TH-CoSS, a Mandarin speech corpus for TTS. J. Chin. Inf. Process. 21(2), 94–99 (2007)
Google Scholar
Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., Kitamura, T.: Speech parameter generation algorithms for HMM-based speech synthesis. In: Proceedings of 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100), pp. 1315–1318. IEEE (2000)
Google Scholar
Tokuda, K., Masuko, T., Miyazaki, N., Kobayashi, T.: Multi-space probability distribution HMM. IEICE Trans. Inf. Syst. 85(3), 455–464 (2002)
Google Scholar
Wang, M., Wen, M., Saito, D., Hirose, K., Minematsu, N.: Improved generation of prosodic features in HMM-based Mandarin speech synthesis. In: Proceedings of the Seventh ISCA Tutorial and Research Workshop on Speech Synthesis (SSW7-2010) (2010)
Google Scholar
Yamagishi, J., Masuko, T., Kobayashi, T.: HMM-based expressive speech synthesis-towards TTS with arbitrary speaking styles and emotions. In: Proceedings of the Special Workshop in Maui (SWIM) (2004)
Google Scholar
Tang, H., Zhou, X., Odisio, M., Hasegawa-Johnson, M., Huang, T.S.: Two-stage prosody prediction for emotional text-to-speech synthesis. In: Proceedings of the Interspeech, pp. 2138–2141 (2008)
Google Scholar
Inanoglu, Z., Young, S.: Data-driven emotion conversion in spoken English. Speech Commun. 51(3), 268–283 (2009)
Article Google Scholar
Wu, Y., Kawai, H., Ni, J., Wang, R.: Discriminative training and explicit duration modeling for HMM-based automatic segmentation. Speech Commun. 47(4), 397–410 (2005). https://doi.org/10.1016/j.specom.2005.03.016
Article Google Scholar
Yu, K., Mairesse, F., Young, S.: Word-level emphasis modelling in HMM-based speech synthesis. In: Proceedings of the ICASSP, pp. 4238–4241. IEEE (2010)
Google Scholar
Badino, L., Andersson, J.S., Yamagishi, J., Clark, R.A.: Identification of contrast and its emphatic realization in HMM-based speech synthesis. In: Proceedings of the INTERSPEECH, pp. 520–523 (2009)
Google Scholar
Turk, O., Schroder, M.: Evaluation of expressive speech synthesis with voice conversion and copy resynthesis techniques. IEEE Trans. Audio Speech Lang. Process. 18(5), 965–973 (2010). https://doi.org/10.1109/TASL.2010.2041113
Article Google Scholar
Zhang, J., Hirose, K.: Tone nucleus modeling for Chinese lexical tone recognition. Speech Commun. 42(3–4), 447–466 (2004)
Article Google Scholar
Sun, Q., Hirose, K., Gu, W., Minematsu, N.: Generation of fundamental frequency contours for Mandarin speech synthesis based on tone nucleus model. In: Proceedings of the Eurospeech, pp. 3265–3268 (2005)
Google Scholar
Wen, M., Wang, M., Hirose, K., Minematsu, N.: Prosody conversion for emotional Mandarin speech synthesis using the tone nucleus model. In: Proceedings of the Twelfth Annual Conference of the International Speech Communication Association (2011)
Google Scholar
Li, Y., Pan, S., Tao, J.: HMM-based speech synthesis with a flexible Mandarin stress adaptation model. In: IEEE 10th International Conference on Signal Processing Proceedings, pp. 625–628. IEEE (2010)
Google Scholar
Yang, C., Ling, Z., Lu, H., Guo, W., Dai, L.: Automatic phrase boundary labeling for Mandarin TTS corpus using context-dependent HMM. In: Proceedings of 2010 7th International Symposium on Chinese Spoken Language Processing, pp. 374–377. IEEE (2010)
Google Scholar
Shao, Y.Q., Sui, Z.F., Han, J.Q., Wu, Y.F.: A study on Chinese prosodic hierarchy prediction based on dependency grammar analysis. J. Chin. Inf. Process. 22(2), 116–123 (2008)
Google Scholar
Li, J., Hu, G., Wang, R.: Chinese prosody phrase break prediction based on maximum entropy model. In: Proceedings of the INTERSPEECH, pp. 729–732 (2004)
Google Scholar
Hsia, C., Wu, C., Wu, J.: Exploiting prosody hierarchy and dynamic features for pitch modeling and generation in HMM-based speech synthesis. IEEE Trans. Audio Speech Lang. Process. 18(8), 1994–2003 (2010). https://doi.org/10.1109/TASL.2010.2040791
Article Google Scholar
Yu, Y., Li, D., Wu, X.: Prosodic modeling with rich syntactic context in HMM-based Mandarin speech synthesis. In: Proceedings of 2013 IEEE China Summit and International Conference on Signal and Information Processing, pp. 132–136. IEEE (2013)
Google Scholar
Gao, L., Ling, Z., Chen, L., Dai, L.: Improving F0 prediction using bidirectional associative memories and syllable-level F0 features for HMM-based Mandarin speech synthesis. In: Proceedings of the 9th International Symposium on Chinese Spoken Language Processing, pp. 275–279. IEEE (2014)
Google Scholar
Ling, Z., Deng, L., Yu, D.: Modeling spectral envelopes using restricted Boltzmann machines for statistical parametric speech synthesis. In: Proceedings of the 38th IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7825–7829. IEEE (2013)
Google Scholar
Gehring, J., Miao, Y., Metze, F., Waibel, A.: Extracting deep bottleneck features using stacked auto-encoders. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3377–3381. IEEE (2013)
Google Scholar
Zen, H., Senior, A.: Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. In: Proceedings of the 39th IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3844–3848. IEEE (2014)
Google Scholar
Bishop, C.M.: Mixture density networks (1994)
Google Scholar
Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM networks. In: Proceedings of the IEEE International Joint Conference on Neural Networks, pp. 2047–2052. IEEE (2005)
Google Scholar
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)
Google Scholar
Xie, L., Lee, T., Mak, M.-W.: Guest Editorial: Advances in deep learning for speech processing. J. Signal Process. Syst. 90(7), 959–961 (2018). https://doi.org/10.1007/s11265-018-1333-3
Article Google Scholar
Zheng, Y., Li, Y., Wen, Z., Liu, B., Tao, J.: Text-based sentential stress prediction using continuous lexical embedding for Mandarin speech synthesis. In: Proceedings of 10th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 1–5. IEEE (2016)
Google Scholar
Zheng, Y., Li, Y., Wen, Z., Liu, B., Tao, J.: Investigating deep neural network adaptation for generating exclamatory and interrogative speech in mandarin. J. Signal Process. Syst. 90(7), 1039–1052 (2018)
Article Google Scholar
Zhu, J.: Probing the phonetic and phonological knowledge of tones in Mandarin TTS models. In: Proceedings of 10th International Conference on Speech Prosody 2020, pp. 930–934 (2020)
Google Scholar
Liu, Z., Mak, B.: Cross-lingual multi-speaker text-to-speech synthesis for voice cloning without using parallel corpus for unseen speakers. arXiv preprint arXiv:1911.11601 (2019)
Li, H., Kang, Y., Wang, Z.: EMPHASIS: an emotional phoneme-based acoustic model for speech synthesis system. arXiv preprint arXiv:1806.09276 (2018)
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)
Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
Shang, L., Lu, Z., Li, H.: Neural responding machine for short-text conversation. arXiv preprint arXiv:1503.02364 (2015)
Vinyals, O., Kaiser, Ł., Koo, T., Petrov, S., Sutskever, I., Hinton, G.: Grammar as a foreign language. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 2773–2781 (2015)
Google Scholar
Rush, A.M., Harvard, S., Chopra, S., Weston, J.: A neural attention model for sentence summarization. In: Proceedings of the ACLWeb. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (2017)
Google Scholar
Luong, M., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015)
Hermann, K.M., et al.: Teaching machines to read and comprehend. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 1693–1701 (2015)
Google Scholar
Chorowski, J., Bahdanau, D., Cho, K., Bengio, Y.: End-to-end continuous speech recognition using attention-based recurrent NN: first results. arXiv preprint arXiv:1412.1602 (2014)
Wang, W., Xu, S., Xu, B.: First step towards end-to-end parametric TTS synthesis: generating spectral parameters with neural attention (2016)
Google Scholar
Zhang, J., Ling, Z., Dai, L.: Forward attention in sequence-to-sequence acoustic modeling for speech synthesis. In: Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4789–4793. IEEE (2018)
Google Scholar
Oord, A.V.D., et al.: WaveNet: a generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)
Oord, A.V.D., Kalchbrenner, N., Kavukcuoglu, K.: Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759 (2016)
Van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A.: Conditional image generation with PixelCNN decoders. In: Proceedings of the Annual Conference on Neural Information Processing Systems, pp. 4790–4798 (2016)
Google Scholar
Oord, A., et al.: Parallel WaveNet: fast high-fidelity speech synthesis. In: Proceedings of the International Conference on Machine Learning, PMLR, pp. 3918–3926 (2018)
Google Scholar
Sotelo, J., et al.: Char2Wav: end-to-end speech synthesis (2017)
Google Scholar
Mehri, S., et al.: SampleRNN: an unconditional end-to-end neural audio generation model. arXiv preprint arXiv:1612.07837 (2016)
Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135 (2017)
Wang, Y., et al.: Tacotron: a fully end-to-end text-to-speech synthesis model, vol. 164. arXiv preprint arXiv:1703.10135 (2017)
Griffin, D., Lim, J.: Signal estimation from modified short-time Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 32(2), 236–243 (1984)
Article Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114 (2013)
Gibiansky, A., et al., Deep Voice 2: multi-speaker neural text-to-speech. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 2962–2970 (2017)
Google Scholar
Shen, J., et al.: Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In: Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. IEEE (2018)
Google Scholar
Tachibana, H., Uenoyama, K., Aihara, S.: Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. In: Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4784–4788. IEEE (2018)
Google Scholar
Ping, W., et al.: Deep Voice 3: scaling text-to-speech with convolutional sequence learning. In: Proceedings of the Sixth International Conference on Learning Representations, pp. 1–16 (2018)
Google Scholar
Ping, W., Peng, K., Chen, J.: ClariNet: parallel wave generation in end-to-end text-to-speech. arXiv preprint arXiv:1807.07281 (2018)
Peng, K., Ping, W., Song, Z., Zhao, K.: Parallel neural text-to-speech. arXiv preprint arXiv:1905.08459 (2019)
Arık, S.Ö., Jun, H., Diamos, G.: Fast spectrogram inversion using multi-head convolutional neural networks. IEEE Signal Process. Lett. 26(1), 94–98 (2018)
Article Google Scholar
Kingma, D.P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., Welling, M.: Improved variational inference with inverse autoregressive flow. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 4743–4751 (2016)
Google Scholar
Prenger, R., Valle, R., Catanzaro, B.: WaveGlow: a flow-based generative network for speech synthesis. In: Proceedings of the ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621. IEEE (2019)
Google Scholar
Ping, W., Peng, K., Zhao, K., Song, Z.: WaveFlow: a compact flow-based model for raw audio. In: Proceedings of the ICML (2020)
Google Scholar
Miao, C., Liang, S., Chen, M., Ma, J., Wang, S., Xiao, J.: Flow-TTS: a non-autoregressive network for text to speech based on flow. In: Proceedings of the ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7209–7213. IEEE (2020)
Google Scholar
Kim, J., Kim, S., Kong, J., Yoon, S.: Glow-TTS: a generative flow for text-to-speech via monotonic alignment search. arXiv preprint arXiv:2005.11129 (2020)
Kalchbrenner, N., et al.: Efficient neural audio synthesis. arXiv preprint arXiv:1802.08435 (2018)
Ebden, P., Sproat, R.: The Kestrel TTS text normalization system. Nat. Lang. Eng. 21(3), 333 (2015)
Article Google Scholar
Sproat, R., Emerson, T.: The first international Chinese word segmentation Bakeoff. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, pp. 133–143 (2003)
Google Scholar
Màrquez, L., Rodríguez, H.: Part-of-speech tagging using decision trees. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 25–36. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0026668
Chapter Google Scholar
Zhang, H., Yu, J., Zhan, W., Yu, S.: Disambiguation of Chinese polyphonic characters. In: Proceedings of the First International Workshop on MultiMedia Annotation (MMA2001), pp. 30–31. Citeseer (2001)
Google Scholar
Shi, Q., Ma, X., Zhu, W., Zhang, W., Shen, L.: Statistic prosody structure prediction. In: Proceedings of 2002 IEEE Workshop on Speech Synthesis, pp. 155–158. IEEE (2002)
Google Scholar
Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015)
Guo-liang, W., Meng-nan, C., Lei, C.: An end-to-end Chinese speech synthesis scheme based on Tacotron 2. J. East China Norm. Univ. Nat. Sci. 2019(4), 111 (2019)
Google Scholar
He, T., Zhao, W., Xu, L.: DOP-Tacotron: a fast Chinese TTS system with local-based attention. In: Proceedings of 2020 Chinese Control and Decision Conference (CCDC), pp. 4345–4350. IEEE (2020)
Google Scholar
Li, J., Wu, Z., Li, R., Zhi, P., Yang, S., Meng, H.: Knowledge-based linguistic encoding for end-to-end Mandarin text-to-speech synthesis. In: Proceedings of the INTERSPEECH, pp. 4494–4498 (2019)
Google Scholar
Pan, H., Li, X., Huang, Z.: A Mandarin prosodic boundary prediction model based on multi-task learning. In: Proceedings of the INTERSPEECH, pp. 4485–4488 (2019)
Google Scholar
Zou, Y., Dong, L., Xu, B.: Boosting character-based Chinese speech synthesis via multi-task learning and dictionary tutoring (2019)
Google Scholar
Pan, J., et al.: A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis. In: Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6689–6693. IEEE (2020)
Google Scholar
Zhang, J., et al.: A hybrid text normalization system using multi-head self-attention for mandarin. In: Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6694–6698. IEEE (2020)
Google Scholar
Yang, B., Zhong, J., Liu, S.: Pre-trained text representations for improving front-end text processing in Mandarin text-to-speech synthesis. In: Proceedings of the INTERSPEECH, pp. 4480–4484 (2019)
Google Scholar
Yan, Y., Jiang, J., Yang, H.: Mandarin prosody boundary prediction based on sequence-to-sequence model. In: Proceedings of 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), pp. 1013–1017. IEEE (2020)
Google Scholar
Tang, G., Müller, M., Rios, A., Sennrich, R.: Why self-attention? A targeted evaluation of neural machine translation architectures. arXiv preprint arXiv:1808.08946 (2018)
Vaswani, A., et al.: Attention is all you need. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Karita, S., et al.: A comparative study on transformer vs RNN in speech applications. In: Proceedings of 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 449–456. IEEE (2019)
Google Scholar
Zheng, Y., Li, X., Xie, F., Lu, L.: Improving end-to-end speech synthesis with local recurrent neural network enhanced transformer. In: Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6734–6738. IEEE (2020)
Google Scholar
Engel, J., Agrawal, K.K., Chen, S., Gulrajani, I., Donahue, C., Roberts, A.: GANSynth: adversarial neural audio synthesis. arXiv preprint arXiv:1902.08710 (2019)
Donahue, C., McAuley, J., Puckette, M.: Adversarial audio synthesis. arXiv preprint arXiv:1802.04208 (2018)

Download references

Author information

Authors and Affiliations

School of Cyber Science and Engineering, Sichuan University, Chengdu, 610207, China
Wenzhuo Gong, Yang Hong, Hancheng Liu & Yutong He

Authors

Wenzhuo Gong
View author publications
You can also search for this author in PubMed Google Scholar
Yang Hong
View author publications
You can also search for this author in PubMed Google Scholar
Hancheng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yutong He
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wenzhuo Gong .

Editor information

Editors and Affiliations

Chinese Academy of Sciences, Beijing, China
Weijia Cao
University of California, Los Angeles, CA, USA
Aydogan Ozcan
China Academy of Space Technology, Beijing, China
Haidong Xie
Chinese Academy of Sciences, Beijing, China
Bei Guan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gong, W., Hong, Y., Liu, H., He, Y. (2021). A Review of End-to-End Chinese – Mandarin Speech Synthesis Techniques. In: Cao, W., Ozcan, A., Xie, H., Guan, B. (eds) Computing and Data Science. CONF-CDS 2021. Communications in Computer and Information Science, vol 1513. Springer, Singapore. https://doi.org/10.1007/978-981-16-8885-0_31

Download citation

DOI: https://doi.org/10.1007/978-981-16-8885-0_31
Published: 12 January 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-8884-3
Online ISBN: 978-981-16-8885-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Review of End-to-End Chinese – Mandarin Speech Synthesis Techniques

Abstract

Access this chapter

Similar content being viewed by others

Conventional and contemporary approaches used in text to speech synthesis: a review

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

A novel method for Mandarin speech synthesis by inserting prosodic structure prediction into Tacotron2

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Review of End-to-End Chinese – Mandarin Speech Synthesis Techniques

Abstract

Access this chapter

Similar content being viewed by others

Conventional and contemporary approaches used in text to speech synthesis: a review

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

A novel method for Mandarin speech synthesis by inserting prosodic structure prediction into Tacotron2

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation