Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

A Review of End-to-End Chinese – Mandarin Speech Synthesis Techniques

  • Conference paper
  • First Online:
Computing and Data Science (CONF-CDS 2021)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1513))

Included in the following conference series:

Abstract

Text-to-speech (TTS) is currently a significant research field of speech synthesis. It has been used in intelligent conversational artificial intelligence (AI), chatbot, speech interaction, and a large variety of other application scenarios. Credit to the application of deep learning and end-to-end technology in speech synthesis, the current advanced TTS system is able to synthesize speech close to the real human voice. However, its application in Chinese Mandarin is hindered by the diversity of pronunciation of Chinese characters and the complexity of Chinese grammar. This article will follow the development of speech synthesis technology to elaborate on the research and development of Mandarin speech synthesis and discuss the existing problems of current research. This work is anticipated to further improve the performance of the Chinese-Mandarin end-to-end system and point out some valuable research directions which can bring the research of Mandarin speech synthesis into a new stage.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now
Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Parlikar, A., Black, A.W.: Data-driven phrasing for speech synthesis in low-resource languages. In: Proceedings of 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4013–4016. IEEE (2012)

    Google Scholar 

  2. Takamichi, S., Toda, T., Shiga, Y., Sakti, S., Neubig, G., Nakamura, S.: Improvements to HMM-based speech synthesis based on parameter generation with rich context models. In: Proceedings of the INTERSPEECH, pp. 364–368 (2013)

    Google Scholar 

  3. Chen, L., Gales, M.J.F., Braunschweiler, N., Akamine, M., Knill, K.: Integrated automatic expression prediction and speech synthesis from text. In: Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7977–7981. IEEE (2013)

    Google Scholar 

  4. Bigorgne, D., et al.: Multilingual PSOLA text-to-speech system. In: Proceedings of 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 187–190. IEEE (1993)

    Google Scholar 

  5. Portele, T., Steffan, B., Preuß, R., Sendlmeier, W.F., Hess, W.: HADIFIX-a speech synthesis system for German. In: Proceedings of the Second International Conference on Spoken Language Processing (ICSLP’92), pp. 1227–1230 (1992)

    Google Scholar 

  6. Hunt, A.J., Black, A.W.: Unit selection in a concatenative speech synthesis system using a large speech database. In: Proceedings of 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings (ICASSP), pp. 373–376. IEEE (1996)

    Google Scholar 

  7. Sagisaka, Y., Kaiki, N., Iwahashi, N., Mimura, K.: ATR μ-talk speech synthesis system. In: Proceedings of the Second International Conference on Spoken Language Processing (ICSLP’92), pp. 483–486 (1992)

    Google Scholar 

  8. Tao, J., Zhao, S., Cai, L.: Research on Chinese speech synthesis system based on statistical prosody model. J. Chin. Inf. Process. 16(1), 1–6 (2002). https://doi.org/10.3969/j.issn.1003-0077.2002.01.001

    Article  Google Scholar 

  9. Zhang, D., Chen, Z., Huang, H.: Design and implementation of mapping address algorithm in Chinese text-to-speech system. J. Softw. 13(1), 105–110 (2002)

    Google Scholar 

  10. Zhang, W., Wu, X., Zhao, Z., Wang, R.: A method of cutting voice database based on virtual variable length. J. Softw. 17(05), 983–990 (2006)

    Article  Google Scholar 

  11. Dong, M., Lua, K.T.: Using prosody database in Chinese speech synthesis. In: Proceedings of the Interspeech, pp. 243–246 (2000)

    Google Scholar 

  12. Chou, F., Tseng, C., Lee, L.: A set of corpus-based text-to-speech synthesis technologies for Mandarin Chinese. IEEE Trans. Speech Audio Process. 10(7), 481–494 (2002). https://doi.org/10.1109/TSA.2002.803437

    Article  Google Scholar 

  13. Li, Y., Tao, J., Zhang, M., Pan, S., Xu, X.: Text-based unstressed syllable prediction in Mandarin. In: Proceedings of the Interspeech, pp. 1752–1755 (2010)

    Google Scholar 

  14. Li, Y., Tao, J., Xu, X.: Hierarchical stress modeling in Mandarin text-to-speech. In: Proceedings of the Interspeech, pp. 2013–2016 (2011)

    Google Scholar 

  15. Yang, C., Ling, Z., Dai, L.: Unsupervised prosodic phrase boundary labeling of Mandarin speech synthesis database using context-dependent HMM. In: Proceedings of the ICASSP 2013, pp. 6875–6879. IEEE (2013)

    Google Scholar 

  16. Yang, F., Yang, S., Zhu, P., Yan, P., Xie, L.: Improving Mandarin end-to-end speech synthesis by self-attention and learnable Gaussian bias. In: Proceedings of 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 208–213. IEEE (2019)

    Google Scholar 

  17. Zhang, C., Zhang, S., Zhong, H.: A prosodic Mandarin text-to-speech system based on Tacotron. In: Proceedings of 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 165–169. IEEE (2019)

    Google Scholar 

  18. Xiao, Y., He, L., Ming, H., Soong, F.K.: Improving prosody with linguistic and bert derived features in multi-speaker based Mandarin Chinese neural TTS. In: Proceedings of the ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 6704–6708 (2020)

    Google Scholar 

  19. Lu, Y., Dong, M., Chen, Y.: Implementing prosodic phrasing in chinese end-to-end speech synthesis. In: Proceedings of the ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7050–7054. IEEE (2019)

    Google Scholar 

  20. Lei, S., Lee, L.: Digital synthesis of mandarin speech using its special characteristics. J. Chin. Inst. Eng. 6(2), 107–115 (1983). https://doi.org/10.1080/02533839.1983.9676732

    Article  Google Scholar 

  21. Huang, T., Wang, C., Pao, Y.: A Chinese text-to-speech synthesis system based on an initial-final model. Comput. Process. Chin. Orient. Lang. 1(1), 59–70 (1983)

    Google Scholar 

  22. Zhou, K., Cole, T.: A chip designed for Chinese text-to-speech synthesis. J. Electr. Electron. Eng. Aust. 4(4), 314–318 (1984)

    Google Scholar 

  23. Zhou, K.: A Chinese text-to-speech synthesis system using the logarithmic magnitude filter. J. Electr. Electron. Eng. Aust. 6(4), 270–274 (1986)

    Google Scholar 

  24. Qin, D., Hu, N.C.: A new method of synthetic Chinese speech on a unlimited vocabulary. In: Cybernetics and Systems’88, Proceedings of the Ninth European Meeting on Cybernetics and Sytems Research, pp. 1223–1230 (1988)

    Google Scholar 

  25. Shi, B., Lu, S.: A Chinese speech synthesis-by-rule system. In: Speech, Hearing and Language: Work in Progress, U.C.L. Number 3, pp. 219–236. University College London (1989)

    Google Scholar 

  26. Cai, L., Liu, H., Zhou, Q.: Design and achievement of a Chinese text-to-speech system under windows. Microcomputer 3 (1995)

    Google Scholar 

  27. Chu, M., Lu, S.: High intelligibility and naturalness Chinese TTS system and prosodic rules. In: Proceedings of the XIII International Congress of Phonetic Sciences, Stockholm, pp. 334–337 (1995)

    Google Scholar 

  28. Hwang, S., Chen, S., Wang, Y.: A Mandarin text-to-speech system. In: Proceedings of the ICSLP 96, pp. 1421–1424. IEEE (1996)

    Google Scholar 

  29. Suen, C.: Computer synthesis of Mandarin. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings (ICASSP), pp. 698–700. IEEE (1976)

    Google Scholar 

  30. Shih, C., Sproat, R.: Issues in text-to-speech conversion for Mandarin. Int. J. Comput. Linguist. Chin. Lang. Process. 1, 37–86 (1996)

    Google Scholar 

  31. Chou, F., Tseng, C.: Corpus-based Mandarin speech synthesis with contextual syllabic units based on phonetic properties. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 893–896. IEEE (1998)

    Google Scholar 

  32. Tseng, S.: Spontaneous Mandarin production: results of a corpus-based study. In: Proceedings of the International Symposium on Chinese Spoken Language Processing, pp. 29–32. IEEE (2004)

    Google Scholar 

  33. Gu, H., Wang, K.: An acoustic and articulatory knowledge integrated method for improving synthetic Mandarin speech’s fluency. In: Proceedings of the International Symposium on Chinese Spoken Language Processing, pp. 205–208. IEEE (2004)

    Google Scholar 

  34. Xu, J., Fu, G., Li, H.: Grapheme-to-phoneme conversion for Chinese text-to-speech. In: Proceedings of the International Conference on Spoken Language Processing (2004)

    Google Scholar 

  35. Zhang, J.T.F.L., Jia, H.: Design of speech corpus for mandarin text to speech. In: Proceedings of the Blizzard Challenge 2008 Workshop (2008)

    Google Scholar 

  36. Cai, L., Cui, D., Cai, R.: TH-CoSS, a Mandarin speech corpus for TTS. J. Chin. Inf. Process. 21(2), 94–99 (2007)

    Google Scholar 

  37. Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., Kitamura, T.: Speech parameter generation algorithms for HMM-based speech synthesis. In: Proceedings of 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100), pp. 1315–1318. IEEE (2000)

    Google Scholar 

  38. Tokuda, K., Masuko, T., Miyazaki, N., Kobayashi, T.: Multi-space probability distribution HMM. IEICE Trans. Inf. Syst. 85(3), 455–464 (2002)

    Google Scholar 

  39. Wang, M., Wen, M., Saito, D., Hirose, K., Minematsu, N.: Improved generation of prosodic features in HMM-based Mandarin speech synthesis. In: Proceedings of the Seventh ISCA Tutorial and Research Workshop on Speech Synthesis (SSW7-2010) (2010)

    Google Scholar 

  40. Yamagishi, J., Masuko, T., Kobayashi, T.: HMM-based expressive speech synthesis-towards TTS with arbitrary speaking styles and emotions. In: Proceedings of the Special Workshop in Maui (SWIM) (2004)

    Google Scholar 

  41. Tang, H., Zhou, X., Odisio, M., Hasegawa-Johnson, M., Huang, T.S.: Two-stage prosody prediction for emotional text-to-speech synthesis. In: Proceedings of the Interspeech, pp. 2138–2141 (2008)

    Google Scholar 

  42. Inanoglu, Z., Young, S.: Data-driven emotion conversion in spoken English. Speech Commun. 51(3), 268–283 (2009)

    Article  Google Scholar 

  43. Wu, Y., Kawai, H., Ni, J., Wang, R.: Discriminative training and explicit duration modeling for HMM-based automatic segmentation. Speech Commun. 47(4), 397–410 (2005). https://doi.org/10.1016/j.specom.2005.03.016

    Article  Google Scholar 

  44. Yu, K., Mairesse, F., Young, S.: Word-level emphasis modelling in HMM-based speech synthesis. In: Proceedings of the ICASSP, pp. 4238–4241. IEEE (2010)

    Google Scholar 

  45. Badino, L., Andersson, J.S., Yamagishi, J., Clark, R.A.: Identification of contrast and its emphatic realization in HMM-based speech synthesis. In: Proceedings of the INTERSPEECH, pp. 520–523 (2009)

    Google Scholar 

  46. Turk, O., Schroder, M.: Evaluation of expressive speech synthesis with voice conversion and copy resynthesis techniques. IEEE Trans. Audio Speech Lang. Process. 18(5), 965–973 (2010). https://doi.org/10.1109/TASL.2010.2041113

    Article  Google Scholar 

  47. Zhang, J., Hirose, K.: Tone nucleus modeling for Chinese lexical tone recognition. Speech Commun. 42(3–4), 447–466 (2004)

    Article  Google Scholar 

  48. Sun, Q., Hirose, K., Gu, W., Minematsu, N.: Generation of fundamental frequency contours for Mandarin speech synthesis based on tone nucleus model. In: Proceedings of the Eurospeech, pp. 3265–3268 (2005)

    Google Scholar 

  49. Wen, M., Wang, M., Hirose, K., Minematsu, N.: Prosody conversion for emotional Mandarin speech synthesis using the tone nucleus model. In: Proceedings of the Twelfth Annual Conference of the International Speech Communication Association (2011)

    Google Scholar 

  50. Li, Y., Pan, S., Tao, J.: HMM-based speech synthesis with a flexible Mandarin stress adaptation model. In: IEEE 10th International Conference on Signal Processing Proceedings, pp. 625–628. IEEE (2010)

    Google Scholar 

  51. Yang, C., Ling, Z., Lu, H., Guo, W., Dai, L.: Automatic phrase boundary labeling for Mandarin TTS corpus using context-dependent HMM. In: Proceedings of 2010 7th International Symposium on Chinese Spoken Language Processing, pp. 374–377. IEEE (2010)

    Google Scholar 

  52. Shao, Y.Q., Sui, Z.F., Han, J.Q., Wu, Y.F.: A study on Chinese prosodic hierarchy prediction based on dependency grammar analysis. J. Chin. Inf. Process. 22(2), 116–123 (2008)

    Google Scholar 

  53. Li, J., Hu, G., Wang, R.: Chinese prosody phrase break prediction based on maximum entropy model. In: Proceedings of the INTERSPEECH, pp. 729–732 (2004)

    Google Scholar 

  54. Hsia, C., Wu, C., Wu, J.: Exploiting prosody hierarchy and dynamic features for pitch modeling and generation in HMM-based speech synthesis. IEEE Trans. Audio Speech Lang. Process. 18(8), 1994–2003 (2010). https://doi.org/10.1109/TASL.2010.2040791

    Article  Google Scholar 

  55. Yu, Y., Li, D., Wu, X.: Prosodic modeling with rich syntactic context in HMM-based Mandarin speech synthesis. In: Proceedings of 2013 IEEE China Summit and International Conference on Signal and Information Processing, pp. 132–136. IEEE (2013)

    Google Scholar 

  56. Gao, L., Ling, Z., Chen, L., Dai, L.: Improving F0 prediction using bidirectional associative memories and syllable-level F0 features for HMM-based Mandarin speech synthesis. In: Proceedings of the 9th International Symposium on Chinese Spoken Language Processing, pp. 275–279. IEEE (2014)

    Google Scholar 

  57. Ling, Z., Deng, L., Yu, D.: Modeling spectral envelopes using restricted Boltzmann machines for statistical parametric speech synthesis. In: Proceedings of the 38th IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7825–7829. IEEE (2013)

    Google Scholar 

  58. Gehring, J., Miao, Y., Metze, F., Waibel, A.: Extracting deep bottleneck features using stacked auto-encoders. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3377–3381. IEEE (2013)

    Google Scholar 

  59. Zen, H., Senior, A.: Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. In: Proceedings of the 39th IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3844–3848. IEEE (2014)

    Google Scholar 

  60. Bishop, C.M.: Mixture density networks (1994)

    Google Scholar 

  61. Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM networks. In: Proceedings of the IEEE International Joint Conference on Neural Networks, pp. 2047–2052. IEEE (2005)

    Google Scholar 

  62. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)

    Google Scholar 

  63. Xie, L., Lee, T., Mak, M.-W.: Guest Editorial: Advances in deep learning for speech processing. J. Signal Process. Syst. 90(7), 959–961 (2018). https://doi.org/10.1007/s11265-018-1333-3

    Article  Google Scholar 

  64. Zheng, Y., Li, Y., Wen, Z., Liu, B., Tao, J.: Text-based sentential stress prediction using continuous lexical embedding for Mandarin speech synthesis. In: Proceedings of 10th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 1–5. IEEE (2016)

    Google Scholar 

  65. Zheng, Y., Li, Y., Wen, Z., Liu, B., Tao, J.: Investigating deep neural network adaptation for generating exclamatory and interrogative speech in mandarin. J. Signal Process. Syst. 90(7), 1039–1052 (2018)

    Article  Google Scholar 

  66. Zhu, J.: Probing the phonetic and phonological knowledge of tones in Mandarin TTS models. In: Proceedings of 10th International Conference on Speech Prosody 2020, pp. 930–934 (2020)

    Google Scholar 

  67. Liu, Z., Mak, B.: Cross-lingual multi-speaker text-to-speech synthesis for voice cloning without using parallel corpus for unseen speakers. arXiv preprint arXiv:1911.11601 (2019)

  68. Li, H., Kang, Y., Wang, Z.: EMPHASIS: an emotional phoneme-based acoustic model for speech synthesis system. arXiv preprint arXiv:1806.09276 (2018)

  69. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)

  70. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)

    Google Scholar 

  71. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)

  72. Shang, L., Lu, Z., Li, H.: Neural responding machine for short-text conversation. arXiv preprint arXiv:1503.02364 (2015)

  73. Vinyals, O., Kaiser, Ł., Koo, T., Petrov, S., Sutskever, I., Hinton, G.: Grammar as a foreign language. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 2773–2781 (2015)

    Google Scholar 

  74. Rush, A.M., Harvard, S., Chopra, S., Weston, J.: A neural attention model for sentence summarization. In: Proceedings of the ACLWeb. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (2017)

    Google Scholar 

  75. Luong, M., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015)

  76. Hermann, K.M., et al.: Teaching machines to read and comprehend. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 1693–1701 (2015)

    Google Scholar 

  77. Chorowski, J., Bahdanau, D., Cho, K., Bengio, Y.: End-to-end continuous speech recognition using attention-based recurrent NN: first results. arXiv preprint arXiv:1412.1602 (2014)

  78. Wang, W., Xu, S., Xu, B.: First step towards end-to-end parametric TTS synthesis: generating spectral parameters with neural attention (2016)

    Google Scholar 

  79. Zhang, J., Ling, Z., Dai, L.: Forward attention in sequence-to-sequence acoustic modeling for speech synthesis. In: Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4789–4793. IEEE (2018)

    Google Scholar 

  80. Oord, A.V.D., et al.: WaveNet: a generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)

  81. Oord, A.V.D., Kalchbrenner, N., Kavukcuoglu, K.: Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759 (2016)

  82. Van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A.: Conditional image generation with PixelCNN decoders. In: Proceedings of the Annual Conference on Neural Information Processing Systems, pp. 4790–4798 (2016)

    Google Scholar 

  83. Oord, A., et al.: Parallel WaveNet: fast high-fidelity speech synthesis. In: Proceedings of the International Conference on Machine Learning, PMLR, pp. 3918–3926 (2018)

    Google Scholar 

  84. Sotelo, J., et al.: Char2Wav: end-to-end speech synthesis (2017)

    Google Scholar 

  85. Mehri, S., et al.: SampleRNN: an unconditional end-to-end neural audio generation model. arXiv preprint arXiv:1612.07837 (2016)

  86. Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135 (2017)

  87. Wang, Y., et al.: Tacotron: a fully end-to-end text-to-speech synthesis model, vol. 164. arXiv preprint arXiv:1703.10135 (2017)

  88. Griffin, D., Lim, J.: Signal estimation from modified short-time Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 32(2), 236–243 (1984)

    Article  Google Scholar 

  89. Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114 (2013)

  90. Gibiansky, A., et al., Deep Voice 2: multi-speaker neural text-to-speech. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 2962–2970 (2017)

    Google Scholar 

  91. Shen, J., et al.: Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In: Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. IEEE (2018)

    Google Scholar 

  92. Tachibana, H., Uenoyama, K., Aihara, S.: Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. In: Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4784–4788. IEEE (2018)

    Google Scholar 

  93. Ping, W., et al.: Deep Voice 3: scaling text-to-speech with convolutional sequence learning. In: Proceedings of the Sixth International Conference on Learning Representations, pp. 1–16 (2018)

    Google Scholar 

  94. Ping, W., Peng, K., Chen, J.: ClariNet: parallel wave generation in end-to-end text-to-speech. arXiv preprint arXiv:1807.07281 (2018)

  95. Peng, K., Ping, W., Song, Z., Zhao, K.: Parallel neural text-to-speech. arXiv preprint arXiv:1905.08459 (2019)

  96. Arık, S.Ö., Jun, H., Diamos, G.: Fast spectrogram inversion using multi-head convolutional neural networks. IEEE Signal Process. Lett. 26(1), 94–98 (2018)

    Article  Google Scholar 

  97. Kingma, D.P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., Welling, M.: Improved variational inference with inverse autoregressive flow. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 4743–4751 (2016)

    Google Scholar 

  98. Prenger, R., Valle, R., Catanzaro, B.: WaveGlow: a flow-based generative network for speech synthesis. In: Proceedings of the ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621. IEEE (2019)

    Google Scholar 

  99. Ping, W., Peng, K., Zhao, K., Song, Z.: WaveFlow: a compact flow-based model for raw audio. In: Proceedings of the ICML (2020)

    Google Scholar 

  100. Miao, C., Liang, S., Chen, M., Ma, J., Wang, S., Xiao, J.: Flow-TTS: a non-autoregressive network for text to speech based on flow. In: Proceedings of the ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7209–7213. IEEE (2020)

    Google Scholar 

  101. Kim, J., Kim, S., Kong, J., Yoon, S.: Glow-TTS: a generative flow for text-to-speech via monotonic alignment search. arXiv preprint arXiv:2005.11129 (2020)

  102. Kalchbrenner, N., et al.: Efficient neural audio synthesis. arXiv preprint arXiv:1802.08435 (2018)

  103. Ebden, P., Sproat, R.: The Kestrel TTS text normalization system. Nat. Lang. Eng. 21(3), 333 (2015)

    Article  Google Scholar 

  104. Sproat, R., Emerson, T.: The first international Chinese word segmentation Bakeoff. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, pp. 133–143 (2003)

    Google Scholar 

  105. Màrquez, L., Rodríguez, H.: Part-of-speech tagging using decision trees. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 25–36. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0026668

    Chapter  Google Scholar 

  106. Zhang, H., Yu, J., Zhan, W., Yu, S.: Disambiguation of Chinese polyphonic characters. In: Proceedings of the First International Workshop on MultiMedia Annotation (MMA2001), pp. 30–31. Citeseer (2001)

    Google Scholar 

  107. Shi, Q., Ma, X., Zhu, W., Zhang, W., Shen, L.: Statistic prosody structure prediction. In: Proceedings of 2002 IEEE Workshop on Speech Synthesis, pp. 155–158. IEEE (2002)

    Google Scholar 

  108. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015)

  109. Guo-liang, W., Meng-nan, C., Lei, C.: An end-to-end Chinese speech synthesis scheme based on Tacotron 2. J. East China Norm. Univ. Nat. Sci. 2019(4), 111 (2019)

    Google Scholar 

  110. He, T., Zhao, W., Xu, L.: DOP-Tacotron: a fast Chinese TTS system with local-based attention. In: Proceedings of 2020 Chinese Control and Decision Conference (CCDC), pp. 4345–4350. IEEE (2020)

    Google Scholar 

  111. Li, J., Wu, Z., Li, R., Zhi, P., Yang, S., Meng, H.: Knowledge-based linguistic encoding for end-to-end Mandarin text-to-speech synthesis. In: Proceedings of the INTERSPEECH, pp. 4494–4498 (2019)

    Google Scholar 

  112. Pan, H., Li, X., Huang, Z.: A Mandarin prosodic boundary prediction model based on multi-task learning. In: Proceedings of the INTERSPEECH, pp. 4485–4488 (2019)

    Google Scholar 

  113. Zou, Y., Dong, L., Xu, B.: Boosting character-based Chinese speech synthesis via multi-task learning and dictionary tutoring (2019)

    Google Scholar 

  114. Pan, J., et al.: A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis. In: Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6689–6693. IEEE (2020)

    Google Scholar 

  115. Zhang, J., et al.: A hybrid text normalization system using multi-head self-attention for mandarin. In: Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6694–6698. IEEE (2020)

    Google Scholar 

  116. Yang, B., Zhong, J., Liu, S.: Pre-trained text representations for improving front-end text processing in Mandarin text-to-speech synthesis. In: Proceedings of the INTERSPEECH, pp. 4480–4484 (2019)

    Google Scholar 

  117. Yan, Y., Jiang, J., Yang, H.: Mandarin prosody boundary prediction based on sequence-to-sequence model. In: Proceedings of 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), pp. 1013–1017. IEEE (2020)

    Google Scholar 

  118. Tang, G., Müller, M., Rios, A., Sennrich, R.: Why self-attention? A targeted evaluation of neural machine translation architectures. arXiv preprint arXiv:1808.08946 (2018)

  119. Vaswani, A., et al.: Attention is all you need. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  120. Karita, S., et al.: A comparative study on transformer vs RNN in speech applications. In: Proceedings of 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 449–456. IEEE (2019)

    Google Scholar 

  121. Zheng, Y., Li, X., Xie, F., Lu, L.: Improving end-to-end speech synthesis with local recurrent neural network enhanced transformer. In: Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6734–6738. IEEE (2020)

    Google Scholar 

  122. Engel, J., Agrawal, K.K., Chen, S., Gulrajani, I., Donahue, C., Roberts, A.: GANSynth: adversarial neural audio synthesis. arXiv preprint arXiv:1902.08710 (2019)

  123. Donahue, C., McAuley, J., Puckette, M.: Adversarial audio synthesis. arXiv preprint arXiv:1802.04208 (2018)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wenzhuo Gong .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gong, W., Hong, Y., Liu, H., He, Y. (2021). A Review of End-to-End Chinese – Mandarin Speech Synthesis Techniques. In: Cao, W., Ozcan, A., Xie, H., Guan, B. (eds) Computing and Data Science. CONF-CDS 2021. Communications in Computer and Information Science, vol 1513. Springer, Singapore. https://doi.org/10.1007/978-981-16-8885-0_31

Download citation

  • DOI: https://doi.org/10.1007/978-981-16-8885-0_31

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-16-8884-3

  • Online ISBN: 978-981-16-8885-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics