Abstract
Text-to-speech (TTS) is currently a significant research field of speech synthesis. It has been used in intelligent conversational artificial intelligence (AI), chatbot, speech interaction, and a large variety of other application scenarios. Credit to the application of deep learning and end-to-end technology in speech synthesis, the current advanced TTS system is able to synthesize speech close to the real human voice. However, its application in Chinese Mandarin is hindered by the diversity of pronunciation of Chinese characters and the complexity of Chinese grammar. This article will follow the development of speech synthesis technology to elaborate on the research and development of Mandarin speech synthesis and discuss the existing problems of current research. This work is anticipated to further improve the performance of the Chinese-Mandarin end-to-end system and point out some valuable research directions which can bring the research of Mandarin speech synthesis into a new stage.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Parlikar, A., Black, A.W.: Data-driven phrasing for speech synthesis in low-resource languages. In: Proceedings of 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4013–4016. IEEE (2012)
Takamichi, S., Toda, T., Shiga, Y., Sakti, S., Neubig, G., Nakamura, S.: Improvements to HMM-based speech synthesis based on parameter generation with rich context models. In: Proceedings of the INTERSPEECH, pp. 364–368 (2013)
Chen, L., Gales, M.J.F., Braunschweiler, N., Akamine, M., Knill, K.: Integrated automatic expression prediction and speech synthesis from text. In: Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7977–7981. IEEE (2013)
Bigorgne, D., et al.: Multilingual PSOLA text-to-speech system. In: Proceedings of 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 187–190. IEEE (1993)
Portele, T., Steffan, B., Preuß, R., Sendlmeier, W.F., Hess, W.: HADIFIX-a speech synthesis system for German. In: Proceedings of the Second International Conference on Spoken Language Processing (ICSLP’92), pp. 1227–1230 (1992)
Hunt, A.J., Black, A.W.: Unit selection in a concatenative speech synthesis system using a large speech database. In: Proceedings of 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings (ICASSP), pp. 373–376. IEEE (1996)
Sagisaka, Y., Kaiki, N., Iwahashi, N., Mimura, K.: ATR μ-talk speech synthesis system. In: Proceedings of the Second International Conference on Spoken Language Processing (ICSLP’92), pp. 483–486 (1992)
Tao, J., Zhao, S., Cai, L.: Research on Chinese speech synthesis system based on statistical prosody model. J. Chin. Inf. Process. 16(1), 1–6 (2002). https://doi.org/10.3969/j.issn.1003-0077.2002.01.001
Zhang, D., Chen, Z., Huang, H.: Design and implementation of mapping address algorithm in Chinese text-to-speech system. J. Softw. 13(1), 105–110 (2002)
Zhang, W., Wu, X., Zhao, Z., Wang, R.: A method of cutting voice database based on virtual variable length. J. Softw. 17(05), 983–990 (2006)
Dong, M., Lua, K.T.: Using prosody database in Chinese speech synthesis. In: Proceedings of the Interspeech, pp. 243–246 (2000)
Chou, F., Tseng, C., Lee, L.: A set of corpus-based text-to-speech synthesis technologies for Mandarin Chinese. IEEE Trans. Speech Audio Process. 10(7), 481–494 (2002). https://doi.org/10.1109/TSA.2002.803437
Li, Y., Tao, J., Zhang, M., Pan, S., Xu, X.: Text-based unstressed syllable prediction in Mandarin. In: Proceedings of the Interspeech, pp. 1752–1755 (2010)
Li, Y., Tao, J., Xu, X.: Hierarchical stress modeling in Mandarin text-to-speech. In: Proceedings of the Interspeech, pp. 2013–2016 (2011)
Yang, C., Ling, Z., Dai, L.: Unsupervised prosodic phrase boundary labeling of Mandarin speech synthesis database using context-dependent HMM. In: Proceedings of the ICASSP 2013, pp. 6875–6879. IEEE (2013)
Yang, F., Yang, S., Zhu, P., Yan, P., Xie, L.: Improving Mandarin end-to-end speech synthesis by self-attention and learnable Gaussian bias. In: Proceedings of 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 208–213. IEEE (2019)
Zhang, C., Zhang, S., Zhong, H.: A prosodic Mandarin text-to-speech system based on Tacotron. In: Proceedings of 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 165–169. IEEE (2019)
Xiao, Y., He, L., Ming, H., Soong, F.K.: Improving prosody with linguistic and bert derived features in multi-speaker based Mandarin Chinese neural TTS. In: Proceedings of the ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 6704–6708 (2020)
Lu, Y., Dong, M., Chen, Y.: Implementing prosodic phrasing in chinese end-to-end speech synthesis. In: Proceedings of the ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7050–7054. IEEE (2019)
Lei, S., Lee, L.: Digital synthesis of mandarin speech using its special characteristics. J. Chin. Inst. Eng. 6(2), 107–115 (1983). https://doi.org/10.1080/02533839.1983.9676732
Huang, T., Wang, C., Pao, Y.: A Chinese text-to-speech synthesis system based on an initial-final model. Comput. Process. Chin. Orient. Lang. 1(1), 59–70 (1983)
Zhou, K., Cole, T.: A chip designed for Chinese text-to-speech synthesis. J. Electr. Electron. Eng. Aust. 4(4), 314–318 (1984)
Zhou, K.: A Chinese text-to-speech synthesis system using the logarithmic magnitude filter. J. Electr. Electron. Eng. Aust. 6(4), 270–274 (1986)
Qin, D., Hu, N.C.: A new method of synthetic Chinese speech on a unlimited vocabulary. In: Cybernetics and Systems’88, Proceedings of the Ninth European Meeting on Cybernetics and Sytems Research, pp. 1223–1230 (1988)
Shi, B., Lu, S.: A Chinese speech synthesis-by-rule system. In: Speech, Hearing and Language: Work in Progress, U.C.L. Number 3, pp. 219–236. University College London (1989)
Cai, L., Liu, H., Zhou, Q.: Design and achievement of a Chinese text-to-speech system under windows. Microcomputer 3 (1995)
Chu, M., Lu, S.: High intelligibility and naturalness Chinese TTS system and prosodic rules. In: Proceedings of the XIII International Congress of Phonetic Sciences, Stockholm, pp. 334–337 (1995)
Hwang, S., Chen, S., Wang, Y.: A Mandarin text-to-speech system. In: Proceedings of the ICSLP 96, pp. 1421–1424. IEEE (1996)
Suen, C.: Computer synthesis of Mandarin. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings (ICASSP), pp. 698–700. IEEE (1976)
Shih, C., Sproat, R.: Issues in text-to-speech conversion for Mandarin. Int. J. Comput. Linguist. Chin. Lang. Process. 1, 37–86 (1996)
Chou, F., Tseng, C.: Corpus-based Mandarin speech synthesis with contextual syllabic units based on phonetic properties. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 893–896. IEEE (1998)
Tseng, S.: Spontaneous Mandarin production: results of a corpus-based study. In: Proceedings of the International Symposium on Chinese Spoken Language Processing, pp. 29–32. IEEE (2004)
Gu, H., Wang, K.: An acoustic and articulatory knowledge integrated method for improving synthetic Mandarin speech’s fluency. In: Proceedings of the International Symposium on Chinese Spoken Language Processing, pp. 205–208. IEEE (2004)
Xu, J., Fu, G., Li, H.: Grapheme-to-phoneme conversion for Chinese text-to-speech. In: Proceedings of the International Conference on Spoken Language Processing (2004)
Zhang, J.T.F.L., Jia, H.: Design of speech corpus for mandarin text to speech. In: Proceedings of the Blizzard Challenge 2008 Workshop (2008)
Cai, L., Cui, D., Cai, R.: TH-CoSS, a Mandarin speech corpus for TTS. J. Chin. Inf. Process. 21(2), 94–99 (2007)
Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., Kitamura, T.: Speech parameter generation algorithms for HMM-based speech synthesis. In: Proceedings of 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100), pp. 1315–1318. IEEE (2000)
Tokuda, K., Masuko, T., Miyazaki, N., Kobayashi, T.: Multi-space probability distribution HMM. IEICE Trans. Inf. Syst. 85(3), 455–464 (2002)
Wang, M., Wen, M., Saito, D., Hirose, K., Minematsu, N.: Improved generation of prosodic features in HMM-based Mandarin speech synthesis. In: Proceedings of the Seventh ISCA Tutorial and Research Workshop on Speech Synthesis (SSW7-2010) (2010)
Yamagishi, J., Masuko, T., Kobayashi, T.: HMM-based expressive speech synthesis-towards TTS with arbitrary speaking styles and emotions. In: Proceedings of the Special Workshop in Maui (SWIM) (2004)
Tang, H., Zhou, X., Odisio, M., Hasegawa-Johnson, M., Huang, T.S.: Two-stage prosody prediction for emotional text-to-speech synthesis. In: Proceedings of the Interspeech, pp. 2138–2141 (2008)
Inanoglu, Z., Young, S.: Data-driven emotion conversion in spoken English. Speech Commun. 51(3), 268–283 (2009)
Wu, Y., Kawai, H., Ni, J., Wang, R.: Discriminative training and explicit duration modeling for HMM-based automatic segmentation. Speech Commun. 47(4), 397–410 (2005). https://doi.org/10.1016/j.specom.2005.03.016
Yu, K., Mairesse, F., Young, S.: Word-level emphasis modelling in HMM-based speech synthesis. In: Proceedings of the ICASSP, pp. 4238–4241. IEEE (2010)
Badino, L., Andersson, J.S., Yamagishi, J., Clark, R.A.: Identification of contrast and its emphatic realization in HMM-based speech synthesis. In: Proceedings of the INTERSPEECH, pp. 520–523 (2009)
Turk, O., Schroder, M.: Evaluation of expressive speech synthesis with voice conversion and copy resynthesis techniques. IEEE Trans. Audio Speech Lang. Process. 18(5), 965–973 (2010). https://doi.org/10.1109/TASL.2010.2041113
Zhang, J., Hirose, K.: Tone nucleus modeling for Chinese lexical tone recognition. Speech Commun. 42(3–4), 447–466 (2004)
Sun, Q., Hirose, K., Gu, W., Minematsu, N.: Generation of fundamental frequency contours for Mandarin speech synthesis based on tone nucleus model. In: Proceedings of the Eurospeech, pp. 3265–3268 (2005)
Wen, M., Wang, M., Hirose, K., Minematsu, N.: Prosody conversion for emotional Mandarin speech synthesis using the tone nucleus model. In: Proceedings of the Twelfth Annual Conference of the International Speech Communication Association (2011)
Li, Y., Pan, S., Tao, J.: HMM-based speech synthesis with a flexible Mandarin stress adaptation model. In: IEEE 10th International Conference on Signal Processing Proceedings, pp. 625–628. IEEE (2010)
Yang, C., Ling, Z., Lu, H., Guo, W., Dai, L.: Automatic phrase boundary labeling for Mandarin TTS corpus using context-dependent HMM. In: Proceedings of 2010 7th International Symposium on Chinese Spoken Language Processing, pp. 374–377. IEEE (2010)
Shao, Y.Q., Sui, Z.F., Han, J.Q., Wu, Y.F.: A study on Chinese prosodic hierarchy prediction based on dependency grammar analysis. J. Chin. Inf. Process. 22(2), 116–123 (2008)
Li, J., Hu, G., Wang, R.: Chinese prosody phrase break prediction based on maximum entropy model. In: Proceedings of the INTERSPEECH, pp. 729–732 (2004)
Hsia, C., Wu, C., Wu, J.: Exploiting prosody hierarchy and dynamic features for pitch modeling and generation in HMM-based speech synthesis. IEEE Trans. Audio Speech Lang. Process. 18(8), 1994–2003 (2010). https://doi.org/10.1109/TASL.2010.2040791
Yu, Y., Li, D., Wu, X.: Prosodic modeling with rich syntactic context in HMM-based Mandarin speech synthesis. In: Proceedings of 2013 IEEE China Summit and International Conference on Signal and Information Processing, pp. 132–136. IEEE (2013)
Gao, L., Ling, Z., Chen, L., Dai, L.: Improving F0 prediction using bidirectional associative memories and syllable-level F0 features for HMM-based Mandarin speech synthesis. In: Proceedings of the 9th International Symposium on Chinese Spoken Language Processing, pp. 275–279. IEEE (2014)
Ling, Z., Deng, L., Yu, D.: Modeling spectral envelopes using restricted Boltzmann machines for statistical parametric speech synthesis. In: Proceedings of the 38th IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7825–7829. IEEE (2013)
Gehring, J., Miao, Y., Metze, F., Waibel, A.: Extracting deep bottleneck features using stacked auto-encoders. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3377–3381. IEEE (2013)
Zen, H., Senior, A.: Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. In: Proceedings of the 39th IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3844–3848. IEEE (2014)
Bishop, C.M.: Mixture density networks (1994)
Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM networks. In: Proceedings of the IEEE International Joint Conference on Neural Networks, pp. 2047–2052. IEEE (2005)
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)
Xie, L., Lee, T., Mak, M.-W.: Guest Editorial: Advances in deep learning for speech processing. J. Signal Process. Syst. 90(7), 959–961 (2018). https://doi.org/10.1007/s11265-018-1333-3
Zheng, Y., Li, Y., Wen, Z., Liu, B., Tao, J.: Text-based sentential stress prediction using continuous lexical embedding for Mandarin speech synthesis. In: Proceedings of 10th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 1–5. IEEE (2016)
Zheng, Y., Li, Y., Wen, Z., Liu, B., Tao, J.: Investigating deep neural network adaptation for generating exclamatory and interrogative speech in mandarin. J. Signal Process. Syst. 90(7), 1039–1052 (2018)
Zhu, J.: Probing the phonetic and phonological knowledge of tones in Mandarin TTS models. In: Proceedings of 10th International Conference on Speech Prosody 2020, pp. 930–934 (2020)
Liu, Z., Mak, B.: Cross-lingual multi-speaker text-to-speech synthesis for voice cloning without using parallel corpus for unseen speakers. arXiv preprint arXiv:1911.11601 (2019)
Li, H., Kang, Y., Wang, Z.: EMPHASIS: an emotional phoneme-based acoustic model for speech synthesis system. arXiv preprint arXiv:1806.09276 (2018)
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
Shang, L., Lu, Z., Li, H.: Neural responding machine for short-text conversation. arXiv preprint arXiv:1503.02364 (2015)
Vinyals, O., Kaiser, Ł., Koo, T., Petrov, S., Sutskever, I., Hinton, G.: Grammar as a foreign language. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 2773–2781 (2015)
Rush, A.M., Harvard, S., Chopra, S., Weston, J.: A neural attention model for sentence summarization. In: Proceedings of the ACLWeb. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (2017)
Luong, M., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015)
Hermann, K.M., et al.: Teaching machines to read and comprehend. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 1693–1701 (2015)
Chorowski, J., Bahdanau, D., Cho, K., Bengio, Y.: End-to-end continuous speech recognition using attention-based recurrent NN: first results. arXiv preprint arXiv:1412.1602 (2014)
Wang, W., Xu, S., Xu, B.: First step towards end-to-end parametric TTS synthesis: generating spectral parameters with neural attention (2016)
Zhang, J., Ling, Z., Dai, L.: Forward attention in sequence-to-sequence acoustic modeling for speech synthesis. In: Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4789–4793. IEEE (2018)
Oord, A.V.D., et al.: WaveNet: a generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)
Oord, A.V.D., Kalchbrenner, N., Kavukcuoglu, K.: Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759 (2016)
Van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A.: Conditional image generation with PixelCNN decoders. In: Proceedings of the Annual Conference on Neural Information Processing Systems, pp. 4790–4798 (2016)
Oord, A., et al.: Parallel WaveNet: fast high-fidelity speech synthesis. In: Proceedings of the International Conference on Machine Learning, PMLR, pp. 3918–3926 (2018)
Sotelo, J., et al.: Char2Wav: end-to-end speech synthesis (2017)
Mehri, S., et al.: SampleRNN: an unconditional end-to-end neural audio generation model. arXiv preprint arXiv:1612.07837 (2016)
Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135 (2017)
Wang, Y., et al.: Tacotron: a fully end-to-end text-to-speech synthesis model, vol. 164. arXiv preprint arXiv:1703.10135 (2017)
Griffin, D., Lim, J.: Signal estimation from modified short-time Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 32(2), 236–243 (1984)
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114 (2013)
Gibiansky, A., et al., Deep Voice 2: multi-speaker neural text-to-speech. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 2962–2970 (2017)
Shen, J., et al.: Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In: Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. IEEE (2018)
Tachibana, H., Uenoyama, K., Aihara, S.: Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. In: Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4784–4788. IEEE (2018)
Ping, W., et al.: Deep Voice 3: scaling text-to-speech with convolutional sequence learning. In: Proceedings of the Sixth International Conference on Learning Representations, pp. 1–16 (2018)
Ping, W., Peng, K., Chen, J.: ClariNet: parallel wave generation in end-to-end text-to-speech. arXiv preprint arXiv:1807.07281 (2018)
Peng, K., Ping, W., Song, Z., Zhao, K.: Parallel neural text-to-speech. arXiv preprint arXiv:1905.08459 (2019)
Arık, S.Ö., Jun, H., Diamos, G.: Fast spectrogram inversion using multi-head convolutional neural networks. IEEE Signal Process. Lett. 26(1), 94–98 (2018)
Kingma, D.P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., Welling, M.: Improved variational inference with inverse autoregressive flow. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 4743–4751 (2016)
Prenger, R., Valle, R., Catanzaro, B.: WaveGlow: a flow-based generative network for speech synthesis. In: Proceedings of the ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621. IEEE (2019)
Ping, W., Peng, K., Zhao, K., Song, Z.: WaveFlow: a compact flow-based model for raw audio. In: Proceedings of the ICML (2020)
Miao, C., Liang, S., Chen, M., Ma, J., Wang, S., Xiao, J.: Flow-TTS: a non-autoregressive network for text to speech based on flow. In: Proceedings of the ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7209–7213. IEEE (2020)
Kim, J., Kim, S., Kong, J., Yoon, S.: Glow-TTS: a generative flow for text-to-speech via monotonic alignment search. arXiv preprint arXiv:2005.11129 (2020)
Kalchbrenner, N., et al.: Efficient neural audio synthesis. arXiv preprint arXiv:1802.08435 (2018)
Ebden, P., Sproat, R.: The Kestrel TTS text normalization system. Nat. Lang. Eng. 21(3), 333 (2015)
Sproat, R., Emerson, T.: The first international Chinese word segmentation Bakeoff. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, pp. 133–143 (2003)
Màrquez, L., Rodríguez, H.: Part-of-speech tagging using decision trees. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 25–36. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0026668
Zhang, H., Yu, J., Zhan, W., Yu, S.: Disambiguation of Chinese polyphonic characters. In: Proceedings of the First International Workshop on MultiMedia Annotation (MMA2001), pp. 30–31. Citeseer (2001)
Shi, Q., Ma, X., Zhu, W., Zhang, W., Shen, L.: Statistic prosody structure prediction. In: Proceedings of 2002 IEEE Workshop on Speech Synthesis, pp. 155–158. IEEE (2002)
Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015)
Guo-liang, W., Meng-nan, C., Lei, C.: An end-to-end Chinese speech synthesis scheme based on Tacotron 2. J. East China Norm. Univ. Nat. Sci. 2019(4), 111 (2019)
He, T., Zhao, W., Xu, L.: DOP-Tacotron: a fast Chinese TTS system with local-based attention. In: Proceedings of 2020 Chinese Control and Decision Conference (CCDC), pp. 4345–4350. IEEE (2020)
Li, J., Wu, Z., Li, R., Zhi, P., Yang, S., Meng, H.: Knowledge-based linguistic encoding for end-to-end Mandarin text-to-speech synthesis. In: Proceedings of the INTERSPEECH, pp. 4494–4498 (2019)
Pan, H., Li, X., Huang, Z.: A Mandarin prosodic boundary prediction model based on multi-task learning. In: Proceedings of the INTERSPEECH, pp. 4485–4488 (2019)
Zou, Y., Dong, L., Xu, B.: Boosting character-based Chinese speech synthesis via multi-task learning and dictionary tutoring (2019)
Pan, J., et al.: A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis. In: Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6689–6693. IEEE (2020)
Zhang, J., et al.: A hybrid text normalization system using multi-head self-attention for mandarin. In: Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6694–6698. IEEE (2020)
Yang, B., Zhong, J., Liu, S.: Pre-trained text representations for improving front-end text processing in Mandarin text-to-speech synthesis. In: Proceedings of the INTERSPEECH, pp. 4480–4484 (2019)
Yan, Y., Jiang, J., Yang, H.: Mandarin prosody boundary prediction based on sequence-to-sequence model. In: Proceedings of 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), pp. 1013–1017. IEEE (2020)
Tang, G., Müller, M., Rios, A., Sennrich, R.: Why self-attention? A targeted evaluation of neural machine translation architectures. arXiv preprint arXiv:1808.08946 (2018)
Vaswani, A., et al.: Attention is all you need. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Karita, S., et al.: A comparative study on transformer vs RNN in speech applications. In: Proceedings of 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 449–456. IEEE (2019)
Zheng, Y., Li, X., Xie, F., Lu, L.: Improving end-to-end speech synthesis with local recurrent neural network enhanced transformer. In: Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6734–6738. IEEE (2020)
Engel, J., Agrawal, K.K., Chen, S., Gulrajani, I., Donahue, C., Roberts, A.: GANSynth: adversarial neural audio synthesis. arXiv preprint arXiv:1902.08710 (2019)
Donahue, C., McAuley, J., Puckette, M.: Adversarial audio synthesis. arXiv preprint arXiv:1802.04208 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Gong, W., Hong, Y., Liu, H., He, Y. (2021). A Review of End-to-End Chinese – Mandarin Speech Synthesis Techniques. In: Cao, W., Ozcan, A., Xie, H., Guan, B. (eds) Computing and Data Science. CONF-CDS 2021. Communications in Computer and Information Science, vol 1513. Springer, Singapore. https://doi.org/10.1007/978-981-16-8885-0_31
Download citation
DOI: https://doi.org/10.1007/978-981-16-8885-0_31
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-8884-3
Online ISBN: 978-981-16-8885-0
eBook Packages: Computer ScienceComputer Science (R0)