Abstract
Prosodic breaks prediction from text is a fundamental task to obtain naturalness in text to speech applications. In this work we build a data-driven break predictor out of linguistic features like the Part of Speech (POS) tags and forward-backward word distance to punctuation marks, and to do so we use a basic Recurrent Neural Network (RNN) model to exploit the sequence dependency in decisions. In the experiments we evaluate the performance of a logistic regression model and the recurrent one. The results show that the logistic regression outperforms the baseline (CART) by a \(9.5\,\%\) in the F-score, and the addition of the recurrent layer in the model further improves the predictions of the baseline by an \(11\,\%\).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Taylor, P.: Text-to-Speech Synthesis. Cambridge University Press, Cambridge (2009)
Taylor, P., Black, A.W.: Assigning phrase breaks from part-of-speech sequences. Comput. Speech Lang. 12, 99–117 (1998)
Bonafonte, A., Agüero, P.D.: Phrase break prediction using a finite state transducer. In: Proceedings of Advanced Speech Technologies (2004)
Agüero, P.D., et al.: Síntesis de voz aplicada a la traducción voz a voz. Ph.D. dissertation, Tesis Doctoral. Universidad Politécnica de Cataluña (2012). http://hdl.handle.net/10803/97035
Hirschberg, J., Prieto, P.: Training intonational phrasing rules automatically for English and Spanish text-to-speech. Speech Commun. 18(3), 281–290 (1996)
Li, J., Hu, G., Wang, R.: Prosody phrase break prediction based on maximum entropy model. J. Chin. Inf. Process. 18, 56–63 (2004)
Watts, O., Gangireddy, S., Yamagishi, J., King, S., Renals, S., Stan, A., Giurgiu, M.: Neural net word representations for phrase-break prediction without a part of speech tagger. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2599–2603. IEEE (2014)
Mishra, T., Kim, Y.-J., Bangalore, S.: Intonational phrase break prediction for text-to-speech synthesis using dependency relations. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4919–4923. IEEE (2015)
Watts, O., Yamagishi, J., King, S.: Unsupervised continuous-valued word features for phrase-break prediction without a part-of-speech tagger. In: INTERSPEECH, pp. 2157–2160 (2011)
Sun, X., Applebaum, T.H.: Intonational phrase break prediction using decision tree and n-gram model. In: INTERSPEECH, pp. 537–540 (2001)
Deng, L., Yu, D.: Deep learning: methods and applications. Found. Trends Sig. Process. 7(3–4), 197–387 (2014)
Kim, Y., Jernite, Y., Sontag, D., Rush, A.M.: Character-aware neural language models, arXiv preprint arXiv:1508.06615 (2015)
Sutskever, I., Martens, J., Hinton, G.E.: Generating text with recurrent neural networks. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 1017–1024 (2011)
Mikolov, T., Deoras, A., Kombrink, S., Burget, L., Cernockỳ, J.: Empirical evaluation and combination of advanced language modeling techniques. In: INTERSPEECH, pp. 605–608 (2011)
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation, arXiv preprint arXiv:1406.1078 (2014)
Zen, H., Sak, H.: Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4470–4474 (2015)
Fan, Y., Qian, Y., Xie, F.-L., Soong, F.K.: TTS synthesis with bidirectional LSTM based recurrent neural networks. In: INTERSPEECH, pp. 1964–1968 (2014)
Hochreiter, S.: The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. J. Uncertainty Fuzziness Knowl.-Based Syst. 6(02), 107–116 (1998)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Chollet, F.: Keras (2015). https://github.com/fchollet/keras
Bonafonte, A., Höge, H., Kiss, I., Moreno, A., Ziegenhain, U., van den Heuvel, H., Hain, H.-U., Wang, X.S., Garcia, M.-N.: TC-STAR: specifications of language resources and evaluation for speech synthesis. In: Proceedings of LREC Conference, pp. 311–314 (2006)
Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp. 1139–1147 (2013)
Forney Jr., J.D.: The viterbi algorithm. Proc. IEEE 61(3), 268–278 (1973)
Acknowledgments
This work was supported by the Spanish Ministerio de Economía y Competitividad and European Regional Development Fund, contract TEC2015-69266-P (MINECO/FEDER, UE).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Pascual, S., Bonafonte, A. (2016). Prosodic Break Prediction with RNNs. In: Abad, A., et al. Advances in Speech and Language Technologies for Iberian Languages. IberSPEECH 2016. Lecture Notes in Computer Science(), vol 10077. Springer, Cham. https://doi.org/10.1007/978-3-319-49169-1_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-49169-1_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49168-4
Online ISBN: 978-3-319-49169-1
eBook Packages: Computer ScienceComputer Science (R0)