Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

A Hidden Semi-Markov Model-Based Speech Synthesis System

Published: 01 May 2007 Publication History

Abstract

A statistical speech synthesis system based on the hidden Markov model (HMM) was recently proposed. In this system, spectrum, excitation, and duration of speech are modeled simultaneously by context-dependent HMMs, and speech parameter vector sequences are generated from the HMMs themselves. This system defines a speech synthesis problem in a generative model framework and solves it based on the maximum likelihood (ML) criterion. However, there is an inconsistency: although state duration probability density functions (PDFs) are explicitly used in the synthesis part of the system, they have not been incorporated into its training part. This inconsistency can make the synthesized speech sound less natural. In this paper, we propose a statistical speech synthesis system based on a hidden semi-Markov model (HSMM), which can be viewed as an HMM with explicit state duration PDFs. The use of HSMMs can solve the above inconsistency because we can incorporate the state duration PDFs explicitly into both the synthesis and the training parts of the system. Subjective listening test results show that use of HSMMs improves the reported naturalness of synthesized speech.

References

[1]
T. Masuko, K. Tokuda, T. Kobayashi, and S. Imai, “Speech synthesis from HMMs using dynamic features,” Proc. ICASSP, pp.389–392, 1996.
[2]
T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis,” Proc. Eurospeech, pp.2347–2350, 1999.
[3]
T. Masuko, K. Tokuda, T. Kobayashi, and S. Imai, “Voice characteristics conversion for HMM-based speech synthesis system,” Proc. ICASSP, pp.1611–1614, 1997.
[4]
M. Tamura, T. Masuko, K. Tokuda, and T. Kobayashi, “Adaptation of pitch and spectrum for HMM-based speech synthesis using MLLR,” Proc. ICASSP, pp.805–808, 2001.
[5]
T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Speaker interpolation in HMM-based speech synthesis system,” Proc. Eurospeech, pp.2523–2526, 1997.
[6]
K. Shichiri, A. Sawabe, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Eigenvoices for HMM-based speech synthesis,” Proc. ICSLP, pp.1269–1272, 2002.
[7]
N. Kaiki, K. Takeda, and Y. Sagisaka, “Linguistic properties in the control of segmental duration for speech synthesis,” in Talking Machines: Theories, Models, and Designs, ed. G. Bailly and C. Benoit, pp.255–263, Elsevier Science Publishers, 1992.
[8]
M. Riley, “Tree-based modelling of segmental duration,” in Talking Machines: Theories, Models, and Designs, ed. G. Bailly and C. Benoit, pp.265–273, Elsevier Science Publishers, 1992.
[9]
N. Iwahashi and Y. Sagisaka, “Statistical modelling of speech segment duration by constrained tree regression,” IEICE Trans. Inf. & Syst., vol.E83-D, no.7, pp.1550–1559, July 2000.
[10]
J. van Santen, C. Shih, B. Möbius, E. Tzoukermann, and M. Tanenblatt, “Multi-lingual duration modelling,” Proc. Eurospeech, pp.2651–2654, 1997.
[11]
T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Duration modeling for HMM-based speech synthesis,” Proc. ICSLP, pp.29–32, 1998.
[12]
Y. Ishimatsu, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Investigation of state duration model based on gamma distribution for HMM-based speech synthesis,” IEICE Technical Report, SP2001-81, 2001.
[13]
J. Yamagishi, T. Masuko, and Kobayashi, “A study on state duration modeling using lognormal distribution for HMM-based speech synthesis,” Proc. ASJ, pp.225–226, March 2004.
[14]
A. Dempster, N. Laird, and D. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of Royal Statistics Society, vol.39, pp.1–38, 1977.
[15]
J. Odell, The Use of Context in Large Vocabulary Speech Recognition, Ph.D. thesis, Cambridge University, 1995.
[16]
K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura, “Speech parameter generation algorithms for HMM-based speech synthesis,” Proc. ICASSP, pp.1315–1318, 2000.
[17]
L. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, vol.77, no.2, pp.257–285, 1989.
[18]
K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi, “Multi-space probability distribution HMM,” IEICE Trans. Inf. & Syst., vol.E85-D, no.3, pp.455–464, March 2002.
[19]
J. Ferguson, “Variable duration models for speech,” Proc. Symposium on the Application Hidden Markov Models to Text and Speech, pp.143–179, 1980.
[20]
M. Russell and R. Moore, “Explicit modeling of state occupancy in hidden Markov models for automatic speech recognition,” Proc. ICASSP, pp.5–8, 1985.
[21]
S. Levinson, “Continuously variable duration hidden Markov models for automatic speech recognition,” Comput. Speech Lang., vol.1, pp.29–45, 1986.
[22]
C. Mitchell, M. Harper, and L. Jamieson, “On the complexity of explicit duration HMMs,” IEEE Trans. Speech Audio Process., vol.3, no.3, pp.213–217, 1995.
[23]
M. Ostendorf, V. Digalakis, and O. Kimball, “From HMMs to segment models,” IEEE Trans. Speech Audio Process., vol.4, no.5, pp.360–378, 1996.
[24]
K.F. Lee, “Context-dependent phonetic hidden Markov models for speaker-independent continuous speech recognition,” IEEE Trans. Acoust. Speech Signal Process., vol.38, no.4, pp.599–609, 1990.
[25]
M. Ostendorf and H. Singer, “HMM topology design using maximum likelihood successive state splitting,” Comput. Speech Lang., vol.11, no.1, pp.17–41, 1997.
[26]
M.Y. Hwang, X. Huang, and F. Alleva, “Predicting unseen triphones with senones,” Proc. ICASSP, pp.311–314, 1993.
[27]
P. Woodland and S. Young, “Benchmark DARPA RM results with the HTK portable HMM toolkit,” Proc. DARPA Continuous Speech Recognition Workshop, pp.71–76, 1992.
[28]
T. Masuko, K. Tokuda, N. Miyazaki, and T. Kobayashi, “Pitch pattern generation using multi-space probability distribution HMM,” IEICE Trans. Inf. & Syst. (Japanese Edition), vol.J85-D-II, no.7, pp.1600–1609, July 2000.
[29]
A. Kurematsu, K. Takeda, Y. Sagisaka, S. Katagiri, H. Kuwabara, and K. Shikano, “ATR Japanese speech database as a tool of speech recognition and synthesis,” Speech Commun., vol.9, pp.357–363, 1990.
[30]
H. Kawahara, I. Masuda-Katsuse, and A. Cheveigné, “Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based f0 extraction: Possible role of a repetitive structure in sounds,” Speech Commun., vol.27, pp.187–207, 1999.
[31]
H. Kawahara, J. Estill, and O. Fujimura, “Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system straight,” Proc. MAVEBA, pp.13–15, 2001.
[32]
T. Toda and K. Tokuda, “Speech parameter generation algorithm considering global variance for HMM-based speech synthesis,” Proc. Interspeech (Eurospeech), pp.2801–2804, 2005.
[33]
H. Zen and T. Toda, “An overview of Nitech HMM-based speech synthesis system for Blizzard Challenge 2005,” Proc. Interspeech, pp.93–96, 2005.
[34]
J. Rissanen, Stochastic Complexity in Stochastic Inquiry, World Scientific Publishing Company, 1980.
[35]
K. Shinoda and T. Watanabe, “Acoustic modeling based on the MDL criterion for speech recognition,” Proc. Eurospeech, pp.99–102, 1997.
[36]
H. Zen, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Hidden semi-Markov model based speech synthesis,” Proc. ICSLP, pp.1185–1180, 2004.

Cited By

View all
  • (2022)Conventional and contemporary approaches used in text to speech synthesis: a reviewArtificial Intelligence Review10.1007/s10462-022-10315-056:7(5837-5880)Online publication date: 13-Nov-2022
  • (2020)A Hybrid RNN-HMM Approach for Weakly Supervised Temporal Action SegmentationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2018.288446942:4(765-779)Online publication date: 4-Mar-2020
  • (2018)Mel-Cepstrum-Based Quantization Noise Shaping Applied to Neural-Network-Based Speech Waveform SynthesisIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2018.281840826:7(1173-1180)Online publication date: 1-Jul-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEICE - Transactions on Information and Systems
IEICE - Transactions on Information and Systems  Volume E90-D, Issue 5
May 2007
80 pages
ISSN:0916-8532
EISSN:1745-1361
Issue’s Table of Contents

Publisher

Oxford University Press, Inc.

United States

Publication History

Published: 01 May 2007

Author Tags

  1. HMM-based speech synthesis
  2. hidden Markov model
  3. hidden semi-Markov model

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Conventional and contemporary approaches used in text to speech synthesis: a reviewArtificial Intelligence Review10.1007/s10462-022-10315-056:7(5837-5880)Online publication date: 13-Nov-2022
  • (2020)A Hybrid RNN-HMM Approach for Weakly Supervised Temporal Action SegmentationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2018.288446942:4(765-779)Online publication date: 4-Mar-2020
  • (2018)Mel-Cepstrum-Based Quantization Noise Shaping Applied to Neural-Network-Based Speech Waveform SynthesisIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2018.281840826:7(1173-1180)Online publication date: 1-Jul-2018
  • (2018)A Log Domain Pulse Model for Parametric Speech SynthesisIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2017.276154626:1(57-70)Online publication date: 1-Jan-2018
  • (2018)Investigating Deep Neural Network Adaptation for Generating Exclamatory and Interrogative Speech in MandarinJournal of Signal Processing Systems10.1007/s11265-017-1290-290:7(1039-1052)Online publication date: 1-Jul-2018
  • (2017)A spectral algorithm for inference in hidden semi-Markov modelsThe Journal of Machine Learning Research10.5555/3122009.312204418:1(1164-1202)Online publication date: 1-Jan-2017
  • (2017)Duration-Controlled LSTM for Polyphonic Sound Event DetectionIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2017.274000225:11(2059-2070)Online publication date: 1-Nov-2017
  • (2017)Simultaneous Optimization of Multiple Tree-Based Factor Analyzed HMM for Speech SynthesisIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2017.272121925:9(1836-1845)Online publication date: 1-Sep-2017
  • (2017)Preserving Word-Level Emphasis in Speech-to-Speech TranslationIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2016.264328025:3(544-556)Online publication date: 1-Mar-2017
  • (2017)Video-realistic expressive audio-visual speech synthesis for the Greek languageSpeech Communication10.1016/j.specom.2017.08.01195:C(137-152)Online publication date: 1-Dec-2017
  • Show More Cited By

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media