article

A Hidden Semi-Markov Model-Based Speech Synthesis System

Authors:

Keiichi Tokuda,

Takashi Masuko,

Takao Kobayasih,

Tadashi KitamuraAuthors Info & Claims

IEICE - Transactions on Information and Systems, Volume E90-D, Issue 5

Pages 825 - 834

https://doi.org/10.1093/ietisy/e90-d.5.825

Published: 01 May 2007 Publication History

Abstract

A statistical speech synthesis system based on the hidden Markov model (HMM) was recently proposed. In this system, spectrum, excitation, and duration of speech are modeled simultaneously by context-dependent HMMs, and speech parameter vector sequences are generated from the HMMs themselves. This system defines a speech synthesis problem in a generative model framework and solves it based on the maximum likelihood (ML) criterion. However, there is an inconsistency: although state duration probability density functions (PDFs) are explicitly used in the synthesis part of the system, they have not been incorporated into its training part. This inconsistency can make the synthesized speech sound less natural. In this paper, we propose a statistical speech synthesis system based on a hidden semi-Markov model (HSMM), which can be viewed as an HMM with explicit state duration PDFs. The use of HSMMs can solve the above inconsistency because we can incorporate the state duration PDFs explicitly into both the synthesis and the training parts of the system. Subjective listening test results show that use of HSMMs improves the reported naturalness of synthesized speech.

References

[1]

T. Masuko, K. Tokuda, T. Kobayashi, and S. Imai, “Speech synthesis from HMMs using dynamic features,” Proc. ICASSP, pp.389–392, 1996.

[2]

T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis,” Proc. Eurospeech, pp.2347–2350, 1999.

[3]

T. Masuko, K. Tokuda, T. Kobayashi, and S. Imai, “Voice characteristics conversion for HMM-based speech synthesis system,” Proc. ICASSP, pp.1611–1614, 1997.

[4]

M. Tamura, T. Masuko, K. Tokuda, and T. Kobayashi, “Adaptation of pitch and spectrum for HMM-based speech synthesis using MLLR,” Proc. ICASSP, pp.805–808, 2001.

Digital Library

[5]

T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Speaker interpolation in HMM-based speech synthesis system,” Proc. Eurospeech, pp.2523–2526, 1997.

[6]

K. Shichiri, A. Sawabe, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Eigenvoices for HMM-based speech synthesis,” Proc. ICSLP, pp.1269–1272, 2002.

[7]

N. Kaiki, K. Takeda, and Y. Sagisaka, “Linguistic properties in the control of segmental duration for speech synthesis,” in Talking Machines: Theories, Models, and Designs, ed. G. Bailly and C. Benoit, pp.255–263, Elsevier Science Publishers, 1992.

[8]

M. Riley, “Tree-based modelling of segmental duration,” in Talking Machines: Theories, Models, and Designs, ed. G. Bailly and C. Benoit, pp.265–273, Elsevier Science Publishers, 1992.

[9]

N. Iwahashi and Y. Sagisaka, “Statistical modelling of speech segment duration by constrained tree regression,” IEICE Trans. Inf. & Syst., vol.E83-D, no.7, pp.1550–1559, July 2000.

[10]

J. van Santen, C. Shih, B. Möbius, E. Tzoukermann, and M. Tanenblatt, “Multi-lingual duration modelling,” Proc. Eurospeech, pp.2651–2654, 1997.

[11]

T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Duration modeling for HMM-based speech synthesis,” Proc. ICSLP, pp.29–32, 1998.

[12]

Y. Ishimatsu, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Investigation of state duration model based on gamma distribution for HMM-based speech synthesis,” IEICE Technical Report, SP2001-81, 2001.

[13]

J. Yamagishi, T. Masuko, and Kobayashi, “A study on state duration modeling using lognormal distribution for HMM-based speech synthesis,” Proc. ASJ, pp.225–226, March 2004.

[14]

A. Dempster, N. Laird, and D. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of Royal Statistics Society, vol.39, pp.1–38, 1977.

[15]

J. Odell, The Use of Context in Large Vocabulary Speech Recognition, Ph.D. thesis, Cambridge University, 1995.

[16]

K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura, “Speech parameter generation algorithms for HMM-based speech synthesis,” Proc. ICASSP, pp.1315–1318, 2000.

[17]

L. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, vol.77, no.2, pp.257–285, 1989.

Digital Library

[18]

K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi, “Multi-space probability distribution HMM,” IEICE Trans. Inf. & Syst., vol.E85-D, no.3, pp.455–464, March 2002.

[19]

J. Ferguson, “Variable duration models for speech,” Proc. Symposium on the Application Hidden Markov Models to Text and Speech, pp.143–179, 1980.

[20]

M. Russell and R. Moore, “Explicit modeling of state occupancy in hidden Markov models for automatic speech recognition,” Proc. ICASSP, pp.5–8, 1985.

[21]

S. Levinson, “Continuously variable duration hidden Markov models for automatic speech recognition,” Comput. Speech Lang., vol.1, pp.29–45, 1986.

Digital Library

[22]

C. Mitchell, M. Harper, and L. Jamieson, “On the complexity of explicit duration HMMs,” IEEE Trans. Speech Audio Process., vol.3, no.3, pp.213–217, 1995.

[23]

M. Ostendorf, V. Digalakis, and O. Kimball, “From HMMs to segment models,” IEEE Trans. Speech Audio Process., vol.4, no.5, pp.360–378, 1996.

[24]

K.F. Lee, “Context-dependent phonetic hidden Markov models for speaker-independent continuous speech recognition,” IEEE Trans. Acoust. Speech Signal Process., vol.38, no.4, pp.599–609, 1990.

[25]

M. Ostendorf and H. Singer, “HMM topology design using maximum likelihood successive state splitting,” Comput. Speech Lang., vol.11, no.1, pp.17–41, 1997.

[26]

M.Y. Hwang, X. Huang, and F. Alleva, “Predicting unseen triphones with senones,” Proc. ICASSP, pp.311–314, 1993.

[27]

P. Woodland and S. Young, “Benchmark DARPA RM results with the HTK portable HMM toolkit,” Proc. DARPA Continuous Speech Recognition Workshop, pp.71–76, 1992.

[28]

T. Masuko, K. Tokuda, N. Miyazaki, and T. Kobayashi, “Pitch pattern generation using multi-space probability distribution HMM,” IEICE Trans. Inf. & Syst. (Japanese Edition), vol.J85-D-II, no.7, pp.1600–1609, July 2000.

[29]

A. Kurematsu, K. Takeda, Y. Sagisaka, S. Katagiri, H. Kuwabara, and K. Shikano, “ATR Japanese speech database as a tool of speech recognition and synthesis,” Speech Commun., vol.9, pp.357–363, 1990.

[30]

H. Kawahara, I. Masuda-Katsuse, and A. Cheveigné, “Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based f0 extraction: Possible role of a repetitive structure in sounds,” Speech Commun., vol.27, pp.187–207, 1999.

Digital Library

[31]

H. Kawahara, J. Estill, and O. Fujimura, “Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system straight,” Proc. MAVEBA, pp.13–15, 2001.

[32]

T. Toda and K. Tokuda, “Speech parameter generation algorithm considering global variance for HMM-based speech synthesis,” Proc. Interspeech (Eurospeech), pp.2801–2804, 2005.

[33]

H. Zen and T. Toda, “An overview of Nitech HMM-based speech synthesis system for Blizzard Challenge 2005,” Proc. Interspeech, pp.93–96, 2005.

[34]

J. Rissanen, Stochastic Complexity in Stochastic Inquiry, World Scientific Publishing Company, 1980.

[35]

K. Shinoda and T. Watanabe, “Acoustic modeling based on the MDL criterion for speech recognition,” Proc. Eurospeech, pp.99–102, 1997.

[36]

H. Zen, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Hidden semi-Markov model based speech synthesis,” Proc. ICSLP, pp.1185–1180, 2004.

Cited By

Kaur NSingh P(2022)Conventional and contemporary approaches used in text to speech synthesis: a reviewArtificial Intelligence Review10.1007/s10462-022-10315-056:7(5837-5880)Online publication date: 13-Nov-2022
https://dl.acm.org/doi/10.1007/s10462-022-10315-0
Kuehne HRichard AGall J(2020)A Hybrid RNN-HMM Approach for Weakly Supervised Temporal Action SegmentationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2018.288446942:4(765-779)Online publication date: 4-Mar-2020
https://dl.acm.org/doi/10.1109/TPAMI.2018.2884469
Yoshimura THashimoto KOura KNankaku YTokuda K(2018)Mel-Cepstrum-Based Quantization Noise Shaping Applied to Neural-Network-Based Speech Waveform SynthesisIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2018.281840826:7(1173-1180)Online publication date: 1-Jul-2018
https://dl.acm.org/doi/10.1109/TASLP.2018.2818408
Show More Cited By

Index Terms

A Hidden Semi-Markov Model-Based Speech Synthesis System
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Speech recognition
2. Mathematics of computing
  1. Probability and statistics
    1. Probabilistic representations
      1. Markov networks
    2. Stochastic processes
      1. Markov processes

Index terms have been assigned to the content through auto-classification.

Recommendations

A Fully Consistent Hidden Semi-Markov Model-Based Speech Recognition System

In a hidden Markov model (HMM), state duration probabilities decrease exponentially with time, which fails to adequately represent the temporal structure of speech. One of the solutions to this problem is integrating state duration probability ...
Hidden-Markov-model based statistical parametric speech synthesis for Marathi with optimal number of hidden states

Hidden Markov Model and Deep Neural Networks based Statistical Parametric Speech Synthesis systems, gain a significant attention from researchers because of their flexibility in generating speech waveforms in diverse voice qualities as well as in ...
Hidden Markov models with arbitrary state dwell-time distributions

A hidden Markov model (HMM) with a special structure that captures the 'semi'-property of hidden semi-Markov models (HSMMs) is considered. The proposed model allows arbitrary dwell-time distributions in the states of the Markov chain. For dwell-time ...

Comments

Information & Contributors

Information

Published In

cover image IEICE - Transactions on Information and Systems

IEICE - Transactions on Information and Systems Volume E90-D, Issue 5

May 2007

80 pages

ISSN:0916-8532

EISSN:1745-1361

Issue’s Table of Contents

Copyright © Copyright © 2007 The Institute of Electronics, Information and Communication Engineers.

Publisher

Oxford University Press, Inc.

United States

Publication History

Published: 01 May 2007

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kaur NSingh P(2022)Conventional and contemporary approaches used in text to speech synthesis: a reviewArtificial Intelligence Review10.1007/s10462-022-10315-056:7(5837-5880)Online publication date: 13-Nov-2022
https://dl.acm.org/doi/10.1007/s10462-022-10315-0
Kuehne HRichard AGall J(2020)A Hybrid RNN-HMM Approach for Weakly Supervised Temporal Action SegmentationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2018.288446942:4(765-779)Online publication date: 4-Mar-2020
https://dl.acm.org/doi/10.1109/TPAMI.2018.2884469
Yoshimura THashimoto KOura KNankaku YTokuda K(2018)Mel-Cepstrum-Based Quantization Noise Shaping Applied to Neural-Network-Based Speech Waveform SynthesisIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2018.281840826:7(1173-1180)Online publication date: 1-Jul-2018
https://dl.acm.org/doi/10.1109/TASLP.2018.2818408
Degottex GLanchantin PGales MDegottex GLanchantin PGales M(2018)A Log Domain Pulse Model for Parametric Speech SynthesisIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2017.276154626:1(57-70)Online publication date: 1-Jan-2018
https://dl.acm.org/doi/10.1109/TASLP.2017.2761546
Zheng YLi YWen ZLiu BTao J(2018)Investigating Deep Neural Network Adaptation for Generating Exclamatory and Interrogative Speech in MandarinJournal of Signal Processing Systems10.1007/s11265-017-1290-290:7(1039-1052)Online publication date: 1-Jul-2018
https://dl.acm.org/doi/10.1007/s11265-017-1290-2
Melnyk IBanerjee A(2017)A spectral algorithm for inference in hidden semi-Markov modelsThe Journal of Machine Learning Research10.5555/3122009.312204418:1(1164-1202)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.5555/3122009.3122044
Hayashi TWatanabe SToda THori TLe Roux JTakeda KHayashi TWatanabe SToda THori TLe Roux JTakeda K(2017)Duration-Controlled LSTM for Polyphonic Sound Event DetectionIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2017.274000225:11(2059-2070)Online publication date: 1-Nov-2017
https://dl.acm.org/doi/10.1109/TASLP.2017.2740002
Yoshimura THashimoto KOura KNankaku YTokuda KYoshimura THashimoto KOura KNankaku YTokuda K(2017)Simultaneous Optimization of Multiple Tree-Based Factor Analyzed HMM for Speech SynthesisIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2017.272121925:9(1836-1845)Online publication date: 1-Sep-2017
https://dl.acm.org/doi/10.1109/TASLP.2017.2721219
Do QToda TNeubig GSakti SNakamura SQuoc Truong Do Toda TNeubig GSakti SNakamura S(2017)Preserving Word-Level Emphasis in Speech-to-Speech TranslationIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2016.264328025:3(544-556)Online publication date: 1-Mar-2017
https://dl.acm.org/doi/10.1109/TASLP.2016.2643280
Filntisis PKatsamanis ATsiakoulis PMaragos P(2017)Video-realistic expressive audio-visual speech synthesis for the Greek languageSpeech Communication10.1016/j.specom.2017.08.01195:C(137-152)Online publication date: 1-Dec-2017
https://dl.acm.org/doi/10.1016/j.specom.2017.08.011
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents