article

Word Error Rate Minimization Using an Integrated Confidence Measure

Authors:

Akio Kobayashi,

Shinichi Homma,

Toru ImaiAuthors Info & Claims

IEICE - Transactions on Information and Systems, Volume E90-D, Issue 5

Pages 835 - 843

https://doi.org/10.1093/ietisy/e90-d.5.835

Published: 01 May 2007 Publication History

Abstract

This paper describes a new criterion for speech recognition using an integrated confidence measure to minimize the word error rate (WER). The conventional criteria for WER minimization obtain the expected WER of a sentence hypothesis merely by comparing it with other hypotheses in an n-best list. The proposed criterion estimates the expected WER by using an integrated confidence measure with word posterior probabilities for a given acoustic input. The integrated confidence measure, which is implemented as a classifier based on maximum entropy (ME) modeling or support vector machines (SVMs), is used to acquire probabilities reflecting whether the word hypotheses are correct. The classifier is comprised of a variety of confidence measures and can deal with a temporal sequence of them to attain a more reliable confidence. Our proposed criterion for minimizing WER achieved a WER of 9.8% and a 3.9% reduction, relative to conventional n-best rescoring methods in transcribing Japanese broadcast news in various environments such as under noisy field and spontaneous speech conditions.

References

[1]

A. Ando, T. Imai, A. Kobayashi, S. Homma, J. Goto, N. Seiyama, T. Mishima, T. Kobayakawa, S. Sato, K. Onoe, H. Segi, A. Imai, A. Matsui, A. Nakamura, H. Tanaka, T. Takagi, E. Miyasaka, and H. Isono, “Simultaneous subtitling system for broadcast news programs with a speech recognizer,” IEICE Trans. Inf. & Syst., vol.E86-D, no.1, pp.15–25, Jan. 2003.

[2]

G. Riccardi and D. Hakkani-Tür, “Active and unsupervised learning for automatic speech recognition,” Proc. Eurospeech, pp.1825–1828, 2003.

[3]

M. Nakano, “Using untranscribed user utterances for improving language models based on confidence scoring,” Proc. Eurospeech, pp.417–420, 2003.

[4]

A. Stolcke, Y. Konig, and M. Weintraub, “Explicit word error minimization in N-best list rescoring,” Proc. Eurospeech, pp.163–166, 1997.

[5]

V. Goel, W.J. Byrne, and S.P. Khudanpur, “LVCSR rescoring with modified loss functions: A decision theoristic perspective,” Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, pp.425–428, 1998.

[6]

F. Wessel, R. Schlüter, K. Macherey, and H. Ney, “Confidence measure for large vocabulary continuous speech recognition,” IEEE Trans. Speech Audio Process., vol.9, no.3, pp.288–298, 2001.

[7]

T. Kemp and T. Schaaf, “Estimating confidence using word lattices,” Proc. Eurospeech, pp.827–830, 1997.

[8]

G. Evermann and P. Woodland, “Posterior probability decoding, confidence estimation and system combination,” Proc. NIST Speech Transcription Workshop, http://www.nist.gov/speech/publications/tw00/html/cp230/cp230.htm, 2000.

[9]

G. Riccardi and D. Hakkani-Tür, “Active learning: Theory and applications to automatic speech recognition,” IEEE Trans. Speech Audio Process., vol.13, no.4, pp.504–511, 2005.

[10]

A. Berger, S.D. Pietra, and V.D. Pietra, “A maximum entropy approach to natural language processing,” Computational Linguistics, vol.22, pp.39–71, 1996.

Digital Library

[11]

J. Darroch and D. Ratcliff, “Generalized iterative scaling for log-linear models,” The Annals of Mathematical Statistics, pp.1470–1480, 1972.

[12]

S.F. Chen and R. Rosenfeld, “A Gaussian prior for smoothing maximum entropy models,” Technical Report CMU-CS-99-108, Carnegie Mellon University, 1999.

[13]

T.J. Hazen, S. Seneff, and J. Polifromi, “Recognition confidence scoring and its use in speech understanding systems,” Comput. Speech Lang., vol.16, pp.49–67, 2002

Digital Library

[14]

P.J. Moreno, B. Logan, and B. Raj, “A boosting approach for confidence scoring,” Proc. Eurospeech, pp.2109–2112, 2001.

[15]

T. Joachims, Learning to classify text using support vector machines, Kluwer Academic Publishers, Boston, 2002.

Digital Library

[16]

T. Joachims, “Introduction to support vector learning,” in Advances in Kernel Methods, ed. B. Scho-lkopf, C.J. Burges, and A.J. Smola, MIT Press, 1999.

[17]

J. Platt, “Probabilistic outputs for support vector machines and comparison to regularized likelihood methods,” in Advances in Large Margin Classifiers, ed. A. Smola, P. Bartlett, B. Schoelkopf, and D. Schuurmans, pp.61–74, MIT Press, 2000.

[18]

T. Joachims, “Making large-scale SVM learning practical,” in Advances in Kernel Methods, ed. B. Scho-lkopf, C.J. Burges, and A.J. Smola, MIT Press, 1999.

[19]

A. Stolcke, “SRILM–An extenisible language modeling toolkit,” Proc. Int. Conf. Spoken Language Processing, pp.901–904, 2002.

[20]

T. Zeppenfeld, M. Finke, M. Westphal, K. Ries, and A. Waibel, “Recognition of conversational telephone speech using the Janus speech engine,” Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, pp.1815–1818, 1997.

[21]

F. Weng, A. Stolcke, and A. Sanker, “Efficient lattice representation and generation,” Proc. Int. Conf. Spoken Language Processing, pp.2531–2534, 1998.

[22]

H. Hermansky and N. Morgan, “RASTA processing of speech,” IEEE Trans. Speech Audio Process., vol.2, pp.587–589, 1994.

[23]

A. Kobayashi, K. Onoe, T. Imai, and A. Ando, “Time dependent language model for broadcast news transcription and its post-correction,” Proc. Int. Conf. Spoken Language Processing, pp.2435–2438, 1998.

[24]

L. Gillick and S.J. Cox, “Some statistical issues in the comparison of speech recognition algorithms,” Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, pp.532–535, 1989.

Recommendations

Diphone subspace mixture trajectory models for HMM Complementation

This paper describes an extension of the previously reported attempt of capturing segmental transition information for speech recognition tasks [Speech Communication 27 (1) (1999) 19]. Representations in the subspace with multiple projected trajectories ...
Dysarthric speech classification from coded telephone speech using glottal features
Abstract
This paper proposes a new dysarthric speech classification method from coded telephone speech using glottal features. The proposed method utilizes glottal features, which are efficiently estimated from coded telephone speech using a ...
Characterization and recognition of emotions from speech using excitation source information

This paper explores the excitation source features of speech production mechanism for characterizing and recognizing the emotions from speech signal. The excitation source signal is obtained from speech signal using linear prediction (LP) analysis, and ...

Comments

Information & Contributors

Information

Published In

cover image IEICE - Transactions on Information and Systems

IEICE - Transactions on Information and Systems Volume E90-D, Issue 5

May 2007

80 pages

ISSN:0916-8532

EISSN:1745-1361

Issue’s Table of Contents

Copyright © Copyright © 2007 The Institute of Electronics, Information and Communication Engineers.

Publisher

Oxford University Press, Inc.

United States

Publication History

Published: 01 May 2007

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents