research-article

Turbo automatic speech recognition

Authors:

Simon Receveur,

Tim FingscheidtAuthors Info & Claims

IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), Volume 24, Issue 5

Pages 846 - 862

https://doi.org/10.1109/TASLP.2016.2520364

Published: 01 May 2016 Publication History

Abstract

Performance of automatic speech recognition (ASR) systems can significantly be improved by integrating further sources of information such as additional modalities, or acoustic channels, or acoustic models. Given the arising problem of information fusion, striking parallels to problems in digital communications are exhibited, where the discovery of the turbo codes by Berrou et al. was a groundbreaking innovation. In this paper, we show ways how to successfully apply the turbo principle to the domain of ASR and thereby provide solutions to the above-mentioned information fusion problem. The contribution of our work is fourfold: First, we review the turbo decoding forward-backward algorithm (FBA), giving detailed insights into turbo ASR, and providing a new interpretation and formulation of the so-called extrinsic information being passed between the recognizers. Second, we present a real-time capable turbo-decoding Viterbi algorithm suitable for practical information fusion and recognition tasks. Then we present simulation results for a multimodal example of information fusion. Finally, we prove the suitability of both our turbo FBA and turbo Viterbi algorithm also for a single-channel multimodel recognition task obtained by using two acoustic feature extraction methods. On a small vocabulary task (challenging, since spelling is included), our proposed turbo ASR approach outperforms even the best reference system on average over all SNR conditions and investigated noise types by a relative word error rate (WER) reduction of 22.4% (audio-visual task) and 18.2% (audio-only task), respectively.

References

[1]

R. P. Lippmann, "Speech recognition by machines and humans," Speech Commun., vol. 22, no. 1, pp. 1--15, Jul. 1997.

Digital Library

[2]

C. Berrou, A. Glavieux, and P. Thitimajshima, "Near Shannon limit error-correcting coding and decoding: Turbo-codes," in Proc. IEEE Int. Conf. Commun. (ICC), Geneva, Switzerland, May 1993, pp. 1064--1070.

[3]

C. Berrou, R. Pyndiah, P. Adde, C. Douillard, and R. Le Bidan, "An overview of turbo codes and their applications," in Proc. IEEE Eur. Conf. Wireless Technol., Paris, France, Oct. 2005, pp. 1--9.

[4]

C. E. Shannon, "A mathematical theory of communication," Bell Syst. Tech. J ., vol. 27, pp. 379--423, Jul. 1948.

[5]

S. Lin and J. D. J. Costello, Error Control Coding. Englewood Cliffs, NJ, USA: Prentice-Hall, 1983.

[6]

R. Johannesson and K. S. Zigangirov, Fundamentals of Convolutional Coding. Hoboken, NJ, USA: Wiley/IEEE Press, 1999.

Digital Library

[7]

G. Forney, "The Viterbi algorithm," Proc. IEEE, vol. 61, no. 3, pp. 268--278, Mar. 1973.

[8]

L. Rabiner and B.-H. Juang, Fundamentals of Speech Processing. Englewood Cliffs, NJ, USA: Prentice-Hall, 1993.

Digital Library

[9]

L. Bahl, J. Cocke, F. Jelinek, and J. Raviv, "Optimal decoding of linear codes for minimizing symbol error rate," IEEE Trans. Inf. Theory, vol. IT-20, no. 2, pp. 284--287, Mar. 1974.

Digital Library

[10]

L. Bahl, F. Jelinek, and R. Mercer, "A maximum likelihood approach to continuous speech recognition," IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-5, no. 2, pp. 179--190, Mar. 1983.

Digital Library

[11]

J. Hagenauer and P. Hoeher, "A Viterbi algorithm with soft-decision outputs and its applications," in Proc. GLOBECOM, Dallas, TX, USA, Nov. 1989, pp. 1680--1686.

[12]

J. Huber and A. Rüppel, "Zuverlässigkeitsschätzung für die Ausgangssymbole von Trellis-Decodern," AEÜ, vol. 44, no. 1, pp. 8--21, Jan. 1990. (in German).

[13]

H. Jiang, "Confidence measures for speech recognition: A survey," Speech Commun., vol. 45, pp. 455--470, 2005.

[14]

F. Wessel, R. Schlüter, K. Macherey, and H. Ney, "Confidence measures for large vocabulary continuous speech recognition," IEEE Trans. Speech Audio Process., vol. 9, no. 3, pp. 288--298, Mar. 2001.

[15]

J. Hagenauer, "The turbo principle: Tutorial introduction and state of the art," in Proc. Int. Symp. Turbo Codes Related Topics, Brest, France, Sep. 1997, pp. 1--11.

[16]

F. Faubel and M. Wölfel, "Coupling particle filters with automatic speech recognition for speech feature enhancement," in Proc. Int. Conf. Spoken Lang. Process. (ICSLP), Pittsburgh, PA, USA, Sep. 2006, pp. 37--40.

[17]

Z.-J. Yan, F. Soong, and R.-H. Wang, "Word graph based feature enhancement for noisy speech recognition," in Proc. Int. Conf. Acoust. Speech Signal Process. (ICASSP), Honululu, HI, USA, Apr. 2007, vol. 4, pp. IV-373--IV-376.

[18]

J. D. Li Deng and A. Acero, "Recursive estimation of nonstationary noise using iterative stochastic approximation for robust speech recognition," IEEE Trans. Speech Audio Process., vol. 11, no. 6, pp. 568--580, Nov. 2003.

[19]

S. Windmann and R. Haeb-Umbach, "Approaches to iterative speech feature enhancement and recognition," IEEE Trans. Audio Speech Lang. Process., vol. 17, no. 5, pp. 974--984, Jul. 2009.

Digital Library

[20]

M. Paulik, S. Stuker, C. Fugen, T. Schultz, T. Schaaf, and A. Waibel, "Speech translation enhanced automatic speech recognition," in Proc. Workshop Automat. Speech Recog. Understand. (ASRU), Cancún, Mexico, Nov. 2005, pp. 121--126.

[21]

H. Bourlard and S. Dupont, "A new ASR approach based on independent processing and recombination of partial frequency bands," in Proc. Int. Conf. Spoken Lang. Process. (ICSLP), Philadelphia, PA, USA, Oct. 1996, pp. 426--429.

[22]

H. Hermansky, S. Tibrewala, and M. Pavel, "Towards ASR on partially corrupted speech," in Proc. Int. Conf. Spoken Lang. Process. (ICSLP), Philadelphia, PA, USA, Oct. 1996, pp. 462--465.

[23]

W. H. Sumby and I. Pollack, "Visual contribution to speech intelligibility in noise," J. Acoust. Soc. Amer., vol. 26, no. 2, pp. 212--215, Mar. 1954.

[24]

D. G. Stork, M. E. Hennecke, and K. V. Prasad, "Visionary speech: Looking ahead to practical speechreading systems," in Speechreading by Humans and Machines, D. G. Stork and M. E. Hennecke, Eds. New York, NY, USA: Springer, 1996.

[25]

C. Neti et al. "Audio-visual speech recognition," Center Lang. Speech Process., Johns Hopkins Univ., Baltimore, MD, USA, Tech. Rep. EPFL-Report-82633, IDIAP, 2000.

[26]

G. Potamianos, C. Neti, G. Iyengar, and E. Helmuth, "Large-vocabulary audio-visual speech recognition by machines and humans," in Proc. Eurospeech, Aalborg, Denmark, Sep. 2001, pp. 1027--1030.

[27]

G. Potamianos, C. Neti, J. Luettin, and I. Matthews, "Audio-visual automatic speech recognition: An overview," in Issues in Visual and Audio-Visual Speech Processing, G. Bailly, E. Vatikiotis-Bateson and P. Perrier, Eds. Cambridge, MA, USA: MIT Press, 2004, pp. 356--396.

[28]

J. Kratt, F. Metze, R. Stiefelhagen, and A. Waibel, "Large vocabulary audio-visual speech recognition using the Janus speech recognition toolkit," in Proc. DAGM-Symp., Tübingen, Germany, Aug. 2004, pp. 488--495.

[29]

U. Jain et al., "Recognition of continuous broadcast news with multiple unknown speakers and environments," in Proc. ARPA Speech Recog. Workshop, Harriman, NY, USA, Feb. 1996, pp. 61--66.

[30]

J. Ming, P. Hanna, D. Stewart, M. Owens, and F. J. Smith, "Improving speech recognition performance by using multi-model approaches," in Proc. Int. Conf. Acoust. Speech Signal Process. (ICASSP), Phoenix, AZ, USA, Mar. 1999, pp. 161--164.

Digital Library

[31]

J. G. Fiscus, "A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER)," in Proc. Workshop Automat. Speech Recog. Understand. (ASRU), Santa Barbara, CA, USA, Dec. 1997, pp. 347--352.

[32]

L. Mangu, E. Brill, and A. Stolcke, "Finding consensus in speech recognition: Word error minimization and other applications of confusion networks," Comput. Speech Lang., vol. 14, no. 4, pp. 373--400, Oct. 2000.

Digital Library

[33]

S. Lucey, T. Chen, S. Sridharan, and V. Chandran, "Integration strategies for audio-visual speech processing: Applied to text-dependent speaker recognition," IEEE Trans. Multimedia, vol. 7, no. 3, pp. 495--506, Jun. 2005.

Digital Library

[34]

A. Nefian, L. Liang, X. Pi, X. Liu, and K. Murphy, "Dynamic Bayesian networks for audio-visual speech recognition," EURASIP J. Appl. Signal Process., vol. 11, pp. 1--15, 2002.

Digital Library

[35]

A. Garg, G. Potamianos, C. Neti, and T. S. Huang, "Frame-dependent multi-stream reliability indicators for audio-visual speech recognition," in Proc. Int. Conf. Multimedia Expo (ICME), Baltimore, MD, USA, Jul. 2003, pp. 605--608.

Digital Library

[36]

D. Kolossa, S. Zeiler, A. Vorwerk, and R. Orglmeister, "Audiovisual speech recognition with missing or unreliable data," in Proc. Auditory Visual Speech Process. (AVSP), Norwich, U.K., Sep. 2009, pp. 117--122.

[37]

J. Luettin, G. Potamianos, and C. Neti, "Asynchronous stream modeling for large vocabulary audio-visual speech recognition," in Proc. Int. Conf. Acoust. Speech Signal Process. (ICASSP), Salt Lake City, UT, USA, May 2001, pp. 169--172.

[38]

A. Abdelaziz, S. Zeiler, and D. Kolossa, "Learning dynamic stream weights for coupled-HMM-based audio-visual speech recognition," IEEE Trans. Audio Speech Lang. Process., vol. 23, no. 5, pp. 863--876, Mar. 2015.

Digital Library

[39]

S. Shivappa, B. Rao, and M. Trivedi, "An iterative decoding algorithm for fusion of multimodal information," EURASIP J. Adv. Signal Process., vol. 2008, pp. 1--10, 2008.

[40]

S. Shivappa, B. Rao, and M. Trivedi, "Multimodal information fusion using the iterative decoding algorithm and its application to audio-visual speech recognition," in Proc. Int. Conf. Acoust. Speech Signal Process. (ICASSP), Las Vegas, NV, USA, Apr. 2008, pp. 2241--2244.

[41]

S. Shivappa, M. Trivedi, and B. Rao, "Audiovisual information fusion in human-computer interfaces and intelligent environments: A survey," Proc. IEEE, vol. 98, no. 10, pp. 1692--1715, Oct. 2010.

[42]

S. Shivappa, M. Trivedi, and B. Rao, "Person tracking with audio-visual cues using the iterative decoding framework," in Proc. IEEE 5th Int. Conf. Adv. Video Signal Based Surveillance (AVSS), Santa Fe, NM, USA, Sep. 2008, pp. 260--267.

Digital Library

[43]

S. Shivappa, B. Rao, and M. Trivedi, "Audiovisual fusion and tracking with multilevel iterative decoding: Framework and experimental evaluation," IEEE J. Sel. Topics Signal Process., vol. 4, no. 5, pp. 882--894, Oct. 2010.

[44]

D. Divsalar and F. Pollara, "Turbo codes for deep-space communications," Jet Propul. Lab., Pasadena, CA, USA, Telecommun. Data Acquis. Progress Rep. 42--120, Feb. 1995.

[45]

R. G. Gallager, "Low-density parity-check codes," IRE Trans. Inf. Theory, vol. 8, no. 1, pp. 21--28, Jan. 1962.

[46]

J. Lodge, R. Young, P. Hoeher, and J. Hagenauer, "Separable MAP filters for the decoding of product and concatenated codes," in Proc. IEEE Int. Conf. Commun. (ICC), Geneva, Switzerland, May 1993, pp. 1740--1745.

[47]

S. ten Brink, "Convergence behaviour of iteratively decoded parallel concatenated codes," IEEE Trans. Commun., vol. 49, no. 10, pp. 1727--1737, Oct. 2001.

[48]

D. Scheler, S. Walz, and T. Fingscheidt, "On iterative exchange of soft state information in two-channel automatic speech recognition," in Proc. ITG-Fachtagung Sprachkommunikation, Sep. 2012, pp. 55--58.

[49]

S. Receveur and T. Fingscheidt, "A turbo-decoding weighted forward-backward algorithm for multimodal speech recognition," in Proc. Int. Workshop Spoken Dialog Syst. (IWSDS), Napa Valley, CA, USA, Jan. 2014, pp. 4--15.

[50]

S. Receveur and T. Fingscheidt, "A compact formulation of turbo audiovisual speech recognition," in Proc. Int. Conf. Acoust. Speech Signal Process. (ICASSP), Florence, Italy, May 2014, pp. 5554--5558.

[51]

S. Receveur, R. Weiss, and T. Fingscheidt, "Multimodal ASR by turbo decoding vs. feature concatenation: Where to perform information integration?" in Proc. 11th ITG Conf. Speech Commun., Erlangen, Germany, Sep. 2014, pp. 21--24.

[52]

C. Douillard et al., "Iterative correction of intersymbol interference: Turbo-equalization," Eur. Trans. Telecommun., vol. 6, no. 5, pp. 507--511, May 1995.

[53]

R. Zhang and A. Rudnicky, "Word level confidence annotation using combinations of features," in Proc. 7th Eur. Conf. Speech Commun. Technol., Aalborg, Denmark, Sep. 2001, pp. 2105--2108.

[54]

A. C. Reid, T. A. Gulliver, and D. P. Taylor, "Convergence and errors in turbo-decoding," IEEE Trans. Commun., vol. 49, no. 12, pp. 2045--2051, Dec. 2001.

[55]

ETSI STQ Aspects: Distributed Speech Recognition; Advanced Front-End Feature Extraction Algorithm; Compression Algorithms, ETSI ES 202 050, Oct. 2002.

[56]

M. R. Schädler, B. T. Meyer, and B. Kollmeier, "Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition," J. Acoust. Soc. Amer., vol. 131, no. 5, pp. 4134--4151, May 2012.

[57]

B. Hoffmeister, T. Klein, R. Schlüter, and H. Ney, "Frame based system combination and a comparison with weighted ROVER and CNC," in Proc. INTERSPEECH, Pittsburgh, PA, USA, Sep. 2006, pp. 537--540.

[58]

K. Audhkhasi, A. M. Zavou, P. G. Georgiou, and S. S. Narayanan, "Theoretical analysis of diversity in an ensemble of automatic speech recognition systems," IEEE/ACM Trans. Audio Speech Lang. Process., vol. 22, no. 3, pp. 711--726, Mar. 2014.

[59]

K. Audhkhasi, A. M. Zavou, P. G. Georgiou, and S. S. Narayanan, "Empirical link between hypothesis diversity and fusion performance in an ensemble of automatic speech recognition systems," in Proc. INTERSPEECH, Lyon, France, Aug. 2013, pp. 3082--3086.

[60]

M. Cooke, J. Barker, S. Cunningham, and X. Shao, "An audio-visual corpus for speech perception and automatic speech recognition," J. Acoust. Soc. Amer., vol. 120, no. 5, pp. 2421--2424, Nov. 2006.

[61]

H. G. Hirsch and D. Pearce, "The AURORA experimental framework for the performance evaluations of speech recognition systems under noisy conditions," in Proc. ISCA Workshop Automat. Speech Recog. (ASR), Paris, France, Sep. 2000, pp. 1--8.

[62]

ITU, "ITU-T Recommendation P.56, Objective measurement of active speech level," Dec. 2011.

[63]

T. G. Kolda, R. M. Lewis, and V. Torczon, "A generating set direct search augmented Lagrangian algorithm for optimization with a combination of general and linear constraints," Sandia National Lab., Albuquerque, NM, USA, Tech. Rep. SAND2006-5315, Aug. 2006.

[64]

J. Kliewer, S. X. Ng, and L. Hanzo, "Efficient computation of EXIT functions for nonbinary iterative decoding," IEEE Trans. Commun., vol. 54, no. 12, pp. 2133--2136, Dec. 2006.

Cited By

Abdelaziz A(2018)Comparing Fusion Models for DNN-Based Audiovisual Continuous Speech RecognitionIEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)10.1109/TASLP.2017.278354526:3(475-484)Online publication date: 1-Mar-2018
https://dl.acm.org/doi/10.1109/TASLP.2017.2783545

Turbo automatic speech recognition
1. Hardware
  1. Communication hardware, interfaces and storage
2. Mathematics of computing

Recommendations

Phoneme and tonal accent recognition for Thai speech
Highlights
► Phoneme recognition with soft phoneme segmentation procedure for Thai speech. ► Recognition system classifies phonemes using discrete hidden Markov models. ► MFPLP is better than MFCC as features in phoneme ...
Abstract
In this paper, we investigate the application of a phoneme recognition system with a soft phoneme segmentation procedure for Thai speech. In addition, we propose a new method to classify the tonal accent of a syllable. The recognition ...
Cued Speech automatic recognition in normal-hearing and deaf subjects

This article discusses the automatic recognition of Cued Speech in French based on hidden Markov models (HMMs). Cued Speech is a visual mode which, by using hand shapes in different positions and in combination with lip patterns of speech, makes all the ...
Using prosody to improve automatic speech recognition

In this paper acoustic processing and modelling of the supra-segmental characteristics of speech is addressed, with the aim of incorporating advanced syntactic and semantic level processing of spoken language for speech recognition/understanding tasks. ...

Comments

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Audio, Speech and Language Processing

IEEE/ACM Transactions on Audio, Speech and Language Processing Volume 24, Issue 5

May 2016

157 pages

ISSN:2329-9290

EISSN:2329-9304

Editor:
Haizhou Li
Institute for Infocomm Research, Singapore

Issue’s Table of Contents

Publisher

IEEE Press

Publication History

Published: 01 May 2016

Published in TASLP Volume 24, Issue 5

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
108
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Abdelaziz A(2018)Comparing Fusion Models for DNN-Based Audiovisual Continuous Speech RecognitionIEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)10.1109/TASLP.2017.278354526:3(475-484)Online publication date: 1-Mar-2018
https://dl.acm.org/doi/10.1109/TASLP.2017.2783545

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents