Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Turbo automatic speech recognition

Published: 01 May 2016 Publication History

Abstract

Performance of automatic speech recognition (ASR) systems can significantly be improved by integrating further sources of information such as additional modalities, or acoustic channels, or acoustic models. Given the arising problem of information fusion, striking parallels to problems in digital communications are exhibited, where the discovery of the turbo codes by Berrou et al. was a groundbreaking innovation. In this paper, we show ways how to successfully apply the turbo principle to the domain of ASR and thereby provide solutions to the above-mentioned information fusion problem. The contribution of our work is fourfold: First, we review the turbo decoding forward-backward algorithm (FBA), giving detailed insights into turbo ASR, and providing a new interpretation and formulation of the so-called extrinsic information being passed between the recognizers. Second, we present a real-time capable turbo-decoding Viterbi algorithm suitable for practical information fusion and recognition tasks. Then we present simulation results for a multimodal example of information fusion. Finally, we prove the suitability of both our turbo FBA and turbo Viterbi algorithm also for a single-channel multimodel recognition task obtained by using two acoustic feature extraction methods. On a small vocabulary task (challenging, since spelling is included), our proposed turbo ASR approach outperforms even the best reference system on average over all SNR conditions and investigated noise types by a relative word error rate (WER) reduction of 22.4% (audio-visual task) and 18.2% (audio-only task), respectively.

References

[1]
R. P. Lippmann, "Speech recognition by machines and humans," Speech Commun., vol. 22, no. 1, pp. 1--15, Jul. 1997.
[2]
C. Berrou, A. Glavieux, and P. Thitimajshima, "Near Shannon limit error-correcting coding and decoding: Turbo-codes," in Proc. IEEE Int. Conf. Commun. (ICC), Geneva, Switzerland, May 1993, pp. 1064--1070.
[3]
C. Berrou, R. Pyndiah, P. Adde, C. Douillard, and R. Le Bidan, "An overview of turbo codes and their applications," in Proc. IEEE Eur. Conf. Wireless Technol., Paris, France, Oct. 2005, pp. 1--9.
[4]
C. E. Shannon, "A mathematical theory of communication," Bell Syst. Tech. J ., vol. 27, pp. 379--423, Jul. 1948.
[5]
S. Lin and J. D. J. Costello, Error Control Coding. Englewood Cliffs, NJ, USA: Prentice-Hall, 1983.
[6]
R. Johannesson and K. S. Zigangirov, Fundamentals of Convolutional Coding. Hoboken, NJ, USA: Wiley/IEEE Press, 1999.
[7]
G. Forney, "The Viterbi algorithm," Proc. IEEE, vol. 61, no. 3, pp. 268--278, Mar. 1973.
[8]
L. Rabiner and B.-H. Juang, Fundamentals of Speech Processing. Englewood Cliffs, NJ, USA: Prentice-Hall, 1993.
[9]
L. Bahl, J. Cocke, F. Jelinek, and J. Raviv, "Optimal decoding of linear codes for minimizing symbol error rate," IEEE Trans. Inf. Theory, vol. IT-20, no. 2, pp. 284--287, Mar. 1974.
[10]
L. Bahl, F. Jelinek, and R. Mercer, "A maximum likelihood approach to continuous speech recognition," IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-5, no. 2, pp. 179--190, Mar. 1983.
[11]
J. Hagenauer and P. Hoeher, "A Viterbi algorithm with soft-decision outputs and its applications," in Proc. GLOBECOM, Dallas, TX, USA, Nov. 1989, pp. 1680--1686.
[12]
J. Huber and A. Rüppel, "Zuverlässigkeitsschätzung für die Ausgangssymbole von Trellis-Decodern," AEÜ, vol. 44, no. 1, pp. 8--21, Jan. 1990. (in German).
[13]
H. Jiang, "Confidence measures for speech recognition: A survey," Speech Commun., vol. 45, pp. 455--470, 2005.
[14]
F. Wessel, R. Schlüter, K. Macherey, and H. Ney, "Confidence measures for large vocabulary continuous speech recognition," IEEE Trans. Speech Audio Process., vol. 9, no. 3, pp. 288--298, Mar. 2001.
[15]
J. Hagenauer, "The turbo principle: Tutorial introduction and state of the art," in Proc. Int. Symp. Turbo Codes Related Topics, Brest, France, Sep. 1997, pp. 1--11.
[16]
F. Faubel and M. Wölfel, "Coupling particle filters with automatic speech recognition for speech feature enhancement," in Proc. Int. Conf. Spoken Lang. Process. (ICSLP), Pittsburgh, PA, USA, Sep. 2006, pp. 37--40.
[17]
Z.-J. Yan, F. Soong, and R.-H. Wang, "Word graph based feature enhancement for noisy speech recognition," in Proc. Int. Conf. Acoust. Speech Signal Process. (ICASSP), Honululu, HI, USA, Apr. 2007, vol. 4, pp. IV-373--IV-376.
[18]
J. D. Li Deng and A. Acero, "Recursive estimation of nonstationary noise using iterative stochastic approximation for robust speech recognition," IEEE Trans. Speech Audio Process., vol. 11, no. 6, pp. 568--580, Nov. 2003.
[19]
S. Windmann and R. Haeb-Umbach, "Approaches to iterative speech feature enhancement and recognition," IEEE Trans. Audio Speech Lang. Process., vol. 17, no. 5, pp. 974--984, Jul. 2009.
[20]
M. Paulik, S. Stuker, C. Fugen, T. Schultz, T. Schaaf, and A. Waibel, "Speech translation enhanced automatic speech recognition," in Proc. Workshop Automat. Speech Recog. Understand. (ASRU), Cancún, Mexico, Nov. 2005, pp. 121--126.
[21]
H. Bourlard and S. Dupont, "A new ASR approach based on independent processing and recombination of partial frequency bands," in Proc. Int. Conf. Spoken Lang. Process. (ICSLP), Philadelphia, PA, USA, Oct. 1996, pp. 426--429.
[22]
H. Hermansky, S. Tibrewala, and M. Pavel, "Towards ASR on partially corrupted speech," in Proc. Int. Conf. Spoken Lang. Process. (ICSLP), Philadelphia, PA, USA, Oct. 1996, pp. 462--465.
[23]
W. H. Sumby and I. Pollack, "Visual contribution to speech intelligibility in noise," J. Acoust. Soc. Amer., vol. 26, no. 2, pp. 212--215, Mar. 1954.
[24]
D. G. Stork, M. E. Hennecke, and K. V. Prasad, "Visionary speech: Looking ahead to practical speechreading systems," in Speechreading by Humans and Machines, D. G. Stork and M. E. Hennecke, Eds. New York, NY, USA: Springer, 1996.
[25]
C. Neti et al. "Audio-visual speech recognition," Center Lang. Speech Process., Johns Hopkins Univ., Baltimore, MD, USA, Tech. Rep. EPFL-Report-82633, IDIAP, 2000.
[26]
G. Potamianos, C. Neti, G. Iyengar, and E. Helmuth, "Large-vocabulary audio-visual speech recognition by machines and humans," in Proc. Eurospeech, Aalborg, Denmark, Sep. 2001, pp. 1027--1030.
[27]
G. Potamianos, C. Neti, J. Luettin, and I. Matthews, "Audio-visual automatic speech recognition: An overview," in Issues in Visual and Audio-Visual Speech Processing, G. Bailly, E. Vatikiotis-Bateson and P. Perrier, Eds. Cambridge, MA, USA: MIT Press, 2004, pp. 356--396.
[28]
J. Kratt, F. Metze, R. Stiefelhagen, and A. Waibel, "Large vocabulary audio-visual speech recognition using the Janus speech recognition toolkit," in Proc. DAGM-Symp., Tübingen, Germany, Aug. 2004, pp. 488--495.
[29]
U. Jain et al., "Recognition of continuous broadcast news with multiple unknown speakers and environments," in Proc. ARPA Speech Recog. Workshop, Harriman, NY, USA, Feb. 1996, pp. 61--66.
[30]
J. Ming, P. Hanna, D. Stewart, M. Owens, and F. J. Smith, "Improving speech recognition performance by using multi-model approaches," in Proc. Int. Conf. Acoust. Speech Signal Process. (ICASSP), Phoenix, AZ, USA, Mar. 1999, pp. 161--164.
[31]
J. G. Fiscus, "A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER)," in Proc. Workshop Automat. Speech Recog. Understand. (ASRU), Santa Barbara, CA, USA, Dec. 1997, pp. 347--352.
[32]
L. Mangu, E. Brill, and A. Stolcke, "Finding consensus in speech recognition: Word error minimization and other applications of confusion networks," Comput. Speech Lang., vol. 14, no. 4, pp. 373--400, Oct. 2000.
[33]
S. Lucey, T. Chen, S. Sridharan, and V. Chandran, "Integration strategies for audio-visual speech processing: Applied to text-dependent speaker recognition," IEEE Trans. Multimedia, vol. 7, no. 3, pp. 495--506, Jun. 2005.
[34]
A. Nefian, L. Liang, X. Pi, X. Liu, and K. Murphy, "Dynamic Bayesian networks for audio-visual speech recognition," EURASIP J. Appl. Signal Process., vol. 11, pp. 1--15, 2002.
[35]
A. Garg, G. Potamianos, C. Neti, and T. S. Huang, "Frame-dependent multi-stream reliability indicators for audio-visual speech recognition," in Proc. Int. Conf. Multimedia Expo (ICME), Baltimore, MD, USA, Jul. 2003, pp. 605--608.
[36]
D. Kolossa, S. Zeiler, A. Vorwerk, and R. Orglmeister, "Audiovisual speech recognition with missing or unreliable data," in Proc. Auditory Visual Speech Process. (AVSP), Norwich, U.K., Sep. 2009, pp. 117--122.
[37]
J. Luettin, G. Potamianos, and C. Neti, "Asynchronous stream modeling for large vocabulary audio-visual speech recognition," in Proc. Int. Conf. Acoust. Speech Signal Process. (ICASSP), Salt Lake City, UT, USA, May 2001, pp. 169--172.
[38]
A. Abdelaziz, S. Zeiler, and D. Kolossa, "Learning dynamic stream weights for coupled-HMM-based audio-visual speech recognition," IEEE Trans. Audio Speech Lang. Process., vol. 23, no. 5, pp. 863--876, Mar. 2015.
[39]
S. Shivappa, B. Rao, and M. Trivedi, "An iterative decoding algorithm for fusion of multimodal information," EURASIP J. Adv. Signal Process., vol. 2008, pp. 1--10, 2008.
[40]
S. Shivappa, B. Rao, and M. Trivedi, "Multimodal information fusion using the iterative decoding algorithm and its application to audio-visual speech recognition," in Proc. Int. Conf. Acoust. Speech Signal Process. (ICASSP), Las Vegas, NV, USA, Apr. 2008, pp. 2241--2244.
[41]
S. Shivappa, M. Trivedi, and B. Rao, "Audiovisual information fusion in human-computer interfaces and intelligent environments: A survey," Proc. IEEE, vol. 98, no. 10, pp. 1692--1715, Oct. 2010.
[42]
S. Shivappa, M. Trivedi, and B. Rao, "Person tracking with audio-visual cues using the iterative decoding framework," in Proc. IEEE 5th Int. Conf. Adv. Video Signal Based Surveillance (AVSS), Santa Fe, NM, USA, Sep. 2008, pp. 260--267.
[43]
S. Shivappa, B. Rao, and M. Trivedi, "Audiovisual fusion and tracking with multilevel iterative decoding: Framework and experimental evaluation," IEEE J. Sel. Topics Signal Process., vol. 4, no. 5, pp. 882--894, Oct. 2010.
[44]
D. Divsalar and F. Pollara, "Turbo codes for deep-space communications," Jet Propul. Lab., Pasadena, CA, USA, Telecommun. Data Acquis. Progress Rep. 42--120, Feb. 1995.
[45]
R. G. Gallager, "Low-density parity-check codes," IRE Trans. Inf. Theory, vol. 8, no. 1, pp. 21--28, Jan. 1962.
[46]
J. Lodge, R. Young, P. Hoeher, and J. Hagenauer, "Separable MAP filters for the decoding of product and concatenated codes," in Proc. IEEE Int. Conf. Commun. (ICC), Geneva, Switzerland, May 1993, pp. 1740--1745.
[47]
S. ten Brink, "Convergence behaviour of iteratively decoded parallel concatenated codes," IEEE Trans. Commun., vol. 49, no. 10, pp. 1727--1737, Oct. 2001.
[48]
D. Scheler, S. Walz, and T. Fingscheidt, "On iterative exchange of soft state information in two-channel automatic speech recognition," in Proc. ITG-Fachtagung Sprachkommunikation, Sep. 2012, pp. 55--58.
[49]
S. Receveur and T. Fingscheidt, "A turbo-decoding weighted forward-backward algorithm for multimodal speech recognition," in Proc. Int. Workshop Spoken Dialog Syst. (IWSDS), Napa Valley, CA, USA, Jan. 2014, pp. 4--15.
[50]
S. Receveur and T. Fingscheidt, "A compact formulation of turbo audiovisual speech recognition," in Proc. Int. Conf. Acoust. Speech Signal Process. (ICASSP), Florence, Italy, May 2014, pp. 5554--5558.
[51]
S. Receveur, R. Weiss, and T. Fingscheidt, "Multimodal ASR by turbo decoding vs. feature concatenation: Where to perform information integration?" in Proc. 11th ITG Conf. Speech Commun., Erlangen, Germany, Sep. 2014, pp. 21--24.
[52]
C. Douillard et al., "Iterative correction of intersymbol interference: Turbo-equalization," Eur. Trans. Telecommun., vol. 6, no. 5, pp. 507--511, May 1995.
[53]
R. Zhang and A. Rudnicky, "Word level confidence annotation using combinations of features," in Proc. 7th Eur. Conf. Speech Commun. Technol., Aalborg, Denmark, Sep. 2001, pp. 2105--2108.
[54]
A. C. Reid, T. A. Gulliver, and D. P. Taylor, "Convergence and errors in turbo-decoding," IEEE Trans. Commun., vol. 49, no. 12, pp. 2045--2051, Dec. 2001.
[55]
ETSI STQ Aspects: Distributed Speech Recognition; Advanced Front-End Feature Extraction Algorithm; Compression Algorithms, ETSI ES 202 050, Oct. 2002.
[56]
M. R. Schädler, B. T. Meyer, and B. Kollmeier, "Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition," J. Acoust. Soc. Amer., vol. 131, no. 5, pp. 4134--4151, May 2012.
[57]
B. Hoffmeister, T. Klein, R. Schlüter, and H. Ney, "Frame based system combination and a comparison with weighted ROVER and CNC," in Proc. INTERSPEECH, Pittsburgh, PA, USA, Sep. 2006, pp. 537--540.
[58]
K. Audhkhasi, A. M. Zavou, P. G. Georgiou, and S. S. Narayanan, "Theoretical analysis of diversity in an ensemble of automatic speech recognition systems," IEEE/ACM Trans. Audio Speech Lang. Process., vol. 22, no. 3, pp. 711--726, Mar. 2014.
[59]
K. Audhkhasi, A. M. Zavou, P. G. Georgiou, and S. S. Narayanan, "Empirical link between hypothesis diversity and fusion performance in an ensemble of automatic speech recognition systems," in Proc. INTERSPEECH, Lyon, France, Aug. 2013, pp. 3082--3086.
[60]
M. Cooke, J. Barker, S. Cunningham, and X. Shao, "An audio-visual corpus for speech perception and automatic speech recognition," J. Acoust. Soc. Amer., vol. 120, no. 5, pp. 2421--2424, Nov. 2006.
[61]
H. G. Hirsch and D. Pearce, "The AURORA experimental framework for the performance evaluations of speech recognition systems under noisy conditions," in Proc. ISCA Workshop Automat. Speech Recog. (ASR), Paris, France, Sep. 2000, pp. 1--8.
[62]
ITU, "ITU-T Recommendation P.56, Objective measurement of active speech level," Dec. 2011.
[63]
T. G. Kolda, R. M. Lewis, and V. Torczon, "A generating set direct search augmented Lagrangian algorithm for optimization with a combination of general and linear constraints," Sandia National Lab., Albuquerque, NM, USA, Tech. Rep. SAND2006-5315, Aug. 2006.
[64]
J. Kliewer, S. X. Ng, and L. Hanzo, "Efficient computation of EXIT functions for nonbinary iterative decoding," IEEE Trans. Commun., vol. 54, no. 12, pp. 2133--2136, Dec. 2006.

Cited By

View all
  • (2018)Comparing Fusion Models for DNN-Based Audiovisual Continuous Speech RecognitionIEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)10.1109/TASLP.2017.278354526:3(475-484)Online publication date: 1-Mar-2018

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Audio, Speech and Language Processing
IEEE/ACM Transactions on Audio, Speech and Language Processing  Volume 24, Issue 5
May 2016
157 pages
ISSN:2329-9290
EISSN:2329-9304
  • Editor:
  • Haizhou Li
Issue’s Table of Contents

Publisher

IEEE Press

Publication History

Published: 01 May 2016
Published in TASLP Volume 24, Issue 5

Author Tags

  1. hidden Markov models
  2. iterative decoding
  3. multimedia systems
  4. robustness
  5. speech recognition

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)2
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2018)Comparing Fusion Models for DNN-Based Audiovisual Continuous Speech RecognitionIEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)10.1109/TASLP.2017.278354526:3(475-484)Online publication date: 1-Mar-2018

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media