Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3474085.3481542acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

L2RS: A Learning-to-Rescore Mechanism for Hybrid Speech Recognition

Published: 17 October 2021 Publication History

Abstract

This paper aims to advance the performance of industrial ASR systems by exploring a more effective method for N-best rescoring, a critical step that greatly affects the final recognition accuracy. Existing rescoring approaches suffer the following issues: (i) limited performance since they optimize an unnecessarily harder problem, namely predicting accurate grammatical legitimacy scores of the N-best hypotheses rather than directly predicting their partial orders regarding a specific acoustic input; (ii) hard to incorporate various information by advanced natural language processing (NLP) models such as BERT to achieve a comprehensive evaluation of each N-best candidate. To relieve the above drawbacks, we propose a simple yet effective mechanism, Learning-to-Rescore (L2RS), to empower ASR systems with state-of-the-art information retrieval (IR) techniques. Specifically, L2RS utilizes a wide range of textual information from the state-of-the-art NLP models and automatically deciding their weights to directly learn the ranking order of each N-best hypothesis with respect to a specific acoustic input. We incorporate various features including BERT sentence embeddings, the topic vectors, and perplexity scores produced by an n-gram language model (LM), topic modeling LM, BERT, and RNNLM to train the rescoring model. Experimental results on a public dataset show that L2RS outperforms not only traditional rescoring methods but also its deep neural network counterparts by a substantial margin of 20.85% in terms of NDCG@10. The L2RS toolkit has been successfully deployed for many online commercial services in WeBank Co., Ltd, China's leading digital bank. The efficacy and applicability of L2RS are validated by real-life online customer datasets.

References

[1]
Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et almbox. 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning. 173--182.
[2]
Georgios Balikas, Massih-Reza Amini, and Marianne Clausel. 2016. On a topic model for sentences. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 921--924.
[3]
Jerome R Bellegarda. 2004. Statistical language model adaptation: review and perspectives. Speech communication, Vol. 42, 1 (2004), 93--108.
[4]
David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research, Vol. 3, Jan (2003), 993--1022.
[5]
Abraham Bookstein. 1982. Explanation and generalization of vector models in information retrieval. In International Conference on Research and Development in Information Retrieval. Springer, 118--132.
[6]
Christopher Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Gregory N Hullender. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine learning (ICML-05). 89--96.
[7]
Yunbo Cao, Jun Xu, Tie-Yan Liu, Hang Li, Yalou Huang, and Hsiao-Wuen Hon. 2006. Adapting ranking SVM to document retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 186--193.
[8]
Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. 2007 a. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning. ACM, 129--136.
[9]
Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. 2007 b. Learning to Rank: From Pairwise Approach to Listwise Approach. In Proceedings of the 24th International Conference on Machine Learning (ICML '07). Association for Computing Machinery, New York, NY, USA, 129--136.
[10]
William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4960--4964.
[11]
Kuan-Yu Chen, Hsuan-Sheng Chiu, and Berlin Chen. 2010. Latent topic modeling of word vicinity information for speech recognition. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on. IEEE, 5394--5397.
[12]
Yu-An Chung and James Glass. 2018. Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech. arXiv preprint arXiv:1803.08976 (2018).
[13]
David Cossock and Tong Zhang. 2006. Subset ranking using regression. In International Conference on Computational Learning Theory. Springer, 605--619.
[14]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171--4186.
[15]
Hakan Erdogan, Tomoki Hayashi, John R Hershey, Takaaki Hori, Chiori Hori, Wei-Ning Hsu, Suyoun Kim, Jonathan Le Roux, Zhong Meng, and Shinji Watanabe. 2016. Multi-channel speech recognition: LSTMs all the way through. In CHiME-4 workshop. 1--4.
[16]
Jonathan T Foote. 1997. Content-based retrieval of music and audio. In Multimedia Storage and Archiving Systems II, Vol. 3229. International Society for Optics and Photonics, 138--147.
[17]
Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189--1232.
[18]
Norbert Fuhr. 1992. Probabilistic models in information retrieval. The computer journal, Vol. 35, 3 (1992), 243--255.
[19]
Xiubo Geng, Tie-Yan Liu, Tao Qin, and Hang Li. 2007. Feature selection for ranking. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 407--414.
[20]
Ralf Herbrich. 2000. Large margin rank boundaries for ordinal regression. Advances in large margin classifiers (2000), 115--132.
[21]
Thomas Hofmann. 1999. Probabilistic latent semantic analysis. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., 289--296.
[22]
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. ACM, 2333--2338.
[23]
Di Jiang, Yuanfeng Song, Rongzhong Lian, Siqi Bao, Jinhua Peng, Huang He, Hua Wu, Chen Zhang, and Lei Chen. 2021. Familia: A Configurable Topic Modeling Framework for Industrial Text Engineering. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2021), 516.
[24]
Maryam Karimzadehgan, Wei Li, Ruofei Zhang, and Jianchang Mao. 2011. A stochastic learning-to-rank algorithm and its application to contextual advertising. In Proceedings of the 20th international conference on World wide web. ACM, 377--386.
[25]
Dietrich Klakow and Jochen Peters. 2002. Testing the correlation of word error rate and perplexity. Speech Communication, Vol. 38, 1--2 (2002), 19--28.
[26]
Roland Kuhn and Renato De Mori. 1990. A cache-based natural language model for speech recognition. IEEE transactions on pattern analysis and machine intelligence, Vol. 12, 6 (1990), 570--583.
[27]
Yanyan Lan, Yadong Zhu, Jiafeng Guo, Shuzi Niu, and Xueqi Cheng. 2014. Position-Aware ListMLE: A Sequential Learning Process for Ranking. In UAI. 449--458.
[28]
Arash Habibi Lashkari, Fereshteh Mahdavi, and Vahid Ghomi. 2009. A boolean model in information retrieval for search engines. In 2009 International Conference on Information Management and Engineering. IEEE, 385--389.
[29]
Duc Le, Xiaohui Zhang, Weiyi Zheng, Christian Fügen, Geoffrey Zweig, and Michael L Seltzer. 2019. From senones to chenones: Tied context-dependent graphemes for hybrid speech recognition. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 457--464.
[30]
Jinyu Li, Rui Zhao, Eric Sun, Jeremy HM Wong, Amit Das, Zhong Meng, and Yifan Gong. 2020. High-Accuracy and Low-Latency Speech Recognition with Two-Head Contextual Layer Trajectory LSTM Model. In ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7699--7703.
[31]
Ke Li, Hainan Xu, Yiming Wang, Daniel Povey, and Sanjeev Khudanpur. 2018. Recurrent neural network language model adaptation for conversational speech recognition. INTERSPEECH, Hyderabad (2018), 1--5.
[32]
Ping Li, Qiang Wu, and Christopher J Burges. 2008. Mcrank: Learning to rank using multiple classification and gradient boosting. In Advances in neural information processing systems. 897--904.
[33]
Bin Liu, Junjie Chen, and Xiaolong Wang. 2015. Application of learning to rank to protein remote homology detection. Bioinformatics, Vol. 31, 21 (2015), 3492--3498.
[34]
Tie-Yan Liu et almbox. 2009. Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval, Vol. 3, 3 (2009), 225--331.
[35]
Xunying Liu, Yongqiang Wang, Xie Chen, Mark J. F. Gales, and Phil Woodland. 2014. Efficient Lattice Rescoring Using Recurrent Neural Network Language Models. In IEEE International Conference on Acoustics.
[36]
Lidia Mangu, Eric Brill, and Andreas Stolcke. 2000. Finding consensus in speech recognition: word error minimization and other applications of confusion networks. Computer Speech & Language, Vol. 14, 4 (2000), 373--400.
[37]
Tomávs Mikolov, Martin Karafiát, Lukávs Burget, Jan Černockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Eleventh annual conference of the international speech communication association.
[38]
Tomáš Mikolov, Stefan Kombrink, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. 2011. Extensions of recurrent neural network language model. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5528--5531.
[39]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111--3119.
[40]
T. Oba, T. Hori, A. Nakamura, and A. Ito. 2012. Round-Robin Duel Discriminative Language Models. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 20, 4 (May 2012), 1244--1255. https://doi.org/10.1109/TASL.2011.2174225
[41]
Atsunori Ogawa, Marc Delcroix, Shigeki Karita, and Tomohiro Nakatani. 2018. Rescoring N-Best Speech Recognition List Based on One-on-One Hypothesis Comparison Using Encoder-Classifier Model. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6099--6103.
[42]
Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et almbox. 2011. The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society.
[43]
Brian Roark, Murat Saraclar, and Michael Collins. 2007. Discriminative N-gram Language Modeling. Comput. Speech Lang., Vol. 21, 2 (April 2007), 373--392. https://doi.org/10.1016/j.csl.2006.06.006
[44]
Brian Roark, Murat Saraclar, Michael Collins, and Mark Johnson. 2004. Discriminative language modeling with conditional random fields and the perceptron algorithm. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 47.
[45]
Anthony Rousseau, Paul Deléglise, and Yannick Esteve. 2012. TED-LIUM: an Automatic Speech Recognition dedicated corpus. In LREC. 125--129.
[46]
Jorge Sánchez, Franco Luque, and Leandro Lichtensztein. 2018. A Structured Listwise Approach to Learning to Rank for Image Tagging. In Proceedings of the European Conference on Computer Vision (ECCV). 0--0.
[47]
Amnon Shashua and Anat Levin. 2003. Ranking with large margin principle: Two approaches. In Advances in neural information processing systems. 961--968.
[48]
Zhendong Shi, Jacky Keung, Kwabena Ebo Bennin, and Xingjun Zhang. 2018. Comparing learning to rank techniques in hybrid bug localization. Applied Soft Computing, Vol. 62 (2018), 636--648.
[49]
Yuanfeng Song, Di Jiang, Xuefang Zhao, Qian Xu, Raymond Chi-Wing Wong, Lixin Fan, and Qiang Yang. 2019. L2RS: A Learning-to-Rescore Mechanism for Automatic Speech Recognition. arxiv: cs.CL/1910.11496
[50]
Krysta M Svore and CJ Burges. 2011. Large-scale learning to rank using boosted decision trees. Scaling Up Machine Learning: Parallel and Distributed Approaches, Vol. 2 (2011), 2011.
[51]
Tomohiro Tanaka, Ryo Masumura, Takafumi Moriya, and Yushi Aono. 2018. Neural Speech-to-Text Language Models for Rescoring Hypotheses of DNN-HMM Hybrid Automatic Speech Recognition Systems. In 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 196--200.
[52]
Tomohiro Tanaka, Ryo Masumura, Takafumi Moriya, Takanobu Oba, and Yushi Aono. 2019. A Joint End-to-End and DNN-HMM Hybrid Automatic Speech Recognition System with Transferring Sharable Knowledge. In INTERSPEECH. 2210--2214.
[53]
Michael Taylor, John Guiver, Stephen Robertson, and Tom Minka. 2008. Softrank: optimizing non-smooth rank metrics. In Proceedings of the 2008 International Conference on Web Search and Data Mining. 77--86.
[54]
Aaron Van den Oord, Sander Dieleman, and Benjamin Schrauwen. 2013. Deep content-based music recommendation. In Advances in neural information processing systems. 2643--2651.
[55]
Alex Wang and Kyunghyun Cho. 2019. BERT has a mouth, and it must speak: BERT as a markov random field language model. arXiv preprint arXiv:1902.04094 (2019).
[56]
Yongqiang Wang, Abdelrahman Mohamed, Due Le, Chunxi Liu, Alex Xiao, Jay Mahadeokar, Hongzhao Huang, Andros Tjandra, Xiaohui Zhang, Frank Zhang, et almbox. 2020. Transformer-based acoustic modeling for hybrid speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6874--6878.
[57]
Han Xiao. 2018. bert-as-service. https://github.com/hanxiao/bert-as-service.
[58]
Hainan Xu, Tongfei Chen, Dongji Gao, Yiming Wang, Ke Li, Nagendra Goel, Yishay Carmiel, Daniel Povey, and Sanjeev Khudanpur. 2018. A pruned rnnlm lattice-rescoring algorithm for automatic speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5929--5933.
[59]
Jun Xu and Hang Li. 2007. Adarank: a boosting algorithm for information retrieval. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 391--398.
[60]
Jinhui Yuan, Fei Gao, Qirong Ho, Wei Dai, Jinliang Wei, Xun Zheng, Eric Po Xing, Tie-Yan Liu, and Wei-Ying Ma. 2015. Lightlda: Big topic models on modest computer clusters. In Proceedings of the 24th International Conference on World Wide Web. 1351--1361.

Cited By

View all
  • (2024)Speech-to-SQL: toward speech-driven SQL query generation from natural language questionThe VLDB Journal10.1007/s00778-024-00837-033:4(1179-1201)Online publication date: 16-Feb-2024

Index Terms

  1. L2RS: A Learning-to-Rescore Mechanism for Hybrid Speech Recognition
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Conferences
          MM '21: Proceedings of the 29th ACM International Conference on Multimedia
          October 2021
          5796 pages
          ISBN:9781450386517
          DOI:10.1145/3474085
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Sponsors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 17 October 2021

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. automatic speech recognition
          2. information retrieval
          3. learning-to-rank
          4. n-best rescoring

          Qualifiers

          • Research-article

          Funding Sources

          • Science, Technology & Information Technology Bureau of Guangzhou Development Zone

          Conference

          MM '21
          Sponsor:
          MM '21: ACM Multimedia Conference
          October 20 - 24, 2021
          Virtual Event, China

          Acceptance Rates

          Overall Acceptance Rate 995 of 4,171 submissions, 24%

          Upcoming Conference

          MM '24
          The 32nd ACM International Conference on Multimedia
          October 28 - November 1, 2024
          Melbourne , VIC , Australia

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)12
          • Downloads (Last 6 weeks)2
          Reflects downloads up to 04 Oct 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)Speech-to-SQL: toward speech-driven SQL query generation from natural language questionThe VLDB Journal10.1007/s00778-024-00837-033:4(1179-1201)Online publication date: 16-Feb-2024

          View Options

          Get Access

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media