Abstract
Currently, phonotactic spoken language recognition (SLR) and acoustic SLR systems are widely used language recognition systems. Parallel phone recognition followed by vector space modeling (PPRVSM) is one typical phonotactic system for spoken language recognition. To achieve better performance, researchers assumed to extract more complementary information of the training data using phone recognizers trained for multiple language-specific phone recognizers, different acoustic models and acoustic features. These methods achieve good performance but usually compute at high computational cost and only using complementary information of the training data. In this paper, we explore a novel approach to discriminative vector space model (VSM) training by using a boosting framework to use the discriminative information of test data effectively, in which an ensemble of VSMs is trained sequentially. The effectiveness of our boosting variation comes from the emphasis on working with the high confidence test data to achieve discriminatively trained models. Our variant of boosting also includes utilizing original training data in VSM training. The discriminative boosting algorithm (DBA) is applied to the National Institute of Standards and Technology (NIST) language recognition evaluation (LRE) 2009 task and show performance improvements. The experimental results demonstrate that the proposed DBA shows 1.8 %, 11.72 % and 15.35 % relative reduction for 30s, 10s and 3s test utterances in equal error rate (EER) than baseline system.
Similar content being viewed by others
References
Li, H., Ma, B., & Lee, K.A. (2013). Spoken language recognition: from fundamentals to practice. Proceedings of the IEEE, 101(5), 1136–1159.
Zissman, M.A. (1996). Comparison of four approaches to automatic language identification of telephone speech. IEEE Transactions on Speech and Audio Processing, 4(1), 31–34.
Torres-Carrasquillo, P.A., Singer, E., Kohler, M.A., Greene, R.J., Reynolds, D.A., & Deller Jr, J.R. (2002). Approaches to language identification using Gaussian mixture models and shifted delta cepstral features. In INTERSPEECH (pp. 33–36).
Schwarz, P. (2009). Phoneme recognition based on long temporal context. PhD thesis. Brno University of Technology.
Sim, K.C., & Li, H. (2008). On acoustic diversification front-end for spoken language identification. IEEE Transactions on Audio, Speech, and Language Processing, 16(5), 1029– 1037.
Wells, W.M., Viola, P., Atsumi, H., Nakajima, S., & Kikinis, R. (1996). Multi-modal volume registration by maximization of mutual information. Medical Image Analysis, 1(1), 35–51.
Bahl, L., Brown, P., de Souza, P.V., & Mercer, R. (1986). Maximum mutual information estimation of hidden Markov model parameters for speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (pp. 49–52).
Povey, D., & Woodland, P. C. (2002). Minimum phone error and I-smoothing for improved discriminative training. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (pp. 101–105).
Juang, B.H., & Katagiri, S. (1992). Discriminative learning for minimum error classification [pattern recognition]. IEEE Transactions on Signal Processing, 40(12), 3043–3054.
Zhang, W.Q., He, L., Deng, Y., Liu, J., & Johnson, M.T. (2011). Time-Frequency Cepstral Features and Heteroscedastic Linear Discriminant Analysis for Language Recognition. IEEE Transactions on Audio, Speech, and Language Processing, 19(2), 266–276.
Singer, E., Torres-Carrasquillo, P.A., Gleason, T.P., Campbell, W.M., & Reynolds, D.A. (2003). Acoustic, phonetic, and discriminative approaches to automatic language identification. In INTERSPEECH (pp. 1944–1948).
Martin, A. F., & Greenberg, C. S. (2010). The 2009 NIST Language Recognition Evaluation. In Odyssey (p. 30).
Li, H., Ma, B., & Lee, C.H. (2007). A vector space modeling approach to spoken language identification. IEEE Transactions on Audio, Speech, and Language Processing, 15(1), 271–284.
Gauvain, J. L., Messaoudi, A., & Schwenk, H. (2004). Language recognition using phone latices. In INTERSPEECH (pp. 1283–1286).
Dahl, G.E., Yu, D., Deng, L., & Acero, A. (2012). Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 30–42.
Campbell, W. M., Campbell, J. P., Reynolds, D. A., Jones, D. A., & Leek, T. R. (2003). Phonetic speaker recognition with support vector machines. In Advances in Neural Information Processing Systems (pp. 1377–1384).
Matejka, P., Burget, L., Glembek, O., Schwarz, P., Hubeika, V., Fapso, M., & Plchot, O. (2007). BUT system description for NIST LRE 2007. In 2007 NIST Language Recognition Evaluation Workshop (pp. 1–5).
Povey, D. (2005). Discriminative training for large vocabulary speech recognition (Doctoral dissertation, University of Cambridge).
Jancik, Z., Plchot, O., Brmmer, N., Burget, L., Glembek, O., Hubeika, V., & Cernocky, J. (2010). Data selection and calibration issues in automatic language recognition-investigation with BUT-AGNITIO NIST LRE 2009 system. In Odyssey (pp. 215–221).
Torres-Carrasquillo, P.A., Singer, E., Gleason, T., McCree, A., Reynolds, D.A., Richardson, F., & Sturim, D. (2010). The MITLL NIST LRE 2009 language recognition system. In IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (pp. 4994–4997).
Liu, W.-W., Cai, M., Yuan, H., Xu, J., Liu, J., & Zhang, W.-Q. (2014). DNN-HMM acoustic model for phonotactic language recognition. International Symposium on Chinese Spoken Language Processing(ISCSLP), 148–152.
Cai, M., Shi, Y., & Liu, J. (2013). Deep maxout neural networks for speech recognition. In IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 291–296).
Godfrey, J. J., Holliman, E. C., & McDaniel, J. (1992). SWITCHBOARD: Telephone speech corpus for research and development. In EEE International Conference on Acoustics Speech, and Signal Processing (pp. 517–520).
Seide, F., Li, G., Chen, X., & Yu, D. (2011). Feature engineering in context-dependent deep neural networks for conversational speech transcription. In IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 24–29).
Mnih, V. (2009). Cudamat: a CUDA-based matrix class for python. Department of Computer Science, University of Toronto, Tech. Rep. UTML TR.
Deng, Y., Zhang, W.-Q., Qian, Y.M., & Liu, J. (2011). Language recognition based on acoustic diversified phone recognizers and phonotactic feature fusion. IEICE Transactions on Information and Systems, 94(3), 679–689.
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X.A., & Woodland, P. (2006). The HTK book (for HTK version 3.4).
Stolcke, A. (2002). SRILM-an extensible language modeling toolkit. In INTERSPEECH.
Mikolov, T., Kombrink, S., Deoras, A., Burget, L., & Cernocky, J. (2011). RNNLM-Recurrent neural network language modeling toolkit. In Proceeding of the 2011 ASRU Workshop (pp. 196–201).
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., & Lin, C.J. (2008). LIBLINEAR: A library for large linear classification. The Journal of Machine Learning Research, 1871–1874.
Zhang, W.Q., Hou, T., & Liu, J. (2010). Discriminative score fusion for language identification. Chinese Journal of Electronics, 19(1), 124–128.
Author information
Authors and Affiliations
Corresponding author
Additional information
This project is supported by National Natural Science Foundation of China (No.61273268, No. 61370034 and No. 61403224).
Rights and permissions
About this article
Cite this article
Liu, WW., Cai, M., Zhang, WQ. et al. Discriminative Boosting Algorithm for Diversified Front-End Phonotactic Language Recognition. J Sign Process Syst 82, 229–239 (2016). https://doi.org/10.1007/s11265-015-1017-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-015-1017-1