Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Discriminative Boosting Algorithm for Diversified Front-End Phonotactic Language Recognition

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

Currently, phonotactic spoken language recognition (SLR) and acoustic SLR systems are widely used language recognition systems. Parallel phone recognition followed by vector space modeling (PPRVSM) is one typical phonotactic system for spoken language recognition. To achieve better performance, researchers assumed to extract more complementary information of the training data using phone recognizers trained for multiple language-specific phone recognizers, different acoustic models and acoustic features. These methods achieve good performance but usually compute at high computational cost and only using complementary information of the training data. In this paper, we explore a novel approach to discriminative vector space model (VSM) training by using a boosting framework to use the discriminative information of test data effectively, in which an ensemble of VSMs is trained sequentially. The effectiveness of our boosting variation comes from the emphasis on working with the high confidence test data to achieve discriminatively trained models. Our variant of boosting also includes utilizing original training data in VSM training. The discriminative boosting algorithm (DBA) is applied to the National Institute of Standards and Technology (NIST) language recognition evaluation (LRE) 2009 task and show performance improvements. The experimental results demonstrate that the proposed DBA shows 1.8 %, 11.72 % and 15.35 % relative reduction for 30s, 10s and 3s test utterances in equal error rate (EER) than baseline system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3

Similar content being viewed by others

References

  1. Li, H., Ma, B., & Lee, K.A. (2013). Spoken language recognition: from fundamentals to practice. Proceedings of the IEEE, 101(5), 1136–1159.

    Article  Google Scholar 

  2. Zissman, M.A. (1996). Comparison of four approaches to automatic language identification of telephone speech. IEEE Transactions on Speech and Audio Processing, 4(1), 31–34.

    Article  Google Scholar 

  3. Torres-Carrasquillo, P.A., Singer, E., Kohler, M.A., Greene, R.J., Reynolds, D.A., & Deller Jr, J.R. (2002). Approaches to language identification using Gaussian mixture models and shifted delta cepstral features. In INTERSPEECH (pp. 33–36).

  4. Schwarz, P. (2009). Phoneme recognition based on long temporal context. PhD thesis. Brno University of Technology.

  5. Sim, K.C., & Li, H. (2008). On acoustic diversification front-end for spoken language identification. IEEE Transactions on Audio, Speech, and Language Processing, 16(5), 1029– 1037.

    Article  Google Scholar 

  6. Wells, W.M., Viola, P., Atsumi, H., Nakajima, S., & Kikinis, R. (1996). Multi-modal volume registration by maximization of mutual information. Medical Image Analysis, 1(1), 35–51.

    Article  Google Scholar 

  7. Bahl, L., Brown, P., de Souza, P.V., & Mercer, R. (1986). Maximum mutual information estimation of hidden Markov model parameters for speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (pp. 49–52).

  8. Povey, D., & Woodland, P. C. (2002). Minimum phone error and I-smoothing for improved discriminative training. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (pp. 101–105).

  9. Juang, B.H., & Katagiri, S. (1992). Discriminative learning for minimum error classification [pattern recognition]. IEEE Transactions on Signal Processing, 40(12), 3043–3054.

    Article  MATH  Google Scholar 

  10. Zhang, W.Q., He, L., Deng, Y., Liu, J., & Johnson, M.T. (2011). Time-Frequency Cepstral Features and Heteroscedastic Linear Discriminant Analysis for Language Recognition. IEEE Transactions on Audio, Speech, and Language Processing, 19(2), 266–276.

    Article  Google Scholar 

  11. Singer, E., Torres-Carrasquillo, P.A., Gleason, T.P., Campbell, W.M., & Reynolds, D.A. (2003). Acoustic, phonetic, and discriminative approaches to automatic language identification. In INTERSPEECH (pp. 1944–1948).

  12. Martin, A. F., & Greenberg, C. S. (2010). The 2009 NIST Language Recognition Evaluation. In Odyssey (p. 30).

  13. Li, H., Ma, B., & Lee, C.H. (2007). A vector space modeling approach to spoken language identification. IEEE Transactions on Audio, Speech, and Language Processing, 15(1), 271–284.

    Article  Google Scholar 

  14. Gauvain, J. L., Messaoudi, A., & Schwenk, H. (2004). Language recognition using phone latices. In INTERSPEECH (pp. 1283–1286).

  15. Dahl, G.E., Yu, D., Deng, L., & Acero, A. (2012). Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 30–42.

    Article  Google Scholar 

  16. Campbell, W. M., Campbell, J. P., Reynolds, D. A., Jones, D. A., & Leek, T. R. (2003). Phonetic speaker recognition with support vector machines. In Advances in Neural Information Processing Systems (pp. 1377–1384).

  17. Matejka, P., Burget, L., Glembek, O., Schwarz, P., Hubeika, V., Fapso, M., & Plchot, O. (2007). BUT system description for NIST LRE 2007. In 2007 NIST Language Recognition Evaluation Workshop (pp. 1–5).

  18. Povey, D. (2005). Discriminative training for large vocabulary speech recognition (Doctoral dissertation, University of Cambridge).

  19. Jancik, Z., Plchot, O., Brmmer, N., Burget, L., Glembek, O., Hubeika, V., & Cernocky, J. (2010). Data selection and calibration issues in automatic language recognition-investigation with BUT-AGNITIO NIST LRE 2009 system. In Odyssey (pp. 215–221).

  20. Torres-Carrasquillo, P.A., Singer, E., Gleason, T., McCree, A., Reynolds, D.A., Richardson, F., & Sturim, D. (2010). The MITLL NIST LRE 2009 language recognition system. In IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (pp. 4994–4997).

  21. Liu, W.-W., Cai, M., Yuan, H., Xu, J., Liu, J., & Zhang, W.-Q. (2014). DNN-HMM acoustic model for phonotactic language recognition. International Symposium on Chinese Spoken Language Processing(ISCSLP), 148–152.

  22. Cai, M., Shi, Y., & Liu, J. (2013). Deep maxout neural networks for speech recognition. In IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 291–296).

  23. Godfrey, J. J., Holliman, E. C., & McDaniel, J. (1992). SWITCHBOARD: Telephone speech corpus for research and development. In EEE International Conference on Acoustics Speech, and Signal Processing (pp. 517–520).

  24. Seide, F., Li, G., Chen, X., & Yu, D. (2011). Feature engineering in context-dependent deep neural networks for conversational speech transcription. In IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 24–29).

  25. Mnih, V. (2009). Cudamat: a CUDA-based matrix class for python. Department of Computer Science, University of Toronto, Tech. Rep. UTML TR.

  26. Deng, Y., Zhang, W.-Q., Qian, Y.M., & Liu, J. (2011). Language recognition based on acoustic diversified phone recognizers and phonotactic feature fusion. IEICE Transactions on Information and Systems, 94(3), 679–689.

    Article  Google Scholar 

  27. Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X.A., & Woodland, P. (2006). The HTK book (for HTK version 3.4).

  28. Stolcke, A. (2002). SRILM-an extensible language modeling toolkit. In INTERSPEECH.

  29. Mikolov, T., Kombrink, S., Deoras, A., Burget, L., & Cernocky, J. (2011). RNNLM-Recurrent neural network language modeling toolkit. In Proceeding of the 2011 ASRU Workshop (pp. 196–201).

  30. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., & Lin, C.J. (2008). LIBLINEAR: A library for large linear classification. The Journal of Machine Learning Research, 1871–1874.

  31. Zhang, W.Q., Hou, T., & Liu, J. (2010). Discriminative score fusion for language identification. Chinese Journal of Electronics, 19(1), 124–128.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei-Qiang Zhang.

Additional information

This project is supported by National Natural Science Foundation of China (No.61273268, No. 61370034 and No. 61403224).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, WW., Cai, M., Zhang, WQ. et al. Discriminative Boosting Algorithm for Diversified Front-End Phonotactic Language Recognition. J Sign Process Syst 82, 229–239 (2016). https://doi.org/10.1007/s11265-015-1017-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-015-1017-1

Keywords