Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Text-independent speaker recognition using LSTM-RNN and speech enhancement

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Speaker recognition revolution has lead to the inclusion of speaker recognition modules in several commercial products. Most published algorithms for speaker recognition focus on text-dependent speaker recognition. In contrast, text-independent speaker recognition is more advantageous as the client can talk freely to the system. In this paper, text-independent speaker recognition is considered in the presence of some degradation effects such as noise and reverberation. Mel-Frequency Cepstral Coefficients (MFCCs), spectrum and log-spectrum are used for feature extraction from the speech signals. These features are processed with the Long-Short Term Memory Recurrent Neural Network (LSTM-RNN) as a classification tool to complete the speaker recognition task. The network learns to recognize the speakers efficiently in a text-independent manner, when the recording circumstances are the same. The recognition rate reaches 95.33% using MFCCs, while it is increased to 98.7% when using spectrum or log-spectrum. However, the system has some challenges to recognize speakers from different recording environments. Hence, different speech enhancement techniques, such as spectral subtraction and wavelet denoising, are used to improve the recognition performance to some extent. The proposed approach shows superiority, when compared to the algorithm of R. Togneri and D. Pullella (2011).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Abd El-Samie FE (2011) Information security for automatic speaker identification. Springer briefs in electrical and computer engineering. Springer, New York

  2. Baccouche M et al (2010) Action classification in soccer videos with long short-term memory recurrent neural networks. Springer-Verlag Berlin Heidelberg, pp 154–159

  3. Bhattacharya G, Alam J, Stafylakis T, Kenny P (2016) Deep neural network based text-dependent speaker recognition: preliminary results. Odyssey 2016, pp 9–15

  4. Campbell JP (1997) Speaker recognition: a tutorial. In: Proceedings of the IEEE, vol 85, no 9

  5. Das A, Jena MR, Barik KK (2014) Mel-frequency cepstral coefficient (MFCC) a novel method for speaker recognition. Digital Technologies 1:1–3

    Google Scholar 

  6. Dennis J, Dat T, Li H (2011) Spectrogram image feature for sound event classification in mismatched conditions. IEEE Signal Processing Letters 18(2):130–133

    Article  Google Scholar 

  7. Dominguez JG et al (2014) Automatic language identification using long short-term memory recurrent neural networks. Inter Speech 2014:2155–2159

    Google Scholar 

  8. Evans NWD, Mason JS, Liu WM, Fauve B (2005) On the fundamental limitations of spectra subtraction: an assessment by automatic speech recognition. IEEE, European Signal Processing Conference, 2005

  9. Gish H, Schmidt M (1994) Text-independent speaker identification. IEEE Signal Process Mag 11:18–32

    Article  Google Scholar 

  10. Kaladharan N (2014) Speech enhancement by spectral subtraction method. International Journal of Computer Applications 96(13):45–48

    Article  Google Scholar 

  11. Karam M et al (2014) Noise removal in speech processing using spectral subtraction. Journal of Signal and Information Processing 5:32–41

    Article  Google Scholar 

  12. Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. Speech Comm 52:12–40

    Article  Google Scholar 

  13. Kumari VSR, Devarakonda DK (2013) A wavelet based denoising of speech signal. Int J Eng Trends Technol 5(2):107–115

    Google Scholar 

  14. Larsson J (2014) Optimizing text-independent speaker recognition using an LSTM neural network. Master Thesis in Robotics

  15. Li KP, Wrench KH (1983) An approach to text-independent speaker recognition with short utterances. IEEE, pp 555–558

  16. Mihov SG (2009) Denoising speech signals by wavelet transform. Annual Journal of Electronics

  17. Nilufar S, Ray N, Islam Molla MK, Hirose K (2012) Spectrogram based features selection using multiple kernel learning for speech/music discrimination. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 501–504

    Google Scholar 

  18. Parada PP et al (2014) Reverberant-speech-recognition:-A-phoneme-analysis. In: Proc. IEEE global Conf. Signal Inf. Process, pp 567–571

    Google Scholar 

  19. Sant’Ana R et al (2006) Text-independent speaker recognition based on the Hurst parameter and the multidimensional fractional Brownian motion model. IEEE Trans Audio Speech Lang Process 14(3):931–940

    Article  Google Scholar 

  20. Seo Y, Huh J (2019) Automatic emotion-based music classification for supporting intelligent IoT applications, vol 8. Electronics, p 164

  21. Sharma A, Singh SP, Kumar V (2005) Text-independent speaker identification using Back propagation MLP network classifier for a closed set of speaker. IEEE International Symposium on Signal Processing and Information Technology, 2005

  22. Togneri R, Pullella D (2011) An overview of speaker identification: accuracy and robustness issues. IEEE Circuits and Systems Magazine 11:23–61

    Article  Google Scholar 

  23. Yegnanarayana B, Murthy PS (2000) Enhancement of reverberant speech using LP residual signal. IEEE Trans Speech Audio Processing 8:267–281

    Article  Google Scholar 

  24. Zazo R (2016) Language identification in short utterances using long short-term memory (LSTM) recurrent neural networks. PLoS One 11:e0146917

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Samia Abd El-Moneim.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

El-Moneim, S.A., Nassar, M.A., Dessouky, M.I. et al. Text-independent speaker recognition using LSTM-RNN and speech enhancement. Multimed Tools Appl 79, 24013–24028 (2020). https://doi.org/10.1007/s11042-019-08293-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-019-08293-7

Keywords