Abstract
Speaker recognition revolution has lead to the inclusion of speaker recognition modules in several commercial products. Most published algorithms for speaker recognition focus on text-dependent speaker recognition. In contrast, text-independent speaker recognition is more advantageous as the client can talk freely to the system. In this paper, text-independent speaker recognition is considered in the presence of some degradation effects such as noise and reverberation. Mel-Frequency Cepstral Coefficients (MFCCs), spectrum and log-spectrum are used for feature extraction from the speech signals. These features are processed with the Long-Short Term Memory Recurrent Neural Network (LSTM-RNN) as a classification tool to complete the speaker recognition task. The network learns to recognize the speakers efficiently in a text-independent manner, when the recording circumstances are the same. The recognition rate reaches 95.33% using MFCCs, while it is increased to 98.7% when using spectrum or log-spectrum. However, the system has some challenges to recognize speakers from different recording environments. Hence, different speech enhancement techniques, such as spectral subtraction and wavelet denoising, are used to improve the recognition performance to some extent. The proposed approach shows superiority, when compared to the algorithm of R. Togneri and D. Pullella (2011).
Similar content being viewed by others
References
Abd El-Samie FE (2011) Information security for automatic speaker identification. Springer briefs in electrical and computer engineering. Springer, New York
Baccouche M et al (2010) Action classification in soccer videos with long short-term memory recurrent neural networks. Springer-Verlag Berlin Heidelberg, pp 154–159
Bhattacharya G, Alam J, Stafylakis T, Kenny P (2016) Deep neural network based text-dependent speaker recognition: preliminary results. Odyssey 2016, pp 9–15
Campbell JP (1997) Speaker recognition: a tutorial. In: Proceedings of the IEEE, vol 85, no 9
Das A, Jena MR, Barik KK (2014) Mel-frequency cepstral coefficient (MFCC) a novel method for speaker recognition. Digital Technologies 1:1–3
Dennis J, Dat T, Li H (2011) Spectrogram image feature for sound event classification in mismatched conditions. IEEE Signal Processing Letters 18(2):130–133
Dominguez JG et al (2014) Automatic language identification using long short-term memory recurrent neural networks. Inter Speech 2014:2155–2159
Evans NWD, Mason JS, Liu WM, Fauve B (2005) On the fundamental limitations of spectra subtraction: an assessment by automatic speech recognition. IEEE, European Signal Processing Conference, 2005
Gish H, Schmidt M (1994) Text-independent speaker identification. IEEE Signal Process Mag 11:18–32
Kaladharan N (2014) Speech enhancement by spectral subtraction method. International Journal of Computer Applications 96(13):45–48
Karam M et al (2014) Noise removal in speech processing using spectral subtraction. Journal of Signal and Information Processing 5:32–41
Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. Speech Comm 52:12–40
Kumari VSR, Devarakonda DK (2013) A wavelet based denoising of speech signal. Int J Eng Trends Technol 5(2):107–115
Larsson J (2014) Optimizing text-independent speaker recognition using an LSTM neural network. Master Thesis in Robotics
Li KP, Wrench KH (1983) An approach to text-independent speaker recognition with short utterances. IEEE, pp 555–558
Mihov SG (2009) Denoising speech signals by wavelet transform. Annual Journal of Electronics
Nilufar S, Ray N, Islam Molla MK, Hirose K (2012) Spectrogram based features selection using multiple kernel learning for speech/music discrimination. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 501–504
Parada PP et al (2014) Reverberant-speech-recognition:-A-phoneme-analysis. In: Proc. IEEE global Conf. Signal Inf. Process, pp 567–571
Sant’Ana R et al (2006) Text-independent speaker recognition based on the Hurst parameter and the multidimensional fractional Brownian motion model. IEEE Trans Audio Speech Lang Process 14(3):931–940
Seo Y, Huh J (2019) Automatic emotion-based music classification for supporting intelligent IoT applications, vol 8. Electronics, p 164
Sharma A, Singh SP, Kumar V (2005) Text-independent speaker identification using Back propagation MLP network classifier for a closed set of speaker. IEEE International Symposium on Signal Processing and Information Technology, 2005
Togneri R, Pullella D (2011) An overview of speaker identification: accuracy and robustness issues. IEEE Circuits and Systems Magazine 11:23–61
Yegnanarayana B, Murthy PS (2000) Enhancement of reverberant speech using LP residual signal. IEEE Trans Speech Audio Processing 8:267–281
Zazo R (2016) Language identification in short utterances using long short-term memory (LSTM) recurrent neural networks. PLoS One 11:e0146917
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
El-Moneim, S.A., Nassar, M.A., Dessouky, M.I. et al. Text-independent speaker recognition using LSTM-RNN and speech enhancement. Multimed Tools Appl 79, 24013–24028 (2020). https://doi.org/10.1007/s11042-019-08293-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-019-08293-7