Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Deep Learning in Speaker Recognition

  • Chapter
  • First Online:
Development and Analysis of Deep Learning Architectures

Part of the book series: Studies in Computational Intelligence ((SCI,volume 867))

Abstract

It is supposed in Speaker Recognition (SR) that everyone has a unique voice which could be used as an identity rather than or in addition to other identities such as fingerprint, face, or iris. Even though steps have been taken long ago to apply neural networks in SR, recent advances in computing hardware, new deep learning (DL) architectures and training methods, and access to a large amount of training data have inspired the research community to make use of DL as in a large variety of other signal processing applications. In this chapter, the traditional principle techniques in SR are first briefly reviewed and the potential signal processing aspects of these techniques which can be improved by DL are addressed. Then the recent most successful DL architectures used in SR are introduced and some illustrative experiments from the authors are included.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Oglesby, J., Mason, J.S.: Speaker identification using neural nets. IOA Speech (1988)

    Google Scholar 

  2. Oglesby, J., Mason, J.S.: Speaker recognition with a neural classifier. In: Artificial Neural Networks, IET (1989)

    Google Scholar 

  3. Oglesby, J., Mason, J.S.: Optimisation of neural models for speaker identification. In: ICASSP (1990)

    Google Scholar 

  4. Bennani, Y., Soulie, F.F., Gallinari, P.: A connectionist approach for automatic speaker identification. In: ICASSP (1990)

    Google Scholar 

  5. Bennani, Y., Gallinari, P.: On the use of tdnn-extracted features information in talker identification. In: ICASSP (1991)

    Google Scholar 

  6. Oglesby, J., Mason, J.S.: Radial basis function networks for speaker recognition. In: ICASSP (1991)

    Google Scholar 

  7. Rudasi, L., Zahorian, S.A.: Text-independent talker identification with neural networks. In: ICASSP (1991)

    Google Scholar 

  8. Farrell, K.R., Mammone, R.J., Assaleh, K.T.: Speaker recognition using neural networks and conventional classifiers. IEEE Trans. Speech Audio Process. (1994)

    Google Scholar 

  9. Heck, L.P., Konig, Y., Sönmez, M.K., Weintraub, M.: Robustness to telephone handset distortion in speaker recognition by discriminative feature design. Speech Commun. (2000)

    Google Scholar 

  10. Yegnanarayana, B., Kishore, S.P.: Aann: an alternative to gmm for pattern recognition. Neural Netw. (2002)

    Google Scholar 

  11. Lapidot, I., Guterman, H., Cohen, A.: Unsupervised speaker recognition based on competition between self-organizing maps. IEEE Trans. Neural Netw. (2002)

    Google Scholar 

  12. Chen, K., Salman, A.: Learning speaker-specific characteristics with a deep neural architecture. IEEE Trans. Neural Netw. (2011)

    Google Scholar 

  13. Chen, K., Salman, A.: Extracting speaker-specific information with a regularized siamese deep network. In: Advances in Neural Information Processing Systems (2011)

    Google Scholar 

  14. Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. Speech, and Language Processing, IEEE Transactions on Audio (2011)

    Google Scholar 

  15. Lei, Y., Scheffer, N., Ferre, L., Mclaren, M.: A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: ICASSP (2014)

    Google Scholar 

  16. Kenny, P., Gupta, V., Stafylakis, T., Ouellet, P., Alam, J.: Deep neural networks for extracting baum-welch statistics for speaker recognition. In: Odyssey (2014)

    Google Scholar 

  17. Campbell, W.M.: Using deep belief networks for vector-based speaker recognition. In: Interspeech (2014)

    Google Scholar 

  18. Richardson, F., Reynolds, D., Dehak, N.: Deep neural network approaches to speaker and language recognition. IEEE Signal Process. Lett. (2015)

    Google Scholar 

  19. Garcia-Romero, D., Zhang, X., McCree, A., Povey, D.: Improving speaker recognition performance in the domain adaptation challenge using deep neural networks. In: SLT (2014)

    Google Scholar 

  20. Mclaren, M., Lei, Y., Ferre, L.: Advances in deep neural network approaches to speaker recognition. In: ICASSP (2015)

    Google Scholar 

  21. Variani, E., Lei, X., McDermott, E., Lopez Moreno, I., Gonzalez-Dominguez, J.: Deep neural networks for small footprint text-dependent speaker verification. In: ICASSP (2014)

    Google Scholar 

  22. Wang, S., Qian, Y., Yu, K.: What does the speaker embedding encode? In: Interspeech (2017)

    Google Scholar 

  23. Bhattacharya, G., Alam, J., Kenny, P.: Deep speaker embeddings for short-duration speaker verification. In: Interspeech (2017)

    Google Scholar 

  24. Snyder, D., Garcia-Romero, D., Povey, D., Khudanpur, S.: Deep neural network embeddings for text-independent speaker verification. In: Interspeech (2017)

    Google Scholar 

  25. Snyder, D., Garcia-Romero, D., Sell, G., D. Povey, Khudanpur, S.: X-vectors: robust dnn embeddings for speaker recognition. In: ICASSP (2018)

    Google Scholar 

  26. Prince, S.J.D., Elder, J.H.: Probabilistic linear discriminant analysis for inferences about identity. In: ICCV (2007)

    Google Scholar 

  27. Ghahabi, O., Hernando, J.: Restricted boltzmann machine supervectors for speaker recognition. In: ICASSP (2015)

    Google Scholar 

  28. Ghahabi, O., Hernando, J.: Restricted Boltzmann machines for vector representation of speech in speaker recognition. Comput. Speech Lang. (2018)

    Google Scholar 

  29. Safari, P., Ghahabi, O., Hernando, J.: From features to speaker vectors by means of restricted boltzmann machine adaptation. In: Odyssey (2016)

    Google Scholar 

  30. Kenny, P.: Bayesian speaker verification with heavy tailed priors. In: Odyssey (2010)

    Google Scholar 

  31. Stafylakis, T., Kenny, P., Senoussaoui, M., Dumouchel, P.: Preliminary investigation of boltzmann machine classifiers for speaker recognition. In: Odyssey (2012)

    Google Scholar 

  32. Senoussaoui, M., Dehak, N., Kenny, P., Dehak, R., Dumouchel, P.: First attempt of boltzmann machines for speaker verification. In: Odyssey (2012)

    Google Scholar 

  33. Novoselov, S., Pekhovsky, T., Simonchik, K., Shulipa, A.: RBM-PLDA subsystem for the NIST i-vector challenge. In: Interspeech (2014)

    Google Scholar 

  34. Isik, Y.Z., Erdogan, H., Sarikaya, R.: S-vector: a discriminative representation derived from i-vector for speaker verification. In: EUSIPCO (2015)

    Google Scholar 

  35. Villalba, J., Brümmer, N., Dehak, N.: Tied variational autoencoder backends for i-vector speaker recognition. In: Interspeech (2017)

    Google Scholar 

  36. Novoselov, S., Pekhovsky, T., Kudashev, O., Mendelev, V.S., Prudnikov, A.: Non-linear plda for i-vector speaker verification. In: Interspeech (2015)

    Google Scholar 

  37. Pekhovsky, T., Novoselov, S., Sholokhov, A., Kudashev, O.: On autoencoders in the i-vector space for speaker recognition. In: Odyssey (2016)

    Google Scholar 

  38. The NIST Speaker Recognition i-vector Machine Learning Challenge (2014)

    Google Scholar 

  39. Khoury, E., El Shafey, L., Ferras, M., Marcel, S.: Hierarchical speaker clustering methods for the nist i-vector challenge, In: Odyssey (2014)

    Google Scholar 

  40. Novoselov, S., Pekhovsky, T., Simonchik, K.: STC speaker recognition system for the NIST i-vector challenge. In: Odyssey (2014)

    Google Scholar 

  41. Ghahabi, O., Hernando, J.: Deep belief networks for i-vector based speaker recognition. In: ICASSP (2014)

    Google Scholar 

  42. Ghahabi, O., Hernando, J.: i-vector modeling with deep belief networks for multi-session speaker recognition. In: Odyssey (2014)

    Google Scholar 

  43. Ghahabi, O., Hernando, J.: Deep learning backend for single and multisession i-vector speaker recognition. Speech, and Language Processing, IEEE/ACM Transactions on Audio (2017)

    Google Scholar 

  44. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv:1409.0473 (2014)

  45. Song, W., Cai, J.: End-to-end deep neural network for automatic speech recognition (2015)

    Google Scholar 

  46. Safari, P., Ghahabi, O., Hernando, J.: Feature classification by means of deep belief networks for speaker recognition. In: EUSIPCO (2015)

    Google Scholar 

  47. Reynolds, D.A., Rose, R.C.: Robust text-independent speaker identification using gaussian mixture speaker models. IEEE Trans. Speech Audio Process. (1995)

    Google Scholar 

  48. Campbell, W.M., Campbell, J.P., Reynolds, D.A., Singer, E., Torres-Carrasquillo, P.A.: Support vector machines for speaker and language recognition. Comput. Speech Lang. (2006)

    Google Scholar 

  49. Sadaoki, F.: Fifty years of progress in speech and speaker recognition. J. Acoust. Soc. Am. (2004)

    Google Scholar 

  50. Nadeu, C., Hernando, J., Gorricho, M.: On the decorrelation of filter-bank energies in speech recognition. In: Eurospeech (1995)

    Google Scholar 

  51. Nadeu, C., Macho, D., Hernando, J.: Time and frequency filtering of filter-bank energies for robust HMM speech recognition. Speech Commun. (2001)

    Google Scholar 

  52. Hansen, J.H.L., Hasan, T.: Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process. Mag. (2015)

    Google Scholar 

  53. Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker verification using adapted gaussian mixture models. In: Digital Signal Processing (2000)

    Article  Google Scholar 

  54. Campbell, W.M., Sturim, D.E., Reynolds, D.A.: Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett. (2006)

    Google Scholar 

  55. Dehak, N., Chollet, G.: Support vector gmms for speaker verification. In: Odyssey (2006)

    Google Scholar 

  56. Lee, K., You, C., Li, H., Kinnunen, T., Zhu, D.: Characterizing speech utterances for speaker verification with sequence kernel SVM. Comput. Speech Lang. (2008)

    Google Scholar 

  57. Campbell, W.M., Sturim, D.E., Reynolds, D.A., Solomonoff, A.: SVM based speaker verification using a GMM supervector kernel and NAP variability compensation. In: ICASSP (2006)

    Google Scholar 

  58. Solomonoff, A., Campbell, W.M., Boardman, I.: Advances in channel compensation for SVM speaker recognition. In: ICASSP (2005)

    Google Scholar 

  59. Hatch, A.O., Stolcke, A.: Generalized linear kernels for one-versus-all classification: application to speaker recognition. In: ICASSP (2006)

    Google Scholar 

  60. Kenny, P.: Joint factor analysis of speaker and session variability: theory and algorithms (2006)

    Google Scholar 

  61. Dehak, N., Dehak, R., Glass, J., Reynolds, D., Kenny, P.: Cosine similarity scoring without score normalization techniques. In: Odyssey (2010)

    Google Scholar 

  62. Garcia-Romero, D., Espy-Wilson, C. Y.: Analysis of i-vector length normalization in speaker recognition systems. In: Interspeech (2011)

    Google Scholar 

  63. Matějka, P., Glembek, O., Castaldo, F., Alam, J., Plchot, O., Kenny, P., Burget, L., Černocky, J.: Full-covariance ubm and heavy-tailed plda in i-vector speaker verification. In: ICASSP (2011)

    Google Scholar 

  64. Greenberg, C., Banse, D., Doddington, G., Garcia-Romero, D., Godfrey, J., Kinnunen, T., Martin, A., McCree, A., Przybocki, M., Reynolds, D.: The NIST 2014 speaker recognition i-vector machine learning challenge. In: Odyssey

    Google Scholar 

  65. Kinnunen, T., Li, H.: An overview of text-independent speaker recognition: from features to supervectors. Speech Commun. (2010)

    Google Scholar 

  66. Hansen, J.H.L., Hasan, T.: Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process. Mag. (2015)

    Google Scholar 

  67. Ghahabi, O.: Deep learning for i-vector speaker and language recognition. Ph.D. thesis, Universitat Politècnica de Catalunya (2018)

    Google Scholar 

  68. Ferrer, L., Lei, Y., McLaren, M., Scheffer, N.: Spoken language recognition based on senone posteriors. In: INTERSPEECH (2014)

    Google Scholar 

  69. Ferrer, L., Lei, Y., McLaren, M., Scheffer, N.: Study of senone-based deep neural network approaches for spoken language recognition. IEEE/ACM Trans. Audio Speech Lang. Process. (2016)

    Google Scholar 

  70. Silnova, A., Burget, L., Cernocky, J.: Alternative approaches to neural network based speaker verification. In: Interspeech (2017)

    Google Scholar 

  71. Ranjan, S., Hansen, J.H.L.: Improved gender independent speaker recognition using convolutional neural network based bottleneck features. In: Interspeech (2017)

    Google Scholar 

  72. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems (2014)

    Google Scholar 

  73. Serdyuk, D., Audhkhasi, K., Brakel, P., Ramabhadran, B., Thomas, S., Bengio, Y.: Invariant representations for noisy speech recognition. arXiv:1612.01928 (2016)

  74. Shinohara, Y.: Adversarial multi-task learning of deep neural networks for robust speech recognition. In: INTERSPEECH (2016)

    Google Scholar 

  75. Yu, H., Tan, Z.-H., Ma, Z., Guo, J.: Adversarial network bottleneck features for noise robust speaker verification. arXiv:1706.03397 (2017)

  76. Li, L., Tang, Z., Wang, D., Zheng, T.F.: Full-info training for deep speaker feature learning. In: ICASSP (2018)

    Google Scholar 

  77. Novoselov, S., Shulipa, A., Kremnev, I., Kozlov, A., Shchemelinin, V.: On deep speaker embeddings for text-independent speaker recognition. In: Odyssey (2018)

    Google Scholar 

  78. Li, L., Tang, Z., Shi, Y., Wang, D.: Gaussian-constrained training for speaker verification. arXiv:1811.03258 (2018)

  79. Zeinali, H., Burget, L., Rohdin, J., Stafylakis, T., Cernocky, J.: How to improve your speaker embeddings extractor in generic toolkits. arXiv:1811.02066 (2018)

  80. Huang, Z., Wang, S., Yu, K.: Angular softmax for short-duration text-independent speaker verification. In: Interspeech (2018)

    Google Scholar 

  81. Wang, J., Song, Y., Leung, T., Rosenberg, C., Wang, J., Philbin, J., Chen, B., Wu, Y.: Learning fine-grained image similarity with deep ranking. In: CVPR (2014)

    Google Scholar 

  82. Hoffer, E., Ailon, N.: Deep metric learning using triplet network. In: International Workshop on Similarity-Based Pattern Recognition. Springer (2015)

    Google Scholar 

  83. Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: CVPR (2015)

    Google Scholar 

  84. Zhang, C., Koishida, K.: End-to-end text-independent speaker verification with triplet loss on short utterances. In: Interspeech (2017)

    Google Scholar 

  85. Bredin, H.: Tristounet: triplet loss for speaker turn embedding. In: ICASSP (2017)

    Google Scholar 

  86. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  87. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)

  88. Nagrani, A., Chung, J.S., Zisserman, A.: Voxceleb: a large-scale speaker identification dataset. In: Interspeech (2017)

    Google Scholar 

  89. Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: deep speaker recognition. In: Interspeech (2018)

    Google Scholar 

  90. India, M., Safari, P., Hernando, J.: Self multi-head attention for speaker recognition. In: Interspeech (2019)

    Google Scholar 

  91. Ghahabi, O., Fischer, V.: Speaker-corrupted embeddings for online speaker diarization. In: Interspeech (2019)

    Google Scholar 

  92. Jung, J.-W., Heo, H.-S., Yang, I.-H., Shim, H.-J., Yu, H.-J.: Avoiding speaker overfitting in end-to-end dnns using raw waveform for text-independent speaker verification (2018)

    Google Scholar 

  93. Ravanelli, M., Bengio, Y.: Speaker recognition from raw waveform with sincnet. arXiv:1808.00158 (2018)

  94. Stafylakis, T., Kenny, P.: Preliminary investigation of boltzmann machine classifiers for speaker recognition. In: Odyssey (2012)

    Google Scholar 

  95. Senoussaoui, M., Dehak, N., Kenny, P., Dehak, R.: First attempt of boltzmann machines for speaker verification. In: Odyssey (2012)

    Google Scholar 

  96. Stafylakis, T., Kenny, P., Senoussaoui, M., Dumouchel, P.: PLDA using gaussian restricted boltzmann machines with application to speaker verification. In: Interspeech (2012)

    Google Scholar 

  97. Lee, H., Pham, P., Ng, A.Y.: Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Advances in Neural Information Processing Systems (2009)

    Google Scholar 

  98. Ghahabi, O., Hernando, J.: Global impostor selection for DBNs in multi-session i-vector speaker recognition. In: Advances in Speech and Language Technologies for Iberian Languages. Springer International Publishing (2014)

    Google Scholar 

  99. Campbell, W.M.: Using deep belief networks for vector-based speaker recognition. In: Interspeech (2014)

    Google Scholar 

  100. Vasilakakis, V., Cumani, S., Laface, P.: Speaker recognition by means of deep belief networks. In: Biometric Technologies in Forensic Science (2013)

    Google Scholar 

  101. Mahto, S., Yamamoto, H., Koshinaka, T.: I-vector transformation using a novel discriminative denoising autoencoder for noise-robust speaker recognition. In: Interspeech (2017)

    Google Scholar 

  102. Alam, J., Kenny, P., Bhattacharya, G., Kockmann, M.: Speaker verification under adverse conditions using i-vector adaptation and neural networks. In: Interspeech (2017)

    Google Scholar 

  103. Guzewich, P., Zahorian, S.: Improving speaker verification for reverberant conditions with deep neural network dereverberation processing. in: Interspeech (2017)

    Google Scholar 

  104. Tan, Z., Mak, M.-W.: I-vector dnn scoring and calibration for noise robust speaker verification. In: Interspeech (2017)

    Google Scholar 

  105. Shon, S., Mun, S., Kim, W., Ko, H.: Autoencoder based domain adaptation for speaker recognition under insufficient channel information. arXiv:1708.01227 (2017)

  106. Bousquet, P.-M., Rouvier, M.: Duration mismatch compensation using four-covariance model and deep neural network for speaker verification. In: Interspeech (2017)

    Google Scholar 

  107. Guo, J., Nookala, U. A., Alwan, A.: Cnn-based joint mapping of short and long utterance i-vectors for speaker verification using short utterances. In: Interspeech (2017)

    Google Scholar 

  108. Heigold, G., Moreno, I., Bengio, S., Shazeer, N.: End-to-end text-dependent speaker verification. In: ICASSP (2016)

    Google Scholar 

  109. Zhang, S.-X., Chen, Z., Zhao, Y., Li, J., Gong, Y.: End-to-end attention based text-dependent speaker verification. In: SLT (2016)

    Google Scholar 

  110. Heo, H.-S., Jung, J.-W., Yang, I.-H., Yoon, S.-H., Yu, H.-J.: Joint training of expanded end-to-end dnn for text-dependent speaker verification. In: Interspeech (2017)

    Google Scholar 

  111. Valenti, G., Daniel, A., Evans, N.: End-to-end automatic speaker verification with evolving recurrent neural networks. In: Odyssey (2018)

    Google Scholar 

  112. Dasgupta, D., McGregor, D.R.: Designing application-specific neural networks using the structured genetic algorithm. In: COGANN (1992)

    Google Scholar 

  113. Stanley, K.O., Miikkulainen, R.: Evolving neural networks through augmenting topologies. Evolut. Comput. (2002)

    Google Scholar 

Download references

Acknowledgements

This work is partially supported by the Spanish project DeepVoice under grant number TEC2015-69266-P.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Omid Ghahabi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Ghahabi, O., Safari, P., Hernando, J. (2020). Deep Learning in Speaker Recognition. In: Pedrycz, W., Chen, SM. (eds) Development and Analysis of Deep Learning Architectures. Studies in Computational Intelligence, vol 867. Springer, Cham. https://doi.org/10.1007/978-3-030-31764-5_6

Download citation

Publish with us

Policies and ethics