Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Coupling a Generative Model With a Discriminative Learning Framework for Speaker Verification

Published: 19 November 2021 Publication History

Abstract

The task of speaker verification (SV) is to decide whether an utterance is spoken by a target or an imposter speaker. In most studies of SV, a log-likelihood ratio (LLR) score is estimated based on a generative probability model on speaker features, and compared with a threshold for making a decision. However, the generative model usually focuses on individual feature distributions, does not have the discriminative feature selection ability, and is easy to be distracted by nuisance features. The SV, as a hypothesis test, could be formulated as a binary discrimination task where neural network based discriminative learning could be applied. In discriminative learning, the nuisance features could be removed with the help of label supervision. However, discriminative learning pays more attention to classification boundaries, and is prone to overfitting to a training set which may result in bad generalization on a test set. In this paper, we propose a hybrid learning framework, i.e., coupling a joint Bayesian (JB) generative model structure and parameters with a neural discriminative learning framework for SV. In the hybrid framework, a two-branch Siamese neural network is built with dense layers that are coupled with factorized affine transforms as used in the JB model. The LLR score estimation in the JB model is formulated according to the distance metric in the discriminative learning framework. By initializing the two-branch neural network with the generatively learned model parameters of the JB model, we further train the model parameters with the pairwise samples as a binary discrimination task. Moreover, a direct evaluation metric (DEM) in SV based on minimum empirical Bayes risk (EBR) is designed and integrated as an objective function in the discriminative learning. We carried out SV experiments on Speakers in the wild (SITW) and Voxceleb. Experimental results showed that our proposed model improved the performance with a large margin compared with state of the art models for SV.

References

[1]
J. Hansen and T. Hasan, “Speaker recognition by machines and humans: A tutorial review,” IEEE Signal Process. Mag., vol. 32, no. 6, pp. 74–99, Nov. 2015.
[2]
H. Beigi, Fundamentals of Speaker Recognition. Berlin: Springer, 2011.
[3]
N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 4, pp. 788–798, May 2011.
[4]
E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2014, pp. 4052–4056.
[5]
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2018, pp. 5329–5333.
[6]
M. Sugiyama, “Local fisher discriminant analysis for supervised dimensionality reduction,” in Proc. 23rd Int. Conf. Mach. Learn., 2006, pp. 905–912.
[7]
P. Shen, X. Lu, L. Liu, and H. Kawai, “Local fisher discriminant analysis for spoken language identification,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2016, pp. 5825–5829.
[8]
S. Prince and J. Elder, “Probabilistic linear discriminant analysis for inferences about identity,” in Proc. IEEE Int. Conf. Comput. Vis., 2007, pp. 1–8.
[9]
A. Sizov, K. Lee, and T. Kinnunen, “Unifying probabilistic linear discriminant analysis variants in biometric authentication,” in Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). Berlin, Heidelberg: Springer, 2014, pp. 464–475.
[10]
P. Kenny, “Bayesian speaker verification with heavy tailed priors,” in Proc. Odyssey Speaker Lang. Recognit. Workshop, 2010, pp. 1–14.
[11]
P. Kenny, T. Stafylakis, P. Ouellet, M. J. Alam, and P. Dumouchel, “PLDA for speaker verification with utterances of arbitrary duration,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2013, pp. 7649–7653.
[12]
D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun, “Bayesian face revisited: A joint formulation,” in Proc. Eur. Conf. Comput. Vis., 2012, pp. 566–579.
[13]
D. Chen, X. Cao, D. Wipf, F. Wen, and J. Sun, “An efficient joint formulation for Bayesian face verification,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 1, pp. 32–46, Jan. 2017.
[14]
Y. Wang, H. Xu, and Z. Ou, “Joint Bayesian Gaussian discriminant analysis for speaker verification,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2017, pp. 5390–5394.
[15]
A. Nagrani et al., “VoxSRC 2020: The second VoxCeleb speaker recognition challenge,” [Online]. Available: https://www.robots.ox.ac.uk/~vgg/data/voxceleb/competition2020.html
[16]
V. Wan and W. Campbell, “Support vector machines for speaker verification and identification,” Proc. Neural Netw. Signal Process. X. Proc. IEEE Signal Process. Soc. Workshop, vol. 2, pp. 775–784, 2000.
[17]
W. Campbell, D. Sturim, and D. Reynolds, “Support vector machines using GMM supervectors for speaker verification,” IEEE Signal Process. Lett., vol. 13, no. 5, pp. 308–311, May 2006.
[18]
J. Villalba, N. Brummer, and N. Dehak, “Tied variational autoencoder backends for i-vector speaker recognition,” in Proc INTERSPEECH, 2017, pp. 1004–1008.
[19]
L. Burget, O. Plchot, S. Cumani, O. Glembek, P. Matejka, and N. Brummer, “Discriminatively trained probabilistic linear discriminant analysis for speaker verification,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2011, pp. 4832–4835.
[20]
S. Cumani, N. Brummer, L. Burget, P. Laface, O. Plchot, and V. Vasilakakis, “Pairwise discriminative speaker verification in the I-vector space,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 6, pp. 1217–1227, Jun. 2013.
[21]
G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-end text dependent speaker verification,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2016, pp. 5115–5119.
[22]
L. Wan, Q. Wang, A. Papir, and I. Moreno, “Generalized end-to-end loss for speaker verification,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2018, pp. 4879–4883.
[23]
L. Ferrer and M. Mclaren, “A speaker verification backend for improved calibration performance across varying conditions,” in Proc. Odyssey, Speaker Lang. Recognit. Workshop, 2020, pp. 372–379.
[24]
L. Ferrer, M. McLaren, and N. Brummer, “A speaker verification backend with robust performance across conditions,” Comput. Speech Lang., vol. 71, Jan. 2021, Art. no.
[25]
J. Rohdin et al., “End-to-end DNN based text-independent speaker recognition for long and short utterances,” Comput. Speech Lang., vol. 59, pp. 22–35, 2020.
[26]
S. Ramoji, P. Krishnan, and S. Ganapathy, “Neural PLDA modeling for end-to-end speaker verification,” in Proc. INTERSPEECH, 2020, pp. 4333–4337.
[27]
A. Lasserre, C. Bishop, and T. Minka, “Principled hybrids of generative and discriminative models,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., 2006, pp. 87–94.
[28]
N. Brummer and E. Villiers, “The BOSARIS toolkit: Theory, algorithms and code for surviving the new DCF,” arXiv:1304.2865.
[29]
E. Lehmann and J. Romano, Testing Statistical Hypotheses. New York, NY, USA: Springer, 2005.
[30]
J. Platt, “Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods,” Adv. Large Margin Classifiers, vol. 10, no. 3, pp. 61–74, 1999.
[31]
H. Lin, C. Lin, and R. Weng, “A note on platt’s probabilistic outputs for support vector machines,” Mach. Learn., vol. 68, pp. 267–276, 2007.
[32]
X. Lu, P. Shen, Y. Tsao, and H. Kawai, “Regularization of neural network model with distance metric learning for I-vector based spoken language identification,” Comput. Speech Lang., vol. 44, pp. 48–60, 2017.
[33]
D. Garcia-Romero and C. Y. Espy-Wilson, “Analysis of I-vector length normalization in speaker recognition systems,” in Proc. INTERSPEECH, 2011, pp. 249–252.
[34]
M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The speakers in the wild (SITW) speaker recognition database,” in Proc. INTERSPEECH, 2016, pp. 818–822.
[35]
A. Nagrani, J. Chung, W. Xie, and A. Zisserman, “Voxceleb: Large-scale speaker verification in the wild,” Comput. Sci. Lang., vol. 60, 2020, Art. no.
[37]
K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1026–1034.
[38]
D. P. Kingma and J. Ba, “Adam: A. method for stochastic optimization,” in Proc. 3rd Int. Conf. Learn. Representations, 2014.
[39]
E. Xing, A. Ng, M. Jordan, and R. Russell, “Distance metric learning, with application to clustering with side-information,” in Proc. Adv. Neural Inf. Process. Syst., MIT Press, 2002, pp. 521–528.
[40]
K. Weinberger and L. Saul, “Distance metric learning for large margin classification,” J. Mach. Learn. Res., vol. 10, pp. 207–244, 2009.
[41]
B. Moghaddam, T. Jebara, and A. Pentland, “Bayesian face recognition,” Pattern Recognit., vol. 33, pp. 1771–1782, 2000.
[42]
J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” in Proc. INTERSPEECH, 2018, pp. 1086–1090.
[43]
D. Garcia-Romero, A. McCree, D. Snyder, and G. Sell, “JHUHLTCOE system for the VoxSRC speaker recognition challenge,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2020, pp. 7559–7563.
[44]
H. Zeinali, S. Wang, A. Silnova, P. Matejka, and O. Plchot, “BUT system description to VoxCeleb speaker recognition challenge,” 2019, arXiv:1910.12592.
[45]
L. Chen, K. Lee, L. He, and F. Soong, “On early-stop clustering for speaker diarization,” in Proc. Odyssey, Speaker Lang. Recognit. Workshop, 2020, pp. 110–116.
[46]
D. Raj, D. Snyder, D. Povey, and S. Khudanpur, “Probing the information encoded in X-vectors,” in Proc. IEEE Autom. Speech Recognit. Understanding Workshop, 2019, pp. 726–733.
[47]
J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “ArcFace: Additive angular margin loss for deep face recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 4685–4694.
[48]
X. Xiang, S. Wang, H. Huang, Y. Qian, and K. Yu, “Margin matters: Towards more discriminative deep neural network embeddings for speaker recognition,” in Proc. Asia-Pacific Signal Inf. Process. Assoc. Annu. Summit Conf., 2019, pp. 1652–1656.
[49]
J. Chung et al., “In defence of metric learning for speaker recognition,” in Proc. INTERSPEECH, 2020, pp. 2977–2981.
[50]
B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Proc. INTERSPEECH, 2020, pp. 3830–3834.

Index Terms

  1. Coupling a Generative Model With a Discriminative Learning Framework for Speaker Verification
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image IEEE/ACM Transactions on Audio, Speech and Language Processing
    IEEE/ACM Transactions on Audio, Speech and Language Processing  Volume 29, Issue
    2021
    3717 pages
    ISSN:2329-9290
    EISSN:2329-9304
    Issue’s Table of Contents

    Publisher

    IEEE Press

    Publication History

    Published: 19 November 2021
    Published in TASLP Volume 29

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 15
      Total Downloads
    • Downloads (Last 12 months)1
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 09 Nov 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media