Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A bilevel framework for joint optimization of session compensation and classification for speaker identification

Published: 01 June 2019 Publication History

Abstract

The i-vector framework based system is one of the most popular systems in speaker identification (SID). In this system, session compensation is usually employed first and then the classifier. For any session-compensated representation of i-vector, there is a corresponding identification result, so that both the stages are related. However, in current SID systems, session compensation and classifier are usually optimized independently. An incomplete knowledge about the session compensation to the identification task may lead to involving uncertainties. In this paper, we propose a bilevel framework to jointly optimize session compensation and classifier to enhance the relationship between the two stages. In this framework, we use the sparse coding (SC) to obtain the session-compensated feature by learning an overcomplete dictionary, and employ the softmax classifier and support vector machine (SVM) in classifying respectively. Moreover, we present a joint optimization of the dictionary and classifier parameters under a discriminative criterion for classifier with conditions for SC. In addition, the proposed methods are evaluated on the King-ASR-010, VoxCeleb and RSR2015 databases. Compared with typical session compensation techniques, such as linear discriminant analysis (LDA) and nonparametric discriminant analysis (NDA), our methods can be more robust to complex session variability. Moreover, compared with the typical classifiers in i-vector framework, i.e. the cosine distance scoring (CDS) and probabilistic linear discriminant analysis (PLDA), our methods can be more suitable for SID (multiclass task).

References

[1]
J.H.L. Hansen, T. Hasan, Speaker recognition by machines and humans: a tutorial review, IEEE Signal Process. Mag. 32 (6) (2015) 74–99.
[2]
T. Zheng, J. Han, G. Zheng, Deep neural network based discriminative training for i-vector/PLDA speaker verification, in: Proc. of ICASSP, 2018, pp. 5354–5358.
[3]
D.A. Reynolds, Speaker identification and verification using Gaussian mixture speaker models, Speech Commun. 17 (1–2) (1995) 91–108.
[4]
D.A. Reynolds, T.F. Quatieri, R.B. Dunn, Speaker verification using adapted Gaussian mixture models, Digit. Signal Process. 10 (1–3) (2000) 19–41.
[5]
M.E. Ayadi, A.S.O. Hassan, A. Abdel-Naby, O.A. Elgendy, Text-independent speaker identification using robust statistics estimation, Speech Commun. 92 (2017) 53–63.
[6]
P. Kenny, G. Boulianne, P. Ouellet, P. Dumouchel, Joint factor analysis versus eigenchannels in speaker recognition, IEEE Trans. Audio Speech Lang. Process. 15 (4) (2007) 1435–1447.
[7]
N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification, IEEE Trans. Audio Speech Lang. Process. 19 (4) (2011) 788–798.
[8]
L. Xu, K.A. Lee, H. Li, Z. Yang, Generalizing i-vector estimation for rapid speaker recognition, IEEE/ACM Trans. Audio Speech Lang. Process. 26 (4) (2018) 749–759.
[9]
G. Biagetti, P. Crippa, L. Falaschetti, S. Orcioni, C. Turchetti, An investigation on the accuracy of truncated DKLT representation for speaker identification with short sequences of speech frames, IEEE Trans. Cybern. 47 (12) (2017) 4235–4249.
[10]
G. Biagetti, P. Crippa, L. Falaschetti, S. Orcioni, C. Turchetti, Speaker Identification in Noisy Conditions Using Short Sequences of Speech Frames, Smart Innovation, Systems and Technologies, vol. 73, Springer, Cham, 2018, pp. 43–52.
[11]
Y.Z. Işik, H. Erdogan, R. Sarikaya, S-vector: a discriminative representation derived from i-vector for speaker verification, in: Proc. of European Signal Processing Conference, EUSIPCO, 2015, pp. 2097–2101.
[12]
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, S. Khudanpur, X-vectors: robust DNN embeddings for speaker recognition, in: Proc. of ICASSP, 2018, pp. 5329–5333.
[13]
S. Prince, J. Elder, Probabilistic linear discriminant analysis for inferences about identity, in: Proc. of International Conference on Computer Vision, 2007, pp. 1–8.
[14]
O. Ghahabi, J. Hernando, Deep learning backend for single and multisession i-vector speaker recognition, IEEE/ACM Trans. Audio Speech Lang. Process. 25 (4) (2017) 807–817.
[15]
R. Vogt, S. Kajarekar, S. Sridharan, Discriminant NAP for SVM speaker recognition, in: Proc. of Odyssey Speaker and Language Recognition Workshop, 2008.
[16]
M. Mclaren, D. Leeuwen, Source-normalized LDA for robust speaker recognition using i-vectors from multiple speech sources, IEEE Trans. Audio Speech Lang. Process. 20 (3) (2012) 755–766.
[17]
S. Sadjadi, J. Pelecanos, W. Zhu, Nearest neighbor discriminant analysis for robust speaker recognition, in: Proc. of Interspeech, 2014.
[18]
W. Rao, X. Xiao, C. Xu, H. Xu, K.A. Lee, E.S. Chng, H. Li, Neural networks based channel compensation for i-vector speaker verification, in: Proc. of International Symposium on Chinese Spoken Language, 2017, pp. 1–5.
[19]
A. Misra, S. Ranjan, J. Hansen, Locally weighted linear discriminant analysis for robust speaker verification, in: Proc. of Interspeech, 2017, pp. 2864–2868.
[20]
A. Misra, J.H.L. Hansen, Modelling and compensation for language mismatch in speaker verification, Speech Commun. 96 (2017) 58–66.
[21]
B. Colson, P. Marcotte, G. Savard, An overview of bilevel optimization, Ann. Oper. Res. 153 (1) (2007) 235–256.
[22]
D. Bradley, J. Bagnell, Differentiable sparse coding, in: Proc. of International Conference on Neural Information Processing Systems, 2008, pp. 113–120.
[23]
F. Bach, J. Mairal, J. Ponce, Task-driven dictionary learning, IEEE Trans. Pattern Anal. Mach. Intell. 34 (4) (2012) 791–804.
[24]
S. Bahrampour, N. Nasrabadi, A. Ray, W. Jenkins, Multimodal task-driven dictionary learning for image classification, IEEE Trans. Image Process. 25 (1) (2015) 24–38.
[25]
V. Bisot, R. Serizel, S. Essid, G. Richard, Feature learning with matrix factorization applied to acoustic scene classification, IEEE Trans. Audio Speech Lang. Process. 25 (6) (2016) 1216–1229.
[26]
Kingma, D.; Ba, J. (2014): Adam: a method for stochastic optimization. arXiv:1412.6980.
[27]
S.R. Madikeri, A fast and scalable hybrid FA/PPCA-based framework for speaker recognition, Digit. Signal Process. 32 (2) (2014) 137–145.
[28]
Y. Lei, J. Hansen, Speaker recognition using supervised probabilistic principal component analysis, in: Proc. of Interspeech, 2010, pp. 382–385.
[29]
C. Chen, J. Han, Y. Pan, Speaker verification via estimating total variability space using probabilistic partial least squares, in: Proc. of Interspeech, 2017, pp. 1537–1541.
[30]
D. Garcia-Romero, C.Y. Espy-Wilson, Analysis of i-vector length normalization in speaker recognition systems, in: Proc. of Interspeech, 2011, pp. 249–252.
[31]
M. Li, X. Zhang, Y. Yan, S. Narayanan, Speaker verification using sparse representations on total variability i-vectors, in: Proc. of Interspeech, 2011, pp. 2729–2732.
[32]
Y. Bengio, A. Courville, P. Vincent, Representation learning: a review and new perspectives, IEEE Trans. Pattern Anal. Mach. Intell. 35 (8) (2012) 1798–1828.
[33]
Kavukcuoglu, K.; Ranzato, M.A.; LeCun, Y. (2010): Fast inference in sparse coding algorithms with applications to object recognition. arXiv:1010.3467.
[34]
H. Zou, T. Hastie, Regularization and variable selection via the elastic net, J. R. Stat. Soc., Ser. B, Methodol. 67 (2) (2005) 301–320.
[35]
R. Tibshirani, Regression shrinkage and selection via the LASSO, J. R. Stat. Soc., Ser. B, Methodol. 73 (1) (1996) 273–282.
[36]
M. Figueiredo, Adaptive sparseness using Jeffreys prior, in: Proc. of Advances in Neural Information Processing Systems, 2002, pp. 697–704.
[37]
B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, Least angle regression, Ann. Stat. 32 (2) (2004) 407–451.
[38]
R. Caruana, N. Karampatziakis, A. Yessenalina, An empirical evaluation of supervised learning in high dimensions, in: Proc. of International Conference on Machine Learning, 2008, pp. 96–103.
[39]
C. Koby, S. Yoram, On the algorithmic implementation of multiclass kernel-based vector machines, J. Mach. Learn. Res. 2 (2) (2001) 265–292.
[40]
R.C. Moore, J. Denero, L1 and L2 regularization for multiclass hinge loss models, in: Symposium on Machine Learning in Speech and Natural Language Processing, 2013.
[41]
J. Yang, K. Yu, T. Huang, Supervised translation-invariant sparse coding, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., CVPR, 2010, pp. 3517–3524.
[42]
D.M. Bradley, J.A. Bagnell, Differentiable sparse coding, in: Adv. Neural Inf. Process. Syst., NIPS, 2009, pp. 113–120.
[44]
A. Nagrani, J. Chung, A. Zisserman, VoxCeleb: a large-scale speaker identification dataset, in: Proc. of Interspeech, 2017, pp. 2616–2620.
[45]
A. Larcher, K.A. Lee, B. Ma, H. Li, Text-dependent speaker verification: classifiers, databases and RSR2015, Speech Commun. 60 (2014) 56–77.
[46]
S.O. Sadjadi, M. Slaney, A.L. Heck, MSR Identity Toolbox v1.0: A MATLAB Toolbox for Speaker Recognition Research, Microsoft Research Technical Report 2013.
[47]
J. Mairal, F. Bach, J. Ponce, G. Sapiro, Online learning for matrix factorization and sparse coding, J. Mach. Learn. Res. 11 (1) (2009) 19–60.
[48]
V.D.M. Laurens, G.E. Hinton, Visualizing data using t-SNE, J. Mach. Learn. Res. 9 (Nov) (2008) 2579–2605.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Digital Signal Processing
Digital Signal Processing  Volume 89, Issue C
Jun 2019
187 pages

Publisher

Academic Press, Inc.

United States

Publication History

Published: 01 June 2019

Author Tags

  1. Speaker identification
  2. Session compensation
  3. Joint optimization
  4. Bilevel framework

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 06 Oct 2024

Other Metrics

Citations

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media