research-article

A bilevel framework for joint optimization of session compensation and classification for speaker identification

Authors:

Jiqing HanAuthors Info & Claims

Volume 89, Issue C

Pages 104 - 115

https://doi.org/10.1016/j.dsp.2019.03.008

Published: 01 June 2019 Publication History

Abstract

The i-vector framework based system is one of the most popular systems in speaker identification (SID). In this system, session compensation is usually employed first and then the classifier. For any session-compensated representation of i-vector, there is a corresponding identification result, so that both the stages are related. However, in current SID systems, session compensation and classifier are usually optimized independently. An incomplete knowledge about the session compensation to the identification task may lead to involving uncertainties. In this paper, we propose a bilevel framework to jointly optimize session compensation and classifier to enhance the relationship between the two stages. In this framework, we use the sparse coding (SC) to obtain the session-compensated feature by learning an overcomplete dictionary, and employ the softmax classifier and support vector machine (SVM) in classifying respectively. Moreover, we present a joint optimization of the dictionary and classifier parameters under a discriminative criterion for classifier with conditions for SC. In addition, the proposed methods are evaluated on the King-ASR-010, VoxCeleb and RSR2015 databases. Compared with typical session compensation techniques, such as linear discriminant analysis (LDA) and nonparametric discriminant analysis (NDA), our methods can be more robust to complex session variability. Moreover, compared with the typical classifiers in i-vector framework, i.e. the cosine distance scoring (CDS) and probabilistic linear discriminant analysis (PLDA), our methods can be more suitable for SID (multiclass task).

References

[1]

J.H.L. Hansen, T. Hasan, Speaker recognition by machines and humans: a tutorial review, IEEE Signal Process. Mag. 32 (6) (2015) 74–99.

[2]

T. Zheng, J. Han, G. Zheng, Deep neural network based discriminative training for i-vector/PLDA speaker verification, in: Proc. of ICASSP, 2018, pp. 5354–5358.

[3]

D.A. Reynolds, Speaker identification and verification using Gaussian mixture speaker models, Speech Commun. 17 (1–2) (1995) 91–108.

[4]

D.A. Reynolds, T.F. Quatieri, R.B. Dunn, Speaker verification using adapted Gaussian mixture models, Digit. Signal Process. 10 (1–3) (2000) 19–41.

[5]

M.E. Ayadi, A.S.O. Hassan, A. Abdel-Naby, O.A. Elgendy, Text-independent speaker identification using robust statistics estimation, Speech Commun. 92 (2017) 53–63.

[6]

P. Kenny, G. Boulianne, P. Ouellet, P. Dumouchel, Joint factor analysis versus eigenchannels in speaker recognition, IEEE Trans. Audio Speech Lang. Process. 15 (4) (2007) 1435–1447.

Digital Library

[7]

N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification, IEEE Trans. Audio Speech Lang. Process. 19 (4) (2011) 788–798.

Digital Library

[8]

L. Xu, K.A. Lee, H. Li, Z. Yang, Generalizing i-vector estimation for rapid speaker recognition, IEEE/ACM Trans. Audio Speech Lang. Process. 26 (4) (2018) 749–759.

[9]

G. Biagetti, P. Crippa, L. Falaschetti, S. Orcioni, C. Turchetti, An investigation on the accuracy of truncated DKLT representation for speaker identification with short sequences of speech frames, IEEE Trans. Cybern. 47 (12) (2017) 4235–4249.

[10]

G. Biagetti, P. Crippa, L. Falaschetti, S. Orcioni, C. Turchetti, Speaker Identification in Noisy Conditions Using Short Sequences of Speech Frames, Smart Innovation, Systems and Technologies, vol. 73, Springer, Cham, 2018, pp. 43–52.

[11]

Y.Z. Işik, H. Erdogan, R. Sarikaya, S-vector: a discriminative representation derived from i-vector for speaker verification, in: Proc. of European Signal Processing Conference, EUSIPCO, 2015, pp. 2097–2101.

[12]

D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, S. Khudanpur, X-vectors: robust DNN embeddings for speaker recognition, in: Proc. of ICASSP, 2018, pp. 5329–5333.

[13]

S. Prince, J. Elder, Probabilistic linear discriminant analysis for inferences about identity, in: Proc. of International Conference on Computer Vision, 2007, pp. 1–8.

[14]

O. Ghahabi, J. Hernando, Deep learning backend for single and multisession i-vector speaker recognition, IEEE/ACM Trans. Audio Speech Lang. Process. 25 (4) (2017) 807–817.

[15]

R. Vogt, S. Kajarekar, S. Sridharan, Discriminant NAP for SVM speaker recognition, in: Proc. of Odyssey Speaker and Language Recognition Workshop, 2008.

[16]

M. Mclaren, D. Leeuwen, Source-normalized LDA for robust speaker recognition using i-vectors from multiple speech sources, IEEE Trans. Audio Speech Lang. Process. 20 (3) (2012) 755–766.

[17]

S. Sadjadi, J. Pelecanos, W. Zhu, Nearest neighbor discriminant analysis for robust speaker recognition, in: Proc. of Interspeech, 2014.

[18]

W. Rao, X. Xiao, C. Xu, H. Xu, K.A. Lee, E.S. Chng, H. Li, Neural networks based channel compensation for i-vector speaker verification, in: Proc. of International Symposium on Chinese Spoken Language, 2017, pp. 1–5.

[19]

A. Misra, S. Ranjan, J. Hansen, Locally weighted linear discriminant analysis for robust speaker verification, in: Proc. of Interspeech, 2017, pp. 2864–2868.

[20]

A. Misra, J.H.L. Hansen, Modelling and compensation for language mismatch in speaker verification, Speech Commun. 96 (2017) 58–66.

[21]

B. Colson, P. Marcotte, G. Savard, An overview of bilevel optimization, Ann. Oper. Res. 153 (1) (2007) 235–256.

[22]

D. Bradley, J. Bagnell, Differentiable sparse coding, in: Proc. of International Conference on Neural Information Processing Systems, 2008, pp. 113–120.

[23]

F. Bach, J. Mairal, J. Ponce, Task-driven dictionary learning, IEEE Trans. Pattern Anal. Mach. Intell. 34 (4) (2012) 791–804.

[24]

S. Bahrampour, N. Nasrabadi, A. Ray, W. Jenkins, Multimodal task-driven dictionary learning for image classification, IEEE Trans. Image Process. 25 (1) (2015) 24–38.

[25]

V. Bisot, R. Serizel, S. Essid, G. Richard, Feature learning with matrix factorization applied to acoustic scene classification, IEEE Trans. Audio Speech Lang. Process. 25 (6) (2016) 1216–1229.

[26]

Kingma, D.; Ba, J. (2014): Adam: a method for stochastic optimization. arXiv:1412.6980.

[27]

S.R. Madikeri, A fast and scalable hybrid FA/PPCA-based framework for speaker recognition, Digit. Signal Process. 32 (2) (2014) 137–145.

[28]

Y. Lei, J. Hansen, Speaker recognition using supervised probabilistic principal component analysis, in: Proc. of Interspeech, 2010, pp. 382–385.

[29]

C. Chen, J. Han, Y. Pan, Speaker verification via estimating total variability space using probabilistic partial least squares, in: Proc. of Interspeech, 2017, pp. 1537–1541.

[30]

D. Garcia-Romero, C.Y. Espy-Wilson, Analysis of i-vector length normalization in speaker recognition systems, in: Proc. of Interspeech, 2011, pp. 249–252.

[31]

M. Li, X. Zhang, Y. Yan, S. Narayanan, Speaker verification using sparse representations on total variability i-vectors, in: Proc. of Interspeech, 2011, pp. 2729–2732.

[32]

Y. Bengio, A. Courville, P. Vincent, Representation learning: a review and new perspectives, IEEE Trans. Pattern Anal. Mach. Intell. 35 (8) (2012) 1798–1828.

Digital Library

[33]

Kavukcuoglu, K.; Ranzato, M.A.; LeCun, Y. (2010): Fast inference in sparse coding algorithms with applications to object recognition. arXiv:1010.3467.

[34]

H. Zou, T. Hastie, Regularization and variable selection via the elastic net, J. R. Stat. Soc., Ser. B, Methodol. 67 (2) (2005) 301–320.

[35]

R. Tibshirani, Regression shrinkage and selection via the LASSO, J. R. Stat. Soc., Ser. B, Methodol. 73 (1) (1996) 273–282.

[36]

M. Figueiredo, Adaptive sparseness using Jeffreys prior, in: Proc. of Advances in Neural Information Processing Systems, 2002, pp. 697–704.

[37]

B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, Least angle regression, Ann. Stat. 32 (2) (2004) 407–451.

[38]

R. Caruana, N. Karampatziakis, A. Yessenalina, An empirical evaluation of supervised learning in high dimensions, in: Proc. of International Conference on Machine Learning, 2008, pp. 96–103.

[39]

C. Koby, S. Yoram, On the algorithmic implementation of multiclass kernel-based vector machines, J. Mach. Learn. Res. 2 (2) (2001) 265–292.

[40]

R.C. Moore, J. Denero, L1 and L2 regularization for multiclass hinge loss models, in: Symposium on Machine Learning in Speech and Natural Language Processing, 2013.

[41]

J. Yang, K. Yu, T. Huang, Supervised translation-invariant sparse coding, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., CVPR, 2010, pp. 3517–3524.

[42]

D.M. Bradley, J.A. Bagnell, Differentiable sparse coding, in: Adv. Neural Inf. Process. Syst., NIPS, 2009, pp. 113–120.

[43]

: http://kingline.speechocean.com/exchange.php?id=171&act=view.

[44]

A. Nagrani, J. Chung, A. Zisserman, VoxCeleb: a large-scale speaker identification dataset, in: Proc. of Interspeech, 2017, pp. 2616–2620.

[45]

A. Larcher, K.A. Lee, B. Ma, H. Li, Text-dependent speaker verification: classifiers, databases and RSR2015, Speech Commun. 60 (2014) 56–77.

[46]

S.O. Sadjadi, M. Slaney, A.L. Heck, MSR Identity Toolbox v1.0: A MATLAB Toolbox for Speaker Recognition Research, Microsoft Research Technical Report 2013.

[47]

J. Mairal, F. Bach, J. Ponce, G. Sapiro, Online learning for matrix factorization and sparse coding, J. Mach. Learn. Res. 11 (1) (2009) 19–60.

[48]

V.D.M. Laurens, G.E. Hinton, Visualizing data using t-SNE, J. Mach. Learn. Res. 9 (Nov) (2008) 2579–2605.

Index Terms

A bilevel framework for joint optimization of session compensation and classification for speaker identification
1. Computing methodologies
2. Hardware
  1. Communication hardware, interfaces and storage
    1. Signal processing systems
  2. Robustness
    1. Hardware reliability
      1. Signal integrity and noise analysis

Index terms have been assigned to the content through auto-classification.

Recommendations

Speaker Identification Using Whispered Speech
CSNT '13: Proceedings of the 2013 International Conference on Communication Systems and Network Technologies

The study of closed set text-independent speaker identification using whisper speech is presented in this paper. A new feature called temporal Teager energy based sub band cepstral coefficients (TTESBCC) is proposed. The work presented compares the ...
Text-Independent Speaker Identification Using Vowel Formants

Automatic speaker identification has become a challenging research problem due to its wide variety of applications. Neural networks and audio-visual identification systems can be very powerful, but they have limitations related to the number of ...
Text-Independent/Text-Prompted Speaker Recognition by Combining Speaker-Specific GMM with Speaker Adapted Syllable-Based HMM

We presented a new text-independent/text-prompted speaker recognition method by combining speaker-specific Gaussian Mixture Model (GMM) with syllable-based HMM adapted by MLLR or MAP. The robustness of this speaker recognition method for speaking style'...

Comments

Information & Contributors

Information

Published In

cover image Digital Signal Processing

Digital Signal Processing Volume 89, Issue C

Jun 2019

187 pages

ISSN:1051-2004

Issue’s Table of Contents

Elsevier Inc.

Publisher

Academic Press, Inc.

United States

Publication History

Published: 01 June 2019

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 06 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents