Abstract
This paper presents a generalized i-vector representation framework with phonetic tokenization and tandem features for text independent as well as text dependent speaker verification. In the conventional i-vector framework, the tokens for calculating the zero-order and first-order Baum-Welch statistics are Gaussian Mixture Model (GMM) components trained from acoustic level MFCC features. Yet besides MFCC, we believe that phonetic information makes another direction that can benefit the system performance. Our contribution in this paper lies in integrating phonetic information into the i-vector representation by several extensions, forming a more generalized i-vector framework. First, the tokens for calculating the zero-order statistics is extended from the MFCC trained GMM components to phonetic phonemes, trigrams and tandem feature trained GMM components, using phoneme posterior probabilities. Second, given the zero-order statistics (posterior probabilities on tokens), the feature used to calculate the first-order statistics is also extended from MFCC to tandem feature, and is not necessarily the same feature employed by the tokenizer. Third, the zero-order and first-order statistics vectors are then concatenated and represented by the simplified supervised i-vector approach followed by the standard Probabilistic Linear Discriminant Analysis (PLDA) back-end. We study different token and feature combinations, and we show that the feature level fusion of acoustic level MFCC features and phonetic level tandem features with GMM based i-vector representation achieves the best performance for text independent speaker verification. Furthermore, we demonstrate that the phonetic level phoneme constraints introduced by the tandem features help the text dependent speaker verification system to reject wrong password trials and improve the performance dramatically. Experimental results are reported on the NIST SRE 2010 common condition 5 female part task and the RSR 2015 part 1 female part task for text independent and text dependent speaker verification, respectively. For the text independent speaker verification task, the proposed generalized i-vector representation outperforms the i-vector baseline by relatively 53 % in terms of equal error rate (EER) and norm minDCF values. For the text dependent speaker verification task, our proposed approach also reduced the EER significantly from 23 % to 90 % relatively for different types of trials.
Similar content being viewed by others
References
Campbell, W., Sturim, D., & Reynolds, D. (2006). Support vector machines using gmm supervectors for speaker verification. IEEE Signal Processing Letters, 13(5), 308–311.
Cumani, S., Brummer, N., Burget, L., & Laface, P. (2011). Fast discriminative speaker verification in the i-vector space. In Proceedings ICASSP (pp. 4852–4855): IEEE.
Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798.
Dehak, N., Torres-Carrasquillo, P., Reynolds, D., & Dehak, R. (2011). Language recognition via i-vectors and dimensionality reduction. In Proceedings INTERSPEECH (pp. 857–860).
D’Haro, L.F., Cordoba, R., Salamea, C., & Echeverry, J.D. (2014). Extended phone log-likelihood ratio features and acoustic-based i-vectors for language recognition. In Proceedings ICASSP (pp. 5379–5383): IEEE.
Ellis, D.P., Singh, R., & Sivadas, S. (2001). Tandem acoustic modeling in large-vocabulary recognition, (Vol. 1 pp. 517–520): Proceedings ICASSP.
Hatch, A., Kajarekar, S., & Stolcke, A. (2006). Within-class covariance normalization for SVM-based speaker recognition, (Vol. 4 pp. 1471–1474): Proceedings INTERSPEECH.
Hébert, M. (2008). Text-dependent speaker recognition. Springer Handbook of Speech Processing, 743–762.
Hermansky, H., Ellis, D.P., & Sharma, S. (2000). Tandem connectionist feature extraction for conventional hmm systems. In Proceedings ICASSP, (Vol. 3 pp. 1635–1638).
Kenny, P., Boulianne, G., & Dumouchel, P. (2005). Eigenvoice modeling with sparse training data. IEEE Transactions on Speech and Audio Processing, 13(3), 345–354.
Kenny, P., Stafylakis, T., Ouellet, P., & Alam, M.J. (2014). Jfa-based front ends for speaker recognition. In Proceedings ICASSP (pp. 1724–1728).
Larcher, A., Lee, K.A., Ma, B., & Li, H. (2014). Imposture classification for text-dependent speaker verification. In Proceedings ICASSP (pp. 739–743).
Larcher, A., Lee, K.A., Ma, B., & Li, H. (2014). Text-dependent speaker verification: Classifiers, databases and rsr2015. Speech Communication, 60, 56–77.
Lei, Y., Scheffer, N., Ferrer, L., & McLaren, M. (2014). A novel scheme for speaker recognition using a phonetically-aware deep neural network. In Proceedings ICASSP.
Li, H., Ma, B., & Lee, C. (2007). A vector space modeling approach to spoken language identification. IEEE Transactions on Audio. Speech, and Language Processing, 15(1), 271–284.
Li, M., & Narayanan, S. (2014). Simplified supervised i-vector modeling with application to robust and efficient language identification and speaker verification: Computer speech and language.
Li, M., Tsiartas, A., Van Segbroeck, M., & Narayanan, S.S. (2013). Speaker verification using simplified and supervised i-vector modeling. In Proceedings ICASSP (pp. 7199–7203): IEEE.
Li, M., Zhang, X., Yan, Y., & Narayanan, S. (2011). Speaker verification using sparse representations on total variability i-vectors. In Proceedings INTERSPEECH (pp. 4548–4551).
Matejka, P., Glembek, O., Castaldo, F., Alam, M., Plchot, O., Kenny, P., Burget, L., & Cernocky, J. (2011). Full-covariance ubm and heavy-tailed plda in i-vector speaker verification. In Proceedings ICASSP (pp. 4828–4831).
(2010). NIST: The NIST 2010 Speaker Recognition Evaluation Plan. www.itl.nist.gov/iad/mig/tests/spk/2010/index.html.
Novoselov, S., Pekhovsky, T., Shulipa, A., & Sholokhov, A. (2014). Text-dependent gmm-jfa system for password based speaker verification. In Proceedings ICASSP (pp. 729–733).
Pinto, J., Garimella, S., Hermansky, H., Bourlard, H., & et al. (2011). Analysis of mlp-based hierarchical phoneme posterior probability estimator. IEEE Transactions on Audio, Speech, and Language Processing, 19(2), 225–241.
Prince, S., & Elder, J. (2007). Probabilistic linear discriminant analysis for inferences about identity (pp. 1–8): Proceedings ICCV.
Schwarz, P., Matejka, P., & Cernocky, J. (2006). Hierarchical structures of neural networks for phoneme. In Proc. ICASSP. Software available at http://speech.fit.vutbr.cz/software/phoneme-recognizer-based-long-temporal-context (pp. 325–328).
Stolcke, A., & et al. (2002). Srilm-an extensible language modeling toolkit. In Proceedings INTERSPEECH.
Variani, E., Lei, X., McDermott, E., Moreno, I.L., & Gonzalez-Dominguez, J. (2014). Deep neural networks for small footprint text-dependent speaker verification. In Proceedings ICASSP (pp. 4080–4084).
Wang, H., Leung, C.C., Lee, T., Ma, B., & Li, H. (2013). Shifted-delta mlp features for spoken language recognition. IEEE Signal Processing Letters, 20(1), 15–18.
Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., Valtchev, V., & Woodland, P. (1997). The HTK book, vol. 2: Entropic Cambridge Research Laboratory Cambridge.
Zhu, Q., Stolcke, A., Chen, B.Y., & Morgan, N. (2005). Using mlp features in sris conversational speech recognition system. In Proc. INTERSPEECH.
Acknowledgments
This research is funded in part by the National Natural Science Foundation of China (NSFC 61401524),Natural Science Foundation of Guangdong Province (2014A030313123), SYSU-CMU Shunde International Joint Research Institute and CMU-SYSU Collaborative Innovation Research Center.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, M., Liu, L., Cai, W. et al. Generalized I-vector Representation with Phonetic Tokenizations and Tandem Features for both Text Independent and Text Dependent Speaker Verification. J Sign Process Syst 82, 207–215 (2016). https://doi.org/10.1007/s11265-015-1019-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-015-1019-z