Voiceprint Identification for Limited Dataset Using the Deep Migration Hybrid Model Based on Transfer Learning
Abstract
:1. Introduction
- We propose a TLCNN-RBM model, with high-accuracy and low computational cost for small voiceprint samples, which consists of 5-layer CNN, 2-layer RBM, and the Softmax layer.
- There are differences between voiceprint data sets during the transfer process. To deal with this difficultly, we use RBM and the Softmax classifier to replace the fully connected layers of the CNN network; the RBM and the Softmax classifier are re-trained with the target samples.
- We introduce a novel algorithm to speed up and simplify training of the network. By careful design, the CNN network with FBN reduces 48.04% of the training time compared with the one without FBN, on the benchmark NIST 2008 SRE.
- We develop a software for voiceprint identification using the proposed algorithm and make an intelligent mailbox that unlocks based on voiceprint identification.
2. Small Sample Voiceprint Identification Algorithm
2.1. Pre-Processing
2.2. Pre-Training CNN Network Based on the Source of Large Sample Voiceprint Data
2.3. Data Augmentation
- (1)
- Input audio data file.
- (2)
- Use the short time Fourier transform to generate the speech spectrogram.
- (3)
- According to the principle of convex lens imaging, the image obtained by taking P point location L1 (F < L1 < 2F) is larger than the original image, as shown in Figure 4a.
- (4)
- The image obtained by taking p position L2 (L2 = 2F) is as large as the original image, as shown in Figure 4b.
- (5)
- The image obtained by taking p position L3 (L3 > 2F) is smaller than the original image, as shown in Figure 4c.
- (6)
- Multiple speech spectrograms can be obtained through three kinds of transformations, and finally, all image scales are normalized to 227 × 227 as the input of the convolutional neural network.
2.4. Re-Training the TLCNN-RBM Hybrid Model Based on the Target of Voiceprint Data
2.4.1. Restricted Boltzmann Machine Retraining
2.4.2. TLCNN-RBM-FBN Hybrid Model Self-Adaptability
2.5. Voiceprint Identification
3. Experiment by Hold-Out Validation
3.1. Dataset
3.2. Experiment Settings
3.2.1. Experimental Operation Platform and Experiment Settings
3.2.2. Experimental Procedure
3.3. Pre-Training and Testing Results
3.3.1. Comparison of Recognition Performance Based on the Source Dataset (NIST 2008)
3.3.2. Discussion Based on Recognition Performance
3.3.3. Comparison of Network Pre-Training Time and Convergence Speed
3.3.4. Discussion Based on the Convergence Speed and Training Time
3.4. Re-Training and Testing Results
3.4.1. Comparison of Recognition Performance after Transferring the Pre-Trained Model
3.4.2. Comparison of Accuracy Based on the Different Number of Target Training Sample
3.4.3. Discussion Based on Recognition Performance
4. Experiment in Real Scenes
4.1. Dataset
4.2. Experiment Settings and Results
4.3. Discussion
4.4. Application
5. Conclusions and Future Work
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Islam, M.A. Frequency domain linear prediction-based robust text-dependent speaker identification. In Proceedings of the International Conference on Innovations in Science, Engineering and Technology (ICISET), Dhaka, Bangladesh, 28–29 October 2016; pp. 1–4. [Google Scholar]
- Abdel-Hamid, O.; Mohamed, A.R.; Jiang, H. Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 1533–1545. [Google Scholar] [CrossRef]
- Huang, J.T.; Li, J.; Gong, Y. An analysis of convolutional neural networks for speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 19–24 April 2015; pp. 4989–4993. [Google Scholar]
- Lukic, Y.; Vogt, C.; Dürr, O. Speaker identification and clustering using convolutional neural networks. In Proceedings of the IEEE International Workshop on Machine Learning for Signal Processing (MLSP), Salerno, Italy, 13–16 September 2016; pp. 1–6. [Google Scholar]
- Krishnamoorthy, P.; Jayanna, H.; Prasanna, S. Speaker recognition under limited data condition by noise addition. Expert Syst. Appl. 2011, 38, 13487–13490. [Google Scholar] [CrossRef]
- Oquab, M.; Bottou, L.; Laptev, I. Learning and transferring mid-level image representations using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 1717–1724. [Google Scholar]
- Azmy, M.M. Classification of lung sounds based on linear prediction cepstral coefficients and support vector machine. In Proceedings of the 2015 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT), Amman, Jordan, 3–5 November 2015; pp. 1–5. [Google Scholar]
- Wang, Y.; Lawlor, B. Speaker recognition based on MFCC and BP neural networks. In Proceedings of the Irish Signals and Systems Conference (ISSC), Killarney, Ireland, 20–21 June 2017; pp. 1–4. [Google Scholar]
- Garcia-Romero, D.; Mccree, A. Supervised domain adaptation for I-vector based speaker recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 4047–4051. [Google Scholar]
- Kenny, P.; Boulianne, G.; Ouellet, P. Speaker and session variability in GMM-based speaker verification. IEEE Trans. Audio Speech Lang. Process. 2007, 15, 1448–1460. [Google Scholar] [CrossRef]
- Xiong, Z.; Zheng, T.F.; Song, Z. A tree-based kernel selection approach to efficient Gaussian mixture model–universal background model based speaker identification. Speech Commun. 2006, 48, 1273–1282. [Google Scholar] [CrossRef]
- Ferras, M.; Leung, C.C.; Barras, C. Comparison of speaker adaptation methods as feature extraction for SVM-based speaker recognition. IEEE Trans. Audio Speech Lang. Process. 2010, 18, 1366–1378. [Google Scholar] [CrossRef]
- Bengio, Y.; Courville, A.; Vincent, P. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1798–1828. [Google Scholar] [CrossRef] [PubMed]
- Lecun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
- Maas, A.L.; Qi, P.; Xie, Z. Building DNN acoustic models for large vocabulary speech recognition. Comput. Speech Lang. 2017, 41, 195–213. [Google Scholar] [CrossRef] [Green Version]
- Tóth, L.; Grósz, T. A comparison of deep neural network training methods for large vocabulary speech recognition. In Proceedings of the International Conference on Text, Speech, and Dialogue, Pilsen, Czech Republic, 1–5 September 2013; pp. 36–43. [Google Scholar]
- Chang, J.; Wang, D.L. Robust speaker recognition based on DNN/i-Vectors and speech separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 5415–5419. [Google Scholar]
- Zhang, C.; Woodland, P.C. DNN speaker adaptation using parameterized sigmoid and ReLU hidden activation functions. In Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP), Shanghai, China, 20–25 March 2016; pp. 5300–5304. [Google Scholar]
- Peddinti, V.; Wang, Y.; Povey, D.; Khudanpur, S. Low latency acoustic modeling using temporal convolution and LSTMs. IEEE Signal Process. Lett. 2018, 25, 373–377. [Google Scholar] [CrossRef]
- Hong, Q.; Zhang, J.; Li, L. Transfer learning method for PLDA-based speaker verification. Speech Commun. 2017, 92, 90–99. [Google Scholar] [CrossRef]
- Huang, Z.; Siniscalchi, S.M.; Lee, C.H. A unified approach to transfer learning of deep neural networks with applications to speaker adaptation in automatic speech recognition. Neurocomputing 2016, 218, 448–459. [Google Scholar] [CrossRef]
- Lim, B.P.; Wong, F.; Li, Y. Transfer learning with bottleneck feature networks for whispered speech recognition. In Proceedings of the Interspeech 2016, San Francisco, CA, USA, 8–12 September 2016; pp. 1578–1582. [Google Scholar]
- Ghahabi, O.; Hernando, J. Restricted Boltzmann machines for vector representation of speech in speaker recognition. Comput. Speech Lang. 2018, 47, 16–29. [Google Scholar] [CrossRef]
- Zhu, L.Z.; Chen, L.M.; Zhao, D.H. Emotion recognition from Chinese speech for smart affective services using a combination of SVM and DBN. Sensors 2017, 17, 1694. [Google Scholar] [CrossRef] [PubMed]
- Le, Q.V. Building high-level features using large scale unsupervised learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, BC, Canada, 22–27 May 2011; pp. 8595–8598. [Google Scholar]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proceed. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
- Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference Machine Learning (ICML), Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
- Dutta, T. Dynamic time warping based approach to text-dependent speaker identification using spectrograms. In Proceedings of the 2008 IEEE Congress on Image and Signal Processing CISP’08, Hainan, China, 27–30 March 2008; pp. 354–360. [Google Scholar]
- Niu, Y.F.; Zou, D.S.; Niu, Y.D.; He, Z.S.; Tan, H. A breakthrough in speech emotion recognition using deep retinal convolution neural networks. Comput. Sci. 2017arXiv:1707.09917.
- Hinton, G.E. Training products of experts by minimizing contrastive divergence. Neural Comput. 2002, 14, 1771–1800. [Google Scholar] [CrossRef] [PubMed]
- Jing, G.; Du, W.; Guo, Y. Studies on prediction of separation percent in electrodialysis process via BP neural networks and improved BP algorithms. Desalination 2012, 291, 78–93. [Google Scholar] [CrossRef]
- Li, J.; Qiu, T.; Wen, C.; Xie, K.; Wen, F.-Q. Robust Face Recognition Using the Deep C2D-CNN Model Based on Decision-Level Fusion. Sensors 2018, 18, 2080. [Google Scholar] [CrossRef] [PubMed]
- NIST Multimodal Information Group. 2008 NIST Speaker Recognition Evaluation Training Set Part 1 LDC2011S05; Linguistic Data Consortium: Philadelphia, PA, USA, 2011. [Google Scholar]
- DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus. Available online: https://catalog.ldc.upenn.edu/ldc93s1 (accessed on 25 December 2017).
CNN | CNN + BN | CNN + FBN | TLCNN − RBM + FBN | |
---|---|---|---|---|
Learning rate | 0.01 | 0.05 | 0.05 | 0.05 |
Dropout | 0.5 | --- | --- | --- |
Weight decay | 10−3 | 10−3 | 10−3 | 10−3 |
Momentum | 0.9 | 0.9 | 0.9 | 0.9 |
No. of epochs | 80 | 20 | 20 | 20 |
Activation function | RELU | RELU | RELU | RELU |
Cost function | Cross-entropy | Cross-entropy | Cross-entropy | Cross-entropy |
No. of Conv. layers | 5 | 5 | 5 | 5 |
No. of RBM layers | 0 | 0 | 0 | 2 |
Input sizes | 227 × 227 | 227 × 227 | 227 × 227 | 227 × 227 |
Dataset | NIST 2008 SRE | NIST 2008 SRE | NIST 2008 SRE | TIMIT |
No. of Epochs | Accuracy (%) | EER (%) |
---|---|---|
5 | 88.73 | 5.42 |
10 | 90.25 | 4.47 |
15 | 96.12 | 1.63 |
20 | 97.80 | 1.12 |
Training Phase | Pre-Training | Retraining |
---|---|---|
Models | CNN + FBN | TLCNN − RBM + FBN |
Learning rate | 0.05 | 0.05 |
Weight decay | 10−3 | 10−3 |
No. of epochs | 15 | 15 |
Activation function | RELU | RELU |
Cost function | Cross-entropy | Cross-entropy |
No. of Conv. layers | 5 | 5 |
No. of RBM layers | 0 | 2 |
Input sizes | 227 × 227 | 227 × 227 |
Dataset | NIST 2008 SRE | self-built database |
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Sun, C.; Yang, Y.; Wen, C.; Xie, K.; Wen, F. Voiceprint Identification for Limited Dataset Using the Deep Migration Hybrid Model Based on Transfer Learning. Sensors 2018, 18, 2399. https://doi.org/10.3390/s18072399
Sun C, Yang Y, Wen C, Xie K, Wen F. Voiceprint Identification for Limited Dataset Using the Deep Migration Hybrid Model Based on Transfer Learning. Sensors. 2018; 18(7):2399. https://doi.org/10.3390/s18072399
Chicago/Turabian StyleSun, Cunwei, Yuxin Yang, Chang Wen, Kai Xie, and Fangqing Wen. 2018. "Voiceprint Identification for Limited Dataset Using the Deep Migration Hybrid Model Based on Transfer Learning" Sensors 18, no. 7: 2399. https://doi.org/10.3390/s18072399
APA StyleSun, C., Yang, Y., Wen, C., Xie, K., & Wen, F. (2018). Voiceprint Identification for Limited Dataset Using the Deep Migration Hybrid Model Based on Transfer Learning. Sensors, 18(7), 2399. https://doi.org/10.3390/s18072399