Abstract
The significant role of emotion in human daily interaction cannot be over-emphasized, however, the pressing demand for a cutting-edge and highly efficient model for the classification of speech emotion in effective computing has remained a challenging task. Researchers have proposed several approaches for speech emotion classification (SEC) in recent times, but the lingering challenge of the insufficient dataset, which has been limiting the performances of these approaches, is still of major concern. Therefore, this work proposes a deep transfer learning model, a technique that has been yielding tremendous and state-of-the-art results in computer vision, for SEC. Our approach used a pre-trained and optimized model of Visual Geometry Group (VGGNet) convolutional neural network architecture with appropriate fine-tuning for optimal performance. The speech signal is converted to a mel-Spectrogram image suitable for deep learning model input (224\(\,\times \,\)244 x 3) using filterbanks and Fast Fourier transform (FFT) on the speech samples. Multi-layer perceptron (MLP) algorithm is adopted as a classifier after feature extraction is carried out by the deep learning model. Speech pre-processing was carried out on Toronto English Speech Set (TESS) speech emotional corpus used for the study to prevent the low performance of our proposed model. The result of our experiment after evaluation using the TESS dataset shows an improved result in SEC with an accuracy rate of 96.1% and specificity of 97.4%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Luna-Jiménez, C., Kleinlein, R., Griol, D., Callejas, Z., Montero, J., Fernández-MartÃnez, F.: A Proposal for Multimodal Emotion Recognition Using Aural transformer on RAVDESS. Appl. Sci. MDPI 12, 327 (2022). https://doi.org/10.3390/app12010327
Firoozabadi, A., et al.: A multi-channel speech enhancement method based on subband affine projection algorithm in combination with proposed circular nested microphone array. Appl. Sci. MDPI 10(3955), 455–464 (2021)
Leem, S., Fulford, D., Onnela, J., Gard, D., BussoAuthor, C.: separation of emotional and reconstruction embeddings on ladder network to improve speech emotion recognition robustness in noisy conditions. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 1, pp. 516–520 (2021). https://doi.org/10.21437/Interspeech.
Imani, M., Montazer, G.: A survey of emotion recognition methods with emphasis on e-learning environments. J. Netw. Comput. Appl. 147. Academic Press, (2019). https://doi.org/10.1016/j.jnca.2019.102423
Lieskovská, E., Jakubec, M., Jarina, R., Chmulik, M., Olave, M.: A review on speech emotion recognition using deep learning and attention mechanism. Electronics (Switzerland) MDPI 10(10), 455–464 (2021). https://doi.org/10.10.3390/electronics10101163
Saad, F., Mahmud, H., Shaheen, M., Hasan, M., Farastu, P., Kabir, M.: is speech emotion recognition language-independent? analysis of english and bangla languages using language-independent vocal features, pp 1–9 (2021). http://arxiv.org/abs/2111.10776
Padmavathi, K., et al.: Transfer learning techniques for medical image analysis: a review. Biocybern. Biomed. Eng. 42(1), 79–107 (2022). https://doi.org/10.1016/j.bbe.2021.11.004
Akçay, M., Oğuz, K.: Speech emotion recognition: emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers.Elsevier B.V. vol. 116, pp. 56–76 (2020). https://doi.org/10.1016/j.specom.2019.12.001
El Ayadi, M., Kamel, M., Karray, F.: Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognit. 44(3), 572–587 (2011). https://doi.org/10.1016/j.patcog.2010.09.020
Kwon, S.: A CNN-Assisted Enhanced Audio Signal Processing. Sensors 20(1), 183(2020)
Latif, S., Rana, R., Younis, S, Qadir, J., Epps, J.: Cross corpus speech emotion classification - an effective transfer learning technique (2018)
Farooq, M., Hussain, F., Baloch, N., Raja, F., Yu, H., Bin-Zikria Y. : Impact of feature selection algorithm on speech emotion recognition using deep convolutional neural network. Sensors (Switzerland) 20(21), 1–18 (2020). https://doi.org/10.3390/s20216008
Lech, M., Stolar, M., Best, C., Bolia, R.: Real-time speech emotion recognition using a pre-trained image classification network: effects of bandwidth reduction and companding. Front. Comput. Sci. 2, 1–14 (2020). https://doi.org/10.3389/fcomp.2020.00014
Feng, K., Chaspari, T.: A siamese neural network with modified distance loss for transfer learning in speech emotion recognition. Sensors (2020). arXiv:2111.10776
Kamin, A., et al.: A light-weight deep convolutional neural network for speech emotion recognition using mel-spectrograms. In: Proceedings of 14th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP) (2019)
Padi, S., Sadjadi, S., Sriram, R., Manocha, D.: Improved speech emotion recognition using transfer learning and spectrogram augmentation. Sensors, pp. 645–652 (2021). https://doi.org/10.1145/3462244.3481003
Aggarwal, A., et al.: Two-way feature extraction for speech emotion recognition using deep learning, Sensors (Switzerland), 22, 237 (2022). https://doi.org/10.3390/s22062378
Zhang, H., Gou, R., Shang, J., Shen, F., Wu, Y., Dai, G.: Pre-trained deep convolution neural network model with attention for speech emotion recognition. Front. Physiol. 12, 1–13 (2021). https://doi.org/10.3389/fphys.2021.643202
Ortega, J., Cardinal, P., Koerich, A., Jun, L.: Emotion recognition using fusion of audio and video features. (2019). arXiv:1906.10623v1
Vatcharaphrueksadee, A., Viboonpanich, R.: VGG-16 and optimized CNN for emotion classification, 16(2), 10–15, (2020). https://ph01.tci-thaijo.org/index.php/IT-Journal/article/download/243769/165748/848686
Retta, E., Almekhlafi, E., Sutcliffe, R., Mhamed, M., Ali, H., Feng J. : Amharic speech emotion dataset and classification benchmark. (2022). arxiv:abs/2201.02710
Parra-Gallego, L., Orozco-Arroyave, J.: Classification of emotions and evaluation of customer satisfaction from speech in real world acoustic environments. Digit. Signal Process. A Rev. J. 120, 1–18 (2022). https://doi.org/10.1016/j.dsp.2021.103286
Alzubaidi, L., et al.: Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J. Big Data 8(1), 1–74 (2021). https://doi.org/10.1186/s40537-021-00444-8
Pusarla, A., Singh, B., Tripathi, C.: Learning DenseNet features from EEG based spectrograms for subject independent emotion recognition. Biomed. Signal Process. Control, 74(1), 103485 (2022). https://doi.org/10.1016/j.bspc.2022.103485
Jia, W., Sun, M., Lian, J., Hou, S.: Feature dimensionality reduction: a review. Complex Intell. Syst. (2022). https://doi.org/10.1007/s40747-021-00637-x
Pichora-Fuller, M., Kate, K.D.: Toronto emotional speech set (TESS), scholars portal dataverse, V1 (2020). https://doi.org/10.5683/SP2/E8H2MF
Praseetha, V., Vadivel, S.: Deep learning models for speech emotion recognition. J. Comput. Sci. 14(11), 1577–1587 (2018). https://doi.org/10.3844/jcssp.2018.1577.1587
Krishnan, P., Joseph, A., Rajangam, V.: Emotion classification from speech signal based on empirical mode decomposition and non-linear features. Complex Intell. Syst. 7(4), 1119–1934 (2021). https://doi.org/10.1007/s40747-021-00295-z
Venkataramanan, K., Rajamohan, H. : Emotion recognition from speech. audio and speech processing. pp 1–14 (2019). https://doi.org/10.48550/arXiv.1912.10458arXiv:1912.10458
Blumentals, E., Salimbajevs, A., : Emotion recognition in real-world support call center data for latvian language. Jt. Proc. ACM IUI Work. Helsinki, (Finland) (2022).http://ceur-ws.org/Vol-3124/paper23.pdf
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Akinpelu, S., Viriri, S. (2022). A Robust Deep Transfer Learning Model for Accurate Speech Emotion Classification. In: Bebis, G., et al. Advances in Visual Computing. ISVC 2022. Lecture Notes in Computer Science, vol 13599. Springer, Cham. https://doi.org/10.1007/978-3-031-20716-7_33
Download citation
DOI: https://doi.org/10.1007/978-3-031-20716-7_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20715-0
Online ISBN: 978-3-031-20716-7
eBook Packages: Computer ScienceComputer Science (R0)