Abstract
Speech emotion recognition (SER) has attracted a great deal of research interest, which plays as a critical role in human-machine interactions. Unlike other visual tasks, SER becomes intractable when the convolutional neural networks (CNNs) are employed, owing to their limitation in handling log-mel spectrograms. Therefore, it is useful to establish a feature-extraction backbone that allows CNNs to maintain information integrity of speech utterances when utilizing log-mel spectrograms. Moreover, a neural network with a deep stack of layers can lead to a performance degradation due to various challenges, including information loss, overfitting, or vanishing gradient issues. Many studies employ hybrid/multi-modal methods or specialized network designs to mitigate these obstacles. However, those methods often are unstable, hard to configure and non-adaptive to different tasks. In this research, we propose a reusable backbone pertaining to CNN blocks for undertaking SER tasks, as inspired by the FishNet model. denoted as deep-swallow convolution with RNN (DSCRNN), this proposed backbone method preserves features from both deep and shallow layers, which is effective in improving quality of features extracted from log-mel spectrograms. Simulation results indicate that our proposed DSCRNN backbone achieves improved accuracy rates of 2% and 11% when comparing with those from a baseline model with traditional CNN blocks in a speaker-independent evaluation utilizing the RAVDESS dataset with 4 classes and 8 classes, respectively.








Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availibility
The code associated with the paper is available at https://github.com/devpriyagoel/deep-shallow-convolution-with-recurrent-neural-network.
Code availability
Yes.
References
Abdullah SMSA, Ameen SYA, Sadeeq MA, Zeebaree S (2021) Multimodal emotion recognition using deep learning. J Appl Sci Technol Trends 2(02):52–58
Bänziger T, Scherer KR (2005) The role of intonation in emotional expressions. Speech Commun 46(3–4):252–267
Bechara A, Damasio H, Damasio AR (2000) Emotion, decision making and the orbitofrontal cortex. Cereb Cortex 10(3):295–307
Breazeal C (2002) Regulation and entrainment in human–robot interaction. Int J Robot Res 21(10–11):883–902. https://doi.org/10.1177/0278364902021010096
Cen L, Wu F, Yu ZL, Hu F (2016) A real-time speech emotion recognition system and its application in online learning. In: Emotions, technology, design, and learning. Elsevier, pp 27–46
Chen L, Mao X, Xue Y, Cheng LL (2012) Speech emotion recognition: features and classification models. Digit Signal Process 22(6):1154–1160. https://doi.org/10.1016/j.dsp.2012.05.007
Chen M, He X, Yang J, Zhang H (2018) 3D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process Lett 25(10):1440–1444. https://doi.org/10.1109/LSP.2018.2860246
Cowie R (2009) Perceiving emotion: towards a realistic understanding of the task. Philos Trans R Soc Lond Ser B Biol Sci 364:3515–3525. https://doi.org/10.1098/rstb.2009.0139
Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor JG (2001) Emotion recognition in human–computer interaction. IEEE Signal Process Mag 18(1):32–80
El Ayadi M, Kamel MS, Karray F (2011) Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognit 44(3):572–587. https://doi.org/10.1016/j.patcog.2010.09.020
ElAyadi MMH, Kamel MS, Karray F (2007) Speech emotion recognition using gaussian mixture vector autoregressive models. In: IEEE international conference on acoustics, speech and signal processing, 2007. ICASSP 2007, vol 4, pp IV-957–IV-960
Giannopoulos P, Perikos I, Hatzilygeroudis I (2018) Deep learning approaches for facial emotion recognition: a case study on fer-2013. In: Advances in hybridization of intelligent methods. Springer, pp 1–16
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 7132–7141. https://doi.org/10.1109/CVPR.2018.00745
Ingale AB, Chaudhari D (2012) Speech emotion recognition. Int J Soft Comput Eng (IJSCE) 2(1):235–238
Jalal M, Loweimi E, Moore R, Hain T (2019) Learning temporal clusters using capsule routing for speech emotion recognition, pp 1701–1705. https://doi.org/10.21437/Interspeech.2019-3068
Jones C, Sutherland J (2008) Acoustic emotion recognition for affective computer gaming. In: Affect and emotion in human–computer interaction. Springer, pp 209–219
Lee C, Narayanan S, Pieraccini R (2002) Classifying emotions in human-machine spoken dialogs. In: Proceedings of the ICME proceedings ICME, vol 1, pp 737–740. https://doi.org/10.1109/ICME.2002.1035887
Lee J, Tashev I (2015) High-level feature representation using recurrent neural network for speech emotion recognition. Interspeech 2015. ISCA: international speech communication association
Lim W, Jang D, Lee T (2016) Speech emotion recognition using convolutional and recurrent neural networks. In: 2016 Asia-Pacific signal and information processing association annual summit and conference (APSIPA), pp 1–4
Livingstone SR, Russo FA (2018) The Ryerson audio-visual database of emotional speech and song (ravdess): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5):e0196391
Mao X, Chen L, Fu L (2009) Multi-level speech emotion recognition based on hmm and ann. In: 2009 WRI World congress on computer science and information engineering, vol 7, pp 225–229
Meng H, Yan T, Yuan F, Wei H (2019) Speech emotion recognition from 3D log-mel spectrograms with deep learning network. IEEE Access 7:125868–125881. https://doi.org/10.1109/ACCESS.2019.2938007
Nwe T, Foo S, De Silva L (2003) Speech emotion recognition using hidden Markov models. Speech Commun 41:603–623. https://doi.org/10.1016/S0167-6393(03)00099-2
Osawa H, Orszulak J, Godfrey KM, Coughlin JF (2010) Maintaining learning motivation of older people by combining household appliance with a communication robot. In: 2010 IEEE/RSJ international conference on intelligent robots and systems, pp 5310–5316
Petrushin V (1999) Emotion in speech: recognition and application to call centers. In: Proceedings of artificial neural networks in engineering (710, 22)
Ranganathan H, Chakraborty S, Panchanathan S (2016) Multimodal emotion recognition using deep learning architectures. In: 2016 IEEE 0(WACV), pp 1–9
Ren M, Nie W, Liu A, Su Y (2019) Multi-modal correlated network for emotion recognition in speech. Vis Inform 33:150–155
Rozgic V, Ananthakrishnan S, Saleem S, Kumar R, Vembu A, Prasad R (2012). Emotion recognition using acoustic and lexical features. In: 13th annual conference of the international speech communication association 2012, INTERSPEECH 2012 (1)
Schuller B, Rigoll G, Lang M (2003) Hidden Markov model-based speech emotion recognition. In: 2003 International conference on multimedia and expo. ICME ’03. Proceedings (Cat. No.03TH8698) (1, I-401). https://doi.org/10.1109/ICME.2003.1220939
Schuller B, Rigoll G, Lang M (2004) Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In: 2004 IEEE international conference on acoustics, speech, and signal processing, vol 1, pp I–577
Schuller B, Vlasenko B, Eyben F, Rigoll G, Wendemuth A (2009) Acoustic emotion recognition: a benchmark comparison of performances. In: 2009 IEEE workshop on automatic speech recognition and understanding, pp 552–557
Song P, Jin Y, Zha C, Zhao L (2015) Speech emotion recognition method based on hidden factor analysis. Electron Lett 51(1):112–114
Sun S, Pang J, Shi J, Yi S, Ouyang W (2019) Fishnet: a versatile backbone for image, region, and pixel level prediction. arXiv preprint arXiv:1901.03495
Tokuno S, Tsumatori G, Shono S, Takei E, Yamamoto T, Suzuki G, Shimura M (2011) Usage of emotion recognition in military health care. In: 2011 defense science research conference and expo (dsr), pp 1–5
Zeng H, Wu Z, Zhang J, Yang C, Zhang H, Dai G, Kong W (2019) EEG emotion classification using an improved SincNet-based deep learning model. Brain Sci. https://doi.org/10.3390/brainsci9110326
Zhang Q, Chen X, Zhan Q, Yang T, Xia S (2017) Respiration-based emotion recognition with deep learning. Comput Ind 92:84–90
Zhao Z, Zheng Y, Zhang Z, Wang H, Zhao Y, Li C (2018) Exploring spatio-temporal representations by integrating attention-based bidirectional-LSTM-RNNs and FCNs for speech emotion recognition interspeech
Zheng WQ, Yu JS, Zou YX (2015) An experimental study of speech emotion recognition based on deep convolutional neural networks. In: 2015 International conference on affective computing and intelligent interaction (acii), pp 827-831. https://doi.org/10.1109/ACII.2015.7344669
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
DPG and KM implement the algorithm and write the paper. NDN, NS, and CPL provides guidance and revise the paper.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Consent for publication
Yes.
Consent to participate
Yes.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Goel, D.P., Mahajan, K., Nguyen, N.D. et al. Towards an efficient backbone for preserving features in speech emotion recognition: deep-shallow convolution with recurrent neural network. Neural Comput & Applic 35, 2457–2469 (2023). https://doi.org/10.1007/s00521-022-07723-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-022-07723-2