Abstract
Previously and currently, most of literature works on Speech Emotion Recognition (SER) have been orientated towards a monolingual approach. The current study extends monolingual SER to a bilingual setting. However, in order to construct a generalized emotion recognition system for different languages, selecting a proper feature extraction and classification methods remains a topic of discussion. To address this issue, a promising method is proposed and evaluated. On the one hand, the proposed method is based on a statistical-based parameterization framework for representing the speech through a fixed-length vector. On the other hand, we propose a deep learning approach that combines three convolutional neural networks architectures. Based on these, monolingual and bilingual emotion recognition experiments have been conducted using the English RAVDESS and Italian EMOVO corpora. The effectiveness of the suggested SER model is proved from the experimentations when compared to state-of-the-art monolingual methods, with an average achieve of 87.08% and 83.90% in terms of accuracy over RAVDESS and EMOVO datasets, respectively.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
RAVDESS and EMOVO data that support the findings of this study are publicly available in https://zenodo.org/record/1188976#.YwSinnbMK3A and http://voice.fub.it/activities/corpora/emovo/index.html respectively.
References
Aggarwal CC (2018) Neural networks and deep learning. Springer, vol 10, pp 978–983
Ancilin J, Milton A (2021) Improved speech emotion recognition with mel frequency magnitude coefficient. Appl Acoustics 179:108046
Badshah AM, Ahmad J, Rahim N, Baik SW (2017) Speech emotion recognition from spectrograms with deep convolutional neural network. In: 2017 International conference on platform technology and service (PlatCon), pp 1–5
Bensalah N, Ayad H, Adib A, Farouk AIE (2020) Lstm vs. gru for arabic machine translation. In: SoCPaR, pp 156–165
Bhavan A, Chauhan P, Hitkul, Shah RR (2019) Bagged support vector machines for emotion recognition from speech. Knowl-Based Syst 184:104886
Bouny LE, Khalil M, Adib A (2020) ECG heartbeat classification based on multi-scale wavelet convolutional neural networks. In: 2020 IEEE international conference on acoustics, speech and signal processing, ICASSP. Barcelona, Spain, 4-8 May 2020. IEEE, pp 3212–3216
Bouny LE, Khalil M, Adib A (2020) An end-to-end multi-level wavelet convolutional neural networks for heart diseases diagnosis. Neurocomputing 417:187–201
Braga D, Madureira A, Coelho L, Ajith R (2019) Automatic detection of parkinson’s disease based on acoustic analysis of speech. In: Engineering applications of artificial intelligence, vol 77, pp 148–158
Chen M, He X, Yang J, Zhang H (2018) 3-d Convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process Lett 25:1440–1444
Christy A, Vaithyasubramanian S, Jesudoss A et al (2020) Multimodal speech emotion recognition and classification using convolutional neural network techniques. Int J Speech Technol 23:381–388
Costantini G, Iaderola I, Paoloni A, Massimiliano T (2014) EMOVO Corpus: an Italian emotional speech database. In: Proceedings of the ninth international conference on language resources and evaluation (LREC’14), pp 3501–3504, Reykjavik, Iceland. European language resources association (ELRA)
Davis S, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. In: IEEE transactions on acoustics, speech, and signal processing, vol 28, pp 357–366
Ekman P (1992) Are there basic emotions?. Am Psychol Assoc 99 (3):550
Elangovan P, Nath MK (2021) A novel shallow convnet-18 for malaria parasite detection in thin blood smear images. In: SN computer science, vol 2, pp 1–11
Han K, Yu D, Tashev I (2014) Speech emotion recognition using deep neural network and extreme learning machine. In: Interspeech 2014
Heracleous P, Yoneyama A (2019) A comprehensive study on bilingual and multilingual speech emotion recognition using a two-pass classification scheme. PLoS One, vol 8
Hifny Y, networks AA (2019) Efficient arabic emotion recognition using deep neural. In: IEEE international conference on acoustics speech and signal processing (ICASSP), pp 6710–6714
Huang Z, Dong M, Mao Q, Zhan Y (2014) Speech emotion recognition using cnn. In: Proceedings of the 22nd ACM international conference on multimedia, MM ’14. New York, NY USA, pp 801–804. Association for computing machinery
Huang Y, Tian K, Wu A, Zhang G (2017) Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition. In: Journal of ambient intelligence and humanized computing
Issa D, Demirci MF, Yazici A (2020) Speech emotion recognition with deep convolutional neural networks. Biomed Signal Process Contr 59:101894
John Kim, Saurous RA (2018) Emotion recognition from human speech using temporal information and deep learning. In: Proceeding interspeech, vol 2018, pp 937–940
Kar MK, Nath MK, Neog DR (2021) A review on progress in semantic image segmentation and its application to medical images. In: SN computer science, vol 2
Kerkeni L, Serrestou Y, Raoof K, Mbarki M, Mahjoub MA, Cleder C (2019) Automatic speech emotion recognition using an optimal combination of features based on emd-tkeo. In: Speech communication
Khan S, Rahmani H, Ali Shah SA, Bennamoun M, Medioni G, Dickinson S (2018) A guide to convolutional neural networks for computer vision. Springer, ISBN: 9783031006937
Kim J (2007) Bimodal emotion recognition using speech and physiological changes. In: Robust speech recognition and understanding. I-tech education and publishing vienna, vol 265, p 280
Kim J, André E (2008) Emotion recognition based on physiological changes in music listening. In: IEEE transactions on pattern analysis and machine intelligence. IEEE, vol 30, pp 2067–2083
Kim Y, Yun TS (2021) How to classify sand types: a deep learning approach. In: Engineering geology, vol 288, p 106142
Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. In: Speech communication. Elsevier, vol 52, pp 12–40
Lakomkin E, Zamani MA, Weber C, Magg S, Wermter S (2018) On the robustness of speech emotion recognition for human-robot interaction with deep neural networks. In: 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 854–860
Lalitha S, Gupta D, Zakariah M, Alotaibi YA (2020) Investigation of multilingual and mixed-lingual emotion recognition using enhanced cues with data augmentation. Appl Acoustics 170:107519
Lang PJ (1995) The emotion probe: studies of motivation and attention. In: American psychologist, vol 50, p 372. American psychological association
Lella KK, Alphonse PJA (2021) Automatic covid-19 disease diagnosis using 1d convolutional neural network and augmentation with human respiratory sound based on parameters: cough, breath, and voice. In: AIMS public health, vol 8
Livingstone SR, Russo FA (2018) The ryerson audio-visual database of emotionalspeech and song (ravdess): a dynamic, multimodal set of facial and vocal expressions in north american english. Plos One, vol 13
Lopez-Moreno I, Gonzalez-Dominguez J, Plchot O, Martinez D, Gonzalez-Rodriguez J, Moreno P (2014) Automatic language identification using deep neural networks. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5337–5341
Mansouri-Benssassi E, Ye J (2019) Speech emotion recognition with early visual cross-modal enhancement using spiking neural networks. In: 2019 International joint conference on neural networks (IJCNN), pp 1–8
Mao Q, Dong M, Huang Z, Zhan Y (2014) Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans Multimed 16:2203–2213
McFee B, Raffel C, Liang D, Ellis DPW, McVicar M, Battenberg E, Nieto O (2015) Librosa: audio and music signal analysis in python. In: Proceedings of the 14th python in science conference, pp. 18–25.
Meftah A, Alotaibi YA, Selouani S-A (2014) Designing, building, and analyzing an arabic speech emotional corpus. In: Phase 2. in 5th international conference on arabic language processing, pp 181–184
Mustaqeem, Kwon S (2020) A cnn-assisted enhanced audio signal processing for speech emotion recognition. In: Sensors, vol 20
Nagarajan S, Srinivas Nettimi SS, Kumar LS, Nath MK, Kanhe A (2020) Speech emotion recognition using cepstral features extracted with novel triangular filter banks based on bark and erb frequency scales. Digital Signal Process 104:102763
Neumann M, Thang Vu NG (2018) Cross-lingual and multilingual speech emotion recognition on english and french. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5769–5773
Ortony A, Turner TJ (1990) What’s basic about basic emotions? Psychological Rev 97:315–331
Özseven T (2019) A novel feature selection method for speech emotion recognition. Appl Acoustics 146:320–326
Palo HK, Mohanty MN (2018) Wavelet based feature combination for recognition of emotions. In: Ain shams engineering journal, vol 9, pp 1799–1806
Pandey SK, Shekhawat HS, Prasanna SRM (2019) Emotion recognition from raw speech using wavenet. In: TENCON 2019 - 2019 IEEE region 10 conference (TENCON), pp 1292–1297
Picone JW (1993) Signal modeling techniques in speech recognition. In: Proceedings of the IEEE, vol 81, pp 1215–1247
Polzehl T, Schmitt A, Metze F (2010) Approaching multi-lingual emotion recognition from speech-on language dependency of acoustic prosodic features for anger detection. In: Proceedings of speech prosody
Popova AS, Rassadin AG, Ponomarenko AA (2018) Emotion recognition in sound. In: Kryzhanovsky B, Dunin-Barkowski W, Redko V (eds) Advances in neural computation, machine learning, and cognitive research, pp 117–124, Cham. Springer international publishing
Riyad M, Khalil M, Adib A (2020) Incep-eegnet: a convnet for motor imagery decoding. In: Moataz AE, Mammass D, Mansouri A, Nouboud F (eds) Image and signal processing - 9th international conference, ICISP 2020, Marrakesh, Morocco, 4-6 June 2020, proceedings, vol 12119 of lecture notes in computer science. Springer, pp 103–111
Russell J (1980) A circumplex model of affect. In: Journal of personality and social psychology, vol 39, pp 1161–1178, 12
Schuller B, Arsic D, Wallhoff F, Rigoll G (2006) Emotion recognition in the noise applying large acoustic feature sets. In: Speech Prosody
Schuller B, Zhangm Z, Weninger F, Rigoll G (2011) Selecting training data for cross-corpus speech emotion recognition: prototypicality vs. generalization
Sefara TJ (2019) The effects of normalisation methods on speech emotion recognition. In: 2019 International multidisciplinary information technology and engineering conference (IMITEC), pp 1–8
Sekkate S, Khalil M, Adib A (2019) Speaker identification for ofdm-based aeronautical communication system. In: Circuits, systems, and signal processing. Springer US, vol 38, pp 3743–3761
Sekkate S, Khalil M, Adib (2020) A statistical based modeling approach for deep learning based speech emotion recognition. In: International conference on intelligent systems design and applications (ISDA)
Sekkate S, Khalil M, Adib A, Jebara SB (2019) A multiresolution-based fusion strategy for improving speech emotion recognition efficiency. In: Mobile, secure, and programmable networking, pp 96–109, Cham. Springer international publishing
Sekkate S, Khalil M, Adib A, Jebara SB (2019) An investigation of a feature-level fusion for noisy speech emotion recognition. In: Computers, vol 8
Settle S, Roux JL, Hori T, Watanabe S, Hershey JR (2018) End-to-end multi-speaker speech recognition. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4819–4823
Sönmez YÜ, Varol A (2020) A speech emotion recognition model based on multi-level local binary and local ternary patterns. IEEE Access 8:190784–190796
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Dropout RS (2014) A simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958
Sugan N, Srinivas NSS, Kar N, Kumar LS, Nath MK, Kanhe A (2018) Performance comparison of different cepstral features for speech emotion recognition. In: 2018 International CET conference on control, communication, and computing (IC4), pp 266–271
Sugan N, Srinivas NSS, Kar N, Kumar LS, Nath MK, Kanhe A (2019) Recognition of spoken languages from acoustic speech signals using fourier parameters. In: Circuits, systems, and signal processing, vol 38, pp 5018–5067
Thoits PA (1989) The sociology of emotions. In: Annual review of sociology. Annual reviews 4139 el camino way, PO Box 10139, Palo Alto, CA 94303-0139, USA, vol 15, pp 317–342
Trigeorgis G, Ringeval F, Brueckner R, Marchi E, Nicolaou MA, Schuller B, Zafeiriou S (2016) Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5200–5204
Yang B, Lugger M (2010) Emotion recognition from speech signals using new harmony features. In: Signal processing, vol 90, pp 1415–1423. Special Section on Statistical Signal & Array Processing
Yenigalla P, Kumar A, Tripathi S, Singh C, Kar S, Vepa J (2018) Speech emotion recognition using spectrogram & phoneme embedding. In: INTERSPEECH
Zeng Y, Mao H, Peng D, Yi Z (2019) Spectrogram based multi-task audio classification. In: Multimedia tools appl, vol 78, pp 3705–3722, USA. Kluwer academic publishers
Acknowledgements
The research presented in this paper was supported by the Ministry of Higher Education, Scientific Research and Innovation, the Digital Development Agency (ADA) and the CNRST of Morocco (Alkhawarizmi/2020/01).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest, or non-financial interest in the subject matter or materials discussed in this manuscript.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sekkate, S., Khalil, M. & Adib, A. A statistical feature extraction for deep speech emotion recognition in a bilingual scenario. Multimed Tools Appl 82, 11443–11460 (2023). https://doi.org/10.1007/s11042-022-14051-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-14051-z