research-article

Human-Computer Interaction with Detection of Speaker Emotions Using Convolution Neural Networks

Authors:

Abeer Ali Alnuaim,

Mohammed Zakariah,

Aseel Alhadlaq,

Chitra Shashidhar,

Wesam Atef Hatamleh,

Prashant Kumar Shukla,

Rajnish Ratna Academic Editor:

Vijay KumarAuthors Info & Claims

Computational Intelligence and Neuroscience, Volume 2022

https://doi.org/10.1155/2022/7463091

Published: 01 January 2022 Publication History

Abstract

Emotions play an essential role in human relationships, and many real-time applications rely on interpreting the speaker’s emotion from their words. Speech emotion recognition (SER) modules aid human-computer interface (HCI) applications, but they are challenging to implement because of the lack of balanced data for training and clarity about which features are sufficient for categorization. This research discusses the impact of the classification approach, identifying the most appropriate combination of features and data augmentation on speech emotion detection accuracy. Selection of the correct combination of handcrafted features with the classifier plays an integral part in reducing computation complexity. The suggested classification model, a 1D convolutional neural network (1D CNN), outperforms traditional machine learning approaches in classification. Unlike most earlier studies, which examined emotions primarily through a single language lens, our analysis looks at numerous language data sets. With the most discriminating features and data augmentation, our technique achieves 97.09%, 96.44%, and 83.33% accuracy for the BAVED, ANAD, and SAVEE data sets, respectively.

References

[1]

Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey of affect recognition methods: audio, visual, and spontaneous expressions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 1, pp. 39–58, 2008.

Digital Library

[2]

T. Vogt, E. André, and J. Wagner, “Automatic recognition of emotions from speech: a review of the literature and recommendations for practical realisation,” Affect and Emotion in Human-Computer Interaction, pp. 75–91, Springer, Berlin, Heidelberg, 2008.

[3]

M. Wöllmer, A. Metallinou, F. Eyben, B. Schuller, and S. Narayanan, “Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional lstm modeling,” in Proceedings of the Proc. INTERSPEECH 2010 11th Annual Conference of the International Speech Communication Association, pp. 2362–2365, SJR, Makuhari, Japan, 26 September 2010.

[4]

A. Firoz Shah, A. Raji Sukumar, and B. Anto, “Discrete wavelet transforms and artificial neural networks for speech emotion recognition,” International Journal of Computer Theory and Engineering, vol. 2, no. 3, p. 319, 2010.

[5]

C. C. Chibelushi, F. Deravi, and J. S. D. Mason, “A review of speech-based bimodal recognition,” IEEE Transactions on Multimedia, vol. 4, no. 1, pp. 23–37, 2002.

Digital Library

[6]

Z. A. Khan and W. Sohn, “Abnormal human activity recognition system based on R-transform and kernel discriminant technique for elderly home care,” IEEE Transactions on Consumer Electronics, vol. 57, no. 4, pp. 1843–1850, 2011.

[7]

R. W. Picard, Affective Computing, MIT Press, Cambridge, MA, 1997.

Digital Library

[8]

J. Tao and T. Tan, “Affective computing: a review,” in Proceedings of the International Conference on Affective Computing and Intelligent Interaction, pp. 981–995, Springer, Beijing, China, 22 October 2005,.

Digital Library

[9]

N. Garay, I. Cearreta, J. López, and I. Fajardo, “Assistive technology and affective mediation,” Human Technology: An Interdisciplinary Journal on Humans in ICT Environments, vol. 2, no. 1, pp. 55–83, 2006.

[10]

S. G. Koolagudi and K. S. Rao, “Emotion recognition from speech: a review,” International Journal of Speech Technology, vol. 15, pp. 99–117, 2012.

Digital Library

[11]

S. Ramakrishnan and I. M. M. El Emary, “Speech emotion recognition approaches in human computer interaction,” Telecommunication Systems, vol. 52, no. 3, pp. 1467–1478, 2013.

Digital Library

[12]

P. M. Ramirez, D. Desantis, and L. A. Opler, “EEG biofeedback treatment of ADD: a viable alternative to traditional medical intervention?” Annals of the New York Academy of Sciences, vol. 931, no. 1, pp. 342–358, 2001.

[13]

X. Hu, J. Chen, F. Wang, and D. Zhang, “Ten challenges for EEG-based affective computing,” Brain Sci. Adv., vol. 5, no. 1, pp. 1–20, 2019.

[14]

F. Fürbass, M. A. Kural, G. Gritsch, M. Hartmann, T. Kluge, and S. Beniczky, “An artificial intelligence-based EEG algorithm for detection of epileptiform EEG discharges: validation against the diagnostic gold standard,” Clinical Neurophysiology, vol. 131, no. 6, pp. 1174–1179, 2020.

[15]

A. Aouf, “Basic Arabic vocal emotions dataset (baved) – github,” 2019, https://github.com/40uf411/Basic-Arabic-%20Vocal-Emotions-Dataset.

[16]

K. R. Scherer, “Vocal communication of emotion: a review of research paradigms,” Speech Communication, vol. 40, no. 1–2, pp. 227–256, 2003.

Digital Library

[17]

J.-A. Bachorowski and M. J. Owren, “Vocal expression of emotion: acoustic properties of speech are associated with emotional intensity and context,” Psychological Science, vol. 6, no. 4, pp. 219–224, 1995.

[18]

J. Tao and Y. Kang, “Features importance analysis for emotional speech classification,” in Proceedings of the International Conference on Affective Computing and Intelligent Interaction, pp. 449–457, Springer-Verlag, Beijing, China, 22 October 2005.

Digital Library

[19]

D. Ververidis and C. Kotropoulos, “Emotional speech recognition: resources, features, and methods,” Speech Communication, vol. 48, no. 9, pp. 1162–1181, 2006.

[20]

L. He, M. Lech, S. Memon, and N. Allen, “Recognition of Stress in Speech Using Wavelet Analysis and Teager Energy Operator,” in Proceedings of the 9th Annual Conference, International Speech Communication Association and 12 Biennial Conference, Australasian Speech Science and Technology Association, RMIT, Brisbane, Qld, 22 September 2008.

[21]

R. Sun, E. Moore, and J. F. Torres, “Investigating glottal parameters for differentiating emotional categories with similar prosodics,” in Proceeding of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4509–4512, IEEE, Taipei, Taiwan, 19 April 2009.

Digital Library

[22]

J. Pribil and A. Pribilová, “An experiment with evaluation of emotional speech conversion by spectrograms,” Measurement Science Review, vol. 10, no. 3, pp. 72–77, 2010.

[23]

B. Schuller, B. Vlasenko, F. Eyben, G. Rigoll, and A. Wendemuth, “Acoustic emotion recognition: a benchmark comparison of performances,” in Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition \\& Understanding, pp. 552–557, IEEE, Moreno, Italy, 13 November 2009.

[24]

L. He, M. Lech, and N. Allen, “On the Importance of Glottal Flow Spectral Energy for the Recognition of Emotions in Speech,” in Proceedings of the INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, DBLP, Makuhari, Chiba, Japan, 26 Setember 2010.

[25]

K. E. B. Ooi, L.-S. A. Low, M. Lech, and N. Allen, “Early prediction of major depression in adolescents using glottal wave characteristics and teager energy parameters,” in Proceeding of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4613–4616, IEEE, Kyoto, Japan, 25 March 2012.

[26]

N. Dave, “Feature extraction methods LPC, PLP and MFCC in speech recognition,” International Journal of Advanced Research in Engineering & Technology, vol. 1, no. 6, pp. 1–4, 2013.

[27]

A. Luque, J. Gómez-Bellido, A. Carrasco, and J. Barbancho, “Optimal representation of anuran call spectrum in environmental monitoring systems using wireless sensor networks,” Sensors, vol. 18, no. 6, p. 1803, 2018.

[28]

B. Erol, M. S. Seyfioglu, S. Z. Gurbuz, and M. Amin, “Data-driven cepstral and neural learning of features for robust micro-Doppler classification,” Radar Sensor Technology XXII, vol. 10633, 2018.

[29]

G. K. Liu, “Evaluating gammatone frequency cepstral coefficients with neural networks for emotion recognition from speech,” 2018, https://arxiv.org/abs/1806.09010.

[30]

Z.-T. Liu, M. Wu, W.-H. Cao, J.-W. Mao, J.-P. Xu, and G.-Z. Tan, “Speech emotion recognition based on feature selection and extreme learning machine decision tree,” Neurocomputing, vol. 273, pp. 271–280, 2018.

[31]

C.-L. Liu, F. Yin, D.-H. Wang, and Q.-F. Wang, “CASIA online and offline Chinese handwriting databases,” in Proceeding of the 2011 International Conference on Document Analysis and Recognition, pp. 37–41, IEEE, Beijing, China, 18 September 2011.

Digital Library

[32]

M. Fahad, J. Yadav, G. Pradhan, and A. Deepak, “DNN-HMM Based Speaker Adaptive Emotion Recognition Using Proposed Epoch and MFCC Features,” 2018, https://arxiv.org/abs/1806.00984.

[33]

F. Noroozi, M. Marjanovic, A. Njegus, S. Escalera, and G. Anbarjafari, “Audio-visual emotion recognition in video clips,” IEEE Trans. Affect. Comput., vol. 10, no. 1, pp. 60–75, 2017.

Digital Library

[34]

S. R. Bandela and T. K. Kumar, “Stressed speech emotion recognition using feature fusion of teager energy operator and MFCC,” in Proceeding of the 2017 8th International Conference on Computing, Communication and Networking Technologies (ICCCNT), pp. 1–5, IEEE, Delhi, India, 3 July 2017.

[35]

A. A. A. Zamil, S. Hasan, S. M. D. J. Baki, J. M. D. Adam, and I. Zaman, “Emotion detection from speech signals using voting mechanism on classified frames,” in Proceeding of the 2019 International Conference on Robotics, Electrical and Signal Processing Techniques (ICREST), pp. 281–285, IEEE, Dhaka, Bangladesh, 10 January 2019.

[36]

S. B. Reddy and T. Kishore Kumar, “Emotion recognition of stressed speech using teager energy and linear prediction features,” in Proceedings of the IEEE 18th International Conference on Advanced Learning Technologies, IEEE, Mumbai India, 9 July 2018.

[37]

D. Kamińska, T. Sapiński, and G. Anbarjafari, “Efficiency of chosen speech descriptors in relation to emotion recognition,” EURASIP Journal on Audio Speech and Music Processing, vol. 2017, pp. 1–9, 2017.

Digital Library

[38]

J. Kacur, B. Puterka, J. Pavlovicova, and M. Oravec, “On the speech properties and feature extraction methods in speech emotion recognition,” Sensors, vol. 21, no. 5, p. 1888, 2021.

[39]

K. Han, D. Yu, and I. Tashev, “Speech Emotion Recognition Using Deep Neural Network and Extreme Learning Machine,” 2014, https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/IS140441.pdf.

[40]

Z. Huang, M. Dong, Q. Mao, and Y. Zhan, “Speech emotion recognition using CNN,” in Proceedings of the 22nd ACM International Conference on Multimedia, pp. 801–804, IEEE, Orlando Florida USA, 3 November 2014.

Digital Library

[41]

H. M. Fayek, M. Lech, and L. Cavedon, “Towards real-time speech emotion recognition using deep neural networks,” in Proceeding of the 2015 9th International Conference on Signal Processing and Communication Systems (ICSPCS), pp. 1–5, DBLP, Cairns, Autralia, 4 December 2015.

[42]

H. M. Fayek, M. Lech, and L. Cavedon, “Evaluating deep learning architectures for speech emotion recognition,” Neural Networks, vol. 92, pp. 60–68, 2017.

[43]

W. Lim, D. Jang, and T. Lee, “Speech Emotion Recognition Using Convolutional and Recurrent Neural Networks,” in Proceedings of the 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), IEEE, Jeju, Korea (South), 13 December 2016.

[44]

K. Han, D. Yu, and I. Tashev, “Speech emotion recognition using deep neural network and extreme learning machine,” Fifteenth Annu. Conf., 2014.

[45]

G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, and S. Zafeiriou, “Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network,” in Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5200–5204, IEEE, Shanghai, China, 20–25 March 2016.

[46]

Z. Zhao, Qifei Li, Zixing Zhang, Nicholas Cummins, Haishuai Wang, Jianhua Tao, and Björn W. Schuller, “Combining a parallel 2D CNN with a self-attention Dilated Residual Network for CTC-based discrete speech emotion recognition,” Neural Networks: The Official Journal of the International Neural Network Society, vol. 141, pp. 52–60, 2021.

Digital Library

[47]

S. Lee, D. K. Han, and H. Ko, “Fusion-ConvBERT: parallel convolution and BERT fusion for speech emotion recognition,” Sensors, vol. 20, no. 22, p. 6688, 2020.

[48]

H. Zhang, R. Gou, J. Shang, F. Shen, Y. Wu, and G. Dai, “Pre-trained deep convolution neural network model with attention for speech emotion recognition,” Frontiers in Physiology, vol. 12, 2021.

[49]

L. X. Hùng, Détection des émotions dans des énoncés audio multilingues, Institut polytechnique de Grenoble, 2009.

[50]

Y. Aytar, C. Vondrick, and A. Torralba, “Soundnet: learning sound representations from unlabeled video,” Advances in Neural Information Processing Systems, vol. 29, 2016.

[51]

S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” 2015, https://arxiv.org/abs/1502.03167.

[52]

D. P. Kingma and J. Ba, “Adam: a method for stochastic optimization,” 2014, https://arxiv.org/abs/1412.6980.

[53]

A. Almahdawi and W. Teahan, “A New Arabic Dataset for Emotion Recognition,” Intelligent Computing, 2019.

[54]

R. Banse and K. R. Scherer, “Acoustic profiles in vocal emotion expression,” Journal of Personality and Social Psychology, vol. 70, no. 3, pp. 614–636, 1996.

[55]

A. Hammami, “Towards Developing a Speech Emotion Database for Tunisian Arabic,” 2018, https://erepo.uef.fi/bitstream/handle/123456789/19769/urn_nbn_fi_uef-20180767.pdf?sequence=1&isAllowed=y.

[56]

S. Klaylat, Z. Osman, L. Hamandi et al., “Enhancement of an Arabic speech emotion recognition system,” International Journal of Applied Engineering Research, vol. 13, no. 5, pp. 2380–2389, 2018.

[57]

A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: a generative model for raw audio,” SSW, vol. 125, 2016.

[58]

P. Ghahremani, V. Manohar, D. Povey, and S. Khudanpur, “Acoustic Modelling from the Signal Domain Using CNNs,” in Proceedings of the Interspeech, DBLP, San Francisco, USA, 12 September 2016.

[59]

H. Purwins, Bo Li, T. Virtanen, S. Jan, S.-Y. Chang, and T. Sainath, “Deep Learning for Audio Signal Processing,” IEEE Signal Processing Magazine, vol. 13, 2019.

[60]

S. Furui, “Speaker-independent Isolated Word Recognition Based on Emphasized Spectral Dynamics,” in Proceedings of the ICASSP ’86. IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, Tokyo, Japan, 7 April 1986.

[61]

N. Jaitly and G. E. Hinton, “Vocal tract length perturbation (VTLP) improves speech recognition,” Speech, and Language Processing, vol. 117, 2013.

[62]

N. Kanda, R. Takeda, and Y. Obuchi, “Elastic Spectral Distortion for Low Resource Speech Recognition with Deep Neural Networks,” in Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, IEEE, Olomouc, Czech Republic, 8 December 2013.

[63]

O. Mohamed and S. A. Aly, “Arabic Speech Emotion Recognition Employing Wav2vec2.0 and HuBERT Based on BAVED Dataset,” 2021, https://arxiv.org/abs/2110.04425.

[64]

H. Skander, A. Moussaoui, M. Oussalah, and M. Saidi, “Gender Identification from Arabic Speech Using Machine Learning,” Modelling and Implementation of Complex Systems, Springer, Cham, 2020.

[65]

M. Seo and M. Kim, “Fusing visual attention CNN and bag of visual words for cross-corpus speech emotion recognition,” Sensors, vol. 20, p. 5559, 2020.

[66]

M. Farooq, F. Hussain, N. K. Baloch, F. Raja, H. Yu, and Y. Zikria, “Impact of feature selection algorithm on speech emotion recognition using deep convolutional neural network,” Sensors, vol. 20, 2020.

[67]

W. Zhang, X. Meng, Q. Lu, Y. Rao, and J. Zhou, “A hybrid emotion recognition on android smart phones,” in Proceedings of the 2013 IEEE International Conference on Green Computing and Communications (GreenCom) and IEEE Internet of Things (iThings) and IEEE Cyber, Physical and Social Computing (CPSCom), pp. 1313–1318, IEEE, Beijing, China, 20 August 2013.

Digital Library

Cited By

Grágeda NBusso CAlvarado EGarcía RMahu RHuenupan FYoma N(2025)Speech emotion recognition in real static and dynamic human-robot interaction scenariosComputer Speech and Language10.1016/j.csl.2024.10166689:COnline publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1016/j.csl.2024.101666
Nath SShahi AMartin TChoudhury NMandal R(2024)Speech Emotion Recognition Using Machine Learning: A Comparative AnalysisSN Computer Science10.1007/s42979-024-02656-05:4Online publication date: 4-Apr-2024
https://dl.acm.org/doi/10.1007/s42979-024-02656-0
Karoui HHamza SBen Ayed Y(2024)Deep Learning for Cardiac Diseases ClassificationComputational Collective Intelligence10.1007/978-3-031-70816-9_14(170-182)Online publication date: 9-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-70816-9_14
Show More Cited By

Recommendations

Inducing Genuine Emotions in Simulated Speech-Based Human-Machine Interaction: The NIMITEK Corpus

Emotional corpora provide an important empirical foundation for investigation when researchers aim at implementing emotion-aware spoken dialog systems. One of the fundamental research questions is how to acquire an appropriate, realistic emotion corpus. ...
Handling Emotions in Human-Computer Dialogues
Emotions and Affect in Human Factors and Human-Computer Interaction

Comments

Information & Contributors

Information

Published In

cover image Computational Intelligence and Neuroscience

Computational Intelligence and Neuroscience Volume 2022, Issue

2022

32389 pages

ISSN:1687-5265

EISSN:1687-5273

Issue’s Table of Contents

Copyright © 2022 Abeer Ali Alnuaim et al.

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Publisher

Hindawi Limited

London, United Kingdom

Publication History

Published: 01 January 2022

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Grágeda NBusso CAlvarado EGarcía RMahu RHuenupan FYoma N(2025)Speech emotion recognition in real static and dynamic human-robot interaction scenariosComputer Speech and Language10.1016/j.csl.2024.10166689:COnline publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1016/j.csl.2024.101666
Nath SShahi AMartin TChoudhury NMandal R(2024)Speech Emotion Recognition Using Machine Learning: A Comparative AnalysisSN Computer Science10.1007/s42979-024-02656-05:4Online publication date: 4-Apr-2024
https://dl.acm.org/doi/10.1007/s42979-024-02656-0
Karoui HHamza SBen Ayed Y(2024)Deep Learning for Cardiac Diseases ClassificationComputational Collective Intelligence10.1007/978-3-031-70816-9_14(170-182)Online publication date: 9-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-70816-9_14
Rayhan Ahmed MIslam SMuzahidul Islam AShatabda S(2023)An ensemble 1D-CNN-LSTM-GRU model with data augmentation for speech emotion recognitionExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.119633218:COnline publication date: 9-Mar-2023
https://dl.acm.org/doi/10.1016/j.eswa.2023.119633
Naresh EBabu JDarshan SMurthy SSrinidhi N(2023)A Novel Framework for Detection of Harmful Snakes Using YOLO AlgorithmSN Computer Science10.1007/s42979-023-02366-z5:1Online publication date: 2-Dec-2023
https://dl.acm.org/doi/10.1007/s42979-023-02366-z
Bhatt ROnyema EAlmuzaini KIwendi CBand SSharma TMosavi A(2022)Assessment of Dynamic Swarm Heterogeneous Clustering in Cognitive Radio Sensor NetworksWireless Communications & Mobile Computing10.1155/2022/73592102022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/7359210
Joshi SPandit SShukla PAlmalki AOthman NAlharbi AAlhassan M(2022)Analysis of Smart Lung Tumour Detector and Stage Classifier Using Deep Learning Techniques with Internet of ThingsComputational Intelligence and Neuroscience10.1155/2022/46081452022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/4608145
Mishra SDubey RSwami PJain A(2022)Fault Size Estimation of Bearings Using Multiple Decomposition Techniques with Artificial Neural NetworkScientific Programming10.1155/2022/34287152022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/3428715
Misra YSrivastava AAhmad AYadav AVoore SAlshammri GAtiglah H(2022)Secure Information Collection and Energy Efficiency in Heterogeneous Sensor Networks Using Machine Learning with the Internet of ThingsWireless Communications & Mobile Computing10.1155/2022/28744972022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/2874497

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents