Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Human-Computer Interaction with Detection of Speaker Emotions Using Convolution Neural Networks

Published: 01 January 2022 Publication History

Abstract

Emotions play an essential role in human relationships, and many real-time applications rely on interpreting the speaker’s emotion from their words. Speech emotion recognition (SER) modules aid human-computer interface (HCI) applications, but they are challenging to implement because of the lack of balanced data for training and clarity about which features are sufficient for categorization. This research discusses the impact of the classification approach, identifying the most appropriate combination of features and data augmentation on speech emotion detection accuracy. Selection of the correct combination of handcrafted features with the classifier plays an integral part in reducing computation complexity. The suggested classification model, a 1D convolutional neural network (1D CNN), outperforms traditional machine learning approaches in classification. Unlike most earlier studies, which examined emotions primarily through a single language lens, our analysis looks at numerous language data sets. With the most discriminating features and data augmentation, our technique achieves 97.09%, 96.44%, and 83.33% accuracy for the BAVED, ANAD, and SAVEE data sets, respectively.

References

[1]
Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey of affect recognition methods: audio, visual, and spontaneous expressions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 1, pp. 39–58, 2008.
[2]
T. Vogt, E. André, and J. Wagner, “Automatic recognition of emotions from speech: a review of the literature and recommendations for practical realisation,” Affect and Emotion in Human-Computer Interaction, pp. 75–91, Springer, Berlin, Heidelberg, 2008.
[3]
M. Wöllmer, A. Metallinou, F. Eyben, B. Schuller, and S. Narayanan, “Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional lstm modeling,” in Proceedings of the Proc. INTERSPEECH 2010 11th Annual Conference of the International Speech Communication Association, pp. 2362–2365, SJR, Makuhari, Japan, 26 September 2010.
[4]
A. Firoz Shah, A. Raji Sukumar, and B. Anto, “Discrete wavelet transforms and artificial neural networks for speech emotion recognition,” International Journal of Computer Theory and Engineering, vol. 2, no. 3, p. 319, 2010.
[5]
C. C. Chibelushi, F. Deravi, and J. S. D. Mason, “A review of speech-based bimodal recognition,” IEEE Transactions on Multimedia, vol. 4, no. 1, pp. 23–37, 2002.
[6]
Z. A. Khan and W. Sohn, “Abnormal human activity recognition system based on R-transform and kernel discriminant technique for elderly home care,” IEEE Transactions on Consumer Electronics, vol. 57, no. 4, pp. 1843–1850, 2011.
[7]
R. W. Picard, Affective Computing, MIT Press, Cambridge, MA, 1997.
[8]
J. Tao and T. Tan, “Affective computing: a review,” in Proceedings of the International Conference on Affective Computing and Intelligent Interaction, pp. 981–995, Springer, Beijing, China, 22 October 2005,.
[9]
N. Garay, I. Cearreta, J. López, and I. Fajardo, “Assistive technology and affective mediation,” Human Technology: An Interdisciplinary Journal on Humans in ICT Environments, vol. 2, no. 1, pp. 55–83, 2006.
[10]
S. G. Koolagudi and K. S. Rao, “Emotion recognition from speech: a review,” International Journal of Speech Technology, vol. 15, pp. 99–117, 2012.
[11]
S. Ramakrishnan and I. M. M. El Emary, “Speech emotion recognition approaches in human computer interaction,” Telecommunication Systems, vol. 52, no. 3, pp. 1467–1478, 2013.
[12]
P. M. Ramirez, D. Desantis, and L. A. Opler, “EEG biofeedback treatment of ADD: a viable alternative to traditional medical intervention?” Annals of the New York Academy of Sciences, vol. 931, no. 1, pp. 342–358, 2001.
[13]
X. Hu, J. Chen, F. Wang, and D. Zhang, “Ten challenges for EEG-based affective computing,” Brain Sci. Adv., vol. 5, no. 1, pp. 1–20, 2019.
[14]
F. Fürbass, M. A. Kural, G. Gritsch, M. Hartmann, T. Kluge, and S. Beniczky, “An artificial intelligence-based EEG algorithm for detection of epileptiform EEG discharges: validation against the diagnostic gold standard,” Clinical Neurophysiology, vol. 131, no. 6, pp. 1174–1179, 2020.
[15]
A. Aouf, “Basic Arabic vocal emotions dataset (baved) – github,” 2019, https://github.com/40uf411/Basic-Arabic-%20Vocal-Emotions-Dataset.
[16]
K. R. Scherer, “Vocal communication of emotion: a review of research paradigms,” Speech Communication, vol. 40, no. 1–2, pp. 227–256, 2003.
[17]
J.-A. Bachorowski and M. J. Owren, “Vocal expression of emotion: acoustic properties of speech are associated with emotional intensity and context,” Psychological Science, vol. 6, no. 4, pp. 219–224, 1995.
[18]
J. Tao and Y. Kang, “Features importance analysis for emotional speech classification,” in Proceedings of the International Conference on Affective Computing and Intelligent Interaction, pp. 449–457, Springer-Verlag, Beijing, China, 22 October 2005.
[19]
D. Ververidis and C. Kotropoulos, “Emotional speech recognition: resources, features, and methods,” Speech Communication, vol. 48, no. 9, pp. 1162–1181, 2006.
[20]
L. He, M. Lech, S. Memon, and N. Allen, “Recognition of Stress in Speech Using Wavelet Analysis and Teager Energy Operator,” in Proceedings of the 9th Annual Conference, International Speech Communication Association and 12 Biennial Conference, Australasian Speech Science and Technology Association, RMIT, Brisbane, Qld, 22 September 2008.
[21]
R. Sun, E. Moore, and J. F. Torres, “Investigating glottal parameters for differentiating emotional categories with similar prosodics,” in Proceeding of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4509–4512, IEEE, Taipei, Taiwan, 19 April 2009.
[22]
J. Pribil and A. Pribilová, “An experiment with evaluation of emotional speech conversion by spectrograms,” Measurement Science Review, vol. 10, no. 3, pp. 72–77, 2010.
[23]
B. Schuller, B. Vlasenko, F. Eyben, G. Rigoll, and A. Wendemuth, “Acoustic emotion recognition: a benchmark comparison of performances,” in Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition \\& Understanding, pp. 552–557, IEEE, Moreno, Italy, 13 November 2009.
[24]
L. He, M. Lech, and N. Allen, “On the Importance of Glottal Flow Spectral Energy for the Recognition of Emotions in Speech,” in Proceedings of the INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, DBLP, Makuhari, Chiba, Japan, 26 Setember 2010.
[25]
K. E. B. Ooi, L.-S. A. Low, M. Lech, and N. Allen, “Early prediction of major depression in adolescents using glottal wave characteristics and teager energy parameters,” in Proceeding of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4613–4616, IEEE, Kyoto, Japan, 25 March 2012.
[26]
N. Dave, “Feature extraction methods LPC, PLP and MFCC in speech recognition,” International Journal of Advanced Research in Engineering & Technology, vol. 1, no. 6, pp. 1–4, 2013.
[27]
A. Luque, J. Gómez-Bellido, A. Carrasco, and J. Barbancho, “Optimal representation of anuran call spectrum in environmental monitoring systems using wireless sensor networks,” Sensors, vol. 18, no. 6, p. 1803, 2018.
[28]
B. Erol, M. S. Seyfioglu, S. Z. Gurbuz, and M. Amin, “Data-driven cepstral and neural learning of features for robust micro-Doppler classification,” Radar Sensor Technology XXII, vol. 10633, 2018.
[29]
G. K. Liu, “Evaluating gammatone frequency cepstral coefficients with neural networks for emotion recognition from speech,” 2018, https://arxiv.org/abs/1806.09010.
[30]
Z.-T. Liu, M. Wu, W.-H. Cao, J.-W. Mao, J.-P. Xu, and G.-Z. Tan, “Speech emotion recognition based on feature selection and extreme learning machine decision tree,” Neurocomputing, vol. 273, pp. 271–280, 2018.
[31]
C.-L. Liu, F. Yin, D.-H. Wang, and Q.-F. Wang, “CASIA online and offline Chinese handwriting databases,” in Proceeding of the 2011 International Conference on Document Analysis and Recognition, pp. 37–41, IEEE, Beijing, China, 18 September 2011.
[32]
M. Fahad, J. Yadav, G. Pradhan, and A. Deepak, “DNN-HMM Based Speaker Adaptive Emotion Recognition Using Proposed Epoch and MFCC Features,” 2018, https://arxiv.org/abs/1806.00984.
[33]
F. Noroozi, M. Marjanovic, A. Njegus, S. Escalera, and G. Anbarjafari, “Audio-visual emotion recognition in video clips,” IEEE Trans. Affect. Comput., vol. 10, no. 1, pp. 60–75, 2017.
[34]
S. R. Bandela and T. K. Kumar, “Stressed speech emotion recognition using feature fusion of teager energy operator and MFCC,” in Proceeding of the 2017 8th International Conference on Computing, Communication and Networking Technologies (ICCCNT), pp. 1–5, IEEE, Delhi, India, 3 July 2017.
[35]
A. A. A. Zamil, S. Hasan, S. M. D. J. Baki, J. M. D. Adam, and I. Zaman, “Emotion detection from speech signals using voting mechanism on classified frames,” in Proceeding of the 2019 International Conference on Robotics, Electrical and Signal Processing Techniques (ICREST), pp. 281–285, IEEE, Dhaka, Bangladesh, 10 January 2019.
[36]
S. B. Reddy and T. Kishore Kumar, “Emotion recognition of stressed speech using teager energy and linear prediction features,” in Proceedings of the IEEE 18th International Conference on Advanced Learning Technologies, IEEE, Mumbai India, 9 July 2018.
[37]
D. Kamińska, T. Sapiński, and G. Anbarjafari, “Efficiency of chosen speech descriptors in relation to emotion recognition,” EURASIP Journal on Audio Speech and Music Processing, vol. 2017, pp. 1–9, 2017.
[38]
J. Kacur, B. Puterka, J. Pavlovicova, and M. Oravec, “On the speech properties and feature extraction methods in speech emotion recognition,” Sensors, vol. 21, no. 5, p. 1888, 2021.
[39]
K. Han, D. Yu, and I. Tashev, “Speech Emotion Recognition Using Deep Neural Network and Extreme Learning Machine,” 2014, https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/IS140441.pdf.
[40]
Z. Huang, M. Dong, Q. Mao, and Y. Zhan, “Speech emotion recognition using CNN,” in Proceedings of the 22nd ACM International Conference on Multimedia, pp. 801–804, IEEE, Orlando Florida USA, 3 November 2014.
[41]
H. M. Fayek, M. Lech, and L. Cavedon, “Towards real-time speech emotion recognition using deep neural networks,” in Proceeding of the 2015 9th International Conference on Signal Processing and Communication Systems (ICSPCS), pp. 1–5, DBLP, Cairns, Autralia, 4 December 2015.
[42]
H. M. Fayek, M. Lech, and L. Cavedon, “Evaluating deep learning architectures for speech emotion recognition,” Neural Networks, vol. 92, pp. 60–68, 2017.
[43]
W. Lim, D. Jang, and T. Lee, “Speech Emotion Recognition Using Convolutional and Recurrent Neural Networks,” in Proceedings of the 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), IEEE, Jeju, Korea (South), 13 December 2016.
[44]
K. Han, D. Yu, and I. Tashev, “Speech emotion recognition using deep neural network and extreme learning machine,” Fifteenth Annu. Conf., 2014.
[45]
G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, and S. Zafeiriou, “Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network,” in Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5200–5204, IEEE, Shanghai, China, 20–25 March 2016.
[46]
Z. Zhao, Qifei Li, Zixing Zhang, Nicholas Cummins, Haishuai Wang, Jianhua Tao, and Björn W. Schuller, “Combining a parallel 2D CNN with a self-attention Dilated Residual Network for CTC-based discrete speech emotion recognition,” Neural Networks: The Official Journal of the International Neural Network Society, vol. 141, pp. 52–60, 2021.
[47]
S. Lee, D. K. Han, and H. Ko, “Fusion-ConvBERT: parallel convolution and BERT fusion for speech emotion recognition,” Sensors, vol. 20, no. 22, p. 6688, 2020.
[48]
H. Zhang, R. Gou, J. Shang, F. Shen, Y. Wu, and G. Dai, “Pre-trained deep convolution neural network model with attention for speech emotion recognition,” Frontiers in Physiology, vol. 12, 2021.
[49]
L. X. Hùng, Détection des émotions dans des énoncés audio multilingues, Institut polytechnique de Grenoble, 2009.
[50]
Y. Aytar, C. Vondrick, and A. Torralba, “Soundnet: learning sound representations from unlabeled video,” Advances in Neural Information Processing Systems, vol. 29, 2016.
[51]
S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” 2015, https://arxiv.org/abs/1502.03167.
[52]
D. P. Kingma and J. Ba, “Adam: a method for stochastic optimization,” 2014, https://arxiv.org/abs/1412.6980.
[53]
A. Almahdawi and W. Teahan, “A New Arabic Dataset for Emotion Recognition,” Intelligent Computing, 2019.
[54]
R. Banse and K. R. Scherer, “Acoustic profiles in vocal emotion expression,” Journal of Personality and Social Psychology, vol. 70, no. 3, pp. 614–636, 1996.
[55]
A. Hammami, “Towards Developing a Speech Emotion Database for Tunisian Arabic,” 2018, https://erepo.uef.fi/bitstream/handle/123456789/19769/urn_nbn_fi_uef-20180767.pdf?sequence=1&isAllowed=y.
[56]
S. Klaylat, Z. Osman, L. Hamandi et al., “Enhancement of an Arabic speech emotion recognition system,” International Journal of Applied Engineering Research, vol. 13, no. 5, pp. 2380–2389, 2018.
[57]
A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: a generative model for raw audio,” SSW, vol. 125, 2016.
[58]
P. Ghahremani, V. Manohar, D. Povey, and S. Khudanpur, “Acoustic Modelling from the Signal Domain Using CNNs,” in Proceedings of the Interspeech, DBLP, San Francisco, USA, 12 September 2016.
[59]
H. Purwins, Bo Li, T. Virtanen, S. Jan, S.-Y. Chang, and T. Sainath, “Deep Learning for Audio Signal Processing,” IEEE Signal Processing Magazine, vol. 13, 2019.
[60]
S. Furui, “Speaker-independent Isolated Word Recognition Based on Emphasized Spectral Dynamics,” in Proceedings of the ICASSP ’86. IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, Tokyo, Japan, 7 April 1986.
[61]
N. Jaitly and G. E. Hinton, “Vocal tract length perturbation (VTLP) improves speech recognition,” Speech, and Language Processing, vol. 117, 2013.
[62]
N. Kanda, R. Takeda, and Y. Obuchi, “Elastic Spectral Distortion for Low Resource Speech Recognition with Deep Neural Networks,” in Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, IEEE, Olomouc, Czech Republic, 8 December 2013.
[63]
O. Mohamed and S. A. Aly, “Arabic Speech Emotion Recognition Employing Wav2vec2.0 and HuBERT Based on BAVED Dataset,” 2021, https://arxiv.org/abs/2110.04425.
[64]
H. Skander, A. Moussaoui, M. Oussalah, and M. Saidi, “Gender Identification from Arabic Speech Using Machine Learning,” Modelling and Implementation of Complex Systems, Springer, Cham, 2020.
[65]
M. Seo and M. Kim, “Fusing visual attention CNN and bag of visual words for cross-corpus speech emotion recognition,” Sensors, vol. 20, p. 5559, 2020.
[66]
M. Farooq, F. Hussain, N. K. Baloch, F. Raja, H. Yu, and Y. Zikria, “Impact of feature selection algorithm on speech emotion recognition using deep convolutional neural network,” Sensors, vol. 20, 2020.
[67]
W. Zhang, X. Meng, Q. Lu, Y. Rao, and J. Zhou, “A hybrid emotion recognition on android smart phones,” in Proceedings of the 2013 IEEE International Conference on Green Computing and Communications (GreenCom) and IEEE Internet of Things (iThings) and IEEE Cyber, Physical and Social Computing (CPSCom), pp. 1313–1318, IEEE, Beijing, China, 20 August 2013.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Computational Intelligence and Neuroscience
Computational Intelligence and Neuroscience  Volume 2022, Issue
2022
32389 pages
ISSN:1687-5265
EISSN:1687-5273
Issue’s Table of Contents
This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Publisher

Hindawi Limited

London, United Kingdom

Publication History

Published: 01 January 2022

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Speech emotion recognition in real static and dynamic human-robot interaction scenariosComputer Speech and Language10.1016/j.csl.2024.10166689:COnline publication date: 1-Jan-2025
  • (2024)Speech Emotion Recognition Using Machine Learning: A Comparative AnalysisSN Computer Science10.1007/s42979-024-02656-05:4Online publication date: 4-Apr-2024
  • (2024)Deep Learning for Cardiac Diseases ClassificationComputational Collective Intelligence10.1007/978-3-031-70816-9_14(170-182)Online publication date: 9-Sep-2024
  • (2023)An ensemble 1D-CNN-LSTM-GRU model with data augmentation for speech emotion recognitionExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.119633218:COnline publication date: 9-Mar-2023
  • (2023)A Novel Framework for Detection of Harmful Snakes Using YOLO AlgorithmSN Computer Science10.1007/s42979-023-02366-z5:1Online publication date: 2-Dec-2023
  • (2022)Assessment of Dynamic Swarm Heterogeneous Clustering in Cognitive Radio Sensor NetworksWireless Communications & Mobile Computing10.1155/2022/73592102022Online publication date: 1-Jan-2022
  • (2022)Analysis of Smart Lung Tumour Detector and Stage Classifier Using Deep Learning Techniques with Internet of ThingsComputational Intelligence and Neuroscience10.1155/2022/46081452022Online publication date: 1-Jan-2022
  • (2022)Fault Size Estimation of Bearings Using Multiple Decomposition Techniques with Artificial Neural NetworkScientific Programming10.1155/2022/34287152022Online publication date: 1-Jan-2022
  • (2022)Secure Information Collection and Energy Efficiency in Heterogeneous Sensor Networks Using Machine Learning with the Internet of ThingsWireless Communications & Mobile Computing10.1155/2022/28744972022Online publication date: 1-Jan-2022

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media