Dimensionality Reduction and Attention Mechanisms for Extracting Affective State from Sound Spectrograms

Pikramenos, George; Kechagias, Konstantinos; Psallidas, Theodoros; Smyrnis, Georgios; Spyrou, Evaggelos; Perantonis, Stavros

doi:10.1007/978-3-030-66125-0_3

George Pikramenos^11,12,
Konstantinos Kechagias¹¹,
Theodoros Psallidas¹²,
Georgios Smyrnis¹³,
Evaggelos Spyrou¹² &
…
Stavros Perantonis¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12594))

Included in the following conference series:

International Conference on Pattern Recognition Applications and Methods

305 Accesses

Abstract

Emotion recognition (ER) has drawn the interest of many researchers in the field of human-computer interaction, being central in such applications as assisted living and personalized content suggestion. When considering the implementation of ER capable systems, if they are to be widely adopted in daily life, one must take into account that methods for emotion recognition should work on data collected in an unobtrusive way. Out of the possible data modalities for affective state analysis, which include video and biometrics, speech is considered the least intrusive and for this reason has drawn the focus of many research efforts. In this chapter, we discuss methods for analyzing the non-linguistic component of vocalized speech for the purposes of ER. In particular, we propose a method for producing lower dimensional representations of sound spectrograms which respect their temporal structure. Moreover, we explore possible methods for analyzing such representations, including shallow methods, recurrent neural networks and attention mechanisms. Our models are evaluated on data taken from popular, public datasets for emotion analysis with promising results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A novel conversational hierarchical attention network for speech emotion recognition in dyadic conversation

Article 29 December 2023

Speech emotion recognition using feature fusion: a hybrid approach to deep learning

Article 19 February 2024

Shallow over Deep Neural Networks: A Empirical Analysis for Human Emotion Classification Using Audio Data

References

Cowie, R., et al.: Emotion recognition in human-computer interaction. IEEE Sign. Process. Mag. 18(1), 32–80 (2001)
Article Google Scholar
Poria, S., Chaturvedi, I., Cambria, E., Hussain, A.: Convolutional MKL based multimodal emotion recognition and sentiment analysis. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 439–448. IEEE (2016)
Google Scholar
Zeng, E., Mare, S., Roesner, F.: End user security and privacy concerns with smart homes. In Thirteenth Symposium on Usable Privacy and Security (SOUPS 2017), pp. 65–80 (2017)
Google Scholar
Sauter, D.A., Eisner, F., Calder, A.J., Scott, S.K.: Perceptual cues in nonverbal vocal expressions of emotion. Quart. J. Exp. Psychol. 63(11), 2251–2272 (2010)
Article Google Scholar
Anagnostopoulos, C.N., Iliou, T., Giannoukos, I.: Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artif. Intell. Rev. 43(2), 155–177 (2015). https://doi.org/10.1007/s10462-012-9368-5
Article Google Scholar
El Ayadi, M., Kamel, M.S., Karray, F.: Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recogn. 44(3), 572–587 (2011)
Article Google Scholar
Giannakopoulos, T., Pikrakis, A.: Introduction to Audio Analysis: A MATLAB® Approach. Academic Press, Cambridge (2014)
Google Scholar
Drakopoulos, G., Pikramenos, G., Spyrou, E.D., Perantonis, S.J.: Emotion recognition from speech: a survey. In: WEBIST, pp. 432–439 (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Badshah, A.M., Ahmad, J., Rahim, N., Baik, S.W.: Speech emotion recognition from spectrograms with deep convolutional neural network. In: 2017 International Conference on Platform Technology and Service (PlatCon), pp. 1–5. IEEE (2017)
Google Scholar
Satt, A., Rozenberg, S., Hoory, R.: Efficient emotion recognition from speech using deep learning on spectrograms (2017)
Google Scholar
He, L., Lech, M., Maddage, N., Allen, N.: Stress and emotion recognition using log-Gabor filter analysis of speech spectrograms. In 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops, pp. 1–6. IEEE (2009)
Google Scholar
Pikramenos, G., Smyrnis, G., Vernikos, I., Konidaris, T., Spyrou, E., Perantonis, S.J.: Sentiment analysis from sound spectrograms via soft BoVW and temporal structure modelling. In: ICPRAM, pp. 361–369 (2020)
Google Scholar
Lim, W., Jang, D., Lee, T.: Speech emotion recognition using convolutional and recurrent neural networks. In: 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–4. IEEE (2016)
Google Scholar
Wang, Y., Guan, L.: Recognizing human emotional state from audiovisual signals. IEEE Trans. Multimedia 10(5), 936–946 (2008)
Article Google Scholar
Nogueiras, A., Moreno, A., Bonafonte, A., & Mariño, J. B.: Speech emotion recognition using hidden Markov models. In Seventh European Conference on Speech Communication and Technology (2001)
Google Scholar
Spyrou, E., Nikopoulou, R., Vernikos, I., Mylonas, P.: Emotion recognition from speech using the bag-of-visual words on audio segment spectrograms. Technologies 7(1), 20 (2019)
Article Google Scholar
Hanjalic, A.: Extracting moods from pictures and sounds: Towards truly personalized TV. IEEE Sign. Process. Mag. 23(2), 90–100 (2006)
Article Google Scholar
Rozgić, V., Ananthakrishnan, S., Saleem, S., Kumar, R., Vembu, A.N., Prasad, R.: Emotion recognition using acoustic and lexical features. In Thirteenth Annual Conference of the International Speech Communication Association (2012)
Google Scholar
Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: ORB: an efficient alternative to SIFT or SURF. In: 2011 International Conference on Computer Vision, pp. 2564–2571. IEEE (2011)
Google Scholar
Hu, D.: An introductory survey on attention mechanisms in NLP problems. In: Bi, Y., Bhatia, R., Kapoor, S. (eds.) IntelliSys 2019. AISC, vol. 1038, pp. 432–448. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-29513-4_31
Chapter Google Scholar
Kristensen, L.B., Wang, L., Petersson, K.M., Hagoort, P.: The interface between language and attention: prosodic focus marking recruits a general attention network in spoken language comprehension. Cereb. Cortex 23(8), 1836–1848 (2013)
Article Google Scholar
Galassi, A., Lippi, M., Torroni, P.: Attention, please! a critical review of neural attention models in natural language processing. arXiv preprint. arXiv:1902.02181 (2019)
Mao, Q., Dong, M., Huang, Z., Zhan, Y.: Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans. Multimedia 16(8), 2203–2213 (2014)
Article Google Scholar
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., Weiss, B.: A database of German emotional speech. In Ninth European Conference on Speech Communication and Technology (2005)
Google Scholar
Costantini, G., Iaderola, I., Paoloni, A., Todisco, M.: EMOVO corpus: an Italian emotional speech database. In: International Conference on Language Resources and Evaluation (LREC 2014), pp. 3501–3504. European Language Resources Association (ELRA) (2014)
Google Scholar
Jackson, P., Haq, S.: Surrey audio-visual expressed emotion (SAVEE) database. University of Surrey, Guildford, UK (2014)
Google Scholar
Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Tang, X.: Residual attention network for image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2017)
Google Scholar
Salazar, J., Kirchhoff, K., Huang, Z.: Self-attention networks for connectionist temporal classification in speech recognition. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7115–7119. IEEE (2019)
Google Scholar
Lin, Z., Feng, M., Santos, C.N.D., Yu, M., Xiang, B., Zhou, B., Bengio, Y.: A structured self-attentive sentence embedding. arXiv preprint. arXiv:1703.03130 (2017)
Yan, Z., Liu, W., Wen, S., Yang, Y.: Multi-label image classification by feature attention network. IEEE Access 7, 98005–98013 (2019)
Article Google Scholar
Mehrabian, A.: Framework for a comprehensive description and measurement of emotional states. Genet. Soc. Gen. Psychol. Monogr. 121, 339–361 (1995)
Google Scholar
Sculley, D.: Web-scale k-means clustering. In: Proceedings of the 19th International Conference on World Wide Web, pp. 1177–1178 (2010)
Google Scholar
Binali, H., Wu, C., Potdar, V.: Computational approaches for emotion detection in text. In: 4th IEEE International Conference on Digital Ecosystems and Technologies, pp. 172–177. IEEE (2010)
Google Scholar
Jin, Q., Li, C., Chen, S., Wu, H.: Speech emotion recognition with acoustic and lexical features. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4749–4753. IEEE (2015)
Google Scholar
Lu, L., Liu, D., Zhang, H.J.: Automatic mood detection and tracking of music audio signals. IEEE Trans. Audio Speech Lang. Process. 14(1), 5–18 (2005)
Article Google Scholar
Yang, Y.H., Lin, Y.C., Su, Y.F., Chen, H.H.: A regression approach to music emotion recognition. IEEE Trans. Audio Speech Lang. Process. 16(2), 448–457 (2008)
Article Google Scholar
Panda, R., Malheiro, R.M., Paiva, R.P.: Novel audio features for music emotion recognition. IEEE Trans. Affect. Comput. 11, 614–626 (2018)
Google Scholar
Grimm, M., Kroschel, K., Mower, E., Narayanan, S.: Primitives-based evaluation and estimation of emotions in speech. Speech Commun. 49(10–11), 787–800 (2007)
Article Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004). https://doi.org/10.1023/B:VISI.0000029664.99615.94
Article Google Scholar
Bay, H., Tuytelaars, T., Van Gool, L.: SURF: speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 404–417. Springer, Heidelberg (2006). https://doi.org/10.1007/11744023_32
Chapter Google Scholar
Wöllmer, M., et al.: Abandoning emotion classes-towards continuous emotion recognition with modelling of long-range dependencies. In: Proceedings of the 9th Interspeech 2008 Incorp. 12th Australasian International Conference on Speech Science and Technology SST 2008, Brisbane, Australia, pp. 597–600 (2008)
Google Scholar
Giannakopoulos, T., Pikrakis, A., Theodoridis, S.: A dimensional approach to emotion recognition of speech from movies. In: 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 65–68. IEEE (2009)
Google Scholar
Lee, H., Pham, P., Largman, Y., Ng, A.Y.: Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Advances in Neural Information Processing Systems, pp. 1096–1104 (2009)
Google Scholar
Zhang, T., Kuo, C.C.J.: Audio content analysis for online audiovisual data segmentation and classification. IEEE Trans. Speech Audio Process. 9(4), 441–457 (2001)
Article Google Scholar
Busso, C., et al.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335 (2008)
Google Scholar
Plutchik, R.: A general psychoevolutionary theory of emotion. In: Theories of Emotion, pp. 3–33. Academic press, Cambridge (1980)
Google Scholar
Papakostas, M., et al.: Deep visual attributes vs. hand-crafted audio features on multidomain speech emotion recognition. Computation 5(2), 26 (2017)
Google Scholar
Martiínez, J.G.: Recognition and emotions. A critical approach on education. Procedia Soc. Behav. Sci. 46, 3925–3930 (2012)
Article Google Scholar
Tickle, A., Raghu, S., Elshaw, M.: Emotional recognition from the speech signal for a virtual education agent. J. Phys. Conf. Ser. 450(1), 012053 (2013)
Article Google Scholar
Bahreini, K., Nadolski, R., Westera, W.: Towards real-time speech emotion recognition for affective e-learning. Educ. Inf. Technol. 21(5), 1367–1386 (2016). https://doi.org/10.1007/s10639-015-9388-2
Article Google Scholar
Busso, C., et al.: Analysis of emotion recognition using facial expressions, speech and multimodal information. In: Proceedings of the 6th International Conference on Multimodal Interfaces, pp. 205–211 (2004)
Google Scholar
Wöllmer, M., Metallinou, A., Eyben, F., Schuller, B., Narayanan, S.: Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling. In: Proceedings of the INTERSPEECH 2010, Makuhari, Japan, pp. 2362–2365 (2010)
Google Scholar
Trentin, E., Scherer, S., Schwenker, F.: Emotion recognition from speech signals via a probabilistic echo-state network. Pattern Recogn. Lett. 66, 4–12 (2015)
Article Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Sak, H., Senior, A.W., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling (2014)
Google Scholar
Theodoridis, S., Koutroumbas, K.: Pattern recognition and neural networks. In: Paliouras, G., Karkaletsis, V., Spyropoulos, C.D. (eds.) ACAI 1999. LNCS (LNAI), vol. 2049, pp. 169–195. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44673-7_8
Chapter MATH Google Scholar
Aminbeidokhti, M., Pedersoli, M., Cardinal, P., Granger, E.: Emotion recognition with spatial attention and temporal softmax pooling. In: Karray, F., Campilho, A., Yu, A. (eds.) ICIAR 2019. LNCS, vol. 11662, pp. 323–331. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27202-9_29
Chapter Google Scholar
Gupta, A., Agrawal, D., Chauhan, H., Dolz, J., Pedersoli, M.: An attention model for group-level emotion recognition. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 611–615 (2018)
Google Scholar
Tarantino, L., Garner, P.N., Lazaridis, A.: Self-Attention for speech emotion recognition. In: INTERSPEECH, pp. 2578–2582 (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics and Telecommunications, National Kapodistrian University of Athens, Athens, Greece
George Pikramenos & Konstantinos Kechagias
National Center for Scientific Research, Demokritos, Athens, Greece
George Pikramenos, Theodoros Psallidas, Evaggelos Spyrou & Stavros Perantonis
School of Electrical and Computer Engineering, National Technical University of Athens, Athens, Greece
Georgios Smyrnis

Authors

George Pikramenos
View author publications
You can also search for this author in PubMed Google Scholar
Konstantinos Kechagias
View author publications
You can also search for this author in PubMed Google Scholar
Theodoros Psallidas
View author publications
You can also search for this author in PubMed Google Scholar
Georgios Smyrnis
View author publications
You can also search for this author in PubMed Google Scholar
Evaggelos Spyrou
View author publications
You can also search for this author in PubMed Google Scholar
Stavros Perantonis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to George Pikramenos .

Editor information

Editors and Affiliations

Sapienza Università di Roma, Roma, Italy
Maria De Marsico
ICAR, Consiglio Nazionale delle Ricerche, Naples, Napoli, Italy
Gabriella Sanniti di Baja
Instituto de Telecomunicações, Lisbon, Portugal
Ana Fred

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pikramenos, G., Kechagias, K., Psallidas, T., Smyrnis, G., Spyrou, E., Perantonis, S. (2020). Dimensionality Reduction and Attention Mechanisms for Extracting Affective State from Sound Spectrograms. In: De Marsico, M., Sanniti di Baja, G., Fred, A. (eds) Pattern Recognition Applications and Methods. ICPRAM 2020. Lecture Notes in Computer Science(), vol 12594. Springer, Cham. https://doi.org/10.1007/978-3-030-66125-0_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-66125-0_3
Published: 23 December 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66124-3
Online ISBN: 978-3-030-66125-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Dimensionality Reduction and Attention Mechanisms for Extracting Affective State from Sound Spectrograms

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A novel conversational hierarchical attention network for speech emotion recognition in dyadic conversation

Speech emotion recognition using feature fusion: a hybrid approach to deep learning

Shallow over Deep Neural Networks: A Empirical Analysis for Human Emotion Classification Using Audio Data

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

Navigation

Dimensionality Reduction and Attention Mechanisms for Extracting Affective State from Sound Spectrograms

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A novel conversational hierarchical attention network for speech emotion recognition in dyadic conversation

Speech emotion recognition using feature fusion: a hybrid approach to deep learning

Shallow over Deep Neural Networks: A Empirical Analysis for Human Emotion Classification Using Audio Data

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation