Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3536220.3558038acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
short-paper

Investigating Transformer Encoders and Fusion Strategies for Speech Emotion Recognition in Emergency Call Center Conversations.

Published: 07 November 2022 Publication History

Abstract

There has been growing interest in using deep learning techniques to recognize emotions from speech. However, real-life emotion datasets collected in call centers are relatively rare and small, making the use of deep learning techniques quite challenging. This research focuses on the study of Transformer-based models to improve the speech emotion recognition of patients’ speech in French emergency call center dialogues. The experiments were conducted on a corpus called CEMO, which was collected in a French emergency call center. It includes telephone conversations with more than 800 callers and 6 agents. Four emotion classes were selected for these experiments: Anger, Fear, Positive and Neutral state. We compare different Transformer encoders based on the wav2vec2 and BERT models, and explore their fine-tuning as well as fusion of the encoders for emotion recognition from speech. Our objective is to explore how to use these pre-trained models to improve model robustness in the context of a real-life application. We show that the use of specific pre-trained Transformer encoders improves the model performance for emotion recognition in the CEMO corpus. The Unweighted Accuracy (UA) of the french pre-trained wav2vec2 adapted to our task is 73.1%, whereas the UA of our baseline model (Temporal CNN-LSTM without pre-training) is 55.8%. We also tested BERT encoders models: in particular FlauBERT obtained good performance for both manual 67.1% and automatic 67.9% transcripts. The late and model-level fusion of the speech and text models also improve performance (77.1% (late) - 76.9% (model-level)) compared to our best speech pre-trained model, 73.1% UA. In order to place our work in the scientific community, we also report results on the widely used IEMOCAP corpus with our best fusion strategy, 70.8% UA. Our results are promising for constructing more robust speech emotion recognition system for real-world applications.

References

[1]
Francisca Acheampong, Henry Nunoo-Mensah, and Wenyu Chen. 2021. Transformer Models for Text-based Emotion Detection: A Review of BERT-based Approaches.
[2]
Alexei Baevski, Steffen Schneider, and Michael Auli. 2020. Vq-Wav2vec: Self-Supervised Learning of Discrete Speech Representations. arXiv:1910.05453 [cs] (Feb. 2020). arxiv:1910.05453 [cs]
[3]
Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. arXiv:2006.11477 [cs, eess] (Oct. 2020). arxiv:2006.11477 [cs, eess]
[4]
Jonathan Boigne, Biman Liyanage, and Ted Östrem. 2020. Recognizing More Emotions with Less Data Using Self-supervised Transfer Learning. https://doi.org/10.48550/arXiv.2011.05585 arxiv:2011.05585 [cs, eess]
[5]
Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan. 2008. IEMOCAP: Interactive Emotional Dyadic Motion Capture Database. Language Resources and Evaluation 42, 4 (Nov. 2008), 335. https://doi.org/10.1007/s10579-008-9076-6
[6]
Carlos Busso, Serdar Yildirim, Murtaza Bulut, Chul Lee, Abe Kazemzadeh, Sungbok Lee, Ulrich Neumann, and Shrikanth Narayanan. 2004. Analysis of Emotion Recognition Using Facial Expressions, Speech and Multimodal Information. 211 pages. https://doi.org/10.1145/1027933.1027968
[7]
Ming Chen and Xudong Zhao. 2020. A Multi-Scale Fusion Framework for Bimodal Speech Emotion Recognition. In Interspeech 2020. ISCA, 374–378. https://doi.org/10.21437/Interspeech.2020-3156
[8]
Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016. Long Short-Term Memory-Networks for Machine Reading. arXiv:1601.06733 [cs] (Sept. 2016). arxiv:1601.06733 [cs]
[9]
Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, and Michael Auli. 2020. Unsupervised Cross-lingual Representation Learning for Speech Recognition. https://doi.org/10.48550/arXiv.2006.13979 arxiv:2006.13979 [cs, eess]
[10]
Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, and Michael Auli. 2020. Unsupervised Cross-lingual Representation Learning for Speech Recognition. arXiv:2006.13979 [cs, eess] (Dec. 2020). arxiv:2006.13979 [cs, eess]
[11]
Jean-Benoit Delbrouck, Noé Tits, and Stéphane Dupont. 2020. Modulated Fusion Using Transformer for Linguistic-Acoustic Emotion Recognition. arXiv:2010.02057 [cs] (Oct. 2020). arxiv:2010.02057 [cs]
[12]
Theo Deschamps-Berger, Lori Lamel, and Laurence Devillers. 2021. End-to-End Speech Emotion Recognition: Challenges of Real-Life Emergency Call Centers Data Recordings. 8 pages. https://doi.org/10.1109/ACII52823.2021.9597419
[13]
Laurence Devillers and Laurence Vidrascu. 2006. Real-Life Emotions Detection with Lexical and Paralinguistic Cues on Human-Human Call Center Dialogs.
[14]
Laurence Devillers, Laurence Vidrascu, and Lori Lamel. 2005. Challenges in Real-Life Emotion Annotation and Machine Learning Based Detection. Neural Networks 18, 4 (May 2005), 407–422. https://doi.org/10.1016/j.neunet.2005.03.007
[15]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs] (May 2019). arxiv:1810.04805 [cs]
[16]
Caroline Etienne, Guillaume Fidanza, Andrei Petrovskii, Laurence Devillers, and Benoit Schmauch. 2018. CNN+LSTM Architecture for Speech Emotion Recognition with Data Augmentation. 21–25. https://doi.org/10.21437/SMM.2018-5
[17]
Solene Evain, Ha Nguyen, Hang Le, Marcely Zanon Boito, Salima Mdhaffar, Sina Alisamir, Ziyi Tong, Natalia Tomashenko, Marco Dinarelli, Titouan Parcollet, Alexandre Allauzen, Yannick Esteve, Benjamin Lecouteux, Francois Portet, Solange Rossato, Fabien Ringeval, Didier Schwab, and Laurent Besacier. 2021. LeBenchmark: A Reproducible Framework for Assessing Self-Supervised Representation Learning from Speech. arXiv:2104.11462 [cs, eess] (June 2021). arxiv:2104.11462 [cs, eess]
[18]
Florian Eyben, Klaus R. Scherer, Bjorn W. Schuller, Johan Sundberg, Elisabeth Andre, Carlos Busso, Laurence Y. Devillers, Julien Epps, Petri Laukka, Shrikanth S. Narayanan, and Khiet P. Truong. 2016. The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing. IEEE Transactions on Affective Computing 7, 2 (April 2016), 190–202. https://doi.org/10.1109/TAFFC.2015.2457417
[19]
Jing Han, Zixing Zhang, Nicholas Cummins, Fabien Ringeval, and Björn Schuller. 2017. Strength Modelling for Real-World Automatic Continuous Affect Recognition from Audiovisual Signals. Image and Vision Computing 65 (Sept. 2017), 76–86. https://doi.org/10.1016/j.imavis.2016.11.020
[20]
Jing Han, Zixing Zhang, Zhao Ren, and Björn Schuller. 2021. EmoBed: Strengthening Monomodal Emotion Recognition via Training with Crossmodal Emotion Embeddings. IEEE Transactions on Affective Computing 12, 3 (July 2021), 553–564. https://doi.org/10.1109/TAFFC.2019.2928297 arxiv:1907.10428 [cs, eess]
[21]
Jing Han, Zixing Zhang, Fabien Ringeval, and Björn Schuller. 2017. Prediction-Based Learning for Continuous Emotion Recognition in Speech. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5005–5009. https://doi.org/10.1109/ICASSP.2017.7953109
[22]
H. Hardy, K. Baker, L. Devillers, L. Lamel, S. Rosset, T. Strzalkowski, Cristian Ursu, and N. Webb. 2003. Multi-Layer Dialogue Annotation for Automated Multilingual Customer Service. undefined (2003).
[23]
Wei-Ning Hsu, Anuroop Sriram, Alexei Baevski, Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Jacob Kahn, Ann Lee, Ronan Collobert, Gabriel Synnaeve, and Michael Auli. 2021. Robust Wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training. arXiv:2104.01027 [cs, eess] (Sept. 2021). arxiv:2104.01027 [cs, eess]
[24]
Zilong Huang, Xinggang Wang, Yunchao Wei, Lichao Huang, Humphrey Shi, Wenyu Liu, and Thomas S. Huang. 2020. CCNet: Criss-Cross Attention for Semantic Segmentation. https://doi.org/10.48550/arXiv.1811.11721 arxiv:1811.11721 [cs]
[25]
Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, and Didier Schwab. 2020. FlauBERT: Unsupervised Language Model Pre-training for French. arXiv:1912.05372 [cs] (March 2020). arxiv:1912.05372 [cs]
[26]
Pengfei Liu, Kun Li, and Helen Meng. 2022. Group Gated Fusion on Attention-based Bidirectional Alignment for Multimodal Emotion Recognition. https://doi.org/10.48550/arXiv.2201.06309 arxiv:2201.06309 [cs, eess]
[27]
Manon Macary, Marie Tahon, Yannick Estève, and Anthony Rousseau. 2021. On the Use of Self-supervised Pre-trained Acoustic and Linguistic Features for Continuous Speech Emotion Recognition. In IEEE Spoken Language Technology Workshop. Virtual, China.
[28]
Mariana Rodrigues Makiuchi, Kuniaki Uto, and Koichi Shinoda. 2021. Multimodal Emotion Recognition with High-Level Speech and Text Features. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). 350–357. https://doi.org/10.1109/ASRU51503.2021.9688036
[29]
Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah, and Benoît Sagot. 2020. CamemBERT: A Tasty French Language Model. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020), 7203–7219. https://doi.org/10.18653/v1/2020.acl-main.645 arxiv:1911.03894
[30]
Angeliki Metallinou, Sungbok Lee, and Shrikanth Narayanan. 2010. Decision Level Combination of Multiple Modalities for Recognition and Analysis of Emotional Expression. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing. 2462–2465. https://doi.org/10.1109/ICASSP.2010.5494890
[31]
Seyedmahdad Mirsamadi, Emad Barsoum, and Cha Zhang. 2017. Automatic Speech Emotion Recognition Using Recurrent Neural Networks with Local Attention. https://doi.org/10.1109/ICASSP.2017.7952552
[32]
Toan Q. Nguyen and Julian Salazar. 2019. Transformers without Tears: Improving the Normalization of Self-Attention. In Proceedings of the 16th International Conference on Spoken Language Translation. Association for Computational Linguistics, Hong Kong.
[33]
Leonardo Pepino, Pablo Riera, and Luciana Ferrer. 2021. Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings. arXiv:2104.03502 [cs, eess] (April 2021). arxiv:2104.03502 [cs, eess]
[34]
Mousmita Sarma, Pegah Ghahremani, Daniel Povey, N. Goel, K. K. Sarma, and N. Dehak. 2018. Emotion Identification from Raw Speech Signals Using DNNs. In INTERSPEECH. https://doi.org/10.21437/Interspeech.2018-1353
[35]
Aharon Satt, Shai Rozenberg, and Ron Hoory. 2017. Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms. In Interspeech 2017. ISCA, 1089–1093. https://doi.org/10.21437/Interspeech.2017-200
[36]
Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. 2019. Wav2vec: Unsupervised Pre-training for Speech Recognition. arXiv:1904.05862 [cs] (Sept. 2019). arxiv:1904.05862 [cs]
[37]
Panagiotis Tzirakis, Anh Nguyen, Stefanos Zafeiriou, and Björn W. Schuller. 2021. Speech Emotion Recognition Using Semantic Information. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6279–6283. https://doi.org/10.1109/ICASSP39728.2021.9414866
[38]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. arXiv:1706.03762 [cs] (Dec. 2017). arxiv:1706.03762 [cs]
[39]
Laurence Vidrascu and Laurence Devillers. 2005. Detection of Real-Life Emotions in Call Centers. 1841–1844.
[40]
Johannes Wagner, Andreas Triantafyllopoulos, Hagen Wierstorf, Maximilian Schmitt, Florian Eyben, and Björn Schuller. 2022. Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap.
[41]
Haoyu Wang, Ming Tan, Mo Yu, Shiyu Chang, Dakuo Wang, Kun Xu, Xiaoxiao Guo, and Saloni Potdar. 2019. Extracting Multiple-Relations in One-Pass with Pre-Trained Transformers. 1377 pages. https://doi.org/10.18653/v1/P19-1132
[42]
Matthias Wimmer, Björn Schuller, Dejan Arsic, Gerhard Rigoll, and Bernd Radig. 2008. Low-Level Fusion of Audio, Video Feature for Multi-Modal Emotion Recognition.151 pages.
[43]
Haiyang Xu, Hui Zhang, Kun Han, Yun Wang, Yiping Peng, and Xiangang Li. 2020. Learning Alignment for Multimodal Emotion Recognition from Speech. arXiv:1909.05645 [cs, eess] (April 2020). arxiv:1909.05645 [cs, eess]
[44]
Zhihong Zeng, Maja Pantic, and Glenn Roisman. 2009. A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions. IEEE transactions on pattern analysis and machine intelligence 31 (Feb. 2009), 39–58. https://doi.org/10.1109/TPAMI.2008.52

Cited By

View all
  • (2024)Les émotions ‹in the wild› des appelants d’un centre d’appels d’urgence : vers un système de détection des émotions dans la voixLangages10.3917/lang.234.0117N° 234:2(117-134)Online publication date: 29-May-2024
  • (2024)Improving Performance of Speech Emotion Recognition Application using Extreme Learning Machine and Utterance-level2024 International Seminar on Intelligent Technology and Its Applications (ISITIA)10.1109/ISITIA63062.2024.10668153(466-470)Online publication date: 10-Jul-2024
  • (2024)Multimodal evaluation of customer satisfaction from voicemails using speech and language representationsDigital Signal Processing10.1016/j.dsp.2024.104820(104820)Online publication date: Oct-2024
  • Show More Cited By

Index Terms

  1. Investigating Transformer Encoders and Fusion Strategies for Speech Emotion Recognition in Emergency Call Center Conversations.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Conferences
          ICMI '22 Companion: Companion Publication of the 2022 International Conference on Multimodal Interaction
          November 2022
          225 pages
          ISBN:9781450393898
          DOI:10.1145/3536220
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Sponsors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 07 November 2022

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. Transformer-based models
          2. emergency call center
          3. late fusion
          4. model-level fusion
          5. real-life emotional corpus
          6. speech emotion recognition

          Qualifiers

          • Short-paper
          • Research
          • Refereed limited

          Conference

          ICMI '22
          Sponsor:

          Acceptance Rates

          Overall Acceptance Rate 453 of 1,080 submissions, 42%

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)61
          • Downloads (Last 6 weeks)3
          Reflects downloads up to 24 Dec 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)Les émotions ‹in the wild› des appelants d’un centre d’appels d’urgence : vers un système de détection des émotions dans la voixLangages10.3917/lang.234.0117N° 234:2(117-134)Online publication date: 29-May-2024
          • (2024)Improving Performance of Speech Emotion Recognition Application using Extreme Learning Machine and Utterance-level2024 International Seminar on Intelligent Technology and Its Applications (ISITIA)10.1109/ISITIA63062.2024.10668153(466-470)Online publication date: 10-Jul-2024
          • (2024)Multimodal evaluation of customer satisfaction from voicemails using speech and language representationsDigital Signal Processing10.1016/j.dsp.2024.104820(104820)Online publication date: Oct-2024
          • (2024)Spanish MEACorpus 2023: A multimodal speech–text corpus for emotion analysis in Spanish from natural environmentsComputer Standards & Interfaces10.1016/j.csi.2024.10385690(103856)Online publication date: Aug-2024
          • (2024)Combining a multi-feature neural network with multi-task learning for emergency calls severity predictionArray10.1016/j.array.2023.10033321(100333)Online publication date: Mar-2024
          • (2024)An effective speaker adaption using deep learning for the identification of speakers in emergency situationMultimedia Tools and Applications10.1007/s11042-024-19373-8Online publication date: 2-Jul-2024
          • (2023)Multiscale Contextual Learning for Speech Emotion Recognition in Emergency Call Center ConversationsInternational Cconference on Multimodal Interaction10.1145/3610661.3616189(337-343)Online publication date: 9-Oct-2023
          • (2023)GCFormer: A Graph Convolutional Transformer for Speech Emotion RecognitionProceedings of the 25th International Conference on Multimodal Interaction10.1145/3577190.3614177(307-313)Online publication date: 9-Oct-2023
          • (2023)Designing for Control in Nurse-AI Collaboration During Emergency Medical CallsProceedings of the 2023 ACM Designing Interactive Systems Conference10.1145/3563657.3596110(1339-1352)Online publication date: 10-Jul-2023
          • (2023)Exploring Attention Mechanisms for Multimodal Emotion Recognition in an Emergency Call Center CorpusICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10096112(1-5)Online publication date: 4-Jun-2023
          • Show More Cited By

          View Options

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format.

          HTML Format

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media