short-paper

Investigating Transformer Encoders and Fusion Strategies for Speech Emotion Recognition in Emergency Call Center Conversations.

Authors:

Theo Deschamps-Berger,

Lori Lamel,

Laurence DevillersAuthors Info & Claims

ICMI '22 Companion: Companion Publication of the 2022 International Conference on Multimodal Interaction

Pages 144 - 153

https://doi.org/10.1145/3536220.3558038

Published: 07 November 2022 Publication History

Get Access

Abstract

There has been growing interest in using deep learning techniques to recognize emotions from speech. However, real-life emotion datasets collected in call centers are relatively rare and small, making the use of deep learning techniques quite challenging. This research focuses on the study of Transformer-based models to improve the speech emotion recognition of patients’ speech in French emergency call center dialogues. The experiments were conducted on a corpus called CEMO, which was collected in a French emergency call center. It includes telephone conversations with more than 800 callers and 6 agents. Four emotion classes were selected for these experiments: Anger, Fear, Positive and Neutral state. We compare different Transformer encoders based on the wav2vec2 and BERT models, and explore their fine-tuning as well as fusion of the encoders for emotion recognition from speech. Our objective is to explore how to use these pre-trained models to improve model robustness in the context of a real-life application. We show that the use of specific pre-trained Transformer encoders improves the model performance for emotion recognition in the CEMO corpus. The Unweighted Accuracy (UA) of the french pre-trained wav2vec2 adapted to our task is 73.1%, whereas the UA of our baseline model (Temporal CNN-LSTM without pre-training) is 55.8%. We also tested BERT encoders models: in particular FlauBERT obtained good performance for both manual 67.1% and automatic 67.9% transcripts. The late and model-level fusion of the speech and text models also improve performance (77.1% (late) - 76.9% (model-level)) compared to our best speech pre-trained model, 73.1% UA. In order to place our work in the scientific community, we also report results on the widely used IEMOCAP corpus with our best fusion strategy, 70.8% UA. Our results are promising for constructing more robust speech emotion recognition system for real-world applications.

References

[1]

Francisca Acheampong, Henry Nunoo-Mensah, and Wenyu Chen. 2021. Transformer Models for Text-based Emotion Detection: A Review of BERT-based Approaches.

Abstract

References

Cited By

Index Terms

Recommendations

Multiscale Contextual Learning for Speech Emotion Recognition in Emergency Call Center Conversations

Synthesized speech for model training in cross-corpus recognition of human emotion

Long-Time Speech Emotion Recognition Using Feature Compensation and Accentuation-Based Fusion

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

HTML Format

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations