research-article

Leveraging Multi-modal Interactions among the Intermediate Representations of Deep Transformers for Emotion Recognition

Authors:

Bing QinAuthors Info & Claims

MuSe' 22: Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge

Pages 101 - 109

https://doi.org/10.1145/3551876.3554813

Published: 10 October 2022 Publication History

Abstract

Multi-modal emotion recognition aims to recognize emotion states from multi-modal inputs. Existing end-to-end models typically fuse the uni-modal representations in the last layers without leveraging the multi-modal interactions among the intermediate representations. In this paper, we propose the multi-modal Recurrent Intermediate-Layer Aggregation (RILA) model to explore the effectiveness of leveraging the multi-modal interactions among the intermediate representations of deep pre-trained transformers for end-to-end emotion recognition. At the heart of our model is the Intermediate-Representation Fusion Module (IRFM), which consists of the multi-modal aggregation gating module and multi-modal token attention module. Specifically, at each layer, we first use the multi-modal aggregation gating module to capture the utterance-level interactions across the modalities and layers. Then we utilize the multi-modal token attention module to leverage the token-level multi-modal interactions. The experimental results on IEMOCAP and CMU-MOSEI show that our model achieves the state-of-the-art performance, benefiting from fully exploiting the multi-modal interactions among the intermediate representations.

References

[1]

Guillaume Alain and Yoshua Bengio. 2016. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644 (2016).

[2]

Shahin Amiriparian, Lukas Christ, Andreas König, Eva-Maria Meßner, Alan Cowen, Erik Cambria, and Björn W. Schuller. 2022. MuSe 2022 Challenge: Multimodal Humour, Emotional Reactions, and Stress. In Proceedings of the 30th ACM International Conference on Multimedia (MM'22), October 10--14, 2022, Lisbon, Portugal. Association for Computing Machinery, Lisbon, Portugal. 3 pages, to appear.

Digital Library

[3]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).

[4]

AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Multimodal Language Analysis in the Wild: Carnegie Mellon University-MOSEI Dataset and Interpretable Dynamic Fusion Graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 2236--2246. https://doi.org/10.18653/v1/P18--1208

[5]

Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. 2018. Openface 2.0: Facial behavior analysis toolkit. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018). IEEE, 59--66.

Digital Library

[6]

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation, Vol. 42, 4 (2008), 335--359.

[7]

Shi Chen and Qi Zhao. 2018. Shallowing deep networks: Layer-wise pruning based on feature representations. IEEE transactions on pattern analysis and machine intelligence, Vol. 41, 12 (2018), 3048--3056.

[8]

Lukas Christ, Shahin Amiriparian, Alice Baird, Panagiotis Tzirakis, Alexander Kathan, Niklas Müller, Lukas Stappen, Eva-Maria Meßner, Andreas König, Alan Cowen, Erik Cambria, and Björn W. Schuller. 2022. The MuSe 2022 Multimodal Sentiment Analysis Challenge: Humor, Emotional Reactions, and Stress. In Proceedings of the 3rd Multimodal Sentiment Analysis Challenge. Association for Computing Machinery, Lisbon, Portugal. Workshop held at ACM Multimedia 2022, to appear.

[9]

Wenliang Dai, Samuel Cahyawijaya, Zihan Liu, and Pascale Fung. 2021. Multimodal End-to-End Sparse Model for Emotion Recognition. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 5305--5316. https://www.aclweb.org/anthology/2021.naacl-main.417

[10]

Wenliang Dai, Zihan Liu, Tiezheng Yu, and Pascale Fung. 2020. Modality-Transferable Emotion Embeddings for Low-Resource Multimodal Emotion Recognition. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Suzhou, China, 269--280. https://aclanthology.org/2020.aacl-main.30

[11]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.

[12]

Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia. 1459--1462.

Digital Library

[13]

Deepanway Ghosal, Navonil Majumder, Soujanya Poria, Niyati Chhaya, and Alexander Gelbukh. 2019. DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 154--164.

[14]

Chengyue Gong, Dilin Wang, Meng Li, Vikas Chandra, and Qiang Liu. 2021b. Vision transformers with patch diversification. arXiv preprint arXiv:2104.12753 (2021).

[15]

Yuan Gong, Yu-An Chung, and James Glass. 2021a. AST: Audio Spectrogram Transformer. In Proc. Interspeech 2021. 571--575. https://doi.org/10.21437/Interspeech.2021--698

[16]

Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis. In Proceedings of the 28th ACM International Conference on Multimedia (Seattle, WA, USA) (MM '20). Association for Computing Machinery, New York, NY, USA, 1122--1131. https://doi.org/10.1145/3394171.3413678

Digital Library

[17]

Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016).

[18]

Di Hu, Chengze Wang, Feiping Nie, and Xuelong Li. 2019. Dense multimodal fusion for hierarchically joint representation. In ICASSP 2019--2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3941--3945.

[19]

Ganesh Jawahar, Beno^it Sagot, and Djamé Seddah. 2019. What does BERT learn about the structure of language?. In ACL 2019--57th Annual Meeting of the Association for Computational Linguistics.

[20]

Hamid Reza Vaezi Joze, Amirreza Shaban, Michael L Iuzzolino, and Kazuhito Koishida. 2020. MMTM: Multimodal transfer module for CNN fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13289--13299.

[21]

Xincheng Ju, Dong Zhang, Junhui Li, and Guodong Zhou. 2020. Transformer-based label set generation for multi-modal multi-label emotion detection. In Proceedings of the 28th ACM International Conference on Multimedia. 512--520.

Digital Library

[22]

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT. 4171--4186.

[23]

Taeuk Kim, Kang Min Yoo, and Sang-goo Lee. 2021. Self-Guided Contrastive Learning for BERT Sentence Representations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2528--2540.

[24]

Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, and Kentaro Inui. 2020. Attention is Not Only a Weight: Analyzing Transformers with Vector Norms. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 7057--7075. https://doi.org/10.18653/v1/2020.emnlp-main.574

[25]

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In International Conference on Learning Representations. https://openreview.net/forum?id=H1eA7AEtvS

[26]

Andy T Liu, Shu-wen Yang, Po-Han Chi, Po-chun Hsu, and Hung-yi Lee. 2020. Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders. In ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6419--6423.

[27]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012--10022.

[28]

Fengmao Lv, Xiang Chen, Yanyong Huang, Lixin Duan, and Guosheng Lin. 2021. Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2554--2562.

[29]

Navonil Majumder, Soujanya Poria, Devamanyu Hazarika, Rada Mihalcea, Alexander Gelbukh, and Erik Cambria. 2019. Dialoguernn: An attentive rnn for emotion detection in conversations. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6818--6825.

Digital Library

[30]

Ankita Pasad, Ju-Chieh Chou, and Karen Livescu. 2021. Layer-Wise Analysis of a Self-Supervised Speech Representation Model. In IEEE Automatic Speech Recognition and Understanding Workshop-ASRU 2021.

[31]

Han Shi, Jiahui Gao, Hang Xu, Xiaodan Liang, Zhenguo Li, Lingpeng Kong, Stephen M. S. Lee, and James Kwok. 2022. Revisiting Over-smoothing in BERT from the Perspective of Graph. In International Conference on Learning Representations. https://openreview.net/forum?id=dUV91uaXm3

[32]

Edmund Tong, Amir Zadeh, Cara Jones, and Louis-Philippe Morency. 2017. Combating human trafficking with multimodal deep models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1547--1556.

[33]

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2021. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning. PMLR, 10347--10357.

[34]

Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 6558--6569. https://doi.org/10.18653/v1/P19--1656

[35]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.

[36]

Yang Wu, Zijie Lin, Yanyan Zhao, Bing Qin, and Li-Nan Zhu. 2021. A Text-Centered Shared-Private Framework via Cross-Modal Prediction for Multimodal Sentiment Analysis. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, Online, 4730--4738. https://doi.org/10.18653/v1/2021.findings-acl.417

[37]

Zeguan Xiao, Jiarun Wu, Qingliang Chen, and Congjian Deng. 2021. BERT4GCN: Using BERT Intermediate Layers to Augment GCN for Aspect-based Sentiment Classification. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 9193--9200.

[38]

Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European conference on computer vision. Springer, 818--833.

Cited By

Jing YZhao XCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)DQ-Former: Querying Transformer with Dynamic Modality Priority for Cognitive-aligned Multimodal Emotion Recognition in ConversationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681599(4795-4804)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681599
Fu HLu H(2024)A Review of Multimodal Sentiment Analysis: Modal Fusion and Representation2024 International Wireless Communications and Mobile Computing (IWCMC)10.1109/IWCMC61514.2024.10592484(0049-0054)Online publication date: 27-May-2024
https://doi.org/10.1109/IWCMC61514.2024.10592484
Moreno-Galván DLópez-Santillán RGonzález-Gurrola LMontes-Y-Gómez MSánchez-Vega FLópez-Monroy A(2024)Automatic movie genre classification & emotion recognition via a BiProjection Multimodal TransformerInformation Fusion10.1016/j.inffus.2024.102641113:COnline publication date: 21-Nov-2024
https://dl.acm.org/doi/10.1016/j.inffus.2024.102641
Show More Cited By

Index Terms

Leveraging Multi-modal Interactions among the Intermediate Representations of Deep Transformers for Emotion Recognition
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Sentiment analysis
  2. Information systems applications
    1. Multimedia information systems

Recommendations

UCEMA: Uni-modal and cross-modal encoding network based on multi-head attention for emotion recognition in conversation: UCEMA: uni-modal and cross-modal encoding network...
Abstract
Emotion recognition in conversation (ERC) represents a pivotal research domain within affective computing, concentrating on discerning the emotional nuances embedded within individual utterances during conversational exchanges. The majority of ...
Multi-modal Multi-cultural Dimensional Continues Emotion Recognition in Dyadic Interactions
AVEC'18: Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop

Automatic emotion recognition is a challenging task which can make great impact on improving natural human computer interactions. In this paper, we present our solutions for the Cross-cultural Emotion Sub-challenge (CES) of Audio/Visual Emotion ...
DQ-Former: Querying Transformer with Dynamic Modality Priority for Cognitive-aligned Multimodal Emotion Recognition in Conversation
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Multimodal Emotion Recognition in Conversations aims to understand the human emotion of each utterance in a conversation from different types of data, such as speech and text. Previous works mainly focus on either complex unimodal feature extraction or ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MuSe' 22: Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge

October 2022

118 pages

ISBN:9781450394840

DOI:10.1145/3551876

General Chairs:
Shahin Amiriparian
University of Augsburg, GER
,
Lukas Christ
University of Augsburg, GER
,
Andreas König
University of Passau, GER
,
Alan Cowen
Hume AI, USA
,
Eva-Maria Meßner
University of Ulm, GER
,
Erik Cambria
Nanyang Technological University, SNG
,
Björn W. Schuller
Imperial College London, UK

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
National Key R&D Program of China

Conference

MM '22

Sponsor:

SIGMM

MM '22: The 30th ACM International Conference on Multimedia

October 10, 2022

Lisboa, Portugal

Acceptance Rates

MuSe' 22 Paper Acceptance Rate 14 of 17 submissions, 82%;

Overall Acceptance Rate 14 of 17 submissions, 82%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
858
Total Downloads

Downloads (Last 12 months)207
Downloads (Last 6 weeks)8

Reflects downloads up to 27 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jing YZhao XCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)DQ-Former: Querying Transformer with Dynamic Modality Priority for Cognitive-aligned Multimodal Emotion Recognition in ConversationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681599(4795-4804)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681599
Fu HLu H(2024)A Review of Multimodal Sentiment Analysis: Modal Fusion and Representation2024 International Wireless Communications and Mobile Computing (IWCMC)10.1109/IWCMC61514.2024.10592484(0049-0054)Online publication date: 27-May-2024
https://doi.org/10.1109/IWCMC61514.2024.10592484
Moreno-Galván DLópez-Santillán RGonzález-Gurrola LMontes-Y-Gómez MSánchez-Vega FLópez-Monroy A(2024)Automatic movie genre classification & emotion recognition via a BiProjection Multimodal TransformerInformation Fusion10.1016/j.inffus.2024.102641113:COnline publication date: 21-Nov-2024
https://dl.acm.org/doi/10.1016/j.inffus.2024.102641
Liu JChang TShi XSun HXu J(2024)Dataset Construction for Fine-Grained Emotion Analysis in Catering Review DataWeb Information Systems and Applications10.1007/978-981-97-7707-5_23(264-276)Online publication date: 11-Sep-2024
https://doi.org/10.1007/978-981-97-7707-5_23
Wu YDaoudi MAmad A(2023)Transformer-Based Self-Supervised Multimodal Representation Learning for Wearable Emotion RecognitionIEEE Transactions on Affective Computing10.1109/TAFFC.2023.326390715:1(157-172)Online publication date: 3-Apr-2023
https://dl.acm.org/doi/10.1109/TAFFC.2023.3263907

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten