research-article

Mutual Correlation Attentive Factors in Dyadic Fusion Networks for Speech Emotion Recognition

Authors:

Ivan MarsicAuthors Info & Claims

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Pages 157 - 166

https://doi.org/10.1145/3343031.3351039

Published: 15 October 2019 Publication History

Abstract

Emotion recognition in dyadic communication is challenging because: 1. Extracting informative modality-specific representations requires disparate feature extractor designs due to the heterogenous input data formats. 2. How to effectively and efficiently fuse unimodal features and learn associations between dyadic utterances are critical to the model generalization in actual scenario. 3. Disagreeing annotations prevent previous approaches from precisely predicting emotions in context. To address the above issues, we propose an efficient dyadic fusion network that only relies on an attention mechanism to select representative vectors, fuse modality-specific features, and learn the sequence information. Our approach has three distinct characteristics: 1. Instead of using a recurrent neural network to extract temporal associations as in most previous research, we introduce multiple sub-view attention layers to compute the relevant dependencies among sequential utterances; this significantly improves model efficiency. 2. To improve fusion performance, we design a learnable mutual correlation factor inside each attention layer to compute associations across different modalities. 3. To overcome the label disagreement issue, we embed the labels from all annotators into a k-dimensional vector and transform the categorical problem into a regression problem; this method provides more accurate annotation information and fully uses the entire dataset. We evaluate the proposed model on two published multimodal emotion recognition datasets: IEMOCAP and MELD. Our model significantly outperforms previous state-of-the-art research by 3.8%-7.5% accuracy, using a more efficient model.

References

[1]

Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. "Tensor fusion network for multimodal sentiment analysis." arXiv preprint arXiv:1707.07250 (2017).

[2]

Soujanya Poria, Erik Cambria, and Alexander Gelbukh. 2015. "Deep convolutional neural network textual features and multiple kernel learning for utter-ance-level multimodal sentiment analysis." In Proceedings of the 2015 conference on empirical methods in natural language processing, pp. (2015). 2539--2544.

[3]

Yue Gu, Kangning Yang, Shiyu Fu, Shuhong Chen, Xinyu Li, and Ivan Mar-sic. 2018. "Multimodal affective analysis using hierarchical attention strategy with word-level alignment." In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Vol. 1, pp. 2225--2235).

[4]

Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, Amir Zadeh, and Louis-Philippe Morency. 2017. "Context-dependent sentiment analysis in user-generated videos." In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 873--883.

[5]

Amir Zadeh, Paul Pu Liang, Soujanya Poria, Prateek Vij, Erik Cambria, and Louis-Philippe Morency. 2018. "Multi-attention recurrent network for human communication comprehension." In Thirty-Second AAAI Conference on Artifi-cial Intelligence. (2018).

[6]

Yue Gu, Xinyu Li, Kaixiang Huang, Shiyu Fu, Kangning Yang, Shuhong Chen, Moliang Zhou, and Ivan Marsic. 2018. "Human conversation analysis using attentive multimodal networks with hierarchical encoder-decoder." In 2018 ACM Multimedia Conference on Multimedia Conference, pp. 537--545. ACM, (2018).

[7]

Soujanya Poria, Iti Chaturvedi, Erik Cambria, and Amir Hussain. 2016. "Convolutional MKL based multimodal emotion recognition and sentiment analysis." In 2016 IEEE 16th international conference on data mining (ICDM), pp. 439--448. IEEE, (2016).

[8]

Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Baltruaitis, Amir Zadeh, and Louis-Philippe Morency. 2017. "Multimodal sentiment analysis with word-level fusion and reinforcement learning." In Proceedings of the 19th ACM International Conference on Multimodal Interaction, pp. 163--171. ACM, (2017).

[9]

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan. 2008. "IEMOCAP: Interactive emotional dyadic motion capture database." Language resources and evaluation 42, no. 4 (2008): 335.

[10]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. "Attention is all you need." In Advances in neural information processing systems, pp. 5998--6008. (2017).

[11]

Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Mazumder, Amir Zadeh, and Louis-Philippe Morency. 2017. "Multi-level multiple attentions for contextual multimodal sentiment analysis." In 2017 IEEE International Conference on Data Mining (ICDM), pp. 1033--1038. IEEE, (2017).

[12]

Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. 2018. "Meld: A multimodal multi-party dataset for emotion recognition in conversations." arXiv preprint arXiv:1810.02508 (2018).

[13]

Dino Seppi, Anton Batliner, Björn Schuller, Stefan Steidl, Thurid Vogt, Johannes Wagner, Laurence Devillers, Laurence Vidrascu, Noam Amir, and Vered Aharonson. 2008. "Patterns, prototypes, performance: classifying emotional user states." In Proc. 9th Interspeech 2008 incorp. 12th Australasian Int. Conf. on Speech Science and Technology SST 2008, Brisbane, Australia, pp. 601--604. (2008).

[14]

Arman Savran, Houwei Cao, Miraj Shah, Ani Nenkova, and Ragini Verma. 2012. "Combining video, audio and lexical indicators of affect in spontaneous conversation via particle filtering." In Proceedings of the 14th ACM international conference on Multimodal interaction, pp. 485--492. ACM, (2012).

[15]

Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. "Opensmile: the munich versatile and fast open-source audio feature extractor." In Proceedings of the 18th ACM international conference on Multimedia, pp. 1459--1462. ACM, (2010).

[16]

Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, and Stefan Scherer. 2014. "COVAREP-A collaborative voice analysis repository for speech technologies." In 2014 ieee international conference on acoustics, speech and signal processing (icassp), pp. 960--964. IEEE, (2014).

[17]

Yue Gu, Shuhong Chen, and Ivan Marsic. 2018. "Deep Multimodal learning for emotion recognition in spoken language." In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5079--5083. IEEE, (2018).

[18]

Shyam Sundar Rajagopalan, Louis-Philippe Morency, Tadas Baltrusaitis, and Roland Goecke. 2016. "Extending long short-term memory for multi-view structured learning." In European Conference on Computer Vision, pp. 338--353. Springer, Cham, (2016).

[19]

Martin Wöllmer, Felix Weninger, Tobias Knaup, Björn Schuller, Congkai Sun, Kenji Sagae, and Louis-Philippe Morency. 2013. "Youtube movie reviews: Sentiment analysis in an audio-visual context." IEEE Intelligent Systems 28, no. 3 (2013): 46--53.

[20]

Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2018. "Efficient low-rank multi-modal fusion with modality-specific factors." arXiv preprint arXiv:1806.00064 (2018).

[21]

Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2018. "Multimodal local-global ranking fusion for emotion recognition." In Proceedings of the 2018 on International Conference on Multimodal Interaction, pp. 472--476. ACM, (2018).

[22]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. "Glove: Global vectors for word representation." In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532--1543. (2014).

[23]

Sergey Ioffe, and Christian Szegedy. 2015. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." arXiv preprint arXiv:1502.03167 (2015).

[24]

Yuji Tokozume, Yoshitaka Ushiku, and Tatsuya Harada. 2017. "Learning from between-class examples for deep sound recognition." arXiv preprint arXiv:1711.10282 (2017).

[25]

Sheng-Yeh Chen, Chao-Chun Hsu, Chuan-Chun Kuo, and Lun-Wei Ku. 2018. "Emotionlines: An emotion corpus of multi-party conversations." arXiv pre-print arXiv:1802.08379 (2018).

[26]

Verónica Pérez Rosas, Rada Mihalcea, and Louis-Philippe Morency. 2013. "Multimodal sentiment analysis of spanish online videos." IEEE Intelligent Systems 28, no. 3 (2013): 38--45.

[27]

Leo Breiman. 2001. "Random forests." Machine learning 45, no. 1 (2001): 5--32.

[28]

François Chollet. 2015. "Keras." (2015).

[29]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin et al. 2016. "Tensorflow: A system for large-scale ma-chine learning." In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), pp. 265--283. (2016).

[30]

Diederik P. Kingma, and Jimmy Ba. 2014. "Adam: A method for stochastic optimization." arXiv preprint arXiv:1412.6980 (2014).

[31]

Yue Gu, Kangning Yang, Shiyu Fu, Shuhong Chen, Xinyu Li, and Ivan Mar-sic. 2018. "Hybrid Attention based Multimodal Network for Spoken Language Classification." In Proceedings of the 27th International Conference on Com-putational Linguistics, pp. 2379--2390. (2018).

[32]

Paul Pu Liang, Ziyin Liu, Amir Zadeh, and Louis-Philippe Morency. 2018. "Multimodal language analysis with recurrent multistage fusion." arXiv pre-print arXiv:1808.03920 (2018).

Cited By

Prischepa OKireev SNefedov YMartynov ALutsky DKrykova TSinitsa NXu R(2023)Theoretical and methodological approaches to identifying deep accumulations of oil and gas in oil and gas basins of the Russian FederationFrontiers in Earth Science10.3389/feart.2023.119205111Online publication date: 17-May-2023
https://doi.org/10.3389/feart.2023.1192051
Liu HYang XXu C(2023)Counterfactual Scenario-relevant Knowledge-enriched Multi-modal Emotion ReasoningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358369019:5s(1-25)Online publication date: 7-Jun-2023
https://dl.acm.org/doi/10.1145/3583690
Lin FLiu SZhang CFan JWu Z(2023)StyleBERT: Text-audio sentiment analysis with Bi-directional Style EnhancementInformation Systems10.1016/j.is.2022.102147114(102147)Online publication date: Mar-2023
https://doi.org/10.1016/j.is.2022.102147
Show More Cited By

Index Terms

Mutual Correlation Attentive Factors in Dyadic Fusion Networks for Speech Emotion Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Information systems
  1. Information systems applications
    1. Multimedia information systems
      1. Multimedia streaming

Recommendations

FEF-Net: feature enhanced fusion network with crossmodal attention for multimodal humor prediction
Abstract
Humor segment prediction in video involves the comprehension and analysis of humor. Traditional humor prediction has been text-based; however, with the evolution of multimedia, the focus has shifted to multimodal approaches in humor prediction, ...
Speech Emotion Recognition Using Global-Aware Cross-Modal Feature Fusion Network
Advanced Intelligent Computing Technology and Applications
Abstract
Speech emotion recognition (SER) facilitates better interpersonal communication. Emotion is normally present in conversation in many forms, such as speech and text. However, existing emotion recognition systems use only features of a single ...
Beyond Dyadic Interactions: Considering Chatbots as Community Members
CHI '19: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems

Chatbots have grown as a space for research and development in recent years due both to the realization of their commercial potential and to advancements in language processing that have facilitated more natural conversations. However, nearly all ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

October 2019

2794 pages

ISBN:9781450368896

DOI:10.1145/3343031

General Chairs:
Laurent Amsaleg
CNRS-IRISA, France
,
Benoit Huet
EURECOM, France
,
Martha Larson
Radboud University and TU Delft (Netherlands)
,
Program Chairs:
Guillaume Gravier
CNRS-IRISA, France
,
Hayley Hung
Delft University of Technology Netherlands
,
Chong-Wah Ngo
City University of Hong Kong Hong Kong
,
Wei Tsang Ooi
National University of Singapore Singapore

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Institutes of Health

Conference

MM '19

Sponsor:

SIGMM

MM '19: The 27th ACM International Conference on Multimedia

October 21 - 25, 2019

Nice, France

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
586
Total Downloads

Downloads (Last 12 months)26
Downloads (Last 6 weeks)2

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Prischepa OKireev SNefedov YMartynov ALutsky DKrykova TSinitsa NXu R(2023)Theoretical and methodological approaches to identifying deep accumulations of oil and gas in oil and gas basins of the Russian FederationFrontiers in Earth Science10.3389/feart.2023.119205111Online publication date: 17-May-2023
https://doi.org/10.3389/feart.2023.1192051
Liu HYang XXu C(2023)Counterfactual Scenario-relevant Knowledge-enriched Multi-modal Emotion ReasoningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358369019:5s(1-25)Online publication date: 7-Jun-2023
https://dl.acm.org/doi/10.1145/3583690
Lin FLiu SZhang CFan JWu Z(2023)StyleBERT: Text-audio sentiment analysis with Bi-directional Style EnhancementInformation Systems10.1016/j.is.2022.102147114(102147)Online publication date: Mar-2023
https://doi.org/10.1016/j.is.2022.102147
Shirian ATripathi SGuha T(2022)Dynamic Emotion Modeling With Learnable Graphs and Graph Inception NetworkIEEE Transactions on Multimedia10.1109/TMM.2021.305916924(780-790)Online publication date: 2022
https://doi.org/10.1109/TMM.2021.3059169
Dong GPun CZhang Z(2022)Temporal Relation Inference Network for Multimodal Speech Emotion RecognitionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2022.316344532:9(6472-6485)Online publication date: Sep-2022
https://doi.org/10.1109/TCSVT.2022.3163445
Zhou YLiang XGu YYin YYao L(2022)Multi-Classifier Interactive Learning for Ambiguous Speech Emotion RecognitionIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2022.314528730(695-705)Online publication date: 25-Jan-2022
https://dl.acm.org/doi/10.1109/TASLP.2022.3145287
Xie YSun CJi Z(2022)A Commonsense Knowledge Enhanced Network with Retrospective Loss for Emotion Recognition in Spoken DialogICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP43922.2022.9746909(7027-7031)Online publication date: 23-May-2022
https://doi.org/10.1109/ICASSP43922.2022.9746909
Nie WRen MNie JZhao S(2021)C-GCN: Correlation Based Graph Convolutional Network for Audio-Video Emotion RecognitionIEEE Transactions on Multimedia10.1109/TMM.2020.303203723(3793-3804)Online publication date: 2021
https://doi.org/10.1109/TMM.2020.3032037
Yin YGu YYao LZhou YLiang XZhang H(2021)Progressive Co-Teaching for Ambiguous Speech Emotion RecognitionICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP39728.2021.9414494(6264-6268)Online publication date: 6-Jun-2021
https://doi.org/10.1109/ICASSP39728.2021.9414494
Xie YSun CLiu BJi Z(2021)DA-GCN: A Dependency-Aware Graph Convolutional Network for Emotion Recognition in ConversationsNeural Information Processing10.1007/978-3-030-92238-2_39(470-481)Online publication date: 5-Dec-2021
https://doi.org/10.1007/978-3-030-92238-2_39
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten