Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3343031.3351039acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Mutual Correlation Attentive Factors in Dyadic Fusion Networks for Speech Emotion Recognition

Published: 15 October 2019 Publication History

Abstract

Emotion recognition in dyadic communication is challenging because: 1. Extracting informative modality-specific representations requires disparate feature extractor designs due to the heterogenous input data formats. 2. How to effectively and efficiently fuse unimodal features and learn associations between dyadic utterances are critical to the model generalization in actual scenario. 3. Disagreeing annotations prevent previous approaches from precisely predicting emotions in context. To address the above issues, we propose an efficient dyadic fusion network that only relies on an attention mechanism to select representative vectors, fuse modality-specific features, and learn the sequence information. Our approach has three distinct characteristics: 1. Instead of using a recurrent neural network to extract temporal associations as in most previous research, we introduce multiple sub-view attention layers to compute the relevant dependencies among sequential utterances; this significantly improves model efficiency. 2. To improve fusion performance, we design a learnable mutual correlation factor inside each attention layer to compute associations across different modalities. 3. To overcome the label disagreement issue, we embed the labels from all annotators into a k-dimensional vector and transform the categorical problem into a regression problem; this method provides more accurate annotation information and fully uses the entire dataset. We evaluate the proposed model on two published multimodal emotion recognition datasets: IEMOCAP and MELD. Our model significantly outperforms previous state-of-the-art research by 3.8%-7.5% accuracy, using a more efficient model.

References

[1]
Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. "Tensor fusion network for multimodal sentiment analysis." arXiv preprint arXiv:1707.07250 (2017).
[2]
Soujanya Poria, Erik Cambria, and Alexander Gelbukh. 2015. "Deep convolutional neural network textual features and multiple kernel learning for utter-ance-level multimodal sentiment analysis." In Proceedings of the 2015 conference on empirical methods in natural language processing, pp. (2015). 2539--2544.
[3]
Yue Gu, Kangning Yang, Shiyu Fu, Shuhong Chen, Xinyu Li, and Ivan Mar-sic. 2018. "Multimodal affective analysis using hierarchical attention strategy with word-level alignment." In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Vol. 1, pp. 2225--2235).
[4]
Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, Amir Zadeh, and Louis-Philippe Morency. 2017. "Context-dependent sentiment analysis in user-generated videos." In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 873--883.
[5]
Amir Zadeh, Paul Pu Liang, Soujanya Poria, Prateek Vij, Erik Cambria, and Louis-Philippe Morency. 2018. "Multi-attention recurrent network for human communication comprehension." In Thirty-Second AAAI Conference on Artifi-cial Intelligence. (2018).
[6]
Yue Gu, Xinyu Li, Kaixiang Huang, Shiyu Fu, Kangning Yang, Shuhong Chen, Moliang Zhou, and Ivan Marsic. 2018. "Human conversation analysis using attentive multimodal networks with hierarchical encoder-decoder." In 2018 ACM Multimedia Conference on Multimedia Conference, pp. 537--545. ACM, (2018).
[7]
Soujanya Poria, Iti Chaturvedi, Erik Cambria, and Amir Hussain. 2016. "Convolutional MKL based multimodal emotion recognition and sentiment analysis." In 2016 IEEE 16th international conference on data mining (ICDM), pp. 439--448. IEEE, (2016).
[8]
Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Baltruaitis, Amir Zadeh, and Louis-Philippe Morency. 2017. "Multimodal sentiment analysis with word-level fusion and reinforcement learning." In Proceedings of the 19th ACM International Conference on Multimodal Interaction, pp. 163--171. ACM, (2017).
[9]
Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan. 2008. "IEMOCAP: Interactive emotional dyadic motion capture database." Language resources and evaluation 42, no. 4 (2008): 335.
[10]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. "Attention is all you need." In Advances in neural information processing systems, pp. 5998--6008. (2017).
[11]
Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Mazumder, Amir Zadeh, and Louis-Philippe Morency. 2017. "Multi-level multiple attentions for contextual multimodal sentiment analysis." In 2017 IEEE International Conference on Data Mining (ICDM), pp. 1033--1038. IEEE, (2017).
[12]
Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. 2018. "Meld: A multimodal multi-party dataset for emotion recognition in conversations." arXiv preprint arXiv:1810.02508 (2018).
[13]
Dino Seppi, Anton Batliner, Björn Schuller, Stefan Steidl, Thurid Vogt, Johannes Wagner, Laurence Devillers, Laurence Vidrascu, Noam Amir, and Vered Aharonson. 2008. "Patterns, prototypes, performance: classifying emotional user states." In Proc. 9th Interspeech 2008 incorp. 12th Australasian Int. Conf. on Speech Science and Technology SST 2008, Brisbane, Australia, pp. 601--604. (2008).
[14]
Arman Savran, Houwei Cao, Miraj Shah, Ani Nenkova, and Ragini Verma. 2012. "Combining video, audio and lexical indicators of affect in spontaneous conversation via particle filtering." In Proceedings of the 14th ACM international conference on Multimodal interaction, pp. 485--492. ACM, (2012).
[15]
Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. "Opensmile: the munich versatile and fast open-source audio feature extractor." In Proceedings of the 18th ACM international conference on Multimedia, pp. 1459--1462. ACM, (2010).
[16]
Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, and Stefan Scherer. 2014. "COVAREP-A collaborative voice analysis repository for speech technologies." In 2014 ieee international conference on acoustics, speech and signal processing (icassp), pp. 960--964. IEEE, (2014).
[17]
Yue Gu, Shuhong Chen, and Ivan Marsic. 2018. "Deep Multimodal learning for emotion recognition in spoken language." In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5079--5083. IEEE, (2018).
[18]
Shyam Sundar Rajagopalan, Louis-Philippe Morency, Tadas Baltrusaitis, and Roland Goecke. 2016. "Extending long short-term memory for multi-view structured learning." In European Conference on Computer Vision, pp. 338--353. Springer, Cham, (2016).
[19]
Martin Wöllmer, Felix Weninger, Tobias Knaup, Björn Schuller, Congkai Sun, Kenji Sagae, and Louis-Philippe Morency. 2013. "Youtube movie reviews: Sentiment analysis in an audio-visual context." IEEE Intelligent Systems 28, no. 3 (2013): 46--53.
[20]
Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2018. "Efficient low-rank multi-modal fusion with modality-specific factors." arXiv preprint arXiv:1806.00064 (2018).
[21]
Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2018. "Multimodal local-global ranking fusion for emotion recognition." In Proceedings of the 2018 on International Conference on Multimodal Interaction, pp. 472--476. ACM, (2018).
[22]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. "Glove: Global vectors for word representation." In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532--1543. (2014).
[23]
Sergey Ioffe, and Christian Szegedy. 2015. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." arXiv preprint arXiv:1502.03167 (2015).
[24]
Yuji Tokozume, Yoshitaka Ushiku, and Tatsuya Harada. 2017. "Learning from between-class examples for deep sound recognition." arXiv preprint arXiv:1711.10282 (2017).
[25]
Sheng-Yeh Chen, Chao-Chun Hsu, Chuan-Chun Kuo, and Lun-Wei Ku. 2018. "Emotionlines: An emotion corpus of multi-party conversations." arXiv pre-print arXiv:1802.08379 (2018).
[26]
Verónica Pérez Rosas, Rada Mihalcea, and Louis-Philippe Morency. 2013. "Multimodal sentiment analysis of spanish online videos." IEEE Intelligent Systems 28, no. 3 (2013): 38--45.
[27]
Leo Breiman. 2001. "Random forests." Machine learning 45, no. 1 (2001): 5--32.
[28]
François Chollet. 2015. "Keras." (2015).
[29]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin et al. 2016. "Tensorflow: A system for large-scale ma-chine learning." In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), pp. 265--283. (2016).
[30]
Diederik P. Kingma, and Jimmy Ba. 2014. "Adam: A method for stochastic optimization." arXiv preprint arXiv:1412.6980 (2014).
[31]
Yue Gu, Kangning Yang, Shiyu Fu, Shuhong Chen, Xinyu Li, and Ivan Mar-sic. 2018. "Hybrid Attention based Multimodal Network for Spoken Language Classification." In Proceedings of the 27th International Conference on Com-putational Linguistics, pp. 2379--2390. (2018).
[32]
Paul Pu Liang, Ziyin Liu, Amir Zadeh, and Louis-Philippe Morency. 2018. "Multimodal language analysis with recurrent multistage fusion." arXiv pre-print arXiv:1808.03920 (2018).

Cited By

View all
  • (2023)Theoretical and methodological approaches to identifying deep accumulations of oil and gas in oil and gas basins of the Russian FederationFrontiers in Earth Science10.3389/feart.2023.119205111Online publication date: 17-May-2023
  • (2023)Counterfactual Scenario-relevant Knowledge-enriched Multi-modal Emotion ReasoningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358369019:5s(1-25)Online publication date: 7-Jun-2023
  • (2023)StyleBERT: Text-audio sentiment analysis with Bi-directional Style EnhancementInformation Systems10.1016/j.is.2022.102147114(102147)Online publication date: Mar-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '19: Proceedings of the 27th ACM International Conference on Multimedia
October 2019
2794 pages
ISBN:9781450368896
DOI:10.1145/3343031
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. attention mechanism
  2. dyadic communication
  3. multimodal fusion network
  4. mutual correlation attentive factor.
  5. speech emotion recognition

Qualifiers

  • Research-article

Funding Sources

  • National Institutes of Health

Conference

MM '19
Sponsor:

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)34
  • Downloads (Last 6 weeks)6
Reflects downloads up to 23 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Theoretical and methodological approaches to identifying deep accumulations of oil and gas in oil and gas basins of the Russian FederationFrontiers in Earth Science10.3389/feart.2023.119205111Online publication date: 17-May-2023
  • (2023)Counterfactual Scenario-relevant Knowledge-enriched Multi-modal Emotion ReasoningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358369019:5s(1-25)Online publication date: 7-Jun-2023
  • (2023)StyleBERT: Text-audio sentiment analysis with Bi-directional Style EnhancementInformation Systems10.1016/j.is.2022.102147114(102147)Online publication date: Mar-2023
  • (2022)Dynamic Emotion Modeling With Learnable Graphs and Graph Inception NetworkIEEE Transactions on Multimedia10.1109/TMM.2021.305916924(780-790)Online publication date: 2022
  • (2022)Temporal Relation Inference Network for Multimodal Speech Emotion RecognitionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2022.316344532:9(6472-6485)Online publication date: Sep-2022
  • (2022)Multi-Classifier Interactive Learning for Ambiguous Speech Emotion RecognitionIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2022.314528730(695-705)Online publication date: 25-Jan-2022
  • (2022)A Commonsense Knowledge Enhanced Network with Retrospective Loss for Emotion Recognition in Spoken DialogICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP43922.2022.9746909(7027-7031)Online publication date: 23-May-2022
  • (2021)C-GCN: Correlation Based Graph Convolutional Network for Audio-Video Emotion RecognitionIEEE Transactions on Multimedia10.1109/TMM.2020.303203723(3793-3804)Online publication date: 2021
  • (2021)Progressive Co-Teaching for Ambiguous Speech Emotion RecognitionICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP39728.2021.9414494(6264-6268)Online publication date: 6-Jun-2021
  • (2021)DA-GCN: A Dependency-Aware Graph Convolutional Network for Emotion Recognition in ConversationsNeural Information Processing10.1007/978-3-030-92238-2_39(470-481)Online publication date: 5-Dec-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media