research-article

Dynamically Adjust Word Representations Using Unaligned Multimodal Information

Authors:

Wanzeng KongAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 3394 - 3402

https://doi.org/10.1145/3503161.3548137

Published: 10 October 2022 Publication History

Abstract

Multimodal Sentiment Analysis is a promising research area for modeling multiple heterogeneous modalities. Two major challenges that exist in this area are a) multimodal data is unaligned in nature due to the different sampling rates of each modality, and b) long-range dependencies between elements across modalities. These challenges increase the difficulty of conducting efficient multimodal fusion. In this work, we propose a novel end-to-end network named Cross Hyper-modality Fusion Network (CHFN). The CHFN is an interpretable Transformer-based neural model that provides an efficient framework for fusing unaligned multimodal sequences. The heart of our model is to dynamically adjust word representations in different non-verbal contexts using unaligned multimodal sequences. It is concerned with the influence of non-verbal behavioral information at the scale of the entire utterances and then integrates this influence into verbal expression. We conducted experiments on both publicly available multimodal sentiment analysis datasets CMU-MOSI and CMU-MOSEI. The experiment results demonstrate that our model surpasses state-of-the-art models. In addition, we visualize the learned interactions between language modality and non-verbal behavior information and explore the underlying dynamics of multimodal language data.

References

[1]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European Conference on Computer Vision. Springer, 213--229.

Digital Library

[2]

Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Baltru?aitis, Amir Zadeh, and Louis-Philippe Morency. 2017. Multimodal sentiment analysis with word-level fusion and reinforcement learning. In Proceedings of the 19th ACM International Conference on Multimodal Interaction. 163--171.

Digital Library

[3]

Xie Chen, Yu Wu, Zhenghao Wang, Shujie Liu, and Jinyu Li. 2021. Developing real-time streaming transformer transducer for speech recognition on large-scale dataset. In ICASSP 2021--2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5904--5908.

[4]

Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, and Stefan Scherer. 2014. COVAREP-A collaborative voice analysis repository for speech technologies. In 2014 ieee international conference on acoustics, speech and signal processing (icassp). IEEE, 960--964.

[5]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[6]

Linhao Dong, Shuang Xu, and Bo Xu. 2018. Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5884--5888.

Digital Library

[7]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al . 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[8]

Paul Ekman and Erika L Rosenberg. 1997. What the face reveals: Basic and applied studies of spontaneous expression using the Facial Action Coding System (FACS). Oxford University Press, USA.

[9]

Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al. 2020. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100 (2020).

[10]

Wei Han, Hui Chen, Alexander Gelbukh, Amir Zadeh, Louis-philippe Morency, and Soujanya Poria. 2021. Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis. In Proceedings of the 2021 International Conference on Multimodal Interaction. 6--15.

Digital Library

[11]

Wei Han, Hui Chen, and Soujanya Poria. 2021. Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 9180--9192. https://aclanthology.org/2021.emnlp-main.723

[12]

Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. Misa: Modality-invariant and-specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia. 1122--1131.

Digital Library

[13]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.

Digital Library

[14]

Tao Liang, Guosheng Lin, Lei Feng, Yan Zhang, and Fengmao Lv. 2021. Attention Is Not Enough: Mitigating the Distribution Discrepancy in Asynchronous Multimodal Sequence Fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8148--8156.

[15]

Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. 2021. A Survey of Transformers. arXiv preprint arXiv:2106.04554 (2021).

[16]

Fengmao Lv, Xiang Chen, Yanyong Huang, Lixin Duan, and Guosheng Lin. 2021. Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2554--2562.

[17]

Sijie Mai, Haifeng Hu, and Songlong Xing. 2019. Divide, conquer and combine: Hierarchical feature fusion network with local and global perspectives for multimodal affective computing. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 481--492.

[18]

Sijie Mai, Songlong Xing, Jiaxuan He, Ying Zeng, and Haifeng Hu. 2020. Analyzing unaligned multimodal sequence via graph convolution and graph pooling fusion. arXiv preprint arXiv:2011.13572 (2020).

[19]

Sijie Mai, Songlong Xing, and Haifeng Hu. 2021. Analyzing multimodal sentiment via acoustic-and visual-LSTM with channel-aware temporal convolution network. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 1424--1437.

Digital Library

[20]

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. 2011. Multimodal deep learning. In ICML.

[21]

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.

[22]

Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. 2018. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.

[23]

Hai Pham, Paul Pu Liang, Thomas Manzini, Louis-Philippe Morency, and Barnabás Póczos. 2019. Found in translation: Learning robust joint representations by cyclic translations between modalities. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6892--6899.

Digital Library

[24]

Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, and Rada Mihalcea. 2020. Beneath the tip of the iceberg: Current challenges and new directions in sentiment analysis research. IEEE Transactions on Affective Computing (2020).

[25]

Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee, Amir Zadeh, Chengfeng Mao, Louis-Philippe Morency, and Ehsan Hoque. 2020. Integrating multimodal information in large pretrained transformers. In Proceedings of the conference. Association for Computational Linguistics. Meeting, Vol. 2020. NIH Public Access, 2359.

[26]

Zhongkai Sun, Prathusha Sarma, William Sethares, and Yingyu Liang. 2020. Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 8992--8999.

[27]

Zhongkai Sun, Prathusha K Sarma, Yingyu Liang, and William Sethares. 2021. A New View of Multi-modal Language Analysis: Audio and Video Features as Text "Styles". In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 1956--1965.

[28]

Jiajia Tang, Kang Li, Xuanyu Jin, Andrzej Cichocki, Qibin Zhao, and Wanzeng Kong. 2021. CTFN: Hierarchical Learning for Multimodal Sentiment Analysis Using Coupled-Translation Fusion Network. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 5301--5311.

[29]

Yao-Hung Hubert Tsai, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Learning Factorized Multimodal Representations. In ICLR.

[30]

Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting, Vol. 2019. NIH Public Access, 6558.

[31]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.

[32]

Yansen Wang, Ying Shen, Zhun Liu, Paul Pu Liang, Amir Zadeh, and Louis- Philippe Morency. 2019. Words can shift: Dynamically adjusting word representations using nonverbal behaviors. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 7216--7223.

Digital Library

[33]

Jennifer Williams, Steven Kleinegesse, Ramona Comanescu, and Oana Radu. 2018. Recognizing Emotions in Video Using Multimodal DNN Feature Fusion. In Proceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML). Association for Computational Linguistics, Melbourne, Australia, 11--19. https://doi.org/10.18653/v1/W18--3302

[34]

Jianing Yang, Yongxin Wang, Ruitao Yi, Yuying Zhu, Azaan Rehman, Amir Zadeh, Soujanya Poria, and Louis-Philippe Morency. 2021. MTAG: Modal- Temporal Attention Graph for Unaligned Human Multimodal Language Sequences. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 1009--1021. https://www.aclweb.org/anthology/2021.naacl-main.79

[35]

Wenmeng Yu, Hua Xu, Yuan Ziqi, and Wu Jiele. 2021. Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis. In Proceedings of the AAAI Conference on Artificial Intelligence.

[36]

Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor Fusion Network for Multimodal Sentiment Analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 1103--1114.

[37]

Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Memory fusion network for multi-view sequential learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.

[38]

Amir Zadeh, Paul Pu Liang, Soujanya Poria, Prateek Vij, Erik Cambria, and Louis-Philippe Morency. 2018. Multi-attention recurrent network for human communication comprehension. In Thirty-Second AAAI Conference on Artificial Intelligence.

[39]

Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259 (2016).

[40]

AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2236--2246.

Cited By

Chen CZhang P(2024)Modality-collaborative Transformer with Hybrid Feature Reconstruction for Robust Emotion RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364034320:5(1-23)Online publication date: 7-Feb-2024
https://dl.acm.org/doi/10.1145/3640343
Shi PHu MShi XRen F(2024)Deep Modular Co-Attention Shifting Network for Multimodal Sentiment AnalysisACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363470620:4(1-23)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3634706
Yang HLi B(2024)Multimodal Sentiment Analysis for Movie Scenes Based on a Few-Shot Learning ApproachInternational Journal of Pattern Recognition and Artificial Intelligence10.1142/S021800142451009138:05Online publication date: 8-May-2024
https://doi.org/10.1142/S0218001424510091
Show More Cited By

Index Terms

Dynamically Adjust Word Representations Using Unaligned Multimodal Information
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Sentiment analysis
  2. Information systems applications
    1. Multimedia information systems

Recommendations

Text-Oriented Modality Reinforcement Network for Multimodal Sentiment Analysis from Unaligned Multimodal Sequences
Artificial Intelligence
Abstract
Multimodal Sentiment Analysis (MSA) aims to mine sentiment information from text, visual, and acoustic modalities. Previous works have focused on representation learning and feature fusion strategies. However, most of these efforts ignored the ...
Multimodal transformer with adaptive modality weighting for multimodal sentiment analysis
Abstract
Multimodal Sentiment Analysis (MSA) constitutes a pivotal technology in the realm of multimedia research. The efficacy of MSA models largely hinges on the quality of multimodal fusion. Notably, when conveying information pertinent to specific ...
Highlights
- Novel multimodal adaptive weight matrix enables accurate sentiment analysis by considering unique contributions of each modality.
- Multimodal attention mechanism addresses over-focusing on intra-modality attention.
- Multiple Softmax ...
Multimodal sentiment analysis with word-level fusion and reinforcement learning
ICMI '17: Proceedings of the 19th ACM International Conference on Multimodal Interaction

With the increasing popularity of video sharing websites such as YouTube and Facebook, multimodal sentiment analysis has received increasing attention from the scientific community. Contrary to previous works in multimodal sentiment analysis which ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

October 2022

7537 pages

ISBN:9781450392037

DOI:10.1145/3503161

General Chairs:
João Magalhães
NOVA University of Lisbon, Portugal
,
Alberto del Bimbo
University of Florence, Italy
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Nicu Sebe
University of Trento, Italy
,
Program Chairs:
Xavier Alameda-Pineda
Inria, Grenoble, France
,
Qin Jin
Renmin University of China, China
,
Vincent Oria
New Jersey Institute of Technology, USA
,
Laura Toni
University College London, UK

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key R\&D Program of China for Intergovernmental International Science and Technology Innovation Cooperation Project
National Natural Science Foundation of China
Key Laboratory of Brain Machine Collaborative Intelligence

Conference

MM '22

Sponsor:

SIGMM

MM '22: The 30th ACM International Conference on Multimedia

October 10 - 14, 2022

Lisboa, Portugal

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
385
Total Downloads

Downloads (Last 12 months)159
Downloads (Last 6 weeks)11

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chen CZhang P(2024)Modality-collaborative Transformer with Hybrid Feature Reconstruction for Robust Emotion RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364034320:5(1-23)Online publication date: 7-Feb-2024
https://dl.acm.org/doi/10.1145/3640343
Shi PHu MShi XRen F(2024)Deep Modular Co-Attention Shifting Network for Multimodal Sentiment AnalysisACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363470620:4(1-23)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3634706
Yang HLi B(2024)Multimodal Sentiment Analysis for Movie Scenes Based on a Few-Shot Learning ApproachInternational Journal of Pattern Recognition and Artificial Intelligence10.1142/S021800142451009138:05Online publication date: 8-May-2024
https://doi.org/10.1142/S0218001424510091
Huan RZhong GChen PLiang R(2024)UniMF: A Unified Multimodal Framework for Multimodal Sentiment Analysis in Missing Modalities and Unaligned Multimodal SequencesIEEE Transactions on Multimedia10.1109/TMM.2023.333876926(5753-5768)Online publication date: 2024
https://doi.org/10.1109/TMM.2023.3338769
Fu HLu H(2024)A Review of Multimodal Sentiment Analysis: Modal Fusion and Representation2024 International Wireless Communications and Mobile Computing (IWCMC)10.1109/IWCMC61514.2024.10592484(0049-0054)Online publication date: 27-May-2024
https://doi.org/10.1109/IWCMC61514.2024.10592484
Peng LZhang ZPang THan JZhao HChen HSchuller B(2024)Customising General Large Language Models for Specialised Emotion Recognition TasksICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447044(11326-11330)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10447044
Yin YKong WTang JLi JBabiloni F(2024)PSPN: Pseudo-Siamese Pyramid Network for multimodal emotion analysisCognitive Neurodynamics10.1007/s11571-024-10123-yOnline publication date: 28-May-2024
https://doi.org/10.1007/s11571-024-10123-y
Zhang RWang HDu MLiu HZhou YZeng QEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)UMMAFormer: A Universal Multimodal-adaptive Transformer Framework for Temporal Forgery LocalizationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3613767(8749-8759)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3613767
Zhang WLi LDing YChen WDeng ZYu X(2023)Detecting Facial Action Units From Global-Local Fine-Grained ExpressionsIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.328890334:2(983-994)Online publication date: 23-Jun-2023
https://dl.acm.org/doi/10.1109/TCSVT.2023.3288903
Dong ZLiu H(2023)Recent Advancements and Challenges in Multimodal Sentiment Analysis: A Survey2023 International Conference on Machine Learning and Cybernetics (ICMLC)10.1109/ICMLC58545.2023.10327944(464-469)Online publication date: 9-Jul-2023
https://doi.org/10.1109/ICMLC58545.2023.10327944

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents