Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Deep Modular Co-Attention Shifting Network for Multimodal Sentiment Analysis

Published: 11 January 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Human Multimodal Sentiment Analysis (MSA) is an attractive research that studies sentiment expressed from multiple heterogeneous modalities. While transformer-based methods have achieved great success, designing an effective “co-attention” model to associate text modality with nonverbal modalities remains challenging. There are two main problems: 1) the dominant role of the text in modalities is underutilization, and 2) the interaction between modalities is not sufficiently explored. This paper proposes a deep modular Co-Attention Shifting Network (CoASN) for MSA. A Cross-modal Modulation Module based on Co-attention (CMMC) and an Advanced Modality-mixing Adaptation Gate (AMAG) are constructed. The CMMC consists of the Text-guided Co-Attention (TCA) and Interior Transformer Encoder (ITE) units to capture inter-modal features and intra-modal features. With text modality as the core, the CMMC module aims to guide and promote the expression of emotion in nonverbal modalities, and the nonverbal modalities increase the richness of the text-based multimodal sentiment information. In addition, the AMAG module is introduced to explore the dynamical correlations among all modalities. Particularly, this efficient module first captures the nonverbal shifted representations and then combines them to calculate the shifted word embedding representations for the final MSA tasks. Extensive experiments on two commonly used datasets, CMU-MOSI and CMU-MOSEI, demonstrate that our proposed method is superior to the state-of-the-art performance.

    References

    [1]
    Mehdi Arjmand, Mohammad Javad Dousti, and Hadi Moradi. 2021. TEASEL: A transformer-based speech-prefixed language model. arXiv preprint arXiv:2109.05522 (2021).
    [2]
    Marouane Birjali, Mohammed Kasri, and Abderrahim Beni-Hssane. 2021. A comprehensive survey on sentiment analysis: Approaches, challenges and trends. Knowledge-Based Systems 226 (2021), 107134.
    [3]
    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems 33 (2020), 1877–1901.
    [4]
    Marco Caliendo, Daniel Graeber, Alexander S. Kritikos, and Johannes Seebauer. 2022. Pandemic depression: COVID-19 and the mental health of the self-employed. Entrepreneurship Theory and Practice (2022), 10422587221102106.
    [5]
    Minping Chen and Xia Li. 2020. SWAFN: Sentimental words aware fusion network for multimodal sentiment analysis. In Proceedings of the 28th International Conference on Computational Linguistics. 1067–1077.
    [6]
    Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Baltrušaitis, Amir Zadeh, and Louis-Philippe Morency. 2017. Multimodal sentiment analysis with word-level fusion and reinforcement learning. In Proceedings of the 19th ACM International Conference on Multimodal Interaction. 163–171.
    [7]
    Hongju Cheng, Zizhen Yang, Xiaoqi Zhang, and Yang Yang. 2023. Multimodal sentiment analysis based on attentional temporal convolutional network and multi-layer feature fusion. IEEE Transactions on Affective Computing (2023).
    [8]
    Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555 (2020).
    [9]
    Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, and Stefan Scherer. 2014. COVAREP-A collaborative voice analysis repository for speech technologies. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 960–964.
    [10]
    Lingyong Fang, Gongshen Liu, and Ru Zhang. 2022. Sense-aware BERT and multi-task fine-tuning for multimodal sentiment analysis. In 2022 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8.
    [11]
    Ankita Gandhi, Kinjal Adhvaryu, Soujanya Poria, Erik Cambria, and Amir Hussain. 2022. Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Information Fusion (2022).
    [12]
    Jiwei Guo, Jiajia Tang, Weichen Dai, Yu Ding, and Wanzeng Kong. 2022. Dynamically adjust word representations using unaligned multimodal information. In Proceedings of the 30th ACM International Conference on Multimedia. 3394–3402.
    [13]
    Wei Han, Hui Chen, Alexander Gelbukh, Amir Zadeh, Louis-Philippe Morency, and Soujanya Poria. 2021. Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis. In Proceedings of the 2021 International Conference on Multimodal Interaction. 6–15.
    [14]
    Md. Kamrul Hasan, Md. Saiful Islam, Sangwu Lee, Wasifur Rahman, Iftekhar Naim, Mohammed Ibrahim Khan, and Ehsan Hoque. 2023. TextMI: Textualize multimodal information for integrating non-verbal cues in pre-trained language models. arXiv preprint arXiv:2303.15430 (2023).
    [15]
    Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. 2020. MISA: Modality-invariant and-specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia. 1122–1131.
    [16]
    Jiaxuan He and Haifeng Hu. 2021. MF-BERT: Multimodal fusion in pre-trained BERT for sentiment analysis. IEEE Signal Processing Letters 29 (2021), 454–458.
    [17]
    Jing He, Haonan Yanga, Changfan Zhang, Hongrun Chen, and Yifu Xua. 2022. Dynamic invariant-specific representation fusion network for multimodal sentiment analysis. Computational Intelligence and Neuroscience 2022 (2022).
    [18]
    Changqin Huang, Junling Zhang, Xuemei Wu, Yi Wang, Ming Li, and Xiaodi Huang. 2023. TeFNA: Text-centered fusion network with crossmodal attention for multimodal sentiment analysis. Knowledge-Based Systems (2023), 110502.
    [19]
    Mahesh G. Huddar, Sanjeev S. Sannakki, and Vijay S. Rajpurohit. 2019. A survey of computational approaches and challenges in multimodal sentiment analysis. Int. J. Comput. Sci. Eng. 7, 1 (2019), 876–883.
    [20]
    iMotions. 2017. Facial expression analysis. (2017). https://imotions.com/biosensor/fea-facial-expression-analysis/
    [21]
    Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. 2020. SpanBERT: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics 8 (2020), 64–77.
    [22]
    Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT. 4171–4186.
    [23]
    Bin Liang, Xiang Li, Lin Gui, Yonghao Fu, Yulan He, Min Yang, and Ruifeng Xu. 2023. Few-shot aspect category sentiment analysis via meta-learning. ACM Transactions on Information Systems 41, 1 (2023), 1–31.
    [24]
    Fei Liu, Jing Liu, Zhiwei Fang, Richang Hong, and Hanqing Lu. 2019. Densely connected attention flow for visual question answering. In IJCAI. 869–875.
    [25]
    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
    [26]
    Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2018. Efficient low-rank multimodal fusion with modality-specific factors. arXiv preprint arXiv:1806.00064 (2018).
    [27]
    Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. Advances in Neural Information Processing Systems 29 (2016).
    [28]
    Lianyang Ma, Yu Yao, Tao Liang, and Tongliang Liu. 2022. Multi-scale cooperative multimodal transformers for multimodal sentiment analysis in videos. arXiv preprint arXiv:2206.07981 (2022).
    [29]
    Sijie Mai, Ying Zeng, Shuangjia Zheng, and Haifeng Hu. 2022. Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis. IEEE Transactions on Affective Computing (2022).
    [30]
    Albert Mehrabian and Susan R. Ferris. 1967. Inference of attitudes from nonverbal communication in two channels. Journal of Consulting Psychology 31, 3 (1967), 248.
    [31]
    Albert Mehrabian and Morton Wiener. 1967. Decoding of inconsistent communications. Journal of Personality and Social Psychology 6, 1 (1967), 109.
    [32]
    Louis-Philippe Morency, Rada Mihalcea, and Payal Doshi. 2011. Towards multimodal sentiment analysis: Harvesting opinions from the web. In Proceedings of the 13th International Conference on Multimodal Interfaces. 169–176.
    [33]
    Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. 2017. Dual attention networks for multimodal reasoning and matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 299–307.
    [34]
    Duy-Kien Nguyen and Takayuki Okatani. 2018. Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6087–6096.
    [35]
    Namuk Park and Songkuk Kim. 2022. How do vision transformers work? arXiv preprint arXiv:2202.06709 (2022).
    [36]
    Fan Qian, Hongwei Song, and Jiqing Han. 2022. Word-wise sparse attention for multimodal sentiment analysis. Proc. Interspeech 2022 (2022), 1973–1977.
    [37]
    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. (2018). https://api.semanticscholar.org/CorpusID:49313245
    [38]
    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.
    [39]
    Wasifur Rahman, Md. Kamrul Hasan, Sangwu Lee, Amir Zadeh, Chengfeng Mao, Louis-Philippe Morency, and Ehsan Hoque. 2020. Integrating multimodal information in large pretrained transformers. In Proceedings of the Conference. Association for Computational Linguistics. Meeting, Vol. 2020. NIH Public Access, 2359.
    [40]
    Fuji Ren and Zhong Huang. 2016. Automatic facial expression learning method based on humanoid robot XIN-REN. IEEE Transactions on Human-Machine Systems 46, 6 (2016), 810–821.
    [41]
    Piao Shi, Min Hu, Fuji Ren, Xuefeng Shi, Hongbo Li, Zezhong Li, and Hui Lin. 2022. Uncertain and biased facial expression recognition based on depthwise separable convolutional neural network with embedded attention mechanism. Journal of Electronic Imaging 31, 4 (2022), 043056.
    [42]
    Piao Shi, Min Hu, Fuji Ren, Xuefeng Shi, and Liangfeng Xu. 2022. Learning modality-fused representation based on transformer for emotion analysis. Journal of Electronic Imaging (2022).
    [43]
    Xuefeng Shi, Min Hu, Fuji Ren, Piao Shi, and Xiao Sun. 2022. ELM-based active learning via asymmetric samplers: Constructing a multi-class text corpus for emotion classification. Symmetry 14, 8 (2022), 1698.
    [44]
    Mohammad Soleymani, David Garcia, Brendan Jou, Björn Schuller, Shih-Fu Chang, and Maja Pantic. 2017. A survey of multimodal sentiment analysis. Image and Vision Computing 65 (2017), 3–14.
    [45]
    Zhongkai Sun, Prathusha Sarma, William Sethares, and Yingyu Liang. 2020. Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 8992–8999.
    [46]
    Jiajia Tang, Dongjun Liu, Xuanyu Jin, Yong Peng, Qibin Zhao, Yu Ding, and Wanzeng Kong. 2022. BAFN: Bi-direction attention based fusion network for multimodal sentiment analysis. IEEE Transactions on Circuits and Systems for Video Technology 33, 4 (2022), 1966–1978.
    [47]
    Tsegaye Misikir Tashu, Sakina Hajiyeva, and Tomas Horvath. 2021. Multimodal emotion recognition from art using sequential co-attention. Journal of Imaging 7, 8 (2021), 157.
    [48]
    Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the Conference. Association for Computational Linguistics. Meeting, Vol. 2019. NIH Public Access, 6558.
    [49]
    Yao-Hung Hubert Tsai, Martin Q. Ma, Muqiao Yang, Ruslan Salakhutdinov, and Louis-Philippe Morency. 2020. Multimodal routing: Improving local and global interpretability of multimodal language analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, Vol. 2020. NIH Public Access, 1823.
    [50]
    Laurens van der Maaten and Geoffrey E. Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, 11 (2008), 2579–2605.
    [51]
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017).
    [52]
    Jingyao Wang, Luntian Mou, Lei Ma, Tiejun Huang, and Wen Gao. 2023. AMSA: Adaptive multimodal learning for sentiment analysis. ACM Transactions on Multimedia Computing, Communications and Applications 19, 3s (2023), 1–21.
    [53]
    Yansen Wang, Ying Shen, Zhun Liu, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2019. Words can shift: Dynamically adjusting word representations using nonverbal behaviors. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 7216–7223.
    [54]
    Jun Wu, Tianliang Zhu, Jiahui Zhu, Tianyi Li, and Chunzhi Wang. 2023. A optimized BERT for multimodal sentiment analysis. ACM Transactions on Multimedia Computing, Communications and Applications 19, 2s (2023), 1–12.
    [55]
    Xiaojun Xue, Chunxia Zhang, Zhendong Niu, and Xindong Wu. 2022. Multi-level attention map network for multimodal sentiment analysis. IEEE Transactions on Knowledge and Data Engineering (2022).
    [56]
    Ashima Yadav and Dinesh Kumar Vishwakarma. 2023. A deep multi-level attentive network for multimodal sentiment analysis. ACM Transactions on Multimedia Computing, Communications and Applications 19, 1 (2023), 1–19.
    [57]
    Bo Yang, Lijun Wu, Jinhua Zhu, Bo Shao, Xiaola Lin, and Tie-Yan Liu. 2022. Multimodal sentiment analysis with two-phase multi-task learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing (2022).
    [58]
    Kaicheng Yang, Hua Xu, and Kai Gao. 2020. CM-BERT: Cross-modal BERT for text-audio sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia. 521–528.
    [59]
    Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R. Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. Advances in Neural Information Processing Systems 32 (2019).
    [60]
    Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6281–6290.
    [61]
    Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, and Dacheng Tao. 2018. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. IEEE Transactions on Neural Networks and Learning Systems 29, 12 (2018), 5947–5959.
    [62]
    Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250 (2017).
    [63]
    Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Memory fusion network for multi-view sequential learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
    [64]
    Amir Zadeh, Paul Pu Liang, Jonathan Vanbriesen, Soujanya Poria, Edmund Tong, Erik Cambria, Minghai Chen, and Louis Philippe Morency. 2018. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. ACL 2018-56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers) 1 (2018), 2236–2246.
    [65]
    Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intelligent Systems 31, 6 (2016), 82–88.
    [66]
    Sun Zhang, Chunyong Yin, and Zhichao Yin. 2023. Multimodal sentiment recognition with multi-task learning. IEEE Transactions on Emerging Topics in Computational Intelligence 7 (2023), 200–209.
    [67]
    Jinming Zhao, Ruichen Li, Qin Jin, Xinchao Wang, and Haizhou Li. 2022. MEmoBERT: Pre-training model with prompt-based learning for multimodal emotion recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4703–4707.
    [68]
    Linan Zhu, Zhechao Zhu, Chenwei Zhang, Yifei Xu, and Xiangjie Kong. 2023. Multimodal sentiment analysis based on fusion methods: A survey. Information Fusion 95 (2023), 306–325.
    [69]
    Heqing Zou, Yuke Si, Chen Chen, Deepu Rajan, and Eng Siong Chng. 2022. Speech emotion recognition with co-attention based multi-level acoustic information. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7367–7371.
    [70]
    Wenwen Zou, Jundi Ding, and Chao Wang. 2022. Utilizing BERT intermediate layers for multimodal sentiment analysis. In 2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1–6.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 4
    April 2024
    676 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3613617
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 January 2024
    Online AM: 27 November 2023
    Accepted: 21 November 2023
    Revised: 09 September 2023
    Received: 27 April 2023
    Published in TOMM Volume 20, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Multimodal sentiment analysis
    2. co-attention
    3. cross-modal modulation
    4. fine-tuning
    5. shifted representations
    6. modality-mixing adaptation gate

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China
    • Fundamental Research Funds for the Central Universities of China
    • Provincial Natural Science Research Project

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 384
      Total Downloads
    • Downloads (Last 12 months)384
    • Downloads (Last 6 weeks)49

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media