research-article

Self-Adaptive Representation Learning Model for Multi-Modal Sentiment and Sarcasm Joint Analysis

Authors:

M. Shamim HossainAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 20, Issue 5

Article No.: 125, Pages 1 - 17

https://doi.org/10.1145/3635311

Published: 11 January 2024 Publication History

Abstract

Sentiment and sarcasm are intimate and complex, as sarcasm often deliberately elicits an emotional response in order to achieve its specific purpose. Current challenges in multi-modal sentiment and sarcasm joint detection mainly include multi-modal representation fusion and the modeling of the intrinsic relationship between sentiment and sarcasm. To address these challenges, we propose a single-input stream self-adaptive representation learning model (SRLM) for sentiment and sarcasm joint recognition. Specifically, we divide the image into blocks to learn its serialized features and fuse textual feature as input to the target model. Then, we introduce an adaptive representation learning network using a gated network approach for sarcasm and sentiment classification. In this framework, each task is equipped with its dedicated expert network responsible for learning task-specific information, while the shared expert knowledge is acquired and weighted through the gating network. Finally, comprehensive experiments conducted on two publicly available datasets, namely Memotion and MUStARD, demonstrate the effectiveness of the proposed model when compared to state-of-the-art baselines. The results reveal a notable improvement on the performance of sentiment and sarcasm tasks.

References

[1]

DI Hernández Farias and Paolo Rosso. 2017. Irony, sarcasm, and sentiment analysis. Sentiment Analysis in Social Networks. Elsevier, 113–128.

[2]

Abdulmotaleb El Saddik, Stefan Fischer, and Ralf Steinmetz. 2001. Reusable multimedia content in Web based learning systems. IEEE MultiMedia 8, 3 (2001), 30–38.

Digital Library

[3]

Anam Moin, Farhan Aadil, Zeeshan Ali, and Dongwann Kang. 2023. Emotion recognition framework using multiple modalities for an effective human–computer interaction. The Journal of Supercomputing 79, 8 (2023), 9320–9349.

Digital Library

[4]

Xiaoheng Zhang and Yang Li. 2023. A cross-modality context fusion and semantic refinement network for emotion recognition in conversation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 13099–13110.

[5]

M. Shamim Hossain and Ghulam Muhammad. 2019. Emotion recognition using deep learning approach from audio-visual emotional big data. Information Fusion 49 (2019), 69–78.

Digital Library

[6]

Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. 2023. Bidirectional generative framework for cross-domain aspect-based sentiment analysis. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 12272–12285.

[7]

Yi Liu, Zengwei Zheng, Binbin Zhou, Jianhua Ma, Lin Sun, and Ruichen Xia. 2022. Multimodal sarcasm detection based on multimodal sentiment co-training. In Proceedings of the 2022 IEEE Smartworld, Ubiquitous Intelligence & Computing, Scalable Computing & Communications, Digital Twin, Privacy Computing, Metaverse, Autonomous & Trusted Vehicles (SmartWorld/UIC/ScalCom/DigitalTwin/PriComp/Meta). IEEE, 508–515.

[8]

Changsong Wen, Guoli Jia, and Jufeng Yang. 2023. DIP: Dual incongruity perceiving network for sarcasm detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2540–2550.

[9]

Ellen Riloff, Ashequl Qadir, Prafulla Surve, Lalindra De Silva, Nathan Gilbert, and Ruihong Huang. 2013. Sarcasm as contrast between a positive sentiment and negative situation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 704–714.

[10]

Haojie Zhao, Xiao Wang, Dong Wang, Huchuan Lu, and Xiang Ruan. 2023. Transformer vision-language tracking via proxy token guided cross-modal fusion. Pattern Recognition Letters 168 (2023), 10–16.

Digital Library

[11]

Yazhou Zhang, Yang Yu, Dongming Zhao, Zuhe Li, Bo Wang, Yuexian Hou, Prayag Tiwari, and Jing Qin. 2023. Learning multi-task commonness and uniqueness for multi-modal sarcasm detection and sentiment analysis in conversation. IEEE Transactions on Artificial Intelligence, 1, 1 (2023), 1–13.

[12]

Md. Shad Akhtar, Dushyant Singh Chauhan, Deepanway Ghosal, Soujanya Poria, Asif Ekbal, and Pushpak Bhat-tacharyya. 2019. Multi-task learning for multi-modal emotion recognition and sentiment analysis. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT. 370–379.

[13]

Bo Yang, Lijun Wu, Jinhua Zhu, Bo Shao, Xiaola Lin, and Tie-Yan Liu. 2022. Multimodal sentiment analysis with two-phase multi-task learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022), 2015–2024.

Digital Library

[14]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth \(16\times 16\) words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations (ICLR’21).

[15]

Amani Albraikan, Diana P. Tobón, and Abdulmotaleb El Saddik. 2018. Toward user-independent emotion recognition using physiological signals. IEEE sensors Journal 19, 19 (2018), 8402–8412.

[16]

Amani Albraikan, Basim Hafidh, and Abdulmotaleb El Saddik. 2018. iAware: A real-time emotional biofeedback system based on physiological signals. IEEE Access 6 (2018), 78780–78789.

[17]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference Advances in Neural Information Processing Systems 30 (2017).

[18]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT. 4171–4186.

[19]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.

[20]

Jiaxuan He and Haifeng Hu. 2021. MF-BERT: Multimodal fusion in pre-trained BERT for sentiment analysis. IEEE Signal Processing Letters 29 (2021), 454–458.

[21]

Ayush Kumar and Jithendra Vepa. 2020. Gated mechanism for attention based multi modal sentiment analysis. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing. 4477–4481. DOI:

[22]

Deepanway Ghosal, Md Shad Akhtar, Dushyant Chauhan, Soujanya Poria, Asif Ekbal, and Pushpak Bhattacharyya. 2018. Contextual inter-modal attention for multi-modal sentiment analysis. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 3454–3466.

[23]

Qiongan Zhang, Lei Shi, Peiyu Liu, Zhenfang Zhu, and Liancheng Xu. 2023. ICDN: Integrating consistency and difference networks by transformer for multimodal sentiment analysis. Applied Intelligence 53, 12 (2023), 16332–16345.

Digital Library

[24]

Junjie Peng, Ting Wu, Wenqiang Zhang, Feng Cheng, Shuhua Tan, Fen Yi, and Yansong Huang. 2023. A fine-grained modal label-based multi-stage network for multimodal sentiment analysis. Expert Systems with Applications 221 (2023), 119721.

Digital Library

[25]

Kyeonghun Kim and Sanghyun Park. 2023. AOBERT: All-modalities-in-One BERT for multimodal sentiment analysis. Information Fusion 92 (2023), 37–45.

Digital Library

[26]

Rossano Schifanella, Paloma De Juan, Joel Tetreault, and Liangliang Cao. 2016. Detecting sarcasm in multimodal social platforms. In Proceedings of the 24th ACM International Conference on Multimedia. 1136–1145.

Digital Library

[27]

Xinyu Wang, Xiaowen Sun, Tan Yang, and Hongbo Wang. 2020. Building a bridge: A method for image-text sarcasm detection without pretraining on image-text data. In Proceedings of the 1st International Workshop on Natural Language Processing Beyond Text. 19–29.

[28]

Ning Ding, Sheng-wei Tian, and Long Yu. 2022. A multimodal fusion method for sarcasm detection based on late fusion. Multimedia Tools and Applications 81, 6 (2022), 8597–8616.

Digital Library

[29]

Yitao Cai, Huiyu Cai, and Xiaojun Wan. 2019. Multi-modal sarcasm detection in twitter with hierarchical fusion model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2506–2515.

[30]

Yang Qiao, Liqiang Jing, Xuemeng Song, Xiaolin Chen, Lei Zhu, and Liqiang Nie. 2023. Mutual-enhanced incongruity learning network for multi-modal sarcasm detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 9507–9515.

Digital Library

[31]

Xinkai Lu, Ying Qian, Yan Yang, and Wenrao Pang. 2022. Sarcasm detection of dual multimodal contrastive attention networks. In Proceedings of the 2022 IEEE Smartworld, Ubiquitous Intelligence & Computing, Scalable Computing & Communications, Digital Twin, Privacy Computing, Metaverse, Autonomous & Trusted Vehicles (SmartWorld/UIC/ScalCom/DigitalTwin/PriComp/Meta). IEEE, 1455–1460.

[32]

Bin Liang, Chenwei Lou, Xiang Li, Min Yang, Lin Gui, Yulan He, Wenjie Pei, and Ruifeng Xu. 2022. Multi-modal sarcasm detection via cross-modal graph convolutional network. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. Association for Computational Linguistics, 1767–1777.

[33]

Hui Liu, Wenya Wang, and Haoliang Li. 2022. Towards multi-modal sarcasm detection via hierarchical congruity modeling with knowledge enhancement. In 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP’22). 4995–5006.

[34]

Yue Tan, Bo Wang, Anqi Liu, Dongming Zhao, Kun Huang, Ruifang He, and Yuexian Hou. 2023. Guiding dialogue agents to complex semantic targets by dynamically completing knowledge graph. In Findings of the Association for Computational Linguistics: ACL 2023. 6506–6518.

[35]

Navonil Majumder, Soujanya Poria, Haiyun Peng, Niyati Chhaya, Erik Cambria, and Alexander Gelbukh. 2019. Sentiment and sarcasm classification with multitask learning. IEEE Intelligent Systems 34, 3 (2019), 38–43.

[36]

Oxana Vitman, Yevhen Kostiuk, Grigori Sidorov, and Alexander Gelbukh. 2023. Sarcasm detection framework using context, emotion and sentiment features. Expert Systems with Applications 234 (2023), 121068.

Digital Library

[37]

Chunyan Yin, Yongheng Chen, and Wanli Zuo. 2021. Multi-task deep neural networks for joint sarcasm detection and sentiment analysis. Pattern Recognition and Image Analysis 31 (2021), 103–108.

Digital Library

[38]

Dushyant Singh Chauhan, S.R. Dhanush, Asif Ekbal, and Pushpak Bhattacharyya. 2020. Sentiment and emotion help sarcasm? A multi-task learning framework for multi-modal sarcasm, sentiment and emotion analysis. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 4351–4360.

[39]

Yaochen Liu, Yazhou Zhang, and Dawei Song. 2023. A quantum probability driven framework for joint multi-modal sarcasm, sentiment and emotion analysis. IEEE Transactions on Affective Computing 1 (2023), 1–15.

[40]

M. Shamim Hossain and Ghulam Muhammad. 2018. Emotion-aware connected healthcare big data towards 5g. IEEE Internet of Things Journal 5, 4 (2018), 2399–2406.

[41]

Chhavi Sharma, William Paka, Deepesh Bhageria Scott, Amitava Das, Soujanya Poria, Tanmoy Chakraborty, and Björn Gambäck. 2020. Task report: Memotion analysis 1.0@ semeval 2020: The visuo-lingual metaphor. In Proceedings of the 14th International Workshop on Semantic Evaluation, Sep. Association for Computational Linguistics.

[42]

Aurora Linh Cramer, Ho-Hsiang Wu, Justin Salamon, and Juan Pablo Bello. 2019. Look, listen, and learn more: Design choices for deep audio embeddings. In Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 3852–3856.

[43]

Rolandos Alexandros Potamias, Georgios Siolas, and Andreas-Georgios Stafylopatis. 2020. A transformer-based approach to irony and sarcasm detection. Neural Computing and Applications 32 (2020), 17309–17320.

Digital Library

[44]

Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning. PMLR, 6105–6114.

[45]

Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 1103–1114.

[46]

Shraman Pramanick, Aniket Roy, and Vishal M. Patel. 2022. Multimodal learning using optimal transport for sarcasm and humor detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3930–3940.

[47]

Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the Conference Association for Computational Linguistics. Meeting, Vol. 2019. NIH Public Access, 6558.

[48]

Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng Chua. 2019. MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. In Proceedings of the 27th ACM International Conference on Multimedia. 1437–1445.

Digital Library

[49]

Yazhou Zhang, Ao Jia, Bo Wang, Peng Zhang, Dongming Zhao, Pu Li, Yuexian Hou, Xiaojia Jin, Dawei Song, and Jing Qin. 2023. M3gat: A multi-modal, multi-task interactive graph attention network for conversational sentiment analysis and emotion recognition. ACM Transactions on Information Systems, 42, 1 (2023), 1–32.

[50]

Iti Chaturvedi, Chit Lin Su, and Roy E. Welsch. 2021. Fuzzy aggregated topology evolution for cognitive multi-tasks. Cognitive Computation 13 (2021), 96–107.

[51]

George-Alexandru Vlad, George-Eduard Zaharia, Dumitru-Clementin Cercel, Costin Chiru, and Stefan Trausan Matu. 2020. Upb at semeval-2020 task 8: Joint textual and visual modeling in a multi-task learning architecture for memotion analysis. In Proceedings of the Fourteenth Workshop on Semantic Evaluation. 1208–1214.

Cited By

Pokhriyal HJain G(2025)Supposititious Sarcasm Detection and Sentiment Analysis Coping Hindi Language in Social Networks Harnessing Zipf- Mandelbrot Probabilistic Optimisation and Perplexity Entropy LearningACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3712061Online publication date: 16-Jan-2025
https://doi.org/10.1145/3712061
Dhiman CKumar GChua TNgo CKumar RLauw HKa-Wei Lee R(2024)Textual Context guided Vision Transformer with Rotated Multi-Head Attention for Sentiment AnalysisCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3651968(1823-1830)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589335.3651968

Index Terms

Self-Adaptive Representation Learning Model for Multi-Modal Sentiment and Sarcasm Joint Analysis
1. Computing methodologies
  1. Artificial intelligence
    1. Knowledge representation and reasoning
    2. Natural language processing
2. Networks
  1. Network architectures
    1. Network design principles

Recommendations

Sentiment-oriented Sarcasm Integration for Video Sentiment Analysis Enhancement with Sarcasm Assistance
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Sarcasm is an intricate expression phenomenon and has garnered increasing attentions over the recent years, especially for multimodal contexts such as videos. Nevertheless, despite being a significant aspect of human sentiment, the effect of sarcasm is ...
Multi-modal Sentiment Feature Learning Based on Sentiment Signal
ChineseCSCW '17: Proceedings of the 12th Chinese Conference on Computer Supported Cooperative Work and Social Computing

The multi-modal characteristic of social media content (e.g. texts and images) significantly challenges traditional text-based sentiment analysis approaches, multi-modal sentiment analysis gets great theoretical value for understanding and analysis of ...
Joint sentiment/topic model for sentiment analysis
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Sentiment analysis or opinion mining aims to use automated tools to detect subjective information such as opinions, attitudes, and feelings expressed in text. This paper proposes a novel probabilistic modeling framework based on Latent Dirichlet ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 20, Issue 5

May 2024

650 pages

EISSN:1551-6865

DOI:10.1145/3613634

Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 January 2024

Online AM: 05 December 2023

Accepted: 27 November 2023

Revised: 23 November 2023

Received: 27 September 2023

Published in TOMM Volume 20, Issue 5

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Researchers Supporting Project
King Saud University, Riyadh, Saudi Arabia
Foundation of Key Laboratory of Dependable Service Computing in Cyber-Physical-Society (Ministry of Education), Chongqing University
National Science Foundation of China
Fellowship from the China Postdoctoral Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
506
Total Downloads

Downloads (Last 12 months)409
Downloads (Last 6 weeks)38

Reflects downloads up to 24 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Pokhriyal HJain G(2025)Supposititious Sarcasm Detection and Sentiment Analysis Coping Hindi Language in Social Networks Harnessing Zipf- Mandelbrot Probabilistic Optimisation and Perplexity Entropy LearningACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3712061Online publication date: 16-Jan-2025
https://doi.org/10.1145/3712061
Dhiman CKumar GChua TNgo CKumar RLauw HKa-Wei Lee R(2024)Textual Context guided Vision Transformer with Rotated Multi-Head Attention for Sentiment AnalysisCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3651968(1823-1830)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589335.3651968

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents