Text-dominant strategy for multistage optimized modality fusion in multimodal sentiment analysis

Wu, Jun; Wang, Jiangpeng; Jing, Shilong; Liu, Jinyu; Zhang, Tianfeng; Han, Min; Zhan, Pengfei; Zuo, Gan

doi:10.1007/s00530-024-01518-2

Text-dominant strategy for multistage optimized modality fusion in multimodal sentiment analysis

Regular Paper
Published: 21 November 2024

Volume 30, article number 353, (2024)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Jun Wu¹^na1,
Jiangpeng Wang¹^na1,
Shilong Jing¹,
Jinyu Liu¹,
Tianfeng Zhang¹,
Min Han¹,
Pengfei Zhan¹ &
…
Gan Zuo²

172 Accesses
Explore all metrics

Abstract

In the current multimodal sentiment analysis, multimodal data fusion has gradually transitioned from non-modality-dominant to modality-dominant. Effectively realizing the fusion of relevant information between different modalities is still one of the crucial challenges facing multimodal sentiment analysis. Although existing multimodal fusion methods have made some progress, most methods still show limitations in dealing with the complex dynamic relationships between modalities. This study proposes a text-dominant Multistage Modality Fusion (DMMF) mechanism. In the first stage, the text modalities are guided through the auxiliary modalities of vision and audio modalities to perform auxiliary modal fusion at the embedding layer. In the second stage, the extracted features can be utilized to enhance the multimodal interactions between different modalities through the mechanisms of self-attention and cross-attention to develop the fusion and achieve better fusion representation and fusion effect in multistage. The model is comprehensively evaluated on the CMU-MOSI and CMU-MOSEI datasets. The experimental results show a significant improvement in most evaluation metrics of our model compared to the baseline models. The code is available on https://github.com/mmm587/DMMF.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimodal Social Media Sentiment Analysis Based on Cross-Modal Hierarchical Attention Fusion

Transformer-based adaptive contrastive learning for multimodal sentiment analysis

Article 12 April 2024

Text-Oriented Modality Reinforcement Network for Multimodal Sentiment Analysis from Unaligned Multimodal Sequences

Data availability

The datasets generated and analyzed during the current study are available from the corresponding author on reasonable request.

References

Gandhi, A., Adhvaryu, K., Poria, S., et al.: Multimodal sentiment analysis: a systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Inf. Fus. 91, 424–444 (2023). https://doi.org/10.1016/j.inffus.2022.09.025. https://www.sciencedirect.com/science/article/pii/S1566253522001634
Wang, L., Peng, J., Zheng, C., et al.: A cross modal hierarchical fusion multimodal sentiment analysis method based on multi-task learning. Inf. Process. Manage. 61(3), 103675 (2024). https://doi.org/10.1016/j.ipm.2024.103675. https://www.sciencedirect.com/science/article/pii/S0306457324000359
Lai, S., Hu, X., Xu, H., et al.: Multimodal sentiment analysis: a survey. Displays 80, 102563 (2023). https://doi.org/10.1016/j.displa.2023.102563. https://www.sciencedirect.com/science/article/pii/S0141938223001968
Soleymani, M., Garcia, D., Jou, B., et al.: A survey of multimodal sentiment analysis. Image Vis. Comput. 65, 3–14 (2017). https://doi.org/10.1016/j.imavis.2017.08.003. https://www.sciencedirect.com/science/article/pii/S0262885617301191 (multimodal Sentiment Analysis and Mining in the Wild Image and Vision Computing)
Zhao, H., Yang, M., Bai, X., et al.: A survey on multimodal aspect-based sentiment analysis. IEEE Access 12, 12039–12052 (2024). https://doi.org/10.1109/ACCESS.2024.3354844
Article Google Scholar
Ghorbanali, A., Sohrabi, M.K.: Capsule network-based deep ensemble transfer learning for multimodal sentiment analysis. Expert Syst. Appl. 239, 122454 (2024). https://doi.org/10.1016/j.eswa.2023.122454. https://www.sciencedirect.com/science/article/pii/S0957417423029561
Das, R., Singh, T.D.: Multimodal sentiment analysis: a survey of methods, trends, and challenges. ACM Comput. Surv. 55(13s), (2023). https://doi.org/10.1145/3586075
Poria, S., Hazarika, D., Majumder, N., et al.: Beneath the tip of the iceberg: current challenges and new directions in sentiment analysis research. IEEE Trans. Affect. Comput. 14, 108–132 (2020). https://api.semanticscholar.org/CorpusID:218470466
Pandey, A., Vishwakarma, D.K.: Progress, achievements, and challenges in multimodal sentiment analysis using deep learning: a survey. Appl. Soft Comput. 152, 111206 (2024). https://doi.org/10.1016/j.asoc.2023.111206. https://www.sciencedirect.com/science/article/pii/S1568494623012243
Zhu, L., Zhu, Z., Zhang, C., et al.: Multimodal sentiment analysis based on fusion methods: a survey. Inf. Fus. 95, 306–325 (2023). https://doi.org/10.1016/j.inffus.2023.02.028. https://www.sciencedirect.com/science/article/pii/S156625352300074X
Tsai, Y.H.H., Bai, S., Liang, P.P. et al.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, pp. 6558–6569 (2019). https://doi.org/10.18653/v1/P19-1656, https://aclanthology.org/P19-1656
Hazarika, D., Zimmermann, R., Poria, S.: Misa: modality-invariant and -specific representations for multimodal sentiment analysis. In: Proceedings of the 28th ACM International Conference on Multimedia. Association for Computing Machinery, New York, NY, USA, MM ’20, pp. 1122–1131 (2020). https://doi.org/10.1145/3394171.3413678
Yu, W., Xu, H., Yuan, Z. et al.: Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In: AAAI Conference on Artificial Intelligence, (2021). https://api.semanticscholar.org/CorpusID:231855771
Yang, D., Huang, S., Kuang, H., et al.: Disentangled representation learning for multimodal emotion recognition. In: Proceedings of the 30th ACM International Conference on Multimedia. Association for Computing Machinery, New York, NY, USA, MM ’22, pp. 1642–1651 (2022). https://doi.org/10.1145/3503161.3547754
Luo, Y., Wu, R., Liu, J., et al.: Balanced sentimental information via multimodal interaction model. Multimedia Syst. 30(1), 10 (2024). https://doi.org/10.1007/s00530-023-01208-5
Article Google Scholar
Miao, X., Zhang, X., Zhang, H.: Low-rank tensor fusion and self-supervised multi-task multimodal sentiment analysis. Multimed. Tools Appl. (2024). https://doi.org/10.1007/s11042-023-18032-8
Article Google Scholar
Lian, Z., Chen, L., Sun, L., et al.: Gcnet: Graph completion network for incomplete multimodal learning in conversation. IEEE Trans. Pattern Anal. Mach. Intell. 45(07), 8419–8432 (2023). https://doi.org/10.1109/TPAMI.2023.3234553
Article Google Scholar
Wu, J., Zhu, T., Zhu, J., et al.: A optimized bert for multimodal sentiment analysis. ACM Trans. Multimed. Comput. Commun. Appl. (2023). https://doi.org/10.1145/3566126
Article Google Scholar
Li, K., Lu, J., Zuo, H., et al.: Multi-source domain adaptation handling inaccurate label spaces. Neurocomputing 594, 127824 (2024). https://doi.org/10.1016/j.neucom.2024.127824. https://www.sciencedirect.com/science/article/pii/S0925231224005952
Li, K., Lu, J., Zuo, H., et al.: Multidomain adaptation with sample and source distillation. IEEE Trans. Cybern. 54(4), 2193–2205 (2024). https://doi.org/10.1109/TCYB.2023.3236008
Article Google Scholar
Rahman, W., Hasan, M.K., Lee, S. et al.: Integrating multimodal information in large pretrained transformers. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, pp. 2359–2369 (2020). https://doi.org/10.18653/v1/2020.acl-main.214
Guo, J., Tang, J., Dai, W. et al.: Dynamically adjust word representations using unaligned multimodal information. In: Proceedings of the 30th ACM International Conference on Multimedia. Association for Computing Machinery, New York, NY, USA, MM ’22, pp. 3394–3402 (2022). https://doi.org/10.1145/3503161.3548137
Huang, C., Zhang, J., Wu, X., et al.: Tefna: Text-centered fusion network with crossmodal attention for multimodal sentiment analysis. Know-Based Syst. (2023). https://doi.org/10.1016/j.knosys.2023.110502
Article Google Scholar
Sun, L., Lian, Z., Liu, B., et al.: Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis. IEEE Trans. Affect. Comput. 15(1), 309–325 (2023). https://doi.org/10.1109/TAFFC.2023.3274829
Article Google Scholar
Yadav, A., Vishwakarma, D.K.: A deep multi-level attentive network for multimodal sentiment analysis. ACM Trans. Multimed. Comput. Commun. Appl. (2023). https://doi.org/10.1145/3517139
Article Google Scholar
Zhu, T., Li, L., Yang, J., et al.: Multimodal emotion classification with multi-level semantic reasoning network. IEEE Trans. Multimed. 25, 6868–6880 (2023). https://doi.org/10.1109/TMM.2022.3214989
Article Google Scholar
Li, M., Yang, D., Lei, Y., et al.: A unified self-distillation framework for multimodal sentiment analysis with uncertain missing modalities. Proc. AAAI Conf. Artif. Intell. 38(9), 10074–10082 (2024). https://doi.org/10.1609/aaai.v38i9.28871
Article Google Scholar
Zhang, H., Wang, Y., Yin, G. et al.: Learning language-guided adaptive hyper-modality representation for multimodal sentiment analysis. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, pp. 756–767 (2023). https://doi.org/10.18653/v1/2023.emnlp-main.49
Lu, T., Zhong, X., Zhong, L.: mswinunet: a multi-modal u-shaped Swin transformer for supervised change detection. J. Intell. Fuzzy Syst. 46(2), 4243–4252 (2024). https://doi.org/10.3233/JIFS-233868
Article Google Scholar
Wu, J., Zheng, X., Wang, J., et al.: AB-GRU: an attention-based bidirectional GRU model for multimodal sentiment fusion and analysis. Math. Biosci. Eng. 20(10), 18523–18544 (2023)
Article Google Scholar
Jun, W., Tianliang, Z., Jiahui, Z., et al.: Hierarchical multiples self-attention mechanism for multi-modal analysis. Multimed. Syst. 29(6), 3599–3608 (2023). https://doi.org/10.1007/s00530-023-01133-7
Article Google Scholar
Zadeh, A., Zellers, R., Pincus, E., et al.: MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. (2016). CoRR abs/1606.06259 . http://arxiv.org/abs/1606.06259. arXiv:1606.06259
Bagher Zadeh, A., Liang, P.P., Poria, S. et al.: Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, pp. 2236–2246 (2018). https://doi.org/10.18653/v1/P18-1208. https://aclanthology.org/P18-1208
Sun, Z., Sarma, P.K., Sethares, W.A., et al.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: AAAI Conference on Artificial Intelligence, (2019). https://api.semanticscholar.org/CorpusID:207930647
Yang, Z., Dai, Z., Yang, Y., et al.: Xlnet: Generalized autoregressive pretraining for language understanding. In: Wallach, H., Larochelle, H., Beygelzimer, A. et al. (eds.), Advances in Neural Information Processing Systems, vol 32. Curran Associates, Inc., (2019). https://proceedings.neurips.cc/paper_files/paper/2019/file/dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Grant Nos. 61602161, 61772180), Hubei Province Science and Technology Support Project (Grant No. 2020BAB012), Hubei Provincial Science and Technology Program Project (Grant No. 2023BCB041), and The Fundamental Research Funds for the Research Fund of Hubei University of Technology (HBUT: 2021046, 21060, 21066)

Author information

Jun Wu and Jiangpeng Wang have contributed equally to this work.

Authors and Affiliations

School of Computer Science, Hubei University of Technology, 28 Nanli Road, Wuhan, 430068, Hubei, China
Jun Wu, Jiangpeng Wang, Shilong Jing, Jinyu Liu, Tianfeng Zhang, Min Han & Pengfei Zhan
Consulting, IBM (Wuhan), 9 Huacheng Road, Wuhan, 430033, Hubei, China
Gan Zuo

Authors

Jun Wu
View author publications
You can also search for this author in PubMed Google Scholar
Jiangpeng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Shilong Jing
View author publications
You can also search for this author in PubMed Google Scholar
Jinyu Liu
View author publications
You can also search for this author in PubMed Google Scholar
Tianfeng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Min Han
View author publications
You can also search for this author in PubMed Google Scholar
Pengfei Zhan
View author publications
You can also search for this author in PubMed Google Scholar
Gan Zuo
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: [Jun Wu]; Methodology: [Jun Wu, Jiangpeng Wang]; Formal analysis and investigation: [Shilong Jing, Tianfeng Zhang]; Writing—original draft preparation: [Jun Wu, Jiangpeng Wang]; Writing—review and editing: [Jun Wu, Jiangpeng Wang]; Funding acquisition: [Jun Wu, Jinyu Liu]; Resources: [Pengfei Zhan, Min Han]; Supervision: [Gan Zuo].

Corresponding author

Correspondence to Gan Zuo.

Ethics declarations

Conflict of interest

All authors declare that we have no conflict of interest. We promise that our studies have no ethical issues. The Official Datasets website: CMU-MOSI: http://multicomp.cs.cmu.edu/resources/cmu-mosi-dataset/ CMU-MOSEI: http://multicomp.cs.cmu.edu/resources/cmu-mosei-dataset/ The code is available on https://github.com/mmm587/DMMF All authors declare that we agree to submit this manuscript.

Additional information

Communicated by Haojie Li.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wu, J., Wang, J., Jing, S. et al. Text-dominant strategy for multistage optimized modality fusion in multimodal sentiment analysis. Multimedia Systems 30, 353 (2024). https://doi.org/10.1007/s00530-024-01518-2

Download citation

Received: 12 May 2024
Accepted: 26 September 2024
Published: 21 November 2024
DOI: https://doi.org/10.1007/s00530-024-01518-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Text-dominant strategy for multistage optimized modality fusion in multimodal sentiment analysis

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multimodal Social Media Sentiment Analysis Based on Cross-Modal Hierarchical Attention Fusion

Transformer-based adaptive contrastive learning for multimodal sentiment analysis

Text-Oriented Modality Reinforcement Network for Multimodal Sentiment Analysis from Unaligned Multimodal Sequences

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Text-dominant strategy for multistage optimized modality fusion in multimodal sentiment analysis

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multimodal Social Media Sentiment Analysis Based on Cross-Modal Hierarchical Attention Fusion

Transformer-based adaptive contrastive learning for multimodal sentiment analysis

Text-Oriented Modality Reinforcement Network for Multimodal Sentiment Analysis from Unaligned Multimodal Sequences

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation