Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Text-dominant strategy for multistage optimized modality fusion in multimodal sentiment analysis

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

In the current multimodal sentiment analysis, multimodal data fusion has gradually transitioned from non-modality-dominant to modality-dominant. Effectively realizing the fusion of relevant information between different modalities is still one of the crucial challenges facing multimodal sentiment analysis. Although existing multimodal fusion methods have made some progress, most methods still show limitations in dealing with the complex dynamic relationships between modalities. This study proposes a text-dominant Multistage Modality Fusion (DMMF) mechanism. In the first stage, the text modalities are guided through the auxiliary modalities of vision and audio modalities to perform auxiliary modal fusion at the embedding layer. In the second stage, the extracted features can be utilized to enhance the multimodal interactions between different modalities through the mechanisms of self-attention and cross-attention to develop the fusion and achieve better fusion representation and fusion effect in multistage. The model is comprehensively evaluated on the CMU-MOSI and CMU-MOSEI datasets. The experimental results show a significant improvement in most evaluation metrics of our model compared to the baseline models. The code is available on https://github.com/mmm587/DMMF.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

The datasets generated and analyzed during the current study are available from the corresponding author on reasonable request.

References

  1. Gandhi, A., Adhvaryu, K., Poria, S., et al.: Multimodal sentiment analysis: a systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Inf. Fus. 91, 424–444 (2023). https://doi.org/10.1016/j.inffus.2022.09.025. https://www.sciencedirect.com/science/article/pii/S1566253522001634

  2. Wang, L., Peng, J., Zheng, C., et al.: A cross modal hierarchical fusion multimodal sentiment analysis method based on multi-task learning. Inf. Process. Manage. 61(3), 103675 (2024). https://doi.org/10.1016/j.ipm.2024.103675. https://www.sciencedirect.com/science/article/pii/S0306457324000359

  3. Lai, S., Hu, X., Xu, H., et al.: Multimodal sentiment analysis: a survey. Displays 80, 102563 (2023). https://doi.org/10.1016/j.displa.2023.102563. https://www.sciencedirect.com/science/article/pii/S0141938223001968

  4. Soleymani, M., Garcia, D., Jou, B., et al.: A survey of multimodal sentiment analysis. Image Vis. Comput. 65, 3–14 (2017). https://doi.org/10.1016/j.imavis.2017.08.003. https://www.sciencedirect.com/science/article/pii/S0262885617301191 (multimodal Sentiment Analysis and Mining in the Wild Image and Vision Computing)

  5. Zhao, H., Yang, M., Bai, X., et al.: A survey on multimodal aspect-based sentiment analysis. IEEE Access 12, 12039–12052 (2024). https://doi.org/10.1109/ACCESS.2024.3354844

    Article  Google Scholar 

  6. Ghorbanali, A., Sohrabi, M.K.: Capsule network-based deep ensemble transfer learning for multimodal sentiment analysis. Expert Syst. Appl. 239, 122454 (2024). https://doi.org/10.1016/j.eswa.2023.122454. https://www.sciencedirect.com/science/article/pii/S0957417423029561

  7. Das, R., Singh, T.D.: Multimodal sentiment analysis: a survey of methods, trends, and challenges. ACM Comput. Surv. 55(13s), (2023). https://doi.org/10.1145/3586075

  8. Poria, S., Hazarika, D., Majumder, N., et al.: Beneath the tip of the iceberg: current challenges and new directions in sentiment analysis research. IEEE Trans. Affect. Comput. 14, 108–132 (2020). https://api.semanticscholar.org/CorpusID:218470466

  9. Pandey, A., Vishwakarma, D.K.: Progress, achievements, and challenges in multimodal sentiment analysis using deep learning: a survey. Appl. Soft Comput. 152, 111206 (2024). https://doi.org/10.1016/j.asoc.2023.111206. https://www.sciencedirect.com/science/article/pii/S1568494623012243

  10. Zhu, L., Zhu, Z., Zhang, C., et al.: Multimodal sentiment analysis based on fusion methods: a survey. Inf. Fus. 95, 306–325 (2023). https://doi.org/10.1016/j.inffus.2023.02.028. https://www.sciencedirect.com/science/article/pii/S156625352300074X

  11. Tsai, Y.H.H., Bai, S., Liang, P.P. et al.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, pp. 6558–6569 (2019). https://doi.org/10.18653/v1/P19-1656, https://aclanthology.org/P19-1656

  12. Hazarika, D., Zimmermann, R., Poria, S.: Misa: modality-invariant and -specific representations for multimodal sentiment analysis. In: Proceedings of the 28th ACM International Conference on Multimedia. Association for Computing Machinery, New York, NY, USA, MM ’20, pp. 1122–1131 (2020). https://doi.org/10.1145/3394171.3413678

  13. Yu, W., Xu, H., Yuan, Z. et al.: Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In: AAAI Conference on Artificial Intelligence, (2021). https://api.semanticscholar.org/CorpusID:231855771

  14. Yang, D., Huang, S., Kuang, H., et al.: Disentangled representation learning for multimodal emotion recognition. In: Proceedings of the 30th ACM International Conference on Multimedia. Association for Computing Machinery, New York, NY, USA, MM ’22, pp. 1642–1651 (2022). https://doi.org/10.1145/3503161.3547754

  15. Luo, Y., Wu, R., Liu, J., et al.: Balanced sentimental information via multimodal interaction model. Multimedia Syst. 30(1), 10 (2024). https://doi.org/10.1007/s00530-023-01208-5

    Article  Google Scholar 

  16. Miao, X., Zhang, X., Zhang, H.: Low-rank tensor fusion and self-supervised multi-task multimodal sentiment analysis. Multimed. Tools Appl. (2024). https://doi.org/10.1007/s11042-023-18032-8

    Article  Google Scholar 

  17. Lian, Z., Chen, L., Sun, L., et al.: Gcnet: Graph completion network for incomplete multimodal learning in conversation. IEEE Trans. Pattern Anal. Mach. Intell. 45(07), 8419–8432 (2023). https://doi.org/10.1109/TPAMI.2023.3234553

    Article  Google Scholar 

  18. Wu, J., Zhu, T., Zhu, J., et al.: A optimized bert for multimodal sentiment analysis. ACM Trans. Multimed. Comput. Commun. Appl. (2023). https://doi.org/10.1145/3566126

    Article  Google Scholar 

  19. Li, K., Lu, J., Zuo, H., et al.: Multi-source domain adaptation handling inaccurate label spaces. Neurocomputing 594, 127824 (2024). https://doi.org/10.1016/j.neucom.2024.127824. https://www.sciencedirect.com/science/article/pii/S0925231224005952

  20. Li, K., Lu, J., Zuo, H., et al.: Multidomain adaptation with sample and source distillation. IEEE Trans. Cybern. 54(4), 2193–2205 (2024). https://doi.org/10.1109/TCYB.2023.3236008

    Article  Google Scholar 

  21. Rahman, W., Hasan, M.K., Lee, S. et al.: Integrating multimodal information in large pretrained transformers. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, pp. 2359–2369 (2020). https://doi.org/10.18653/v1/2020.acl-main.214

  22. Guo, J., Tang, J., Dai, W. et al.: Dynamically adjust word representations using unaligned multimodal information. In: Proceedings of the 30th ACM International Conference on Multimedia. Association for Computing Machinery, New York, NY, USA, MM ’22, pp. 3394–3402 (2022). https://doi.org/10.1145/3503161.3548137

  23. Huang, C., Zhang, J., Wu, X., et al.: Tefna: Text-centered fusion network with crossmodal attention for multimodal sentiment analysis. Know-Based Syst. (2023). https://doi.org/10.1016/j.knosys.2023.110502

    Article  Google Scholar 

  24. Sun, L., Lian, Z., Liu, B., et al.: Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis. IEEE Trans. Affect. Comput. 15(1), 309–325 (2023). https://doi.org/10.1109/TAFFC.2023.3274829

    Article  Google Scholar 

  25. Yadav, A., Vishwakarma, D.K.: A deep multi-level attentive network for multimodal sentiment analysis. ACM Trans. Multimed. Comput. Commun. Appl. (2023). https://doi.org/10.1145/3517139

    Article  Google Scholar 

  26. Zhu, T., Li, L., Yang, J., et al.: Multimodal emotion classification with multi-level semantic reasoning network. IEEE Trans. Multimed. 25, 6868–6880 (2023). https://doi.org/10.1109/TMM.2022.3214989

    Article  Google Scholar 

  27. Li, M., Yang, D., Lei, Y., et al.: A unified self-distillation framework for multimodal sentiment analysis with uncertain missing modalities. Proc. AAAI Conf. Artif. Intell. 38(9), 10074–10082 (2024). https://doi.org/10.1609/aaai.v38i9.28871

    Article  Google Scholar 

  28. Zhang, H., Wang, Y., Yin, G. et al.: Learning language-guided adaptive hyper-modality representation for multimodal sentiment analysis. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, pp. 756–767 (2023). https://doi.org/10.18653/v1/2023.emnlp-main.49

  29. Lu, T., Zhong, X., Zhong, L.: mswinunet: a multi-modal u-shaped Swin transformer for supervised change detection. J. Intell. Fuzzy Syst. 46(2), 4243–4252 (2024). https://doi.org/10.3233/JIFS-233868

    Article  Google Scholar 

  30. Wu, J., Zheng, X., Wang, J., et al.: AB-GRU: an attention-based bidirectional GRU model for multimodal sentiment fusion and analysis. Math. Biosci. Eng. 20(10), 18523–18544 (2023)

    Article  Google Scholar 

  31. Jun, W., Tianliang, Z., Jiahui, Z., et al.: Hierarchical multiples self-attention mechanism for multi-modal analysis. Multimed. Syst. 29(6), 3599–3608 (2023). https://doi.org/10.1007/s00530-023-01133-7

    Article  Google Scholar 

  32. Zadeh, A., Zellers, R., Pincus, E., et al.: MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. (2016). CoRR abs/1606.06259 . http://arxiv.org/abs/1606.06259. arXiv:1606.06259

  33. Bagher Zadeh, A., Liang, P.P., Poria, S. et al.: Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, pp. 2236–2246 (2018). https://doi.org/10.18653/v1/P18-1208. https://aclanthology.org/P18-1208

  34. Sun, Z., Sarma, P.K., Sethares, W.A., et al.: Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In: AAAI Conference on Artificial Intelligence, (2019). https://api.semanticscholar.org/CorpusID:207930647

  35. Yang, Z., Dai, Z., Yang, Y., et al.: Xlnet: Generalized autoregressive pretraining for language understanding. In: Wallach, H., Larochelle, H., Beygelzimer, A. et al. (eds.), Advances in Neural Information Processing Systems, vol 32. Curran Associates, Inc., (2019). https://proceedings.neurips.cc/paper_files/paper/2019/file/dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Grant Nos. 61602161, 61772180), Hubei Province Science and Technology Support Project (Grant No. 2020BAB012), Hubei Provincial Science and Technology Program Project (Grant No. 2023BCB041), and The Fundamental Research Funds for the Research Fund of Hubei University of Technology (HBUT: 2021046, 21060, 21066)

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization: [Jun Wu]; Methodology: [Jun Wu, Jiangpeng Wang]; Formal analysis and investigation: [Shilong Jing, Tianfeng Zhang]; Writing—original draft preparation: [Jun Wu, Jiangpeng Wang]; Writing—review and editing: [Jun Wu, Jiangpeng Wang]; Funding acquisition: [Jun Wu, Jinyu Liu]; Resources: [Pengfei Zhan, Min Han]; Supervision: [Gan Zuo].

Corresponding author

Correspondence to Gan Zuo.

Ethics declarations

Conflict of interest

All authors declare that we have no conflict of interest. We promise that our studies have no ethical issues. The Official Datasets website: CMU-MOSI: http://multicomp.cs.cmu.edu/resources/cmu-mosi-dataset/ CMU-MOSEI: http://multicomp.cs.cmu.edu/resources/cmu-mosei-dataset/ The code is available on https://github.com/mmm587/DMMF All authors declare that we agree to submit this manuscript.

Additional information

Communicated by Haojie Li.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, J., Wang, J., Jing, S. et al. Text-dominant strategy for multistage optimized modality fusion in multimodal sentiment analysis. Multimedia Systems 30, 353 (2024). https://doi.org/10.1007/s00530-024-01518-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00530-024-01518-2

Keywords