In the current multimodal sentiment analysis, multimodal data fusion has gradually transitioned from non-modality-dominant to modality-dominant. Effectively realizing the fusion of relevant information between different modalities is still one of the crucial challenges facing multimodal sentiment analysis. Although existing multimodal fusion methods have made some progress, most methods still show limitations in dealing with the complex dynamic relationships between modalities. This study proposes a text-dominant Multistage Modality Fusion (DMMF) mechanism. In the first stage, the text modalities are guided through the auxiliary modalities of vision and audio modalities to perform auxiliary modal fusion at the embedding layer. In the second stage, the extracted features can be utilized to enhance the multimodal interactions between different modalities through the mechanisms of self-attention and cross-attention to develop the fusion and achieve better fusion representation and fusion effect in multistage. The model is comprehensively evaluated on the CMU-MOSI and CMU-MOSEI datasets. The experimental results show a significant improvement in most evaluation metrics of our model compared to the baseline models. The code is available on https://github.com/mmm587/DMMF.
Data availability
The datasets generated and analyzed during the current study are available from the corresponding author on reasonable request.
This work is supported by the National Natural Science Foundation of China (Grant Nos. 61602161, 61772180), Hubei Province Science and Technology Support Project (Grant No. 2020BAB012), Hubei Provincial Science and Technology Program Project (Grant No. 2023BCB041), and The Fundamental Research Funds for the Research Fund of Hubei University of Technology (HBUT: 2021046, 21060, 21066)
Conceptualization: [Jun Wu]; Methodology: [Jun Wu, Jiangpeng Wang]; Formal analysis and investigation: [Shilong Jing, Tianfeng Zhang]; Writing—original draft preparation: [Jun Wu, Jiangpeng Wang]; Writing—review and editing: [Jun Wu, Jiangpeng Wang]; Funding acquisition: [Jun Wu, Jinyu Liu]; Resources: [Pengfei Zhan, Min Han]; Supervision: [Gan Zuo].
All authors declare that we have no conflict of interest. We promise that our studies have no ethical issues. The Official Datasets website: CMU-MOSI: http://multicomp.cs.cmu.edu/resources/cmu-mosi-dataset/ CMU-MOSEI: http://multicomp.cs.cmu.edu/resources/cmu-mosei-dataset/ The code is available on https://github.com/mmm587/DMMF
Communicated by Haojie Li.
Wu, J., Wang, J., Jing, S. et al. Text-dominant strategy for multistage optimized modality fusion in multimodal sentiment analysis. Multimedia Systems 30, 353 (2024). https://doi.org/10.1007/s00530-024-01518-2
DOI: https://doi.org/10.1007/s00530-024-01518-2