Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Multimodal Boosting: Addressing Noisy Modalities and Identifying Modality Contribution

Published: 01 January 2024 Publication History

Abstract

In multimodal representation learning, different modalities do not contribute equally. Especially when learning with noisy modalities that convey non-discriminative information, the prediction based on multimodal representation is often biased and even ignores the knowledge from informative modalities. In this paper, we aim to address the noisy modality problem and balance the contributions of multiple modalities dynamically in a parallel format. Specifically, we construct multiple base learners and formulate our framework as a boosting-like algorithm, where different base learners focus on different aspects of multimodal learning. To identify the contributions of individual base learners, we develop a contribution learning network that dynamically determines the contribution and noise level of each base learner. In contrast to the commonly considered attention mechanism, we define the transformation of predictive loss as the supervision signal to train the contribution learning network, which enables more accurate learning of modality importance. We derive the final prediction by incorporating the predictions of base learners based on their contributions. Notably, different from late fusion, we devise a multimodal base learner to explore the cross-modal interactions. To update the network, we design the ‘complementary update mechanism’, where for each base learner, we assign higher weights to those samples that are incorrectly predicted by other base learners. In this way, we can leverage the available information to correctly predict each sample to the utmost extent and enable different base learners to learn different aspects of multimodal information. Extensive experiments demonstrate that the proposed method achieves superior performance on multimodal sentiment analysis and emotion recognition.

References

[1]
T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A survey and taxonomy,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 2, pp. 423–443, Feb. 2019.
[2]
P. P. Liang et al., “MultiBench: Multiscale benchmarks for multimodal representation learning,” in Proc. 35th Conf. Neural Inf. Process. Syst. Datasets Benchmarks Track (Round 1), 2021, pp. 1–97.
[3]
Y. Li, M. Yang, and Z. Zhang, “A survey of multi-view representation learning,” IEEE Trans. Knowl. Data Eng., vol. 31, no. 10, pp. 1863–1883, Oct. 2019.
[4]
P. P. Liang, Z. Liu, A. Zadeh, and L. P. Morency, “Multimodal language analysis with recurrent multistage fusion,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2018, pp. 150–161.
[5]
W. Rahman et al., “Integrating multimodal information in large pretrained transformers,” in Proc. Conf. Assoc. Comput. Linguistics. Meeting, 2020, pp. 2359–2369.
[6]
S. Mai, H. Hu, J. Xu, and S. Xing, “Multi-fusion residual memory network for multimodal human sentiment comprehension,” IEEE Trans. Affect. Comput., vol. 13, no. 1, pp. 320–334, Jan.–Mar. 2022.
[7]
C. Lee and M. Schaar, “A variational information bottleneck approach to multi-omics data integration,” in Proc. Int. Conf. Artif. Intell. Statist., 2021, pp. 1513–1521.
[8]
W. Kay et al., “The kinetics human action video dataset,” 2017, arXiv:1705.06950.
[9]
S. Poria, E. Cambria, R. Bajpai, and A. Hussain, “A review of affective computing: From unimodal analysis to multimodal fusion,” Inf. Fusion, vol. 37, pp. 98–125, 2017.
[10]
S. Poria et al., “Context-dependent sentiment analysis in user-generated videos,” in Proc. 55th Annu. Meeting Assoc. Comput. Linguistics, 2017, pp. 873–883.
[11]
N. Sebe, I. Cohen, and T. S. Huang, “Multimodal emotion recognition,” in Handbook of Pattern Recognition and Computer Vision. Singapore: World Sci., 2005, pp. 387–409.
[12]
T. Mittal, U. Bhattacharya, R. Chandra, A. Bera, and D. Manocha, “Emotions don't lie: An audio-visual deepfake detection method using affective cues,” in Proc. 28th ACM Int. Conf. Multimedia, 2020, pp. 2823–2832.
[13]
L. Stappen et al., “MuSe-Toolbox: The multimodal sentiment analysis continuous annotation fusion and discrete class transformation toolbox,” in Proc. 2nd Multimodal Sentiment Anal. Challenge, 2021, pp. 75–82.
[14]
S. Sankaran, D. Yang, and S.-N. Lim, “Multimodal fusion refiner networks,” 2021, arXiv:2104.03435.
[15]
S. Mai, H. Hu, and S. Xing, “A unimodal representation learning and recurrent decomposition fusion structure for utterance-level multimodal embedding learning,” IEEE Trans. Multimedia, vol. 24, pp. 2488–2501, 2022.
[16]
Y.-H. H. Tsai et al., “Multimodal transformer for unaligned multimodal language sequences,” in Proc. 57th Annu. Meeting Assoc. Comput. Linguistics, Jul. 2019, pp. 6558–6569. [Online]. Available: https://www.aclweb.org/anthology/P19-1656
[17]
Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei, “Visual7W: Grounded question answering in images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 4995–5004.
[18]
S. Mai, H. Hu, and S. Xing, “Divide, conquer and combine: Hierarchical feature fusion network with local and global perspectives for multimodal affective computing,” in Proc.IEEE 57th Conf. Assoc. Comput. Linguistics, Jul. 2019, pp. 481–492. [Online]. Available: https://www.aclweb.org/anthology/P19-1046
[19]
S. Mai, S. Xing, and H. Hu, “Analyzing multimodal sentiment via acoustic- and visual-LSTM with channel-aware temporal convolution network,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 1424–1437, 2021.
[20]
Y. Zeng, S. Mai, and H. Hu, “Which is making the contribution: Modulating unimodal and cross-modal dynamics for multimodal sentiment analysis,” in Proc. Findings Assoc. Comput. Linguistics: EMNLP, 2021, pp. 1262–1274.
[21]
D. S. Chauhan, A. Ekbal, and P. Bhattacharyya, “An efficient fusion mechanism for multimodal low-resource setting,” in Proc. 45th Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2022, pp. 2583–2588.
[22]
C. Zhang et al., “Boosting-based multimodal speaker detection for distributed meeting videos,” IEEE Trans. Multimedia, vol. 10, pp. 1541–1552, 2008.
[23]
R. E. Schapire, “A brief introduction to boosting,” in Proc. Int. Joint Conf. Artif. Intell., 1999, pp. 1401–1406.
[24]
J.-M. Pérez-Rúa, V. Vielzeuf, S. Pateux, M. Baccouche, and F. Jurie, “MFAS: Multimodal fusion architecture search,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 6966–6975.
[25]
O. Kampman, E. J. Barezi, D. Bertero, and P. Fung, “Investigating audio, visual, and text fusion methods for end-to-end automatic personality prediction,” in Proc. 56th Annu. Meeting Assoc. Computat. Linguistics (Volume 2: Short Papers), 2018, pp. 606–611.
[26]
A. Zadeh, R. Zellers, E. Pincus, and L. P. Morency, “Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages,” IEEE Intell. Syst., vol. 31, no. 6, pp. 82–88, Nov./Dec. 2016.
[27]
A. Zadeh, M. Chen, S. Poria, E. Cambria, and L. P. Morency, “Tensor fusion network for multimodal sentiment analysis,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2017, pp. 1114–1125.
[28]
S. Mai, H. Hu, and S. Xing, “Modality to modality translation: An adversarial representation learning and graph fusion network for multimodal fusion,” in Proc. AAAI Conf. Artif. Intell., 2020, pp. 164–172.
[29]
A. Rahate, R. Walambe, S. Ramanna, and K. Kotecha, “Multimodal co-learning: Challenges, applications with datasets, recent advances and future directions,” Inf. Fusion, vol. 81, pp. 203–239, 2022.
[30]
L. Meng, A.-H. Tan, and D. Xu, “Semi-supervised heterogeneous fusion for multimedia data co-clustering,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 9, pp. 2293–2306, Sep. 2014.
[31]
B. Nojavanasghari, D. Gopinath, J. Koushik, T. Baltrušaitis, and L.-P. Morency, “Deep multimodal fusion for persuasiveness prediction,” in Proc. 18th ACM Int. Conf. Multimodal Interaction, 2016, pp. 284–288.
[32]
V. Rozgic, S. Ananthakrishnan, S. Saleem, R. Kumar, and R. Prasad, “Ensemble of SVM trees for multimodal emotion recognition,” in Proc. IEEE Signal Inf. Process. Assoc. Summit Conf., 2012, pp. 1–4.
[33]
A. Anastasopoulos, S. Kumar, and H. Liao, “Neural language modeling with visual features,” 2019, arXiv:1903.02930.
[34]
W. Wu, C. Zhang, and P. C. Woodland, “Emotion recognition by fusing time synchronous and time asynchronous representations,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2021, pp. 6269–6273.
[35]
Z. Liu et al., “Efficient low-rank multimodal fusion with modality-specific factors,” in Proc. Assoc. Comput. Linguistics, 2018, pp. 2247–2256.
[36]
S. Mai, S. Xing, and H. Hu, “Locally confined modality fusion network with a global perspective for multimodal human affective computing,” IEEE Trans. Multimedia, vol. 22, pp. 122–137, 2020.
[37]
M. Hou, J. Tang, J. Zhang, W. Kong, and Q. Zhao, “Deep multimodal multilinear fusion with high-order polynomial pooling,” in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 12113–12122.
[38]
J.-H. Kim et al., “Hadamard product for low-rank bilinear pooling,” in Proc. Int. Conf. Learn. Representations, 2017, pp. 1–14.
[39]
Z. Yu, J. Yu, C. Xiang, J. Fan, and D. Tao, “Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 12, pp. 5947–5959, Dec. 2018.
[40]
W. Peng, X. Hong, and G. Zhao, “Adaptive modality distillation for separable multimodal sentiment analysis,” IEEE Intell. Syst., vol. 36, no. 3, pp. 82–89, May/Jun. 2021.
[41]
A. Zadeh et al., “Memory fusion network for multi-view sequential learning,” in Proc. AAAI Conf. Artif. Intell., 2018, pp. 5634–5641.
[42]
K. Zhang, Y. Li, J. Wang, E. Cambria, and X. Li, “Real-time video emotion recognition based on reinforcement learning and domain knowledge,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 3, pp. 1034–1047, Mar. 2022.
[43]
Y. Peng and J. Qi, “CM-GANs: Cross-modal generative adversarial networks for common representation learning,” ACM Trans. Multimedia Comput., Commun., Appl., vol. 15, no. 1, pp. 1–24, 2019.
[44]
H. Pham, P. P. Liang, T. Manzini, L. P. Morency, and P. Barnabǎs, “Found in translation: Learning robust joint representations by cyclic translations between modalities,” in Proc. AAAI Conf. Artif. Intell., 2019, pp. 6892–6899.
[45]
S. Mai, S. Xing, J. He, Y. Zeng, and H. Hu, “Multimodal graph for unaligned multimodal sequence analysis via graph convolution and graph pooling,” ACM Trans. Multimedia Comput., Commun. Appl., vol. 19, no. 2, pp. 1–24, 2023.
[46]
M. Behmanesh, P. Adibi, S. M. S. Ehsani, and J. Chanussot, “Geometric multimodal deep learning with multiscaled graph wavelet convolutional network,” IEEE Trans. Neural Netw. Learn. Syst., early access, Oct. 25, 2022.
[47]
M. Angelou, V. Solachidis, N. Vretos, and P. Daras, “Graph-based multimodal fusion with metric learning for multimodal classification,” Pattern Recognit., vol. 95, pp. 296–307, 2019.
[48]
Q. Li, D. Gkoumas, C. Lioma, and M. Melucci, “Quantum-inspired multimodal fusion for video sentiment analysis,” Inf. Fusion, vol. 65, pp. 58–71, 2021. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1566253520303365
[49]
Y.-H. H. Tsai, M. Q. Ma, M. Yang, R. Salakhutdinov, and L.-P. Morency, “Multimodal routing: Improving local and global interpretability of multimodal language analysis,” in Proc. Conf. Empirical Methods Natural Lang. Process. Conf. Empirical Methods Natural Lang. Process., vol. 2020, Art. no.
[50]
M. S. Akhtar et al., “Multi-task learning for multi-modal emotion recognition and sentiment analysis,” in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Hum. Lang. Technol., 2019, pp. 370–379.
[51]
D. S. Chauhan, S. Dhanush, A. Ekbal, and P. Bhattacharyya, “Sentiment and emotion help sarcasm? A multi-task learning framework for multi-modal sarcasm, sentiment and emotion analysis,” in Proc. 58th Annu. Meeting Assoc. Comput. Linguistics, 2020, pp. 4351–4360.
[52]
Z. Sun, P. Sarma, W. Sethares, and Y. Liang, “Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis,” in Proc. AAAI Conf. Artif. Intell., 2020, pp. 8992–8999.
[53]
S. H. Dumpala, I. Sheikh, R. Chakraborty, and S. K. Kopparapu, “Audio-visual fusion for sentiment classification using cross-modal autoencoder,” in Proc. 32nd Conf. Neural Inf. Process. Syst.2018, pp. 1–4.
[54]
S. Shankar, “Multimodal fusion via cortical network inspired losses,” in Proc. 60th Annu. Meeting Assoc. Comput. Linguistics, 2022, pp. 1167–1178.
[55]
Y. Sun, S. Mai, and H. Hu, “Learning to learn better unimodal representations via adaptive multimodal meta-learning,” IEEE Trans. Affect. Comput., early access, May 27, 2022.
[56]
W. Han, H. Chen, and S. Poria, “Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2021, pp. 9180–9192.
[57]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. North Amer. Chapter Assoc. Comput. Linguistics: Hum. Lang. Technol., 2019, pp. 4171–4186.
[58]
H. Tan and M. Bansal, “LXMERT: Learning cross-modality encoder representations from transformers,” in Proc. Conf. Empirical Methods Natural Lang. Process. 9th Int. Joint Conf. Natural Lang. Process., 2019, pp. 5100–5111.
[59]
J. Lu, D. Batra, D. Parikh, and S. Lee, “ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” in Proc. Adv. Neural Inf. Process. Syst., 2019.
[60]
G. Li, N. Duan, Y. Fang, M. Gong, and D. Jiang, “Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training,” in Proc. AAAI Conf. Artif. Intell., 2020, pp. 11336–11344.
[61]
C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, “VideoBert: A joint model for video and language representation learning,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 7464–7473.
[62]
Z. Yang et al., “XLNet: Generalized autoregressive pretraining for language understanding,” in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 1–11.
[63]
J. Lu, J. Yang, D. Batra, and D. Parikh, “Hierarchical question-image co-attention for visual question answering,” in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 1–9.
[64]
X. Xue, C. Zhang, Z. Niu, and X. Wu, “Multi-level attention map network for multimodal sentiment analysis,” IEEE Trans. Knowl. Data Eng., vol. 35, no. 5, pp. 5105–5118, May 2023.
[65]
S. A. Qureshi, S. Saha, M. Hasanuzzaman, and G. Dias, “Multitask representation learning for multimodal estimation of depression level,” IEEE Intell. Syst., vol. 34, no. 5, pp. 45–52, Sep./Oct. 2019.
[66]
E. Cambria, N. Howard, J. Hsu, and A. Hussain, “Sentic blending: Scalable multimodal fusion for the continuous interpretation of semantics and sentics,” in Proc. IEEE Symp. Comput. Intell. Hum.-like Intell., 2013, pp. 108–117.
[67]
T. Mittal, U. Bhattacharya, R. Chandra, A. Bera, and D. Manocha, “M3er: Multiplicative multimodal emotion recognition using facial, textual, and speech cues,” in Proc. AAAI Conf. Artif. Intell., 2020, pp. 1359–1367.
[68]
M. Chen et al., “Multimodal sentiment analysis with word-level fusion and reinforcement learning,” in Proc. 19th ACM Int. Conf. Multimodal Interaction, 2017, pp. 163–171.
[69]
C. Louizos, M. Welling, and D. P. Kingma, “Learning sparse neural networks through L0 regularization,” in Proc. Int. Conf. Learn. Representations, 2018, pp. 1–9.
[70]
S. Mai, Y. Zeng, and H. Hu, “Multimodal information bottleneck: Learning minimal sufficient unimodal and multimodal representations,” IEEE Trans. Multimedia, early access, May 03, 2022.
[71]
T. Rahman, B. Xu, and L. Sigal, “Watch, listen and tell: Multi-modal weakly supervised dense event captioning,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 8908–8917.
[72]
W. Yu, H. Xu, Z. Yuan, and J. Wu, “Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis,” in Proc. AAAI Conf. Artif. Intell., 2021, pp. 10790–10797.
[73]
Y. Hirano, S. Okada, and K. Komatani, “Recognizing social signals with weakly supervised multitask learning for multimodal dialogue systems,” in Proc. Int. Conf. Multimodal Interaction, 2021, pp. 141–149.
[74]
W. Dai, S. Cahyawijaya, Y. Bang, and P. Fung, “Weakly-supervised multi-task learning for multimodal affect recognition,” 2021, arXiv:2104.11560.
[75]
A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 1–11.
[76]
J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proc. Empirical Methods Natural Lang. Process., 2014, pp. 1532–1543. [Online]. Available: http://www.aclweb.org/anthology/D14-1162
[77]
D. Hazarika, R. Zimmermann, and S. Poria, “MISA: Modality-invariant and -specific representations for multimodal sentiment analysis,” in Proc. 28th ACM Int. Conf. Multimedia, 2020, pp. 1122–1131.
[78]
A. Zadeh et al., “Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph,” in Proc. Assoc. Comput. Linguistics, 2018, pp. 2236–2246.
[79]
C. Busso et al., “IEMOCAP: Interactive emotional dyadic motion capture database,” Lang. Resour. Eval., vol. 42, no. 4, pp. 335–359, 2008.
[80]
Y. Wang et al., “Words can shift: Dynamically adjusting word representations using nonverbal behaviors,” in Proc. AAAI Conf. Artif. Intell., 2019, pp. 7216–7223.
[81]
Z. Yuan, W. Li, H. Xu, and W. Yu, “Transformer-based feature reconstruction network for robust multimodal sentiment analysis,” in Proc. 29th ACM Int. Conf. Multimedia, 2021, pp. 4400–4407.
[82]
S. Mai, Y. Zeng, S. Zheng, and H. Hu, “Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis,” IEEE Trans. Affect. Comput., early access, May 03, 2022.
[83]
G. Degottex, J. Kane, T. Drugman, T. Raitio, and S. Scherer, “COVAREP: A collaborative voice analysis repository for speech technologies,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2014, pp. 960–964.
[84]
K. Yang, H. Xu, and K. Gao, “CM-BERT: Cross-modal bert for text-audio sentiment analysis,” in Proc. 28th ACM Int. Conf. Multimedia, 2020, pp. 521–528.
[85]
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2014, arXiv:1412.6980.
[86]
D. Gkoumas, Q. Li, C. Lioma, Y. Yu, and D. wei Song, “What makes the difference? An empirical comparison of fusion strategies for multimodal language analysis,” Inf. Fusion, vol. 66, pp. 184–197, 2021.
[87]
W. Han et al., “Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis,” in Proc. Int. Conf. Multimodal Interaction, 2021, pp. 6–15.
[88]
S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” 2018, arXiv: 1803.01271.
[89]
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
[90]
K. Cho et al., “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2014, pp. 1724–1734.
[91]
W. Yu et al., “CH-SIMS: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality,” in Proc. 58th Annu. Meeting Assoc. Comput. Linguistics, 2020, pp. 3718–3727.

Cited By

View all

Index Terms

  1. Multimodal Boosting: Addressing Noisy Modalities and Identifying Modality Contribution
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image IEEE Transactions on Multimedia
        IEEE Transactions on Multimedia  Volume 26, Issue
        2024
        11427 pages

        Publisher

        IEEE Press

        Publication History

        Published: 01 January 2024

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 06 Feb 2025

        Other Metrics

        Citations

        Cited By

        View all

        View Options

        View options

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media