Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3577190.3614151acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article
Open access

Multimodal Fusion Interactions: A Study of Human and Automatic Quantification

Published: 09 October 2023 Publication History
  • Get Citation Alerts
  • Abstract

    In order to perform multimodal fusion of heterogeneous signals, we need to understand their interactions: how each modality individually provides information useful for a task and how this information changes in the presence of other modalities. In this paper, we perform a comparative study of how humans annotate two categorizations of multimodal interactions: (1) partial labels, where different annotators annotate the label given the first, second, and both modalities, and (2) counterfactual labels, where the same annotator annotates the label given the first modality before asking them to explicitly reason about how their answer changes when given the second. We further propose an alternative taxonomy based on (3) information decomposition, where annotators annotate the degrees of redundancy: the extent to which modalities individually and together give the same predictions, uniqueness: the extent to which one modality enables a prediction that the other does not, and synergy: the extent to which both modalities enable one to make a prediction that one would not otherwise make using individual modalities. Through experiments and annotations, we highlight several opportunities and limitations of each approach and propose a method to automatically convert annotations of partial and counterfactual labels to information decomposition, yielding an accurate and efficient method for quantifying multimodal interactions.

    References

    [1]
    Ehsan Abbasnejad, Damien Teney, Amin Parvaneh, Javen Shi, and Anton van den Hengel. 2020. Counterfactual vision and language learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10044–10054.
    [2]
    Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Devi Parikh, and Dhruv Batra. 2017. VQA: Visual Question Answering. International Journal of Computer Vision (2017).
    [3]
    Benjamin Auffarth, Maite López, and Jesús Cerquides. 2010. Comparison of redundancy and relevance measures for feature selection in tissue classification of CT images. In Industrial conference on data mining. Springer, 248–262.
    [4]
    Maria-Florina Balcan, Avrim Blum, and Ke Yang. 2004. Co-training and expansion: Towards bridging theory and practice. Advances in neural information processing systems 17 (2004).
    [5]
    John Bateman. 2014. Text and image: A critical introduction to the visual/verbal divide. Routledge.
    [6]
    Kjell Benson and Arthur J Hartz. 2000. A comparison of observational studies and randomized, controlled trials. New England Journal of Medicine 342, 25 (2000), 1878–1886.
    [7]
    Nils Bertschinger, Johannes Rauh, Eckehard Olbrich, Jürgen Jost, and Nihat Ay. 2014. Quantifying unique information. Entropy 16, 4 (2014), 2161–2183.
    [8]
    Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory. 92–100.
    [9]
    Santiago Castro, Devamanyu Hazarika, Verónica Pérez-Rosas, Roger Zimmermann, Rada Mihalcea, and Soujanya Poria. 2019. Towards Multimodal Sarcasm Detection (An _Obviously_ Perfect Paper). In ACL. 4619–4629.
    [10]
    Thalia E Chan, Michael PH Stumpf, and Ann C Babtie. 2017. Gene regulatory network inference from single-cell data using multivariate information measures. Cell systems 5, 3 (2017), 251–267.
    [11]
    C Mario Christoudias, Raquel Urtasun, and Trevor Darrell. 2008. Multi-view learning in the presence of view disagreement. In Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence. 88–96.
    [12]
    Nigel Colenbier, Frederik Van de Steen, Lucina Q Uddin, Russell A Poldrack, Vince D Calhoun, and Daniele Marinazzo. 2020. Disambiguating the role of blood flow and global signal with partial information decomposition. Neuroimage 213 (2020), 116699.
    [13]
    Steven Diamond and Stephen Boyd. 2016. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research 17, 83 (2016), 1–5.
    [14]
    Sidney K D’mello and Jacqueline Kory. 2015. A review and meta-analysis of multimodal affect detection systems. ACM computing surveys (CSUR) 47, 3 (2015), 1–36.
    [15]
    Sidney D’Mello, Arvid Kappas, and Jonathan Gratch. 2018. The affective computing approach to affect measurement. Emotion Review 10, 2 (2018), 174–183.
    [16]
    Benjamin Flecker, Wesley Alford, John M Beggs, Paul L Williams, and Randall D Beer. 2011. Partial information decomposition as a spatiotemporal filter. Chaos: An Interdisciplinary Journal of Nonlinear Science 21, 3 (2011), 037104.
    [17]
    Ross Flom and Lorraine E Bahrick. 2007. The development of infant discrimination of affect in multimodal and unimodal stimulation: The role of intersensory redundancy.Developmental psychology 43, 1 (2007), 238.
    [18]
    Jerome H Friedman and Bogdan E Popescu. 2008. Predictive learning via rule ensembles. The annals of applied statistics 2, 3 (2008), 916–954.
    [19]
    Wendell R Garner. 1962. Uncertainty and structure as psychological concepts. (1962).
    [20]
    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6904–6913.
    [21]
    Yash Goyal, Ziyan Wu, Jan Ernst, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Counterfactual visual explanations. In International Conference on Machine Learning. PMLR, 2376–2384.
    [22]
    Malte Harder, Christoph Salge, and Daniel Polani. 2013. Bivariate measure of redundant information. Physical Review E 87, 1 (2013), 012130.
    [23]
    Md Kamrul Hasan, Wasifur Rahman, AmirAli Bagher Zadeh, Jianyuan Zhong, Md Iftekhar Tanveer, Louis-Philippe Morency, and Mohammed Ehsan Hoque. 2019. UR-FUNNY: A Multimodal Language Dataset for Understanding Humor. In EMNLP. 2046–2056.
    [24]
    Jack Hessel and Lillian Lee. 2020. Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!. In EMNLP.
    [25]
    Aleks Jakulin and Ivan Bratko. 2003. Quantifying and visualizing attribute interactions: An approach based on entropy. (2003).
    [26]
    Siddhant M. Jayakumar, Wojciech M. Czarnecki, Jacob Menick, Jonathan Schwarz, Jack Rae, Simon Osindero, Yee Whye Teh, Tim Harley, and Razvan Pascanu. 2020. Multiplicative Interactions and Where to Find Them. In International Conference on Learning Representations. https://openreview.net/forum?id=rylnK6VtDH
    [27]
    Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. 2017. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2901–2910.
    [28]
    Divyansh Kaushik, Eduard Hovy, and Zachary C Lipton. 2019. Learning the difference that makes a difference with counterfactually-augmented data. International Conference on Learning Representations (2019).
    [29]
    Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning. PMLR, 5583–5594.
    [30]
    Julia Kruk, Jonah Lubin, Karan Sikka, Xiao Lin, Dan Jurafsky, and Ajay Divakaran. 2019. Integrating Text and Image: Determining Multimodal Document Intent in Instagram Posts. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 4622–4632.
    [31]
    Paul Pu Liang, Yun Cheng, Xiang Fan, Chun Kai Ling, Suzanne Nie, Richard Chen, Zihao Deng, Faisal Mahmood, Ruslan Salakhutdinov, and Louis-Philippe Morency. 2023. Quantifying & Modeling Feature Interactions: An Information Decomposition Framework. arXiv preprint arXiv:2302.12247 (2023).
    [32]
    Paul Pu Liang, Chun Kai Ling, Yun Cheng, Alex Obolenskiy, Yudong Liu, Rohan Pandey, Alex Wilf, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2023. Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications. arXiv preprint arXiv:2306.04539 (2023).
    [33]
    Paul Pu Liang, Ziyin Liu, AmirAli Bagher Zadeh, and Louis-Philippe Morency. 2018. Multimodal Language Analysis with Recurrent Multistage Fusion. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 150–161.
    [34]
    Paul Pu Liang, Yiwei Lyu, Gunjan Chhablani, Nihal Jain, Zihao Deng, Xingbo Wang, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2023. MultiViz: Towards Visualizing and Understanding Multimodal Models. In International Conference on Learning Representations. https://openreview.net/forum?id=i2_TvOFmEml
    [35]
    Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2022. Foundations and recent trends in multimodal machine learning: Principles, challenges, and open questions. arXiv preprint arXiv:2209.03430 (2022).
    [36]
    Yiwei Lyu, Paul Pu Liang, Zihao Deng, Ruslan Salakhutdinov, and Louis-Philippe Morency. 2022. DIME: Fine-Grained Interpretations of Multimodal Models via Disentangled Local Explanations(AIES ’22). Association for Computing Machinery, New York, NY, USA, 455–467. https://doi.org/10.1145/3514094.3534148
    [37]
    Emily E Marsh and Marilyn Domas White. 2003. A taxonomy of relationships between images and text. Journal of documentation (2003).
    [38]
    Alessio Mazzetto, Dylan Sam, Andrew Park, Eli Upfal, and Stephen Bach. 2021. Semi-Supervised Aggregation of Dependent Weak Supervision Sources With Performance Guarantees. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics(Proceedings of Machine Learning Research, Vol. 130), Arindam Banerjee and Kenji Fukumizu (Eds.). PMLR, 3196–3204. https://proceedings.mlr.press/v130/mazzetto21a.html
    [39]
    William McGill. 1954. Multivariate information transmission. Transactions of the IRE Professional Group on Information Theory 4, 4 (1954), 93–111.
    [40]
    Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. 2011. Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11). 689–696.
    [41]
    Christian Otto, Matthias Springstein, Avishek Anand, and Ralph Ewerth. 2020. Characterization and classification of semantic image-text relations. International Journal of Multimedia Information Retrieval 9 (2020), 31–45.
    [42]
    Maja Pantic, Nicu Sebe, Jeffrey F Cohn, and Thomas Huang. 2005. Affective multimodal human-computer interaction. In Proceedings of the 13th annual ACM international conference on Multimedia. 669–676.
    [43]
    Sarah Partan and Peter Marler. 1999. Communication goes multimodal. Science 283, 5406 (1999), 1272–1273.
    [44]
    Sarah R Partan and Peter Marler. 2005. Issues in the classification of multimodal communication signals. The American Naturalist 166, 2 (2005), 231–245.
    [45]
    Giuseppe Pica, Eugenio Piasini, Houman Safaai, Caroline Runyan, Christopher Harvey, Mathew Diamond, Christoph Kayser, Tommaso Fellin, and Stefano Panzeri. 2017. Quantifying how much sensory information in a neural code is relevant for behavior. Advances in Neural Information Processing Systems 30 (2017).
    [46]
    Emily Mower Provost, Yuan Shangguan, and Carlos Busso. 2015. UMEME: University of Michigan emotional McGurk effect data set. IEEE Transactions on Affective Computing 6, 4 (2015), 395–409.
    [47]
    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748–8763.
    [48]
    Natalie Ruiz, Ronnie Taib, and Fang Chen. 2006. Examining the redundancy of multimodal input. In Proceedings of the 18th Australia conference on Computer-Human Interaction: Design: Activities, Artefacts and Environments. 389–392.
    [49]
    Claude Elwood Shannon. 1948. A mathematical theory of communication. The Bell system technical journal 27, 3 (1948), 379–423.
    [50]
    Mohammad Soleymani, Maja Pantic, and Thierry Pun. 2011. Multimodal emotion recognition in response to videos. IEEE transactions on affective computing 3, 2 (2011), 211–223.
    [51]
    Daria Sorokina, Rich Caruana, Mirek Riedewald, and Daniel Fink. 2008. Detecting statistical interactions with additive groves of trees. In Proceedings of the 25th international conference on Machine learning. 1000–1007.
    [52]
    Karthik Sridharan and Sham M Kakade. 2008. An information theoretic framework for multi-view learning. In Conference on Learning Theory.
    [53]
    Han Te Sun. 1980. Multiple mutual informations and multiple interactions in frequency data. Inf. Control 46 (1980), 26–45.
    [54]
    Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. 2020. What makes for good views for contrastive learning?Advances in Neural Information Processing Systems 33 (2020), 6827–6839.
    [55]
    Nicholas M Timme, Shinya Ito, Maxym Myroshnychenko, Sunny Nigam, Masanori Shimono, Fang-Chin Yeh, Pawel Hottowy, Alan M Litke, and John M Beggs. 2016. High-degree neurons feed cortical computations. PLoS computational biology 12, 5 (2016), e1004858.
    [56]
    Christopher Tosh, Akshay Krishnamurthy, and Daniel Hsu. 2021. Contrastive learning, multi-view redundancy, and linear models. In Algorithmic Learning Theory. PMLR, 1179–1206.
    [57]
    Yao-Hung Hubert Tsai, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Learning Factorized Multimodal Representations. In International Conference on Learning Representations.
    [58]
    Yao-Hung Hubert Tsai, Yue Wu, Ruslan Salakhutdinov, and Louis-Philippe Morency. 2020. Self-supervised Learning from a Multi-view Perspective. In International Conference on Learning Representations.
    [59]
    Michael Tsang, Dehua Cheng, Hanpeng Liu, Xue Feng, Eric Zhou, and Yan Liu. 2019. Feature Interaction Interpretability: A Case for Explaining Ad-Recommendation Systems via Neural Interaction Detection. In International Conference on Learning Representations.
    [60]
    Michael Tsang, Dehua Cheng, and Yan Liu. 2018. Detecting Statistical Interactions from Neural Network Weights. In International Conference on Learning Representations.
    [61]
    Len Unsworth and Chris Cléirigh. 2014. Multimodality and reading: The construction of meaning through image-text interaction. Routledge.
    [62]
    Xingbo Wang, Jianben He, Zhihua Jin, Muqiao Yang, Yong Wang, and Huamin Qu. 2021. M2Lens: Visualizing and explaining multimodal models for sentiment analysis. IEEE Transactions on Visualization and Computer Graphics 28, 1 (2021), 802–812.
    [63]
    Satosi Watanabe. 1960. Information theoretical analysis of multivariate correlation. IBM Journal of research and development 4, 1 (1960), 66–82.
    [64]
    Michael Wibral, Joseph T Lizier, and Viola Priesemann. 2015. Bits from brains for biologically inspired computing. Frontiers in Robotics and AI 2 (2015), 5.
    [65]
    Michael Wibral, Viola Priesemann, Jim W Kay, Joseph T Lizier, and William A Phillips. 2017. Partial information decomposition as a unified approach to the specification of neural goal functions. Brain and cognition 112 (2017), 25–38.
    [66]
    Paul L Williams and Randall D Beer. 2010. Nonnegative decomposition of multivariate information. arXiv preprint arXiv:1004.2515 (2010).
    [67]
    Torsten Wörtwein, Lisa Sheeber, Nicholas Allen, Jeffrey Cohn, and Louis-Philippe Morency. 2022. Beyond Additive Fusion: Learning Non-Additive Multimodal Interactions. In Findings of the Association for Computational Linguistics: EMNLP 2022. 4681–4696.
    [68]
    Lei Yu and Huan Liu. 2003. Efficiently handling feature redundancy in high-dimensional data. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. 685–690.
    [69]
    Lei Yu and Huan Liu. 2004. Efficient feature selection via analysis of relevance and redundancy. The Journal of Machine Learning Research 5 (2004), 1205–1224.
    [70]
    Xin Yuan, Zhe Lin, Jason Kuen, Jianming Zhang, Yilin Wang, Michael Maire, Ajinkya Kale, and Baldo Faieta. 2021. Multimodal contrastive training for visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6995–7004.
    [71]
    Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor Fusion Network for Multimodal Sentiment Analysis. In EMNLP.
    [72]
    AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In ACL.
    [73]
    Mingda Zhang, Rebecca Hwa, and Adriana Kovashka. 2018. Equal But Not The Same: Understanding the Implicit Relationship Between Persuasive Images and Text. In British Machine Vision Conference (BMVC).

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICMI '23: Proceedings of the 25th International Conference on Multimodal Interaction
    October 2023
    858 pages
    ISBN:9798400700552
    DOI:10.1145/3577190
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 October 2023

    Check for updates

    Author Tags

    1. Affective computing
    2. Multimodal fusion
    3. Multimodal interactions

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    ICMI '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 453 of 1,080 submissions, 42%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 266
      Total Downloads
    • Downloads (Last 12 months)266
    • Downloads (Last 6 weeks)55
    Reflects downloads up to

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media