Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

M2Lens: Visualizing and Explaining Multimodal Models for Sentiment Analysis

Published: 01 January 2022 Publication History

Abstract

Multimodal sentiment analysis aims to recognize people's attitudes from multiple communication channels such as verbal content (i.e., text), voice, and facial expressions. It has become a vibrant and important research topic in natural language processing. Much research focuses on modeling the complex intra- and inter-modal interactions between different communication channels. However, current multimodal models with strong performance are often deep-learning-based techniques and work like black boxes. It is not clear how models utilize multimodal information for sentiment predictions. Despite recent advances in techniques for enhancing the explainability of machine learning models, they often target unimodal scenarios (e.g., images, sentences), and little research has been done on explaining multimodal models. In this paper, we present an interactive visual analytics system, M2 Lens, to visualize and explain multimodal models for sentiment analysis. M2 Lens provides explanations on intra- and inter-modal interactions at the global, subset, and local levels. Specifically, it summarizes the influence of three typical interaction types (i.e., dominance, complement, and conflict) on the model predictions. Moreover, M2 Lens identifies frequent and influential multimodal features and supports the multi-faceted exploration of model behaviors from language, acoustic, and visual modalities. Through two case studies and expert interviews, we demonstrate our system can help users gain deep insights into the multimodal models for sentiment analysis.

References

[1]
Y. Ahn and Y.-R. Lin. Fairsight: Visual analytics for fairness in decision making. IEEE Transactions on Visualization and Computer Graphics, 26(1):1086–1095, 2020.
[2]
S. Amershi, M. Chickering, S. M. Drucker, B. Lee, P. Simard, and J. Suh. Modeltracker: Redesigning performance analysis tools for machine learning. In Proceedings of the CHI Conference on Human Factors in Computing Systems, pp. 337–346, 2015.
[3]
A. B. Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik, A. Bar-bado, S. García, S. Gil-López, D. Molina, R. Benjaminset al., Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion, 58:82–115, 2020.
[4]
T. Baltrušaitis, C. Ahuja, and L.-P. Morency. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2):423–443, 2018.
[5]
O. Bastani, C. Kim, and H. Bastani. Interpretability via model extraction. arXiv preprint arXiv:, 2017.
[6]
A. Bilal, A. Jourabloo, M. Ye, X. Liu, and L. Ren. Do convolutional neural networks learn class hierarchy?IEEE Transactions on Visualization and Computer Graphics, 24(1):152–162, 2018.
[7]
M. Brooks, S. Amershi, B. Lee, S. M. Drucker, A. Kapoor, and P. Simard. Featureinsight: Visual support for error-driven feature ideation in text classification. In IEEE Conference on Visual Analytics Science and Technology, pp. 105–112, 2015.
[8]
C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42(4):335–359, 2008.
[9]
A. A. Cabrera, W. Epperson, F. Hohman, M. Kahng, J. Morgenstern, and D. H. Chau. Fairvis: Visual analytics for discovering intersectional bias in machine learning. In IEEE Conference on Visual Analytics Science and Technology, pp. 46–56, 2019.
[10]
D. V. Carvalho, E. M. Pereira, and J. S. Cardoso. Machine learning interpretability: A survey on methods and metrics. Electronics, 8(8):832, 2019.
[11]
M. Chen, S. Wang, P. P. Liang, T. Baltrušaitis, A. Zadeh, and L.-P. Morency. Multimodal sentiment analysis with word-level fusion and reinforcement learning. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, pp. 163–171, 2017.
[12]
H. Chernoff. The use of faces to represent points in k-dimensional space graphically. Journal of the American Statistical Association, 68(342):361–368, 1973.
[13]
G. Degottex, J. Kane, T. Drugman, T. Raitio, and S. Scherer. COVAREP-A collaborative voice analysis repository for speech technologies. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 960–964, 2014.
[14]
R. Ekman. What the face reveals: Basic and applied studies of spontaneous expression using the Facial Action Coding System (FACS). Oxford University Press, 1997.
[15]
E. Friesen and P. Ekman. Facial action coding system: A technique for the measurement of facial movement. Consulting Psychologists Press, 1978.
[16]
S. Gehrmann, H. Strobelt, and A. M. Rush. GLTR: Statistical detection and visualization of generated text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 111–116, 2019.
[17]
M. Glodek, S. Tschechne, G. Layher, M. Schels, T. Brosch, S. Scherer, M. Kächele, M. Schmidt, H. Neumann, G. Palmet al., Multiple classifier systems for the classification of audio-visual emotional states. In International Conference on Affective Computing and Intelligent Interaction, pp. 359–368. Springer, 2011.
[18]
B. Haasdonk. Feature space interpretation of svms with indefinite kernels. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(4):482–492, 2005.
[19]
J. Han, J. Pei, Y. Yin, and R. Mao. Mining frequent patterns without candidate generation: A frequent-pattern tree approach. Data Mining and Knowledge Discovery, 8(1):53–87, 2004.
[20]
A. W. Harley. An interactive node-link visualization of convolutional neural networks. In International Symposium on Visual Computing, pp. 867–877. Springer, 2015.
[21]
A. Henelius, K. Puolamäki, H. Boström, L. Asker, and P. Papapetrou. A peek into the black box: exploring classifiers by randomization. Data Mining and Knowledge Discovery, 28(5):1503–1529, 2014.
[22]
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
[23]
F. Hohman, M. Kahng, R. Pienta, and D. H. Chau. Visual analytics in deep learning: An interrogative survey for the next frontiers. IEEE Transactions on Visualization and Computer Graphics, 25(8):2674–2693, 2018.
[24]
A. Hu and S. Flaxman. Multimodal sentiment analysis to explore the structure of emotions. In In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 350–358, 2018.
[25]
U. Johansson, L. Niklasson, and R. König. Accuracy vs. comprehensibility in data mining models. Information Fusion, 1:295–300, 2004.
[26]
M. Kahng, P. Y. Andrews, A. Kalro, and D. H. Chau. Activis: Visual exploration of industry-scale deep neural network models. IEEE Transactions on Visualization and Computer Graphics, 24(1):88–97, 2017.
[27]
A. Karpathy, J. Johnson, and L. Fei-Fei. Visualizing and understanding recurrent networks. arXiv preprint arXiv:, 2015.
[28]
P. W. Koh and P. Liang. Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning, pp. 1885–1894. PMLR, 2017.
[29]
R. Konig, U. Johansson, and L. Niklasson. G-REX: A versatile framework for evolutionary data mining. In Workshops Proceedings of the 8th IEEE International Conference on Data Mining, pp. 971–974, 2008.
[30]
V. Krakovna and F. Doshi-Velez. Increasing the interpretability of recurrent neural networks using hidden markov models. arXiv preprint arXiv:, 2016.
[31]
J. Krause, A. Perer, and E. Bertini. Infuse: Interactive feature selection for predictive modeling of high dimensional data. IEEE Transactions on Visualization and Computer Graphics, 20(12): 1614–1623, 2014.
[32]
J. Krause, A. Perer, and K. Ng. Interacting with predictions: Visual inspection of black-box machine learning models. In Proceedings of the CHI Conference on Human Factors in Computing Systems, pp. 5686–5697, 2016.
[33]
J. Krause, A. Perer, and H. Stavropoulos. Supporting iterative cohort construction with visual temporal queries. IEEE Transactions on Visualization and Computer Graphics, 22(1):91–100, 2015.
[34]
T. Kulesza, M. Burnett, W.-K. Wong, and S. Stumpf. Principles of explanatory debugging to personalize interactive machine learning. In Proceedings of the 20th International Conference on Intelligent User Interfaces, pp. 126–137, 2015.
[35]
A. Lazaridou, N. T. Pham, and M. Baroni. Combining language and vision with a multimodal skip-gram model. In Proceedings of the North American Chapter of the Association for Computational Linguistics, pp. 153–163, 2015.
[36]
Z. Liu, Y. Shen, V. B. Lakshminarasimhan, P. P. Liang, A. Zadeh, and L.-P. Morency. Efficient low-rank multimodal fusion with modality-specific factors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 2247–2256, 2018.
[37]
L. Longo, R. Goebel, F. Lecue, P. Kieseberg, and A. Holzinger. Explainable artificial intelligence: Concepts, applications, research challenges and visions. In International Cross-Domain Conference for Machine Learning and Knowledge Extraction, pp. 1–16. Springer, 2020.
[38]
S. Lundberg and S.-I. Lee. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pp. 4765–4774, 2017.
[39]
C. Molnar. Interpretable Machine Learning. 2019. https://christophm.github.io/interpretable-ml-book/
[40]
L.-P. Morency, R. Mihalcea, and P. Doshi. Towards multimodal sentiment analysis: Harvesting opinions from the web. In Proceedings of the 13th International Conference on Multimodal Interfaces, pp. 169–176. ACM, 2011.
[41]
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning, pp. 689–696. Omnipress, 2011.
[42]
B. Nojavanasghari, D. Gopinath, J. Koushik, T. Baltrušaitis, and L.-P. Morency. Deep multimodal fusion for persuasiveness prediction. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 284–288, 2016.
[43]
C. Olah, A. Mordvintsev, and L. Schubert. Feature visualization. Distill, 2017.
[44]
K. Patel, J. Fogarty, J. A. Landay, and B. Harrison. Investigating statistical machine learning as a tool for software development. In Proceedings of the CHI Conference on Human Factors in Computing Systems, pp. 667–676, 2008.
[45]
J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543, 2014.
[46]
H. Pham, P. P. Liang, T. Manzini, L.-P. Morency, and B. Póczos, Found in translation: Learning robust joint representations by cyclic translations between modalities. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 6892–6899, 2019.
[47]
S. Poria, E. Cambria, R. Bajpai, and A. Hussain. A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion, 37:98–125, 2017.
[48]
S. Poria, E. Cambria, D. Hazarika, N. Majumder, A. Zadeh, and L.-P. Morency. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp. 873–883, 2017.
[49]
G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior. Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE, 91(9):1306–1326, 2003.
[50]
W. Rahman, M. Hasan, S. Lee, A. Zadeh, C. Mao, L.-P. Morency, E. Hoqueet al., Integrating multimodal information in large pretrained transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2359–2369, 2020.
[51]
S. S. Rajagopalan, L.-P. Morency, T. Baltrusaitis, and R. Goecke. Extending long short-term memory for multi-view structured learning. In European Conference on Computer Vision, pp. 338–353. Springer, 2016.
[52]
G. A. Ramirez, T. Baltrušaitis, and L.-P. Morency. Modeling latent discriminative dynamic of multi-dimensional affective signals. In International Conference on Affective Computing and Intelligent Interaction, pp. 396–406. Springer, 2011.
[53]
D. Ren, S. Amershi, B. Lee, J. Suh, and J. D. Williams. Squares: Supporting interactive performance analysis for multiclass classifiers. IEEE Transactions on Visualization and Computer Graphics, 23(1):61–70, 2016.
[54]
M. T. Ribeiro, S. Singh, and C. Guestrin. “Why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144, 2016.
[55]
M. T. Ribeiro, S. Singh, and C. Guestrin. Anchors: High-precision model-agnostic explanations. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, pp. 1527–1535, 2018.
[56]
M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh. Beyond accuracy: Behavioral testing of nlp models with checklist. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4902–4912, 2020.
[57]
M. Robnik-Šikonja and I. Kononenko. Explaining classifications for individual instances. IEEE Transactions on Knowledge and Data Engineering, 20(5):589–600, 2008.
[58]
A. Rogers, O. Kovaleva, and A. Rumshisky. A primer in bertology: What we know about how bert works. Transactions of the Association for Computational Linguistics, 8:842–866, 2020.
[59]
S. Rubin, F. Berthouzoz, G. J. Mysore, W. Li, and M. Agrawala. Content-based tools for editing audio stories. In In Proceedings of the 26th Annual ACM Symposium on User Interface Software and Technology, pp. 113–122, 2013.
[60]
E. Sariyanidi, H. Gunes, and A. Cavallaro. Automatic analysis of facial affect: A survey of registration, representation, and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(6):1113–1133, 2014.
[61]
R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626, 2017.
[62]
L. S. Shapley. A value for n-person games. Contributions to the Theory of Games, 2(28):307–317, 1953.
[63]
D. Smilkov, S. Carter, D. Sculley, F. B. Viégas, and M. Wattenberg. Direct-manipulation visualization of deep networks. arXiv preprint arXiv:, 2017.
[64]
M. Soleymani, D. Garcia, B. Jou, B. Schuller, S.-F. Chang, and M. Pantic. A survey of multimodal sentiment analysis. Image and Vision Computing, 65:3–14, 2017.
[65]
J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: The all convolutional net. In International Conference on Learning Representations (Workshop Track), 2015.
[66]
H. Strobelt, S. Gehrmann, H. Pfister, and A. M. Rush. Lstmvis: A tool for visual analysis of hidden state dynamics in recurrent neural networks. IEEE Transactions on Visualization and Computer Graphics, 24(1):667–676, 2017.
[67]
M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning, pp. 3319–3328. PMLR, 2017.
[68]
G. Tolomei, F. Silvestri, A. Haines, and M. Lalmas. Interpretable predictions of tree-based ensembles via actionable feature tweaking. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 465–474, 2017.
[69]
Y.-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 6558–6569, 2019.
[70]
Y.-H. H. Tsai, M. Ma, M. Yang, R. Salakhutdinov, and L.-P. Morency. Multimodal routing: Improving local and global interpretability of multimodal language analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1823–1833, 2020.
[71]
L. van der Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:2579–2605, 112008.
[72]
H. Wang, A. Meghawat, L.-P. Morency, and E. P. Xing. Select-additive learning: Improving generalization in multimodal sentiment analysis. In IEEE International Conference on Multimedia and Expo, pp. 949–954, 2017.
[73]
X. Wang, H. Zeng, Y. Wang, A. Wu, Z. Sun, X. Ma, and H. Qu. Voicecoach: Interactive evidence-based training for voice modulation skills in public speaking. In In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–12, 2020.
[74]
J. Wexler, M. Pushkarna, T. Bolukbasi, M. Wattenberg, F. Viégas, and J. Wilson. The what-if tool: Interactive probing of machine learning models. IEEE Transactions on Visualization and Computer Graphics, 26(1):56–65, 2019.
[75]
J. Wiebe, T. Wilson, and C. Cardie. Annotating expressions of opinions and emotions in language. Language Resources and Evaluation, 39(2):165–210, 2005.
[76]
A. Zadeh, M. Chen, S. Poria, E. Cambria, and L.-P. Morency. Tensor fusion network for multimodal sentiment analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1103–1114, 2017.
[77]
A. Zadeh, P. P. Liang, N. Mazumder, S. Poria, E. Cambria, and L.-P. Morency. Memory fusion network for multi-view sequential learning. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, pp. 5634–5641, 2018.
[78]
A. Zadeh, P. P. Liang, S. Poria, P. Vij, E. Cambria, and L.-P. Morency. Multi-attention recurrent network for human communication comprehension. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, pp. 5642–5649, 2018.
[79]
A. B. Zadeh, P. P. Liang, S. Poria, E. Cambria, and L.-P. Morency. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 2236–2246, 2018.
[80]
H. Zeng, X. Shu, Y. Wang, Y. Wang, L. Zhang, T.-C. Pong, and H. Qu. Emotioncues: Emotion-oriented visual summarization of classroom videos. IEEE Transactions on Visualization and Computer Graphics, 27(7):3168–3181, 2020.
[81]
H. Zeng, X. Wang, A. Wu, Y. Wang, Q. Li, A. Endert, and H. Qu. Emoco: Visual analysis of emotion coherence in presentation videos. IEEE Transactions on Visualization and Computer Graphics, 26(1):927–937, 2019.
[82]
Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang. A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(1):39–58, 2008.
[83]
J. Zhang, Y. Wang, P. Molino, L. Li, and D. S. Ebert. Manifold: A model-agnostic framework for interpretation and diagnosis of machine learning models. IEEE Transactions on Visualization and Computer Graphics, 25(1):364–373, 2018.

Cited By

View all
  • (2024)VIOLET: Visual Analytics for Explainable Quantum Neural NetworksIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.338855730:6(2862-2874)Online publication date: 23-Apr-2024
  • (2024): A Visual Analytics Approach for Understanding the Dual Frontiers of Science and TechnologyIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332738730:1(518-528)Online publication date: 1-Jan-2024
  • (2024)TransforLearn: Interactive Visual Tutorial for the Transformer ModelIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332735330:1(891-901)Online publication date: 1-Jan-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Visualization and Computer Graphics
IEEE Transactions on Visualization and Computer Graphics  Volume 28, Issue 1
Jan. 2022
1190 pages

Publisher

IEEE Educational Activities Department

United States

Publication History

Published: 01 January 2022

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)VIOLET: Visual Analytics for Explainable Quantum Neural NetworksIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.338855730:6(2862-2874)Online publication date: 23-Apr-2024
  • (2024): A Visual Analytics Approach for Understanding the Dual Frontiers of Science and TechnologyIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332738730:1(518-528)Online publication date: 1-Jan-2024
  • (2024)TransforLearn: Interactive Visual Tutorial for the Transformer ModelIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332735330:1(891-901)Online publication date: 1-Jan-2024
  • (2024)PromptMagician: Interactive Prompt Engineering for Text-to-Image CreationIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332716830:1(295-305)Online publication date: 1-Jan-2024
  • (2024): Visualizing and Understanding Commonsense Reasoning Capabilities of Natural Language ModelsIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332715330:1(273-283)Online publication date: 1-Jan-2024
  • (2024): A Visual Analytics System for Exploring Children's Physical and Mental Health Profiles with Multimodal DataIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332694330:1(1205-1215)Online publication date: 1-Jan-2024
  • (2024)LiveRetro: Visual Analytics for Strategic Retrospect in Livestream E-CommerceIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332691130:1(1117-1127)Online publication date: 1-Jan-2024
  • (2024): A Visual Analytics Approach for Interactive Video ProgrammingIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332658630:1(87-97)Online publication date: 1-Jan-2024
  • (2024)Anchorage: Visual Analysis of Satisfaction in Customer Service Videos Via Anchor EventsIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.324560930:7(4008-4022)Online publication date: 1-Jul-2024
  • (2024)Visual Explanation for Open-Domain Question Answering With BERTIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.324367630:7(3779-3797)Online publication date: 1-Jul-2024
  • Show More Cited By

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media