research-article

Open access

Visual Analysis of Scene-Graph-Based Visual Question Answering

Authors:

Sebastian Künzel,

Tanja Munz-Körner,

Sandeep Vidyapu,

Daniel WeiskopfAuthors Info & Claims

VINCI '23: Proceedings of the 16th International Symposium on Visual Information Communication and Interaction

Article No.: 25, Pages 1 - 8

https://doi.org/10.1145/3615522.3615547

Published: 20 October 2023 Publication History

All formats PDF

Abstract

Scene-graph-based Visual Question Answering (VQA) has emerged as a burgeoning field in Deep Learning research, with a growing demand for robust and interpretable VQA systems. In this paper, we present a novel visual analysis approach that addresses two critical objectives in VQA: identifying and correcting prediction issues and providing insights into model decision-making processes through visualizing internal information. Our approach builds on the GraphVQA framework, which uses graph neural networks to process scene graphs representing images and which was trained on the widely-used GQA dataset. Our analysis tool aims at users familiar with the basics of graph-based VQA. By leveraging query-based scene analysis and visualization of crucial internal states, we are able to detect and pinpoint reasons for inaccurate predictions, facilitating model refinement and dataset curation. Identifying expressive internal states is a challenge. Through rigorous computer-based evaluations and presentation of a use case, we demonstrate the effectiveness of our analysis tool and model state visualization.

Supplementary Material

MP4 File (VQA_Paper_Vinci.mp4)

Presentation video

Download
26.42 MB

References

[1]

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. 2015. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2425–2433.

[2]

X. Chang, P. Ren, P. Xu, Z. Li, X. Chen, and A. Hauptmann. 2023. A comprehensive survey of scene graphs: Generation and application. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 1 (2023), 1–26.

[3]

J. Choo and S. Liu. 2018. Visual analytics for explainable deep learning. IEEE Computer Graphics and Applications 38, 4 (2018), 84–92.

[4]

K. A. Cook and J. J Thomas. 2005. Illuminating the path: The research and development agenda for visual analytics. Technical Report. Pacific Northwest National Laboratory.

[5]

V. Damodaran, S. Chakravarthy, A. Kumar, A. Umapathy, T. Mitamura, Y. Nakashima, N. Garcia, and C. Chu. 2021. Understanding the role of scene graphs in visual question answering. arXiv 2101.05479 (2021).

[6]

M. Danilevsky, K. Qian, R. Aharonov, Y. Katsis, B. Kawas, and P. Sen. 2020. A survey of the state of explainable AI for natural language processing. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 447–459.

[7]

F. K. Došilović, M. Brčić, and N. Hlupić. 2018. Explainable artificial intelligence: A survey. In 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics. IEEE, 0210–0215.

[8]

R. Garcia, T. Munz, and D. Weiskopf. 2021. Visual analytics tool for the interpretation of hidden states in recurrent neural networks. Visual Computing for Industry, Biomedicine, and Art 4, 1 (2021), 24.

[9]

R. Garcia, A. C. Telea, B. C. da Silva, J. Tørresen, and J. L. D. Comba. 2018. A task-and-technique centered survey on visual analytics for deep learning model engineering. Computers & Graphics 77 (2018), 30–49.

[10]

M. Gervautz and W. Purgathofer. 1988. A simple method for color quantization: Octree quantization. In New Trends in Computer Graphics: Proceedings of CG International’88. Springer, 219–231.

[11]

S. Ghosh, G. Burachas, A. Ray, and A.Ziskind. 2019. Generating Natural Language Explanations for Visual Question Answering using Scene Graphs and Visual Attention. arXiv 1902.05715 (2019).

[12]

Y. Goyal, A. Mohapatra, D. Parikh, and D. Batra. 2016. Towards transparent AI systems: Interpreting visual question answering models. arXiv: 1608.08974 (2016).

[13]

F. Hohman, M. Kahng, R. Pienta, and D. H. Chau. 2018. Visual Analytics in Deep Learning: An Interrogative Survey for the Next Frontiers. IEEE Transactions on Visualization and Computer Graphics 25, 8 (2018), 2674–2693.

Digital Library

[14]

Z. Huang, Z. Zeng, B. Liu, D. Fu, and J. Fu. 2020. Pixel-BERT: Aligning image pixels with text by deep multi-modal transformers. arXiv 2004.00849 (2020).

[15]

D. A. Hudson and C. D. Manning. 2019. GQA: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 6693–6702.

[16]

J. Johnson, R. Krishna, M. Stark, L. Li, D. A. Shamma, M. S. Bernstein, and L. Fei-Fei. 2015. Image retrieval using scene graphs. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Institute of Electrical and Electronics Engineers (IEEE), 3668–3678.

[17]

M. Kahng, P. Y. Andrews, A. Kalro, and D. H. Chau. 2018. ActiVis: Visual exploration of industry-scale deep neural network models. IEEE Transactions on Visualization and Computer Graphics 24, 1 (2018), 88–97.

[18]

D. Keim, F. Mansmann, J. Schneidewind, J. Thomas, and H. Ziegler. 2008. Visual analytics: Scope and challenges. In Visual Data Mining: Theory, Techniques and Tools for Visual Analytics, S. J. Simoff, M. H. Böhlen, and A. Mazeika (Eds.). Springer, Berlin, Heidelberg, 76–90.

[19]

X. Li, Q.and Tang and Y. Jian. 2021. Adversarial learning with bidirectional attention for visual question answering. Sensors 21, 21 (2021), 7164.

[20]

W. Liang, Y. Jiang, and Z. Liu. 2021. GraphVQA: Language-Guided Graph Neural Networks for Graph-based Visual Question Answering. In Proceedings of the Third Workshop on Multimodal Artificial Intelligence. Association for Computational Linguistics, 79–86.

[21]

S. Liu, X. Wang, M. Liu, and J. Zhu. 2017. Towards better analysis of machine learning models: A visual analytics perspective. Visual Informatics 1, 1 (2017), 48 – 56.

[22]

Y. Ming, S. Cao, R. Zhang, Z. Li, Y. Chen, Y. Song, and H. Qu. 2017. Understanding hidden memories of recurrent neural networks. In Proceedings of the 2017 IEEE Conference on Visual Analytics Science and Technology. 13–24.

[23]

T. Munz, D. Väth, P. Kuznecov, N. T. Vu, and D. Weiskopf. 2022. Visualization-based improvement of neural machine translation. Computers & Graphics 103 (2022), 45–60.

Digital Library

[24]

W. Norcliffe-Brown, S. Vafeias, and S. Parisot. 2018. Learning conditioned graph structures for interpretable visual question answering. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Vol. 31. Curran Associates, Inc.

[25]

J. Pennington, R. Socher, and C. Manning. 2014. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1532–1543.

[26]

N. F. Rajani and R. J. Mooney. 2017. Ensembling Visual Explanations for VQA. In Proceedings of the NIPS 2017 Workshop on Visually-Grounded Interaction and Language.

[27]

A. Ray, M. Cogswell, X. Lin, K. Alipour, A. Divakaran, Y. Yao, and G. Burachas. 2021. Generating and evaluating explanations of attended and error-inducing input regions for VQA models. Applied AI Letters 2, 4 (2021), e51.

Digital Library

[28]

N. Schäfer, P. Tilli, T. Munz-Körner, S. Künzel, S. Vidyapu, N. T. Vu, and D. Weiskopf. 2023. Visual analysis system for scene-graph-based visual question answering. https://doi.org/10.18419/darus-3589

[29]

H. Strobelt, S. Gehrmann, H. Pfister, and A. M. Rush. 2018. LSTMVis: A tool for visual analysis of hidden state dynamics in recurrent neural networks. IEEE Transactions on Visualization and Computer Graphics 24, 1 (2018), 667–676.

[30]

M. H. Vu, T. Löfstedt, T. Nyholm, and R. Sznitman. 2020. A question-centric model for visual question answering in medical imaging. IEEE Transactions on Medical Imaging 39, 9 (2020), 2856–2868.

[31]

F. Xu, H. Uszkoreit, Y. Du, W. Fan, D. Zhao, and J. Zhu. 2019. Explainable AI: A brief survey on history, research areas, approaches and challenges. In Proceedings of Natural Language Processing and Chinese Computing: 8th CCF International Conference, NLPCC 2019. Springer, 563–574.

Digital Library

[32]

J. Yuan, C. Chen, W. Yang, M. Liu, J. Xia, and S. Liu. 2020. A survey of visual analytics techniques for machine learning. Computational Visual Media 7 (2020), 3–36.

[33]

R. Yusuf, J. Owusu, H. Wang, K. Qin, Z. Lawal, and Y. Dong. 2022. VQA and visual reasoning: An overview of recent datasets, methods and challenges. arXiv 2212.13296 (2022).

Index Terms

Visual Analysis of Scene-Graph-Based Visual Question Answering
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Human-centered computing
  1. Visualization
    1. Visualization application domains
      1. Visual analytics
    2. Visualization systems and tools

Recommendations

Lightweight Visual Question Answering using Scene Graphs
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

Visual question answering (VQA) is a challenging problem in machine perception, which requires a deep joint understanding of both visual and textual data. Recent research has advanced the automatic generation of high-quality scene graphs from images, ...
An analysis of graph convolutional networks and recent datasets for visual question answering
Abstract
Graph neural network is a deep learning approach widely applied on structural and non-structural scenarios due to its substantial performance and interpretability recently. In a non-structural scenario, textual and visual research topics like ...
Visual-Textual Semantic Alignment Network for Visual Question Answering
Artificial Neural Networks and Machine Learning – ICANN 2021
Abstract
VQA task requires deep understanding of visual and textual content and access to key information to better answer the question. Most of current works only use image and question as the input of the network, where the image features are over-... $〈〉$

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

VINCI '23: Proceedings of the 16th International Symposium on Visual Information Communication and Interaction

September 2023

308 pages

ISBN:9798400707513

DOI:10.1145/3615522

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 October 2023

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)

Conference

VINCI 2023

VINCI 2023: The 16th International Symposium on Visual Information Communication and Interaction

September 22 - 24, 2023

Guangzhou, China

Acceptance Rates

Overall Acceptance Rate 71 of 193 submissions, 37%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
333
Total Downloads

Downloads (Last 12 months)333
Downloads (Last 6 weeks)24

Reflects downloads up to 12 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents