3D-Aware Visual Question Answering about Parts, Poses and Occlusions

Wang, Xingrui; Ma, Wufei; Li, Zhuowan; Kortylewski, Adam; Yuille, Alan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2310.17914 (cs)

[Submitted on 27 Oct 2023]

Title:3D-Aware Visual Question Answering about Parts, Poses and Occlusions

Authors:Xingrui Wang, Wufei Ma, Zhuowan Li, Adam Kortylewski, Alan Yuille

View PDF

Abstract:Despite rapid progress in Visual question answering (VQA), existing datasets and models mainly focus on testing reasoning in 2D. However, it is important that VQA models also understand the 3D structure of visual scenes, for example to support tasks like navigation or manipulation. This includes an understanding of the 3D object pose, their parts and occlusions. In this work, we introduce the task of 3D-aware VQA, which focuses on challenging questions that require a compositional reasoning over the 3D structure of visual scenes. We address 3D-aware VQA from both the dataset and the model perspective. First, we introduce Super-CLEVR-3D, a compositional reasoning dataset that contains questions about object parts, their 3D poses, and occlusions. Second, we propose PO3D-VQA, a 3D-aware VQA model that marries two powerful ideas: probabilistic neural symbolic program execution for reasoning and deep neural networks with 3D generative representations of objects for robust visual recognition. Our experimental results show our model PO3D-VQA outperforms existing methods significantly, but we still observe a significant performance gap compared to 2D VQA benchmarks, indicating that 3D-aware VQA remains an important open research area.

Comments:	Accepted by NeurIPS2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2310.17914 [cs.CV]
	(or arXiv:2310.17914v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2310.17914

Submission history

From: Xingrui Wang [view email]
[v1] Fri, 27 Oct 2023 06:15:30 UTC (34,045 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:3D-Aware Visual Question Answering about Parts, Poses and Occlusions

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:3D-Aware Visual Question Answering about Parts, Poses and Occlusions

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators