Abstract
Visual relationship modeling plays an indispensable role in visual question answering (VQA). VQA models need to fully understand the visual scene and positional relationships within the image to answer complex reasoning questions involving visual object relationships. Accurate reasoning and an understanding of the relationships between different visual objects are particularly crucial. However, most reasoning models used in current VQA tasks only use simple attention mechanisms to model visual object relationships and ignore the potential for effective modeling using rich visual object features during the learning process. This work proposes an effective visual object Relationship Reasoning and Adaptive Fusion (RRAF) model to address the shortcomings of existing VQA model research. RRAF can simultaneously model visual objects’ position, appearance, and semantic features and uses an adaptive fusion mechanism to achieve fine-grained multimodal reasoning and fusion. Specifically, we designed an effective image encoder to model and learn the relationship between the position and appearance features of visual objects. In addition, in the co-attention module, we employ semantic information from the question to focus on critical visual objects. Finally, we use an adaptive fusion mechanism to reassign weights and fuse different modalities of features to effectively predict the answer. Experimental results show that the RRAF model outperforms current state-of-the-art methods on the VQA 2.0 and GQA datasets, especially in visual object counting problems. We also conducted extensive ablation experiments to demonstrate the effectiveness of the RRAF model, achieving an overall accuracy of 71.33% and 57.83% on the VQA 2.0 and GQA datasets, respectively. Code is available at https://github.com/shenxiang-vqa/RRAF.





Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2015) Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433
Wang Y, Xu N, Liu A-A, Li W, Zhang Y (2021) High-order interaction learning for image captioning. IEEE Trans Circuits Syst Video Technol 32(7):4417–4430
Stefanini M, Cornia M, Baraldi L, Cascianelli S, Fiameni G, Cucchiara R (2022) From show to tell: A survey on deep learning-based image captioning. IEEE Trans Pattern Anal Mach Intell 45(1):539–559
Deng J, Yang Z, Liu D, Chen T, Zhou W, Zhang Y, Li H, Ouyang W (2023) Transvg++: End-to-end visual grounding with language conditioned vision transformer. IEEE Trans Pattern Anal Mach Intell
Hu P, Peng D, Wang X, Xiang Y (2019) Multimodal adversarial network for cross-modal retrieval. Knowl-Based Syst 180:38–50
Xu X, Lin K, Yang Y, Hanjalic A, Shen HT (2020) Joint feature synthesis and embedding: Adversarial cross-modal retrieval revisited. IEEE Trans Pattern Anal Mach Intell 44(6):3030–3047
Esposito M, Damiano E, Minutolo A, De Pietro G, Fujita H (2020) Hybrid query expansion using lexical resources and word embeddings for sentence retrieval in question answering. Inform Sci 514:88–105
Nguyen BX, Do T, Tran H, Tjiputra E, Tran QD, Nguyen A (2022) Coarse-to-fine reasoning for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4558–4566
Shen X, Han D, Guo Z, Chen C, Hua J, Luo G (2022) Local self-attention in transformer for visual question answering. Appl Intell 1–18
Chen C, Han D, Chang C-C (2022) Caan: Context-aware attention network for visual question answering. Pattern Recognition 132:108980
Yu Z, Yu J, Cui Y, Tao D, Tian Q (2019) Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6281–6290
Gao P, Jiang Z, You H, Lu P, Hoi SC, Wang X, Li H (2019) Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6639–6648
Zhang H, Zeng P, Hu Y, Qian J, Song J, Gao L (2023) Learning visual question answering on controlled semantic noisy labels. Pattern Recognition 138:109339
Yanagimoto H, Nakatani R, Hashimoto K (2022) Visual question answering focusing on object positional relation with capsule network. In: 2022 12th International congress on advanced applied informatics (IIAI-AAI), IEEE, pp 89–94
Das A, Agrawal H, Zitnick L, Parikh D, Batra D (2017) Human attention in visual question answering: Do humans and deep networks look at the same regions? Comput Vision Image Understand 163:90–100
Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv:1606.01847
Yu Z, Yu J, Xiang C, Fan J, Tao D (2018) Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 29(12):5947–5959
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems 30
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Gao P, Jiang Z, You H, Lu P, Hoi SC, Wang X, Li H (2019) Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6639–6648
Wang C, Shen Y, Ji L (2022) Geometry attention transformer with position-aware lstms for image captioning. Expert Syst Appl 201:117174
Hu H, Gu J, Zhang Z, Dai J, Wei Y (2018) Relation networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3588–3597
Wei J, Li Z, Zhu J, Ma H (2022) Enhance understanding and reasoning ability for image captioning. Appl Intell 1–17
Gerrish S (2018) How Smart Machines Think. The MIT Press, London
Guo Z, Han D (2023) Sparse co-attention visual question answering networks based on thresholds. Appl Intell 53(1):586–600
Zhou Y, Ren T, Zhu C, Sun X, Liu J, Ding X, Xu M, Ji R (2021) Trar: Routing the attention spans in transformer for visual question answering. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2074–2084
Shen X, Han D, Chang C-C, Zong L (2022) Dual self-guided attention with sparse question networks for visual question answering. IEICE Trans Inform Syst 105(4):785–796
Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 21–29
Do T, Do T-T, Tran H, Tjiputra E, Tran QD (2019) Compact trilinear interaction for visual question answering. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 392–401
Li L, Gan Z, Cheng Y, Liu J (2019) Relation-aware graph attention network for visual question answering. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10313–10322
Zhang D, Cao R, Wu S (2019) Information fusion in visual question answering: A survey. Inform Fusion 52:268–280
Ben-Younes H, Cadene R, Cord M, Thome N (2017) Mutan: Multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2612–2620
Jiang H, Misra I, Rohrbach M, Learned-Miller E, Chen X (2020) In defense of grid features for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10267–10276
Nguyen A, Tran QD, Do T-T, Reid I, Caldwell DG, Tsagarakis NG (2019) Object captioning and retrieval with natural language. In: Proceedings of the IEEE/CVF international conference on computer vision workshops, pp 0–0
Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Zhao Z, Samel K, Chen B et al (2021) Proto: Program-guided transformer for program-guided tasks. Advances in neural information processing systems 34:17021–17036
Mao A, Yang Z, Lin K, Xuan J, Liu Y-J (2022) Positional attention guided transformer-like architecture for visual question answering. IEEE Trans Multimed
Li W, Sun J, Liu G, Zhao L, Fang X (2020) Visual question answering with attention transfer and a cross-modal gating mechanism. Pattern Recognition Lett 133:334–340
Hu R, Rohrbach A, Darrell T, Saenko K (2019) Language-conditioned graph networks for relational reasoning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10294–10303
Yu J, Zhang W, Lu Y, Qin Z, Hu Y, Tan J, Wu Q (2020) Reasoning on the relation: Enhancing visual representation for visual question answering and cross-modal retrieval. IEEE Trans Multimed 22(12):3196–3209
Huang Q, Wei J, Cai Y, Zheng C, Chen J, Leung H-f, Li Q (2020) Aligned dual channel graph convolutional network for visual question answering. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 7166–7176
Yang Z, Qin Z, Yu J, Wan T (2020) Prior visual relationship reasoning for visual question answering. In: 2020 IEEE International conference on image processing (ICIP), IEEE, pp 1411–1415
Cao J, Qin X, Zhao S, Shen J (2022) Bilateral cross-modality graph matching attention for feature fusion in visual question answering. IEEE Trans Neural Netw Learn Syst
Cadene R, Ben-Younes H, Cord M, Thome N (2019) Murel: Multimodal relational reasoning for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1989–1998
Liu Y, Wei W, Peng D, Mao X-L, He Z, Zhou P (2022) Depth-aware and semantic guided relational attention network for visual question answering. IEEE Trans Multimed
Chen H, Liu R, Peng B (2021) Cross-modal relational reasoning network for visual question answering. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3956–3965
Zhang J, Huang B, Fujita H, Zeng G, Liu J (2023) Feqa: Fusion and enhancement of multi-source knowledge on question answering. Expert Syst Appl 227:120286
Kim J-H, On K-W, Lim W, Kim J, Ha J-W, Zhang B-T (2016) Hadamard product for low-rank bilinear pooling. arXiv:1610.04325
Gu G, Kim ST, Ro YM (2017) Adaptive attention fusion network for visual question answering. In: 2017 IEEE International conference on multimedia and expo (ICME), IEEE, pp 997–1002
Chen H, Liu R, Fang H, Zhang X (2021) Adaptive re-balancing network with gate mechanism for long-tailed visual question answering. In: ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 3605–3609
Zhang W, Yu J, Zhao W, Ran C (2021) Dmrfnet: deep multimodal reasoning and fusion for visual question answering and explanation generation. Inform Fusion 72:70–79
Ren S, He K, Girshick RB, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L-J, Shamma DA et al (2017) Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int J Comput Vision 123:32–73
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6904–6913
Pishchulin L, Insafutdinov E, Tang S, Andres B, Andriluka M, Gehler PV, Schiele B (2016) Deepcut: Joint subset partition and labeling for multi person pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4929–4937
Hudson DA, Manning CD (2019) Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6700–6709
Kim J-H, Jun J, Zhang B-T (2018) Bilinear attention networks. In: Advances in neural information processing systems 31
Yang X, Lin G, Lv F, Liu F (2020) Trrnet: Tiered relation reasoning for compositional visual question answering. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, Springer, pp 414–430
Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in neural information processing systems 32
Zhang W, Yu J, Wang Y, Wang W (2021) Multimodal deep fusion for image question answering. Knowl-Based Syst 212:106639
Zhang S, Chen M, Chen J, Zou F, Li Y-F, Lu P (2021) Multimodal feature-wise co-attention method for visual question answering. Inform Fusion 73:1–10
Rahman T, Chou S-H, Sigal L, Carenini G (2021) An improved attention for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1653–1662
Chen C, Han D, Shen X (2023) Clvin: Complete language-vision interaction network for visual question answering. Knowl-Based Syst 110706
Yan F, Silamu W, Li Y, Chai Y (2022) Spca-net: a based on spatial position relationship co-attention network for visual question answering. Visual Comput 38(9–10):3097–3108
Yao H, Wang L, Cai C, Sun Y, Zhang Z, Luo Y (2023) Multi-modal spatial relational attention networks for visual question answering. Image Vision Comput 140:104840
Acknowledgements
This research received partial funding from the National Natural Science Foundation of China (Grant No. 52331012) and the Natural Science Foundation of Shanghai (Grant No. 21ZR1426500). Additionally, support was provided by the Hunan Provincial Natural Science Foundation (Grant No. 2022JJ50245) and the Scientific Research Fund of Hunan Provincial Education Department (Grant No. 22B0753). This work also supported by the Shanghai Maritime University’s Top Innovative Talent Training Program (Grant No. 2022YBR014) for Graduate Students in 2022.
Author information
Authors and Affiliations
Contributions
Methodology, material preparation, data collection, and analysis were performed by Xiang Shen and Zihan Guo. Xiang Shen wrote the first draft of the manuscript, Liang Zong and Zihan Guo commented on previous versions of the manuscript. Dezhi Han did the supervision, reviewing, and editing. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Shen, X., Han, D., Zong, L. et al. Relational reasoning and adaptive fusion for visual question answering. Appl Intell 54, 5062–5080 (2024). https://doi.org/10.1007/s10489-024-05437-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-024-05437-7