Abstract
Visual question answering (VQA) is a task that has received immense consideration from two major research communities: computer vision and natural language processing. Recently it has been widely accepted as an AI-complete task which can be used as an alternative to visual turing test. In its most common form, it is a multi-modal challenging task where a computer is required to provide the correct answer for a natural language question asked about an input image. It attracts many deep learning researchers after their remarkable achievements in text, voice and vision technologies. This review extensively and critically examines the current status of VQA research in terms of step by step solution methodologies, datasets and evaluation metrics. Finally, this paper also discusses future research directions for all the above-mentioned aspects of VQA separately.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Agrawal A, Kembhavi A, Batra D, Parikh D (2017) C-vqa: A compositional split of the visual question answering (vqa) v1. 0 dataset. arXiv preprint arXiv:1704.08243
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 6077–6086
Andreas J, Rohrbach M, Darrell T, Klein D (2015) Deep compositional question answering with neural module networks. arXiv preprint. arXiv preprint arXiv:1511.02799
Andreas J, Rohrbach M, Darrell T, Klein D (2016) Neural module networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 39–48
Antol S, Zitnick CL, Parikh D (2014) Zero-shot learning via visual abstraction. In: European conference on computer vision. Springer, Cham, pp 401–416
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D (2015) Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision. pp 2425–2433
Bai Y, Fu J, Zhao T, Mei T (2018) Deep attention neural tensor network for visual question answering. In: Computer vision–ECCV 2018: 15th European conference, Munich, Germany, September 8–14, 2018, Proceedings. Springer, vol 11216, p 20
Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3(2):1137–1155
Ben-Younes H, Cadene R, Cord M, & Thome N (2017) Mutan: multimodal tucker fusion for visual question answering. In: Proceedings of the IEEE international conference on computer vision. pp 2612–2620
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146
Cao L, Gao L, Song J, Xu X, Shen HT (2017) Jointly learning attentions with semantic cross-modal correlation for visual question answering. In: Australasian database conference. Springer, Cham, pp 248–260
Cer D, Yang Y, Kong SY, Hua N, Limtiaco N, John RS, Constant N, Guajardo-Cespedes M, Yuan S, Tar C, Sung YH (2018) Universal sentence encoder. arXiv preprint arXiv:1803.11175
Chaturvedi I, Satapathy R, Cavallari S, Cambria E (2019) Fuzzy common-sense reasoning for multimodal sentiment analysis. Pattern Recognit Lett 125:264–270
Chen K, Wang J, Chen LC, Gao H, Xu W, Nevatia R (2015) ABC-CNN: An attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. arXiv preprint arXiv:1406.1078
Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A (2017) Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE computer society conference on computer vision and pattern recognition. CVPR 2005. IEEE, vol 1, pp 886–893
Denkowski M, Lavie A (2014) Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the ninth workshop on statistical machine translation. pp 376–380
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Eckart C, Young G (1936) The approximation of one matrix by another of lower rank. Psychometrika 1(3):211–218
Elman JL (1990) Finding structure in time. Cognit Sci 14(2):179–211
Fang Z, Liu J, Li Y, Qiao Y, Lu H (2019) Improving visual question answering using dropout and enhanced question encoder. Pattern Recognit 90:404–414
Feng, Y., Zhu, X., Li, Y., Ruan, Y., & Greenspan, M. (2018). Learning Capsule Networks with Images and Text. In Advances in neural information processing systems
Forsbom E (2003) Training a super model look-alike: featuring edit distance, n-gram occurrence, and one reference translation. In: Proceedings of the workshop on machine translation evaluation: towards systemizing MT evaluation. pp 29–36
Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847
Gao H, Mao J, Zhou J, Huang Z, Wang L, Xu W (2015) Are you talking to a machine? Dataset and methods for multilingual image question. In: Advances in neural information processing systems, pp 2296–2304
Gao P, Li H, Li S, Lu P, Li Y, Hoi SC, Wang X (2018) Question-guided hybrid convolution for visual question answering. In: Computer vision—ECCV 2018 lecture notes in computer science. pp 485–501
Geman D, Geman S, Hallonquist N, Younes L (2015) Visual turing test for computer vision systems. In: Proceedings of the national academy of sciences. pp 201422953
Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp 1440–1448
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 580–587
Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106(2):210–233
Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR. vol 1(2), p 3
Gurari D, Li Q, Stangl AJ, Guo A, Lin C, Grauman K, Luo J, Bigham JP (2018) VizWiz grand challenge: answering visual questions from blind people. arXiv preprint arXiv:1802.08218
Hasan SA, Ling Y, Farri O, Liu J, Lungren M, Müller H (2018) Overview of the ImageCLEF 2018 medical domain visual question answering task. In CLEF2018 working notes. CEUR Workshop proceedings, Avignon, France
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 770–778
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: 2017 IEEE international conference on computer vision (ICCV). IEEE, pp 2980–2988
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Huang LC, Kulkarni K, Jha A, Lohit S, Jayasuriya S, Turaga P (2018) CS-VQA: visual question answering with compressively sensed images. arXiv preprint arXiv:1806.03379
Jabri A, Joulin A, van der Maaten L (2016) Revisiting visual question answering baselines. In: European conference on computer vision. Springer, Cham, pp 727–739
Johnson J, Hariharan B, van der Maaten L, Fei-Fei L, Lawrence Zitnick C, Girshick R (2017) Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 2901–2910
Kafle K, Kanan C (2016) Answer-type prediction for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 4976–4984
Kafle K, Kanan C (2017a) An analysis of visual question answering algorithms. In: 2017 IEEE international conference on computer vision (ICCV). IEEE, pp 1983–1991
Kafle K, Kanan C (2017b) Visual question answering: datasets, algorithms, and future challenges. Comput Vis Image Underst 163:3–20
Kafle K, Price B, Cohen S, Kanan C (2018) DVQA: Understanding data visualizations via question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 5648–5656
Kahou SE, Michalski V, Atkinson A, Kadar A, Trischler A, Bengio Y (2017) Figureqa: An annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300
Kembhavi A, Salvato M, Kolve E, Seo M, Hajishirzi H, Farhadi A (2016) A diagram is worth a dozen images. In: European conference on computer vision. Springer, Cham, pp 235–251
Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)
Kim JH, Lee SW, Kwak D, Heo MO, Kim J, Ha JW, Zhang BT (2016a) Multimodal residual learning for visual qa. In: Advances in neural information processing systems pp 361–369
Kim JH, On KW, Lim W, Kim J, Ha JW, Zhang BT (2016b) Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325
Kiros R, Zhu Y, Salakhutdinov RR, Zemel R, Urtasun R, Torralba A, Fidler S (2015) Skip-thought vectors. In: Advances in neural information processing systems, pp. 3294–3302
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Bernstein MS (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems pp 1097-1105)
Lao M, Guo Y, Wang H, Zhang X (2018) Cross-modal multistep fusion network with co-attention for visual question answering. IEEE Access
Levy O, Goldberg Y (2014) Neural word embedding as implicit matrix factorization. In: Advances in neural information processing systems pp 2177–2185
Levy O, Goldberg Y, Dagan I (2015) Improving distributional similarity with lessons learned from word embeddings. Trans Assoc Comput Linguist 3:211–225
Li M, Gu L, Ji Y, Liu C (2018) Text-guided dual-branch attention network for visual question answering. In: Pacific rim conference on multimedia. Springer, Cham, pp 750–760
Lienhart R, Maydt J (2002) An extended set of haar-like features for rapid object detection. In: Proceedings. 2002 international conference on image processing. IEEE, vol 1, pp I–I
Lin X, Parikh D (2016) Leveraging visual question answering for image-caption ranking. In: European conference on computer vision. Springer, Cham, pp 261–277
Lioutas V, Passalis N, Tefas A (2018) Explicit ensemble attention learning for improving visual question answering. Pattern Recognit Lett 111:51–57
Logeswaran L, Lee H (2018) An efficient framework for learning sentence representations. arXiv preprint arXiv:1803.02893
Lowe DG (1999) Object recognition from local scale-invariant features. In: The proceedings of the seventh IEEE international conference on computer vision, 1999. IEEE, vol 2, pp 1150–1157
Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical question-image co-attention for visual question answering. In: Advances in neural information processing systems. pp 289–297
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 375–383
Ma L, Lu Z, Li H (2016) Learning to answer questions from image using convolutional neural network. In: AAAI. vol. 3(7), p 16
Malinowski M, Fritz M (2014) A multi-world approach to question answering about real-world scenes based on uncertain input. In: Advances in neural information processing systems. pp 1682–1690
Malinowski M, Rohrbach M, Fritz M (2017) Ask your neurons: a deep learning approach to visual question answering. Int J Comput Vis 125(1–3):110–135
Malinowski M, Doersch C, Santoro A, Battaglia P (2018) Learning visual question answering by bootstrapping hard attention. In: Computer vision—ECCV 2018 lecture notes in computer science. pp 3–20
Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems. pp 3111–3119
Miller GA, Charles WG (1991) Contextual correlates of semantic similarity. Lang Cognit Process 6(1):1–28
Narasimhan M, Schwing AG (2018) Straight to the facts: learning knowledge base retrieval for factual visual question answering. In: Proceedings of the European conference on computer vision (ECCV). pp 451–468
Noh H, Hongsuck Seo P, Han B (2016) Image question answering using convolutional neural network with dynamic parameter prediction. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 30–38
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, pp 311–318
Peng L, Yang Y, Bin Y, Xie N, Shen F, Ji Y, Xu X (2019) Word-to-region attention network for visual question answering. Multimedia Tools Appl 78(3):3843–3858
Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp 1532–1543
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. arXiv preprint ar-Xiv:1802.05365
Prakash BS, Sanjeev KV, Prakash R, Chandrasekaran K (2019) A survey on recurrent eural network architectures for sequential learning. In: Soft computing for problem solving. Springer, Singapore, pp 57–66
Ren H, Lu H (2018) Compositional coding capsule network with k-means routing for text classification. arXiv preprint arXiv:1810.09177
Ren M, Kiros R, Zemel R (2015a) Image question answering: a visual semantic embedding model and a new dataset. Proc Adv Neural Inf Process Syst 1(2):5
Ren M, Kiros R, Zemel R (2015b) Exploring models and data for image question answering. In: Advances in neural information processing systems. pp 2953–2961
Ren S, He K, Girshick R, Sun J (2015c) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems. pp 91–99
Ruwa N, Mao Q, Wang L, Dong M (2018) Affective visual question answering network. In: 2018 IEEE conference on multimedia information processing and retrieval (MIPR)
Sabour S, Frosst N, Hinton GE (2017). Dynamic routing between capsules. In: Advances in neural information processing systems. pp 3856–3866
Saito K, Shin A, Ushiku Y, Harada T (2017) Dualnet: domain-invariant network for visual question answering. In: 2017 IEEE international conference on multimedia and expo (ICME). IEEE, pp 829–834
Shah M, Chen X, Rohrbach M, Parikh D (2019) Cycle-consistency for robust visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 6649–6658
Shi Y, Furlanello T, Zha S, Anandkumar A (2018) Question type guided attention in visual question answering. In: Computer vision—ECCV 2018 lecture notes in computer science. pp 158–175
Shih KJ, Singh S, Hoiem D (2016) Where to look: focus regions for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 4613–4621
Shrestha R, Kafle K, Kanan C (2019) Answer them all! toward universal visual question answering models. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 10472–10481
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems. pp 3104–3112
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 1–9
Teney D, Hengel AV (2018) Visual question answering as a meta learning task. In: Computer vision—ECCV 2018 lecture notes in computer science. 229–245
Tommasi T, Mallya A, Plummer B, Lazebnik S, Berg AC, Berg TL (2019) Combining multiple cues for visual madlibs question answering. Int J Comput Vis 127(1):38–60
Toor AS, Wechsler H, Nappi M (2019) Question action relevance and editing for visual question answering. Multimedia Tools Appl 78(3):2921–2935
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Polosukhin I (2017) Attention is all you need. In: Advances in neural in-formation processing systems. pp 5998–6008
Wang P, Wu Q, Shen C, Hengel AVD, Dick A (2015) Explicit knowledge-based reasoning for visual question answering. arXiv preprint arXiv:1511.02570
Wang P, Wu Q, Shen C, Dick A, van den Hengel A (2018) Fvqa: fact-based visual question answering. IEEE Trans Pattern Anal Mach Intell 40(10):2413–2427
Wu Z, Palmer M (1994) Verbs semantics and lexical selection. In: Proceedings of the 32nd annual meeting on association for computational linguistics. Association for Computational Linguistics, pp 133–138
Wu Q, Wang P, Shen C, Dick A, van den Hengel A (2016). Ask me any-thing: Free-form visual question answering based on knowledge from exter-nal sources. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp 4622-4630)
Wu Q, Shen C, Wang P, Dick A, van den Hengel A (2018) Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell 40(6):1367–1381
Xu W, Rudnicky A (2000) Can artificial neural networks learn language models?. In: sixth international conference on spoken language processing
Xu H, Saenko K (2016) Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In: European conference on computer vision. Springer, Cham, pp 451–466
Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 21–29
Young T, Hazarika D, Poria S, Cambria E (2018) Recent trends in deep learn-ing based natural language processing. IEEE Comput Intell Mag 13(3):55–75
Yu L, Park E, Berg AC, Berg TL (2015) Visual madlibs: fill in the blank description generation and question answering. In: Proceedings of the ieee international conference on computer vision. pp 2461–2469
Yu D, Fu J, Mei T, Rui Y (2017) Multi-level attention networks for visual question answering. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 4187–4195
Yu D, Gao X, Xiong H (2018a) Structured semantic representation for visual question answering. In: 2018 25th IEEE international conference on image processing (ICIP). IEEE, pp 2286–2290
Yu Z, Yu J, Xiang C, Fan J, Tao D (2018b) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 29(12):5947–5959
Yu Z, Yu J, Cui Y, Tao D, Tian Q (2019) Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 6281–6290
Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European conference on computer vision. Springer, Cham, pp 818–833
Zhang P, Goyal Y, Summers-Stay D, Batra D, Parikh D (2016) Yin and yang: balancing and answering binary visual questions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 5014–5022
Zhao W, Peng H, Eger S, Cambria E, Yang M (2019) Towards scalable and reliable capsule networks for challenging NLP applications. arXiv preprint arXiv:1906.02829
Zhou B, Tian Y, Sukhbaatar S, Szlam A, Fergus R (2015) Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167
Zhu Y, Zhang C, Ré C, Fei-Fei L (2015) Building a large-scale multimodal knowledge base system for answering visual queries. arXiv preprint arXiv:1507.05670
Zhu Y, Groth O, Bernstein M, Fei-Fei L (2016) Visual7w: Grounded question answering in images. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 4995–5004
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Manmadhan, S., Kovoor, B.C. Visual question answering: a state-of-the-art review. Artif Intell Rev 53, 5705–5745 (2020). https://doi.org/10.1007/s10462-020-09832-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-020-09832-7