SceneGATE: Scene-Graph Based Co-Attention Networks for Text Visual Question Answering
Abstract
:1. Introduction
- To the best of our knowledge, this is the first attempt to apply a scene graph as an image representation in TextVQA. We introduce a novel scene graph generation framework for the TextVQA environment.
- We propose and integrate scene graph-based semantic relation-aware attention with positional relation-aware attention to establish a complete interaction between each question word and visual feature.
2. Related Work
2.1. TextVQA
2.2. Conventional VQA
2.3. Scene Graph in Visual Language Tasks
3. SceneGATE—Input Representations
3.1. Input Representations
3.2. Scene Graph Construction
3.3. Scene Graph Embedding
3.3.1. Pre-Trained Word Embedding
3.3.2. GCN-Based Embedding
4. SceneGATE—Co-Attention Networks
4.1. Self-Attention Module
4.2. Guided Attention Module
4.3. Semantic Relation-Aware Attention
4.4. Positional Relation-Aware Attention
5. Experiments
5.1. Datasets
5.2. Implementation Details
5.3. Baseline Models
6. Results
6.1. Performance Comparison
6.2. Ablation Studies
6.3. Quality Analysis
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
VQA | Visual Question Answering |
OCR | Optical Character Recognition |
Faster-RCNN | Faster Region-Based Convolutional Neural Network |
SG | Scene Graph |
GCN | Graph Convolutional Networks |
NER | Named Entity Recognition |
SA | Self-Attention |
GA | Guided Attention |
SRA | Semantic Relation-Aware |
PRA | Positional Relation-Aware |
MMTE | Multi-Modal Transformer Encoder |
SOTA | State-of-the-Art |
VCR | Visual Commonsense Reasoning |
UMD | University of Maryland |
ReLU | Rectified Linear unit |
Appendix A. Training Hyperparameters
Name | Value | Name | Value |
---|---|---|---|
max text token length | 20 | max obj num | 100 |
max ocr num | 50 | max SG obj num | 36 |
max SG ocr num | 100 | batch size | 8 |
learning rate | 0.0001 | # epoch | 100 |
max gradient norm | 0.25 | optimizer | Adam |
lr decay @ step | 14,000, 19,000 | lr decay rate | 0.1 |
warmup factor | 0.2 | warmup iterations | 1000 |
# workers | 0 | distance threshold | 0.5 |
SRA Attention context | 3 | PRA Attention context | 3 |
seed | 0 | obj dropout rate | 0.1 |
ocr dropout rate | 0.1 | hidden size | 768 |
# positional relations | 12 | # semantic relations | 12 |
textual query size | 768 | ocr feature size | 3002 |
obj feature size | 2048 | sg feature size | 300 |
# decoding_steps | 12 | text encoder lr scale | 0.1 |
# text encoder layers | 3 | # PRA layers | 2 |
# SRA layers | 2 | # MMTE layers | 2 |
Appendix B. Additional Qualitative Examples
Appendix C. Incorrect Classification Analysis
References
- Singh, A.; Natarajan, V.; Shah, M.; Jiang, Y.; Chen, X.; Batra, D.; Parikh, D.; Rohrbach, M. Towards vqa models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15 June 2019; pp. 8317–8326. [Google Scholar] [CrossRef] [Green Version]
- Hu, R.; Singh, A.; Darrell, T.; Rohrbach, M. Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9992–10002. [Google Scholar] [CrossRef]
- Zhu, Q.; Gao, C.; Wang, P.; Wu, Q. Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps. Proc. AAAI Conf. Artif. Intell. 2021, 35, 3608–3615. [Google Scholar] [CrossRef]
- Gao, D.; Li, K.; Wang, R.; Shan, S.; Chen, X. Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12746–12756. [Google Scholar] [CrossRef]
- Kant, Y.; Batra, D.; Anderson, P.; Schwing, A.; Parikh, D.; Lu, J.; Agrawal, H. Spatially aware multimodal transformers for textvqa. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part IX 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 715–732. [Google Scholar] [CrossRef]
- Han, W.; Huang, H.; Han, T. Finding the Evidence: Localization-aware Answer Prediction for Text Visual Question Answering. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020; pp. 3118–3131. [Google Scholar] [CrossRef]
- Gao, C.; Zhu, Q.; Wang, P.; Li, H.; Liu, Y.; Hengel, A.v.d.; Wu, Q. Structured Multimodal Attentions for TextVQA. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 9603–9614. [Google Scholar] [CrossRef] [PubMed]
- Johnson, J.; Gupta, A.; Fei-Fei, L. Image generation from scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1219–1228. [Google Scholar] [CrossRef] [Green Version]
- Lu, X.; Fan, Z.; Wang, Y.; Oh, J.; Rosé, C.P. Localize, Group, and Select: Boosting Text-VQA by Scene Text Modeling. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021; pp. 2631–2639. [Google Scholar] [CrossRef]
- Zeng, G.; Zhang, Y.; Zhou, Y.; Yang, X. Beyond OCR + VQA: Involving OCR into the Flow for Robust and Accurate TextVQA. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event China, 20–24 October 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 376–385. [Google Scholar] [CrossRef]
- Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C.L.; Parikh, D. VQA: Visual Question Answering. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 2425–2433. [Google Scholar] [CrossRef] [Green Version]
- Shrestha, R.; Kafle, K.; Kanan, C. Answer them all! toward universal visual question answering models. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10472–10481. [Google Scholar] [CrossRef] [Green Version]
- Ben-Younes, H.; Cadene, R.; Thome, N.; Cord, M. Block: Bilinear superdiagonal fusion for visual question answering and visual relationship detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27–28 January 2019; Volume 33, pp. 8102–8109. [Google Scholar] [CrossRef] [Green Version]
- Cadene, R.; Ben-Younes, H.; Cord, M.; Thome, N. Murel: Multimodal relational reasoning for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1989–1998. [Google Scholar] [CrossRef] [Green Version]
- Urooj, A.; Mazaheri, A.; Da vitoria lobo, N.; Shah, M. MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16–20 November 2020; pp. 4648–4660. [Google Scholar] [CrossRef]
- Han, Y.; Guo, Y.; Yin, J.; Liu, M.; Hu, Y.; Nie, L. Focal and Composed Vision-Semantic Modeling for Visual Question Answering. In Proceedings of the 29th ACM International Conference on Multimedia (MM ’21), Virtual Event China, 20–24 October 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 4528–4536. [Google Scholar] [CrossRef]
- Hudson, D.A.; Manning, C.D. Compositional Attention Networks for Machine Reasoning. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6077–6086. [Google Scholar] [CrossRef] [Green Version]
- Yu, Z.; Yu, J.; Cui, Y.; Tao, D.; Tian, Q. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6281–6290. [Google Scholar] [CrossRef] [Green Version]
- Gao, P.; You, H.; Zhang, Z.; Wang, X.; Li, H. Multi-modality latent interaction network for visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 Octorber–2 November 2019; pp. 5825–5835. [Google Scholar] [CrossRef] [Green Version]
- Nguyen, D.K.; Okatani, T. Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6087–6096. [Google Scholar] [CrossRef] [Green Version]
- Rahman, T.; Chou, S.H.; Sigal, L.; Carenini, G. An Improved Attention for Visual Question Answering. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 19–25 June 2021; pp. 1653–1662. [Google Scholar] [CrossRef]
- Yang, X.; Tang, K.; Zhang, H.; Cai, J. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10685–10694. [Google Scholar] [CrossRef] [Green Version]
- Gu, J.; Joty, S.; Cai, J.; Zhao, H.; Yang, X.; Wang, G. Unpaired image captioning via scene graph alignments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Long Beach, CA, USA, 15–20 June 2019; pp. 10323–10332. [Google Scholar] [CrossRef] [Green Version]
- Han, C.; Long, S.; Luo, S.; Wang, K.; Poon, J. VICTR: Visual Information Captured Text Representation for Text-to-Vision Multimodal Tasks. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020; pp. 3107–3117. [Google Scholar] [CrossRef]
- Wang, S.; Wang, R.; Yao, Z.; Shan, S.; Chen, X. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 1508–1517. [Google Scholar] [CrossRef]
- Luo, S.; Han, S.C.; Sun, K.; Poon, J. REXUP: I REason, I EXtract, I UPdate with Structured Compositional Reasoning for Visual Question Answering. In Proceedings of the International Conference on Neural Information Processing, Bangkok, Thailand, 18–22 November 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 520–532. [Google Scholar] [CrossRef]
- Hudson, D.; Manning, C.D. Learning by abstraction: The neural state machine. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 5903–5916. [Google Scholar]
- Haurilet, M.; Roitberg, A.; Stiefelhagen, R. It’s Not About the Journey; It’s About the Destination: Following Soft Paths Under Question-Guidance for Visual Reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1930–1939. [Google Scholar] [CrossRef]
- Nuthalapati, S.V.; Chandradevan, R.; Giunchiglia, E.; Li, B.; Kayser, M.; Lukasiewicz, T.; Yang, C. Lightweight Visual Question Answering Using Scene Graphs. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Virtual Event, 1–5 November 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 3353–3357. [Google Scholar] [CrossRef]
- Wang, Z.; You, H.; Li, L.H.; Zareian, A.; Park, S.; Liang, Y.; Chang, K.W.; Chang, S.F. SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning. Proc. AAAI Conf. Artif. Intell. 2022, 36, 5914–5922. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistic (NAACL), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
- Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.J.; Shamma, D.A.; et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Int. J. Comput. Vis. 2017, 123, 32–73. [Google Scholar] [CrossRef] [Green Version]
- Almazán, J.; Gordo, A.; Fornés, A.; Valveny, E. Word Spotting and Recognition with Embedded Attributes. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 2552–2566. [Google Scholar] [CrossRef] [PubMed]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.U.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
- Krasin, I.; Duerig, T.; Alldrin, N.; Ferrari, V.; Abu-El-Haija, S.; Kuznetsova, A.; Rom, H.; Uijlings, J.; Popov, S.; Veit, A.; et al. OpenImages: A Public Dataset for Large-Scale Multi-Label and Multi-Class Image Classification. 2017, 2, 18. Available online: https://github.com/openimages (accessed on 30 June 2023).
- Biten, A.F.; Tito, R.; Mafla, A.; Gomez, L.; Rusinol, M.; Valveny, E.; Jawahar, C.; Karatzas, D. Scene text visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 Octorber–2 November 2019; pp. 4291–4301. [Google Scholar] [CrossRef] [Green Version]
- Veit, A.; Matera, T.; Neumann, L.; Matas, J.; Belongie, S. COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images. arXiv 2016, arXiv:1601.07140. [Google Scholar]
- Gurari, D.; Li, Q.; Stangl, A.J.; Guo, A.; Lin, C.; Grauman, K.; Luo, J.; Bigham, J.P. VizWiz Grand Challenge: Answering Visual Questions from Blind People. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3608–3617. [Google Scholar] [CrossRef] [Green Version]
- Karatzas, D.; Shafait, F.; Uchida, S.; Iwamura, M.; Bigorda, L.G.i.; Mestre, S.R.; Mas, J.; Mota, D.F.; Almazàn, J.A.; de las Heras, L.P. ICDAR 2013 Robust Reading Competition. In Proceedings of the 2013 12th International Conference on Document Analysis and Recognition (ICDAR ’13), Washington, DC, USA, 25–28 August 2013; pp. 1484–1493. [Google Scholar] [CrossRef] [Green Version]
- Karatzas, D.; Gomez-Bigorda, L.; Nicolaou, A.; Ghosh, S.; Bagdanov, A.; Iwamura, M.; Matas, J.; Neumann, L.; Chandrasekhar, V.R.; Lu, S.; et al. ICDAR 2015 competition on Robust Reading. In Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR), Tunis, Tunisia, 23–26 August 2015; pp. 1156–1160. [Google Scholar] [CrossRef]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef] [Green Version]
- Mishra, A.; Alahari, K.; Jawahar, C. Image Retrieval Using Textual Cues. In Proceedings of the 2013 IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 3040–3047. [Google Scholar] [CrossRef] [Green Version]
- Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; Parikh, D. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6325–6334. [Google Scholar] [CrossRef] [Green Version]
- Johnson, J.; Hariharan, B.; van der Maaten, L.; Fei-Fei, L.; Zitnick, C.L.; Girshick, R. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1988–1997. [Google Scholar] [CrossRef] [Green Version]
- Hudson, D.A.; Manning, C.D. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 6693–6702. [Google Scholar] [CrossRef] [Green Version]
Model | Question Feature | Image Feature | Main Model | Dataset |
---|---|---|---|---|
LoRAA [1] | GloVe LSTM | (Object) Faster R-CNN, (OCR) FastText | Self-Attention, OCR Copy Module | TextVQA, VQA 2.0, VizWiz |
MM-GNN [4] | LSTM | (Object) Faster R-CNN, (OCR) FastText/Sigmoid/ Cosine | Graph Neural Network, Graph Attention, OCR Copy Module | TextVQA, ST-VQA |
M4C [2] | BERT | (Object) Faster R-CNN, (OCR) Faster R-CNN + FastText + PHOC | Multi-Modal Transformer, Dynamic Pointer Network | TextVQA, ST-VQA, OCR-VQA |
LaAP-Net [6] | BERT | (Object) Faster R-CNN, (OCR) Faster R-CNN + FastText + PHOC | Multi-Modal Transformer, Localization Aware Predictor | TextVQA, ST-VQA, OCR-VQA |
SMA [7] | BERT | (Object) Faster R-CNN, (OCR) Faster R-CNN + FastText + PHOC | Self-Attention, Graph Atttention, Dynamic Pointer Network | TextVQA, ST-VQA |
SA-M4C [5] | BERT | (Object) Faster R-CNN, (OCR) Faster R-CNN + FastText + PHOC | Multi-Modal Transformer, Spatial Attention, Dynamic Pointer Network | TextVQA, ST-VQA |
SNE [3] | BERT | (Object) Faster R-CNN, (OCR) Faster R-CNN + FastText + PHOC + Recog-CNN | Feature Summarizing Attention, Multi-Modal Transformer, Dynamic Pointer Network | TextVQA, ST-VQA |
Symbol | Definition |
---|---|
scene graph of an image with node set and edge set | |
object node set of each scene graph | |
attribute node set of each scene graph | |
relationship node set of each scene graph | |
total number of object and attribute nodes in each scene graph | |
total number of object nodes in each scene graph | |
total number of nodes in each scene graph | |
hidden states at layer l of GCN | |
input scene graph node feature matrix of GCN | |
adjacency matrix of scene graph nodes | |
degree matrix of scene graph nodes | |
question words | |
d | embedding size of question words |
embedding size of OCR tokens | |
X | generalized input feature matrix into self-attention/guided attention |
Y | generalized additional input feature matrix into guided attention |
T | self-attended question representation |
self-attended visual objects, OCR features and decoder hidden states | |
question-guided visual objects, OCR features and decoder hidden states | |
positional relation-aware (PRA) attention layer output | |
semantic relation-aware (SRA) attention layer output | |
subset of relationships that the j-th head of SRA attention attends to | |
number of relationships that each SRA/PRA head attends to | |
a bias term introduced in attention computation of SRA/PRA attention | |
t | time step at decoding stage |
, , | decoder answer token at, before and after time step t, respectively |
Dataset | Year | Image Type | Question Type | Answer Type | Size (Img./Q.) |
---|---|---|---|---|---|
VQA v1. [11] | 2015 | photo-realistic images + abstract scene | asking different attributes of objects | open-ended/multiple choice | 255 K/760 K |
VQA v2. [44] | 2017 | photo-realistic images | asking different attributes of objects | open-ended | 204 K/1.1 M |
Clevr [45] | 2017 | auto-generated synthetic images | compositional questions | open-ended | 100 K/865 K |
Visual Genome [33] | 2017 | photo-realistic images | asking different attributes of objects | open-ended | 108 K/1.7 M |
GQA [46] | 2019 | photo-realistic images | compositional questions | open-ended | 113 K/22 M |
Text-VQA [1] | 2019 | photo-realistic images contain texts | asking for texts in images | open-ended | 28 K/45 K |
ST-VQA [37] | 2019 | photo-realistic images contain texts | asking for texts in images | open-ended | 23 K/30 K |
Model | Acc. on Val | Acc. on Test |
---|---|---|
LoRRA | 26.56 | 27.63 |
MM-GNN | 32.92 | 32.46 |
M4C | 39.40 | 39.01 |
LaAP-Net | 40.68 | 40.54 |
SMA | 40.05 | 40.66 |
SA-M4C | 40.71 | 42.61 |
SNE | 40.38 | 40.92 |
SceneGATE (Ours) | 42.37 | 44.02 |
Model | Acc. on Val | ANLS on Val | ANLS on Test |
---|---|---|---|
MM-GNN | - | - | 0.207 |
M4C | 38.05 | 0.472 | 0.462 |
LaAP-Net | 39.74 | 0.497 | 0.485 |
SMA | - | - | 0.466 |
SA-M4C | 37.86 | 0.486 | 0.477 |
SNE | - | - | 0.509 |
SceneGATE (Ours) | 41.29 | 0.525 | 0.516 |
Model | Acc. on | Acc. on |
---|---|---|
Text-VQA | ST-VQA | |
SceneGATE w GloVe | 41.33 | 40.79 |
SceneGATE w GCN | 41.36 | 40.86 |
SceneGATE w FastText | 42.37 | 41.29 |
Model | Acc. on | Acc. on |
---|---|---|
Text-VQA | ST-VQA | |
MM Transformer | 39.13 | 37.78 |
+ Guided Attention | 41.27 | 39.57 |
+ PRA Attention | 41.95 | 40.37 |
+ SRA Attention | 42.37 | 41.29 |
# MMTE | # PRA | # SRA | Acc. on Val |
---|---|---|---|
2 | 1 | 1 | 41.33 |
2 | 1 | 2 | 41.73 |
2 | 1 | 3 | 41.24 |
2 | 2 | 1 | 41.82 |
2 | 2 | 2 | 41.89 |
2 | 2 | 3 | 41.86 |
2 | 3 | 1 | 41.68 |
2 | 3 | 2 | 41.32 |
2 | 3 | 3 | 41.19 |
1 | 2 | 2 | 41.53 |
1 | 2 | 1 | 42.37 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Cao, F.; Luo, S.; Nunez, F.; Wen, Z.; Poon, J.; Han, S.C. SceneGATE: Scene-Graph Based Co-Attention Networks for Text Visual Question Answering. Robotics 2023, 12, 114. https://doi.org/10.3390/robotics12040114
Cao F, Luo S, Nunez F, Wen Z, Poon J, Han SC. SceneGATE: Scene-Graph Based Co-Attention Networks for Text Visual Question Answering. Robotics. 2023; 12(4):114. https://doi.org/10.3390/robotics12040114
Chicago/Turabian StyleCao, Feiqi, Siwen Luo, Felipe Nunez, Zean Wen, Josiah Poon, and Soyeon Caren Han. 2023. "SceneGATE: Scene-Graph Based Co-Attention Networks for Text Visual Question Answering" Robotics 12, no. 4: 114. https://doi.org/10.3390/robotics12040114