research-article

Towards video text visual question answering: benchmark and baseline

AUTHORs:

Shuigeng ZhouAuthors Info & Claims

NIPS'22: Proceedings of the 36th International Conference on Neural Information Processing Systems

Article No.: 2576, Pages 35549 - 35562

Published: 28 November 2022 Publication History

Abstract

There are already some text-based visual question answering (TextVQA) benchmarks for developing machine's ability to answer questions based on texts in images in recent years. However, models developed on these benchmarks cannot work effectively in many real-life scenarios (e.g. traffic monitoring, shopping ads and e-learning videos) where temporal reasoning ability is required. To this end, we propose a new task named Video Text Visual Question Answering (ViteVQA in short) that aims at answering questions by spatiotemporally reasoning texts and visual information in a given video. In particular, on the one hand, we build the first ViteVQA benchmark dataset named M4-ViteVQA — the abbreviation of <b>M</b>ulti-category <b>M</b>ulti-frame <b>M</b>ulti-resolution <b>M</b>ulti-modal benchmark for <b>ViteVQA</b>, which contains 7,620 video clips of 9 categories (i.e., shopping, traveling, driving, vlog, sport, advertisement, movie, game and talking) and 3 kinds of resolutions (i.e., 720p, 1080p and 1176x664), and 25,123 question-answer pairs. On the other hand, we develop a baseline method named T5-ViteVQA for the ViteVQA task. T5-ViteVQA consists of five transformers. It first extracts optical character recognition (OCR) tokens, question features, and video representations via two OCR transformers, one language transformer and one video-language transformer, respectively. Then, a multimodal fusion transformer and an answer generation module are applied to fuse multimodal information and generate the final prediction. Extensive experiments on M4-ViteVQA demonstrate the superiority of T5-ViteVQA over the existing approaches of TextVQA and VQA tasks.

Supplementary Material

Additional material (3600270.3602846_supp.pdf)

Supplemental material.

Download
10.40 MB

References

[1]

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, "Vqa: Visual question answering," in ICCV, 2015, pp. 2425-2433.

[2]

Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, and A. van den Hengel, "Visual question answering: A survey of methods and datasets," Computer Vision and Image Understanding, vol. 163, pp. 21-40, 2017.

Digital Library

[3]

Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, "Making the v in vqa matter: Elevating the role of image understanding in visual question answering," in CVPR, 2017, pp. 6904-6913.

[4]

D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham, "Vizwiz grand challenge: Answering visual questions from blind people," in CVPR, 2018, pp. 36083617.

[5]

R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., "Visual genome: Connecting language and vision using crowdsourced dense image annotations," IJCV, vol. 123, no. 1, pp. 32-73, 2017.

Digital Library

[6]

Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei, "Visual7w: Grounded question answering in images," in CVPR, 2016, pp. 4995-5004.

[7]

K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi, "Ok-vqa: A visual question answering benchmark requiring external knowledge," in CVPR, 2019, pp. 3195-3204.

[8]

A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach, "Towards vqa models that can read," in CVPR, 2019, pp. 8317-8326.

[9]

A. F. Biten, R. Tito, A. Mafla, L. Gomez, M. Rusinol, E. Valveny, C. Jawahar, and D. Karatzas, "Scene text visual question answering," in ICCV, 2019, pp. 4291-4301.

[10]

W. Chen, X. Wang, and W. Y. Wang, "A dataset for answering time-sensitive questions," in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.

[11]

Y. Jiang, V. Natarajan, X. Chen, M. Rohrbach, D. Batra, and D. Parikh, "Pythia v0. 1: the winning entry to the vqa challenge 2018," arXiv preprint arXiv:1807.09956, 2018.

[12]

A. Mishra, S. Shekhar, A. K. Singh, and A. Chakraborty, "Ocr-vqa: Visual question answering by reading text in images," in ICDAR. IEEE, 2019, pp. 947-952.

[13]

B. K. Iwana, S. T. R. Rizvi, S. Ahmed, A. Dengel, and S. Uchida, "Judging a book by its cover," arXiv preprint arXiv:1610.09204, 2016.

[14]

R. Tito, D. Karatzas, and E. Valveny, "Document collection visual question answering," in ICDAR. Springer, 2021, pp. 778-792.

[15]

X. Wang, Y. Liu, C. Shen, C. C. Ng, C. Luo, L. Jin, C. S. Chan, A. v. d. Hengel, and L. Wang, "On the general value of evidence, and bilingual scene-text visual question answering," in CVPR, 2020, pp. 10126-10 135.

[16]

D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y. Zhuang, "Video question answering via gradually refined attention over appearance and motion," in ACM-MM, 2017, pp. 1645-1653.

[17]

J. Xu, T. Mei, T. Yao, and Y. Rui, "Msr-vtt: A large video description dataset for bridging video and language," in CVPR, 2016, pp. 5288-5296.

[18]

J. Mun, P. Hongsuck Seo, I. Jung, and B. Han, "Marioqa: Answering questions by watching gameplay videos," in ICCV, 2017, pp. 2867-2875.

[19]

Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim, "Tgif-qa: Toward spatio-temporal reasoning in visual question answering," in CVPR, 2017, pp. 2758-2766.

[20]

Z. Yu, D. Xu, J. Yu, T. Yu, Z. Zhao, Y. Zhuang, and D. Tao, "Activitynet-qa: A dataset for understanding complex web videos via question answering," in AAAI, vol. 33, no. 01, 2019, pp. 9127-9134.

Digital Library

[21]

Y. Ye, Z. Zhao, Y. Li, L. Chen, J. Xiao, and Y. Zhuang, "Video question answering via attribute-augmented attention network learning," in SIGIR, 2017, pp. 829-832.

[22]

K.-H. Zeng, T.-H. Chen, C.-Y. Chuang, Y.-H. Liao, J. C. Niebles, and M. Sun, "Leveraging video descriptions to learn video question answering," in AAAI, 2017.

[23]

S. Choi, K.-W. On, Y.-J. Heo, A. Seo, Y. Jang, M. Lee, and B.-T. Zhang, "Dramaqa: Character-centered video story understanding with hierarchical qa," arXiv preprint arXiv:2005.03356, 2020.

[24]

J. Lei, L. Yu, M. Bansal, and T. Berg, "Tvqa: Localized, compositional video question answering," in EMNLP, 2018, pp. 1369-1379.

[25]

J. Lei, L. Yu, T. L. Berg, and M. Bansal, "Tvqa+: Spatio-temporal grounding for video question answering," arXiv preprint arXiv:1904.11574, 2019.

[26]

L. Li, Y.-C. Chen, Y. Cheng, Z. Gan, L. Yu, and J. Liu, "Hero: Hierarchical encoder for video+ language omni-representation pre-training," arXiv preprint arXiv:2005.00200, 2020.

[27]

M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler, "Movieqa: Understanding stories in movies through question-answering," in CVPR, 2016, pp. 4631-4640.

[28]

R. Hu, A. Singh, T. Darrell, and M. Rohrbach, "Iterative answer prediction with pointer-augmented multimodal transformers for textvqa," in CVPR, 2020, pp. 9992-10 002.

[29]

C. Gao, Q. Zhu, P. Wang, H. Li, Y. Liu, A. Van den Hengel, and Q. Wu, "Structured multimodal attentions for textvqa," TPAMI, 2021.

[30]

F. Liu, G. Xu, Q. Wu, Q. Du, W. Jia, and M. Tan, "Cascade reasoning network for text-based visual question answering," in ACM-MM, 2020, pp. 4060-4069.

[31]

Q. Zhu, C. Gao, P. Wang, and Q. Wu, "Simple is not easy: A simple strong baseline for textvqa and textcaps," arXiv preprint arXiv:2012.05153, vol. 2, 2020.

[32]

Y. Kant, D. Batra, P. Anderson, A. Schwing, D. Parikh, J. Lu, and H. Agrawal, "Spatially aware multimodal transformers for textvqa," in ECCV. Springer, 2020, pp. 715-732.

[33]

F. Borisyuk, A. Gordo, and V. Sivakumar, "Rosetta: Large scale system for text detection and recognition in images," in KDD, 2018, pp. 71-79.

[34]

S. Ren, K. He, R. Girshick, and J. Sun, "Faster r-cnn: Towards real-time object detection with region proposal networks," NeurIPS, vol. 28, 2015.

[35]

T. M. Le, V. Le, S. Venkatesh, and T. Tran, "Hierarchical conditional relation networks for video question answering," in CVPR, 2020, pp. 9972-9981.

[36]

J. Park, J. Lee, and K. Sohn, "Bridge to answer: Structure-aware graph interaction network for video question answering," in CVPR, 2021, pp. 15 526-15 535.

[37]

M. Bain, A. Nagrani, G. Varol, and A. Zisserman, "Frozen in time: A joint video and image encoder for end-to-end retrieval," in CVPR, 2021, pp. 1728-1738.

[38]

J. Lei, L. Li, L. Zhou, Z. Gan, T. L. Berg, M. Bansal, and J. Liu, "Less is more: Clipbert for video-and-language learning via sparse sampling," in CVPR, 2021, pp. 7331-7341.

[39]

A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid, "Just ask: Learning to answer questions from millions of narrated videos," in ICCV, 2021, pp. 1686-1697.

[40]

A. J. Wang, Y. Ge, R. Yan, G. Yuying, X. Lin, G. Cai, J. Wu, Y. Shan, X. Qie, and M. Z. Shou, "All in one: Exploring unified video-language pre-training," arXiv preprint arXiv:2203.07303, 2022.

[41]

K. Hara, H. Kataoka, and Y. Satoh, "Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?" in CVPR, 2018, pp. 6546-6555.

[42]

S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, "Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification," in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 305-321.

[43]

Y. Xu, M. Li, L. Cui, S. Huang, F. Wei, and M. Zhou, "Layoutlm: Pre-training of text and layout for document image understanding," in KDD, 2020, pp. 1192-1200.

[44]

Z. Yang, Y. Lu, J. Wang, X. Yin, D. Florencio, L. Wang, C. Zhang, L. Zhang, and J. Luo, "Tap: Text-aware pre-training for text-vqa and text-caption," in CVPR, 2021, pp. 8751-8761.

[45]

A. F. Biten, R. Litman, Y. Xie, S. Appalaraju, and R. Manmatha, "Latr: Layout-aware transformer for scene-text vqa," arXiv preprint arXiv:2112.12494, 2021.

[46]

A. J. Wang, Y. Ge, G. Cai, R. Yan, X. Lin, Y. Shan, X. Qie, and M. Z. Shou, "Object-aware video-language pre-training for retrieval," arXiv preprint arXiv:2112.00656, 2021.

[47]

J. Wang, Z. Yang, X. Hu, L. Li, K. Lin, Z. Gan, Z. Liu, C. Liu, and L. Wang, "Git: A generative image-to-text transformer for vision and language," ArXiv, vol. abs/2205.14100, 2022.

[48]

X.-C. Yin, Z.-Y. Zuo, S. Tian, and C.-L. Liu, "Text detection, tracking and recognition in video: a comprehensive survey," TIP, vol. 25, no. 6, pp. 2752-2773, 2016.

Digital Library

[49]

P. X. Nguyen, K. Wang, and S. Belongie, "Video text detection and recognition: Dataset and benchmark," in WACV. IEEE, 2014, pp. 776-783.

[50]

X. Zhou, S. Zhou, C. Yao, Z. Cao, and Q. Yin, "Icdar 2015 text reading in the wild competition," arXiv preprint arXiv:1506.03184, 2015.

[51]

S. Reddy, M. Mathew, L. Gomez, M. Rusinol, D. Karatzas, and C. Jawahar, "Roadtext-1k: Text detection & recognition dataset for driving videos," in ICRA. IEEE, 2020, pp. 11074-11080.

[52]

Z. Cheng, J. Lu, B. Zou, S. Zhou, and F. Wu, "Icdar 2021 competition on scene video text spotting," in ICDAR. Springer, 2021, pp. 650-662.

[53]

W. Wu, Y. Cai, D. Zhang, S. Wang, Z. Li, J. Li, Y. Tang, and H. Zhou, "A bilingual, open-world video text dataset and end-to-end video text spotter with transformer," arXiv preprint arXiv:2112.04888, 2021.

[54]

X. Wang, Y. Jiang, S. Yang, X. Zhu, W. Li, P. Fu, H. Wang, and Z. Luo, "End-to-end scene text recognition in videos based on multi frame tracking," in ICDAR, vol. 1. IEEE, 2017, pp. 1255-1260.

[55]

Z. Cheng, J. Lu, Y. Niu, S. Pu, F. Wu, and S. Zhou, "You only recognize once: Towards fast video text spotting," in ACM-MM, 2019, pp. 855-863.

[56]

Z. Cheng, J. Lu, B. Zou, L. Qiao, Y. Xu, S. Pu, Y. Niu, F. Wu, and S. Zhou, "Free: A fast and robust end-to-end video text spotter," TIP, vol. 30, pp. 822-837, 2020.

Digital Library

[57]

H. Yu, Y. Huang, L. Pi, C. Zhang, X. Li, and L. Wang, "End-to-end video text detection with online tracking," PR, vol. 113, p. 107791, 2021.

[58]

Z. Li, W. Wu, M. Z. Shou, J. Li, S. Li, Z. Wang, and H. Zhou, "Contrastive learning of semantic and visual representations for text tracking," arXiv preprint arXiv:2112.14976, 2021.

[59]

W. Feng, F. Yin, X.-Y. Zhang, and C.-L. Liu, "Semantic-aware video text detection," in CVPR, 2021, pp. 1695-1705.

[60]

W. Wu, D. Zhang, Y. Fu, C. Shen, H. Zhou, Y. Cai, and P. Luo, "End-to-end video text spotting with transformer," arXiv preprint arXiv:2203.10539, 2022.

[61]

S. Fang, H. Xie, Y. Wang, Z. Mao, and Y. Zhang, "Read like humans: autonomous, bidirectional and iterative language modeling for scene text recognition," in CVPR, 2021, pp. 7098-7107.

[62]

V. I. Levenshtein et al., "Binary codes capable of correcting deletions, insertions, and reversals," in Soviet physics doklady, vol. 10, no. 8. Soviet Union, 1966, pp. 707-710.

[63]

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," in NAACL, 2019, pp. 4171-4186.

[64]

P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, "Enriching word vectors with subword information," TACL, vol. 5, pp. 135-146, 2017.

[65]

J. Almazán, A. Gordo, A. Fornés, and E. Valveny, "Word spotting and recognition with embedded attributes," TPAMI, vol. 36, no. 12, pp. 2552-2566, 2014.

[66]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," NeurIPS, vol. 30, 2017.

[67]

A. M. Lamb, A. G. ALIAS PARTH GOYAL, Y. Zhang, S. Zhang, A. C. Courville, and Y. Bengio, "Professor forcing: A new algorithm for training recurrent networks," NeurIPS, vol. 29, 2016.

[68]

I. Loshchilov and F. Hutter, "Fixing weight decay regularization in adam," 2018.

Cited By

Index Terms

Towards video text visual question answering: benchmark and baseline
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
2. Information systems
  1. Information retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

Multimodal attention-driven visual question answering for Malayalam
Abstract
Visual question answering is a challenging task that necessitates for sophisticated reasoning over the visual elements to provide an accurate answer to a question. Majority of the state-of-the-art VQA models are only applicable to English ...
Visual question answering: Which investigated applications?
Highlights
- The paper presents concrete applications of Visual Question Answering
- Domains where VQA has been experimented are presented together with the exploited dataset
- The paper suggests some challenging techniques that can be especially ...
Abstract
Visual Question Answering (VQA) is an extremely stimulating and challenging research area where Computer Vision (CV) and Natural Language Processig (NLP) have recently met. In image captioning and video summarization, the semantic information is ...
Two-Stage Multimodality Fusion for High-Performance Text-Based Visual Question Answering
Computer Vision – ACCV 2022
Abstract
Text-based visual question answering (TextVQA) is to answer a text-related question by reading texts in a given image, which needs to jointly reason over three modalities—question, visual objects and scene texts in images. Most existing works ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

NIPS '22: Proceedings of the 36th International Conference on Neural Information Processing Systems

November 2022

39114 pages

ISBN:9781713871088

Copyright © 2022 Neural Information Processing Systems Foundation, Inc.

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 28 November 2022

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View Options

View options

Figures

Tables

Media

View Table of Conten