Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3600270.3602846guideproceedingsArticle/Chapter ViewAbstractPublication PagesnipsConference Proceedingsconference-collections
research-article

Towards video text visual question answering: benchmark and baseline

Published: 28 November 2022 Publication History

Abstract

There are already some text-based visual question answering (TextVQA) benchmarks for developing machine's ability to answer questions based on texts in images in recent years. However, models developed on these benchmarks cannot work effectively in many real-life scenarios (e.g. traffic monitoring, shopping ads and e-learning videos) where temporal reasoning ability is required. To this end, we propose a new task named Video Text Visual Question Answering (ViteVQA in short) that aims at answering questions by spatiotemporally reasoning texts and visual information in a given video. In particular, on the one hand, we build the first ViteVQA benchmark dataset named M4-ViteVQA — the abbreviation of <b>M</b>ulti-category <b>M</b>ulti-frame <b>M</b>ulti-resolution <b>M</b>ulti-modal benchmark for <b>ViteVQA</b>, which contains 7,620 video clips of 9 categories (i.e., shopping, traveling, driving, vlog, sport, advertisement, movie, game and talking) and 3 kinds of resolutions (i.e., 720p, 1080p and 1176x664), and 25,123 question-answer pairs. On the other hand, we develop a baseline method named T5-ViteVQA for the ViteVQA task. T5-ViteVQA consists of five transformers. It first extracts optical character recognition (OCR) tokens, question features, and video representations via two OCR transformers, one language transformer and one video-language transformer, respectively. Then, a multimodal fusion transformer and an answer generation module are applied to fuse multimodal information and generate the final prediction. Extensive experiments on M4-ViteVQA demonstrate the superiority of T5-ViteVQA over the existing approaches of TextVQA and VQA tasks.

Supplementary Material

Additional material (3600270.3602846_supp.pdf)
Supplemental material.

References

[1]
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, "Vqa: Visual question answering," in ICCV, 2015, pp. 2425-2433.
[2]
Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, and A. van den Hengel, "Visual question answering: A survey of methods and datasets," Computer Vision and Image Understanding, vol. 163, pp. 21-40, 2017.
[3]
Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, "Making the v in vqa matter: Elevating the role of image understanding in visual question answering," in CVPR, 2017, pp. 6904-6913.
[4]
D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham, "Vizwiz grand challenge: Answering visual questions from blind people," in CVPR, 2018, pp. 36083617.
[5]
R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., "Visual genome: Connecting language and vision using crowdsourced dense image annotations," IJCV, vol. 123, no. 1, pp. 32-73, 2017.
[6]
Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei, "Visual7w: Grounded question answering in images," in CVPR, 2016, pp. 4995-5004.
[7]
K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi, "Ok-vqa: A visual question answering benchmark requiring external knowledge," in CVPR, 2019, pp. 3195-3204.
[8]
A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach, "Towards vqa models that can read," in CVPR, 2019, pp. 8317-8326.
[9]
A. F. Biten, R. Tito, A. Mafla, L. Gomez, M. Rusinol, E. Valveny, C. Jawahar, and D. Karatzas, "Scene text visual question answering," in ICCV, 2019, pp. 4291-4301.
[10]
W. Chen, X. Wang, and W. Y. Wang, "A dataset for answering time-sensitive questions," in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
[11]
Y. Jiang, V. Natarajan, X. Chen, M. Rohrbach, D. Batra, and D. Parikh, "Pythia v0. 1: the winning entry to the vqa challenge 2018," arXiv preprint arXiv:1807.09956, 2018.
[12]
A. Mishra, S. Shekhar, A. K. Singh, and A. Chakraborty, "Ocr-vqa: Visual question answering by reading text in images," in ICDAR. IEEE, 2019, pp. 947-952.
[13]
B. K. Iwana, S. T. R. Rizvi, S. Ahmed, A. Dengel, and S. Uchida, "Judging a book by its cover," arXiv preprint arXiv:1610.09204, 2016.
[14]
R. Tito, D. Karatzas, and E. Valveny, "Document collection visual question answering," in ICDAR. Springer, 2021, pp. 778-792.
[15]
X. Wang, Y. Liu, C. Shen, C. C. Ng, C. Luo, L. Jin, C. S. Chan, A. v. d. Hengel, and L. Wang, "On the general value of evidence, and bilingual scene-text visual question answering," in CVPR, 2020, pp. 10126-10 135.
[16]
D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y. Zhuang, "Video question answering via gradually refined attention over appearance and motion," in ACM-MM, 2017, pp. 1645-1653.
[17]
J. Xu, T. Mei, T. Yao, and Y. Rui, "Msr-vtt: A large video description dataset for bridging video and language," in CVPR, 2016, pp. 5288-5296.
[18]
J. Mun, P. Hongsuck Seo, I. Jung, and B. Han, "Marioqa: Answering questions by watching gameplay videos," in ICCV, 2017, pp. 2867-2875.
[19]
Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim, "Tgif-qa: Toward spatio-temporal reasoning in visual question answering," in CVPR, 2017, pp. 2758-2766.
[20]
Z. Yu, D. Xu, J. Yu, T. Yu, Z. Zhao, Y. Zhuang, and D. Tao, "Activitynet-qa: A dataset for understanding complex web videos via question answering," in AAAI, vol. 33, no. 01, 2019, pp. 9127-9134.
[21]
Y. Ye, Z. Zhao, Y. Li, L. Chen, J. Xiao, and Y. Zhuang, "Video question answering via attribute-augmented attention network learning," in SIGIR, 2017, pp. 829-832.
[22]
K.-H. Zeng, T.-H. Chen, C.-Y. Chuang, Y.-H. Liao, J. C. Niebles, and M. Sun, "Leveraging video descriptions to learn video question answering," in AAAI, 2017.
[23]
S. Choi, K.-W. On, Y.-J. Heo, A. Seo, Y. Jang, M. Lee, and B.-T. Zhang, "Dramaqa: Character-centered video story understanding with hierarchical qa," arXiv preprint arXiv:2005.03356, 2020.
[24]
J. Lei, L. Yu, M. Bansal, and T. Berg, "Tvqa: Localized, compositional video question answering," in EMNLP, 2018, pp. 1369-1379.
[25]
J. Lei, L. Yu, T. L. Berg, and M. Bansal, "Tvqa+: Spatio-temporal grounding for video question answering," arXiv preprint arXiv:1904.11574, 2019.
[26]
L. Li, Y.-C. Chen, Y. Cheng, Z. Gan, L. Yu, and J. Liu, "Hero: Hierarchical encoder for video+ language omni-representation pre-training," arXiv preprint arXiv:2005.00200, 2020.
[27]
M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler, "Movieqa: Understanding stories in movies through question-answering," in CVPR, 2016, pp. 4631-4640.
[28]
R. Hu, A. Singh, T. Darrell, and M. Rohrbach, "Iterative answer prediction with pointer-augmented multimodal transformers for textvqa," in CVPR, 2020, pp. 9992-10 002.
[29]
C. Gao, Q. Zhu, P. Wang, H. Li, Y. Liu, A. Van den Hengel, and Q. Wu, "Structured multimodal attentions for textvqa," TPAMI, 2021.
[30]
F. Liu, G. Xu, Q. Wu, Q. Du, W. Jia, and M. Tan, "Cascade reasoning network for text-based visual question answering," in ACM-MM, 2020, pp. 4060-4069.
[31]
Q. Zhu, C. Gao, P. Wang, and Q. Wu, "Simple is not easy: A simple strong baseline for textvqa and textcaps," arXiv preprint arXiv:2012.05153, vol. 2, 2020.
[32]
Y. Kant, D. Batra, P. Anderson, A. Schwing, D. Parikh, J. Lu, and H. Agrawal, "Spatially aware multimodal transformers for textvqa," in ECCV. Springer, 2020, pp. 715-732.
[33]
F. Borisyuk, A. Gordo, and V. Sivakumar, "Rosetta: Large scale system for text detection and recognition in images," in KDD, 2018, pp. 71-79.
[34]
S. Ren, K. He, R. Girshick, and J. Sun, "Faster r-cnn: Towards real-time object detection with region proposal networks," NeurIPS, vol. 28, 2015.
[35]
T. M. Le, V. Le, S. Venkatesh, and T. Tran, "Hierarchical conditional relation networks for video question answering," in CVPR, 2020, pp. 9972-9981.
[36]
J. Park, J. Lee, and K. Sohn, "Bridge to answer: Structure-aware graph interaction network for video question answering," in CVPR, 2021, pp. 15 526-15 535.
[37]
M. Bain, A. Nagrani, G. Varol, and A. Zisserman, "Frozen in time: A joint video and image encoder for end-to-end retrieval," in CVPR, 2021, pp. 1728-1738.
[38]
J. Lei, L. Li, L. Zhou, Z. Gan, T. L. Berg, M. Bansal, and J. Liu, "Less is more: Clipbert for video-and-language learning via sparse sampling," in CVPR, 2021, pp. 7331-7341.
[39]
A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid, "Just ask: Learning to answer questions from millions of narrated videos," in ICCV, 2021, pp. 1686-1697.
[40]
A. J. Wang, Y. Ge, R. Yan, G. Yuying, X. Lin, G. Cai, J. Wu, Y. Shan, X. Qie, and M. Z. Shou, "All in one: Exploring unified video-language pre-training," arXiv preprint arXiv:2203.07303, 2022.
[41]
K. Hara, H. Kataoka, and Y. Satoh, "Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?" in CVPR, 2018, pp. 6546-6555.
[42]
S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, "Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification," in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 305-321.
[43]
Y. Xu, M. Li, L. Cui, S. Huang, F. Wei, and M. Zhou, "Layoutlm: Pre-training of text and layout for document image understanding," in KDD, 2020, pp. 1192-1200.
[44]
Z. Yang, Y. Lu, J. Wang, X. Yin, D. Florencio, L. Wang, C. Zhang, L. Zhang, and J. Luo, "Tap: Text-aware pre-training for text-vqa and text-caption," in CVPR, 2021, pp. 8751-8761.
[45]
A. F. Biten, R. Litman, Y. Xie, S. Appalaraju, and R. Manmatha, "Latr: Layout-aware transformer for scene-text vqa," arXiv preprint arXiv:2112.12494, 2021.
[46]
A. J. Wang, Y. Ge, G. Cai, R. Yan, X. Lin, Y. Shan, X. Qie, and M. Z. Shou, "Object-aware video-language pre-training for retrieval," arXiv preprint arXiv:2112.00656, 2021.
[47]
J. Wang, Z. Yang, X. Hu, L. Li, K. Lin, Z. Gan, Z. Liu, C. Liu, and L. Wang, "Git: A generative image-to-text transformer for vision and language," ArXiv, vol. abs/2205.14100, 2022.
[48]
X.-C. Yin, Z.-Y. Zuo, S. Tian, and C.-L. Liu, "Text detection, tracking and recognition in video: a comprehensive survey," TIP, vol. 25, no. 6, pp. 2752-2773, 2016.
[49]
P. X. Nguyen, K. Wang, and S. Belongie, "Video text detection and recognition: Dataset and benchmark," in WACV. IEEE, 2014, pp. 776-783.
[50]
X. Zhou, S. Zhou, C. Yao, Z. Cao, and Q. Yin, "Icdar 2015 text reading in the wild competition," arXiv preprint arXiv:1506.03184, 2015.
[51]
S. Reddy, M. Mathew, L. Gomez, M. Rusinol, D. Karatzas, and C. Jawahar, "Roadtext-1k: Text detection & recognition dataset for driving videos," in ICRA. IEEE, 2020, pp. 11074-11080.
[52]
Z. Cheng, J. Lu, B. Zou, S. Zhou, and F. Wu, "Icdar 2021 competition on scene video text spotting," in ICDAR. Springer, 2021, pp. 650-662.
[53]
W. Wu, Y. Cai, D. Zhang, S. Wang, Z. Li, J. Li, Y. Tang, and H. Zhou, "A bilingual, open-world video text dataset and end-to-end video text spotter with transformer," arXiv preprint arXiv:2112.04888, 2021.
[54]
X. Wang, Y. Jiang, S. Yang, X. Zhu, W. Li, P. Fu, H. Wang, and Z. Luo, "End-to-end scene text recognition in videos based on multi frame tracking," in ICDAR, vol. 1. IEEE, 2017, pp. 1255-1260.
[55]
Z. Cheng, J. Lu, Y. Niu, S. Pu, F. Wu, and S. Zhou, "You only recognize once: Towards fast video text spotting," in ACM-MM, 2019, pp. 855-863.
[56]
Z. Cheng, J. Lu, B. Zou, L. Qiao, Y. Xu, S. Pu, Y. Niu, F. Wu, and S. Zhou, "Free: A fast and robust end-to-end video text spotter," TIP, vol. 30, pp. 822-837, 2020.
[57]
H. Yu, Y. Huang, L. Pi, C. Zhang, X. Li, and L. Wang, "End-to-end video text detection with online tracking," PR, vol. 113, p. 107791, 2021.
[58]
Z. Li, W. Wu, M. Z. Shou, J. Li, S. Li, Z. Wang, and H. Zhou, "Contrastive learning of semantic and visual representations for text tracking," arXiv preprint arXiv:2112.14976, 2021.
[59]
W. Feng, F. Yin, X.-Y. Zhang, and C.-L. Liu, "Semantic-aware video text detection," in CVPR, 2021, pp. 1695-1705.
[60]
W. Wu, D. Zhang, Y. Fu, C. Shen, H. Zhou, Y. Cai, and P. Luo, "End-to-end video text spotting with transformer," arXiv preprint arXiv:2203.10539, 2022.
[61]
S. Fang, H. Xie, Y. Wang, Z. Mao, and Y. Zhang, "Read like humans: autonomous, bidirectional and iterative language modeling for scene text recognition," in CVPR, 2021, pp. 7098-7107.
[62]
V. I. Levenshtein et al., "Binary codes capable of correcting deletions, insertions, and reversals," in Soviet physics doklady, vol. 10, no. 8. Soviet Union, 1966, pp. 707-710.
[63]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," in NAACL, 2019, pp. 4171-4186.
[64]
P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, "Enriching word vectors with subword information," TACL, vol. 5, pp. 135-146, 2017.
[65]
J. Almazán, A. Gordo, A. Fornés, and E. Valveny, "Word spotting and recognition with embedded attributes," TPAMI, vol. 36, no. 12, pp. 2552-2566, 2014.
[66]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," NeurIPS, vol. 30, 2017.
[67]
A. M. Lamb, A. G. ALIAS PARTH GOYAL, Y. Zhang, S. Zhang, A. C. Courville, and Y. Bengio, "Professor forcing: A new algorithm for training recurrent networks," NeurIPS, vol. 29, 2016.
[68]
I. Loshchilov and F. Hutter, "Fixing weight decay regularization in adam," 2018.

Cited By

View all

Index Terms

  1. Towards video text visual question answering: benchmark and baseline
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Guide Proceedings
      NIPS '22: Proceedings of the 36th International Conference on Neural Information Processing Systems
      November 2022
      39114 pages

      Publisher

      Curran Associates Inc.

      Red Hook, NY, United States

      Publication History

      Published: 28 November 2022

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 0
        Total Downloads
      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 26 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      View options

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media