Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Hierarchical Conditional Relation Networks for Multimodal Video Question Answering

Published: 01 November 2021 Publication History

Abstract

Video Question Answering (Video QA) challenges modelers in multiple fronts. Modeling video necessitates building not only spatio-temporal models for the dynamic visual channel but also multimodal structures for associated information channels such as subtitles or audio. Video QA adds at least two more layers of complexity – selecting relevant content for each channel in the context of the linguistic query, and composing spatio-temporal concepts and relations hidden in the data in response to the query. To address these requirements, we start with two insights: (a) content selection and relation construction can be jointly encapsulated into a conditional computational structure, and (b) video-length structures can be composed hierarchically. For (a) this paper introduces a general-reusable reusable neural unit dubbed Conditional Relation Network (CRN) taking as input a set of tensorial objects and translating into a new set of objects that encode relations of the inputs. The generic design of CRN helps ease the common complex model building process of Video QA by simple block stacking and rearrangements with flexibility in accommodating diverse input modalities and conditioning features across both visual and linguistic domains. As a result, we realize insight (b) by introducing Hierarchical Conditional Relation Networks (HCRN) for Video QA. The HCRN primarily aims at exploiting intrinsic properties of the visual content of a video as well as its accompanying channels in terms of compositionality, hierarchy, and near-term and far-term relation. HCRN is then applied for Video QA in two forms, short-form where answers are reasoned solely from the visual content of a video, and long-form where an additional associated information channel, such as movie subtitles, presented. Our rigorous evaluations show consistent improvements over state-of-the-art methods on well-studied benchmarks including large-scale real-world datasets such as TGIF-QA and TVQA, demonstrating the strong capabilities of our CRN unit and the HCRN for complex domains such as Video QA. To the best of our knowledge, the HCRN is the very first method attempting to handle long and short-form multimodal Video QA at the same time.

References

[1]
P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, Bottom-up and top-down attention for image captioning and visual question answering, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6077–6086.
[2]
L. Baraldi, C. Grana, and R. Cucchiara, Hierarchical boundary-aware neural encoder for video captioning, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1657–1666.
[3]
M. I. H. Chowdhury, K. Nguyen, S. Sridharan, and C. Fookes, Hierarchical relational attention for video question answering, in 2018 25th IEEE International Conference on Image Processing (ICIP).IEEE, 2018, pp. 599–603.
[4]
D.-A. Clevert, T. Unterthiner, and S. Hochreiter, Fast and accurate deep network learning by exponential linear units (elus), International Conference on Learning Representations, 2016.
[5]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, NAACL, 2019.
[6]
K. Do, T. Tran, and S. Venkatesh, Learning deep matrix representations, arXiv preprint arXiv:1703.01454, 2018.
[7]
A. Ezen-Can, A comparison of lstm and bert for small corpus, arXiv preprint arXiv:2009.05451, 2020.
[8]
C. Fan, X. Zhang, S. Zhang, W. Wang, C. Zhang, and H. Huang, Heterogeneous memory enhanced multimodal attention model for video question answering, in CVPR, 2019, pp. 1999–2007.
[9]
J. Gao, R. Ge, K. Chen, and R. Nevatia,Motion-appearance co-memory networks for video question answering,CVPR, 2018.
[10]
K. Hara, H. Kataoka, and Y. Satoh, Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6546–6555.
[11]
K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, CVPR, 2016.
[12]
Hochreiter S and Schmidhuber J Long short-term memory Neural computation 1997 9 8 1735-1780
[13]
Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim, Tgif-qa: Toward spatio-temporal reasoning in visual question answering, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2758–2766.
[14]
W. Jin, Z. Zhao, M. Gu, J. Yu, J. Xiao, and Y. Zhuang, Multi-interaction network with object relation for video question answering, in Proceedings of the 27th ACM International Conference on Multimedia. 1em plus 0.5em minus 0.4em ACM, 2019, pp. 1193–1201.
[15]
K.-M. Kim, S.-H. Choi, J.-H. Kim, and B.-T. Zhang, Multimodal dual attention memory for video story question answering, in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 673–688.
[16]
K.-M. Kim, M.-O. Heo, S.-H. Choi, and B.-T. Zhang, Deepstory: Video story qa by deep embedded memory networks, International Joint Conferences on Artificial Intelligence, pp. 2016–2022, 2017.
[17]
J.-H. Kim, J. Jun, and B.-T. Zhang, Bilinear attention networks, in Advances in Neural Information Processing Systems, 2018, pp. 1564–1574.
[18]
J. Kim, M. Ma, K. Kim, S. Kim, and C. D. Yoo, Progressive attention memory network for movie story question answering, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 8337–8346.
[19]
T. M. Le, V. Le, S. Venkatesh, and T. Tran, Hierarchical conditional relation networks for video question answering, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9972–9981.
[20]
T. M. Le, V. Le, S. Venkatesh, and T. Tran, Neural reasoning, fast and slow, for video question answering, International Joint Conference on Neural Networks, 2020.
[21]
J. Lei, L. Yu, M. Bansal, and T. L. Berg, Tvqa: Localized, compositional video question answering, Conference on Empirical Methods in Natural Language Processing, 2018.
[22]
J. Lei, L. Yu, T. L. Berg, and M. Bansal, Tvqa+: Spatio-temporal grounding for video question answering, in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 8211–8225.
[23]
F. Li, C. Gan, X. Liu, Y. Bian, X. Long, Y. Li, Z. Li, J. Zhou, and S. Wen, Temporal modeling approaches for large-scale youtube-8m video understanding, CVPR workshop, 2017.
[24]
X. Li, L. Gao, X. Wang, W. Liu, X. Xu, H. T. Shen, and J. Song, Learnable aggregating net with diversity learning for video question answering, in Proceedings of the 27th ACM International Conference on Multimedia. 1em plus 0.5em minus 0.4em ACM, 2019, pp. 1166–1174.
[25]
X. Li, J. Song, L. Gao, X. Liu, W. Huang, X. He, and C. Gan, Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering, in AAAI, 2019.
[26]
J. Liang, L. Jiang, L. Cao, L.-J. Li, and A. G. Hauptmann, Focal visual-text attention for visual question answering, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6135–6143.
[27]
R. Lienhart, Abstracting home video automatically, in Proceedings of the seventh ACM international conference on Multimedia (Part 2). 1em plus 0.5em minus 0.4em ACM, 1999, pp. 37–40.
[28]
D. Mahapatra, S. Winkler, and S.-C. Yen, Motion saliency outweighs other low-level features while watching videos, in Human Vision and Electronic Imaging XIII, vol. 6806. International Society for Optics and Photonics, 2008, p. 68060P.
[29]
F. Mao, X. Wu, H. Xue, and R. Zhang, Hierarchical video frame sequence representation with deep convolutional graph network, in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 0–0.
[30]
Na S, Lee S, Kim J, and Kim G A read-write memory network for movie story understanding, in International Conference on Computer Vision (ICCV 2017) 2017 Italy Venice
[31]
P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang, Hierarchical recurrent neural encoder for video representation with application to captioning, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1029–1038.
[32]
J. Pennington, R. Socher, and C. D. Manning, Glove: Global vectors for word representation, in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.
[33]
E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, Film: Visual reasoning with a general conditioning layer, in AAAI, 2018.
[34]
Z. Qiu, T. Yao, and T. Mei, Learning spatio-temporal representation with pseudo-3d residual networks, in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5533–5541.
[35]
J. Sang and C. Xu, Character-based movie summarization, in Proceedings of the 18th ACM international conference on Multimedia, 2010, pp. 855–858.
[36]
M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi, Bidirectional attention flow for machine comprehension, ICLR, 2017.
[37]
G. Singh, L. Sigal, and J. J. Little, Spatio-temporal relational reasoning for video question answering, in BMVC, 2019.
[38]
X. Song, Y. Shi, X. Chen, and Y. Han, Explore multi-step reasoning in video question answering, in 2018 ACM Multimedia Conference on Multimedia Conference. 1em plus 0.5em minus 0.4em ACM, 2018, pp. 239–247.
[39]
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, Going deeper with convolutions, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
[40]
Y. Tang, X. Zhang, L. Ma, J. Wang, S. Chen, and Y.-G. Jiang, Non-local netvlad encoding for video classification, in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 0–0.
[41]
M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler, Movieqa: Understanding stories in movies through question-answering, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4631–4640.
[42]
D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, A closer look at spatiotemporal convolutions for action recognition, in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6450–6459.
[43]
B. Wang, Y. Xu, Y. Han, and R. Hong, Movie question answering: Remembering the textual cues for layered visual contents, AAAI’18, 2018.
[44]
Wang A, Luu AT, Foo C-S, Zhu H, Tay Y, and Chandrasekhar V “Holistic multi-modal memory network for movie question answering IEEE Transactions on Image Processing 2019 29 489-499
[45]
C.-Y. Wu, C. Feichtenhofer, H. Fan, K. He, P. Krahenbuhl, and R. Girshick, Long-term feature banks for detailed video understanding, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 284–293.
[46]
S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, Aggregated residual transformations for deep neural networks, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1492–1500.
[47]
J. Xu, T. Mei, T. Yao, and Y. Rui, Msr-vtt: A large video description dataset for bridging video and language, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5288–5296.
[48]
D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y. Zhuang, Video question answering via gradually refined attention over appearance and motion, in Proceedings of the 25th ACM international conference on Multimedia. 1em plus 0.5em minus 0.4em ACM, 2017, pp. 1645–1653.
[49]
Z. Yang, N. Garcia, C. Chu, M. Otani, Y. Nakashima, and H. Takemura, Bert representations for video question answering, in The IEEE Winter Conference on Applications of Computer Vision, 2020, pp. 1556–1565.
[50]
T. Yang, Z.-J. Zha, H. Xie, M. Wang, and H. Zhang, Question-aware tube-switch network for video question answering, in Proceedings of the 27th ACM International Conference on Multimedia. 1em plus 0.5em minus 0.4em ACM, 2019, pp. 1184–1192.
[51]
Y. Ye, Z. Zhao, Y. Li, L. Chen, J. Xiao, and Y. Zhuang, Video question answering via attribute-augmented attention network learning, in Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1em plus 0.5em minus 0.4em ACM, 2017, pp. 829–832.
[52]
Y. Yu, J. Kim, and G. Kim, A joint sequence fusion model for video question answering and retrieval, in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 471–487.
[53]
Z. Yu, J. Yu, J. Fan, and D. Tao, Multi-modal factorized bilinear pooling with co-attention learning for visual question answering, in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1821–1830.
[54]
K.-H. Zeng, T.-H. Chen, C.-Y. Chuang, Y.-H. Liao, J. C. Niebles, and M. Sun, Leveraging video descriptions to learn video question answering, in Thirty-First AAAI Conference on Artificial Intelligence, 2017.
[55]
Z. Zhao, X. Jiang, D. Cai, J. Xiao, X. He, and S. Pu, Multi-turn video question answering via multi-stream hierarchical attention context network. in IJCAI, 2018, pp. 3690–3696.
[56]
Z. Zhao, Q. Yang, D. Cai, X. He, and Y. Zhuang, Video question answering via hierarchical spatio-temporal attention networks. in IJCAI, 2017, pp. 3518–3524.
[57]
Zhao Z, Zhang Z, Xiao S, Xiao Z, Yan X, Yu J, et al. Long-form video question answering via dynamic hierarchical reinforced networks IEEE Transactions on Image Processing 2019 28 12 5939-5952
[58]
B. Zhou, A. Andonian, A. Oliva, and A. Torralba, Temporal relational reasoning in videos, in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 803–818.
[59]
Zhu L, Xu Z, Yang Y, and Hauptmann AG Uncovering the temporal context for video question answering International Journal of Computer Vision 2017 124 3 409-421

Cited By

View all
  • (2024)Fine-Grained Multimodal DeepFake Classification via Heterogeneous GraphsInternational Journal of Computer Vision10.1007/s11263-024-02128-1132:11(5255-5269)Online publication date: 1-Nov-2024
  • (2024)Diagram Perception Networks for Textbook Question Answering via Joint OptimizationInternational Journal of Computer Vision10.1007/s11263-023-01954-z132:5(1578-1591)Online publication date: 1-May-2024
  • (2023)Hierarchical Synergy-Enhanced Multimodal Relational Network for Video Question AnsweringACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3630101Online publication date: 25-Oct-2023

Index Terms

  1. Hierarchical Conditional Relation Networks for Multimodal Video Question Answering
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image International Journal of Computer Vision
        International Journal of Computer Vision  Volume 129, Issue 11
        Nov 2021
        222 pages

        Publisher

        Kluwer Academic Publishers

        United States

        Publication History

        Published: 01 November 2021
        Accepted: 06 August 2021
        Received: 31 July 2020

        Author Tags

        1. Video QA
        2. Relational networks
        3. Conditional modules
        4. Hierarchy

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 25 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Fine-Grained Multimodal DeepFake Classification via Heterogeneous GraphsInternational Journal of Computer Vision10.1007/s11263-024-02128-1132:11(5255-5269)Online publication date: 1-Nov-2024
        • (2024)Diagram Perception Networks for Textbook Question Answering via Joint OptimizationInternational Journal of Computer Vision10.1007/s11263-023-01954-z132:5(1578-1591)Online publication date: 1-May-2024
        • (2023)Hierarchical Synergy-Enhanced Multimodal Relational Network for Video Question AnsweringACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3630101Online publication date: 25-Oct-2023

        View Options

        View options

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media