Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Multi-Granularity Contrastive Cross-Modal Collaborative Generation for End-to-End Long-Term Video Question Answering

Published: 24 April 2024 Publication History

Abstract

Long-term Video Question Answering (VideoQA) is a challenging vision-and-language bridging task focusing on semantic understanding of untrimmed long-term videos and diverse free-form questions, simultaneously emphasizing comprehensive cross-modal reasoning to yield precise answers. The canonical approaches often rely on off-the-shelf feature extractors to detour the expensive computation overhead, but often result in domain-independent modality-unrelated representations. Furthermore, the inherent gradient blocking between unimodal comprehension and cross-modal interaction hinders reliable answer generation. In contrast, recent emerging successful video-language pre-training models enable cost-effective end-to-end modeling but fall short in domain-specific ratiocination and exhibit disparities in task formulation. Toward this end, we present an entirely end-to-end solution for long-term VideoQA: Multi-granularity Contrastive cross-modal collaborative Generation (MCG) model. To derive discriminative representations possessing high visual concepts, we introduce Joint Unimodal Modeling (JUM) on a clip-bone architecture and leverage Multi-granularity Contrastive Learning (MCL) to harness the intrinsically or explicitly exhibited semantic correspondences. To alleviate the task formulation discrepancy problem, we propose a Cross-modal Collaborative Generation (CCG) module to reformulate VideoQA as a generative task instead of the conventional classification scheme, empowering the model with the capability for cross-modal high-semantic fusion and generation so as to rationalize and answer. Extensive experiments conducted on six publicly available VideoQA datasets underscore the superiority of our proposed method.

References

[1]
M. Gu, Z. Zhao, W. Jin, R. Hong, and F. Wu, “Graph-based multi-interaction network for video question answering,” IEEE Trans. Image Process., vol. 30, pp. 2758–2770, 2021.
[2]
T. M. Le, V. Le, S. Venkatesh, and T. Tran, “Hierarchical conditional relation networks for video question answering,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 9972–9981.
[3]
Y. Liu, X. Zhang, F. Huang, B. Zhang, and Z. Li, “Cross-attentional spatio-temporal semantic graph networks for video question answering,” IEEE Trans. Image Process., vol. 31, pp. 1684–1696, 2022.
[4]
J. Gao, R. Ge, K. Chen, and R. Nevatia, “Motion-appearance co-memory networks for video question answering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 6576–6585.
[5]
P. Zeng, H. Zhang, L. Gao, J. Song, and H. T. Shen, “Video question answering with prior knowledge and object-sensitive learning,” IEEE Trans. Image Process., vol. 31, pp. 5936–5948, 2022.
[6]
W. Jin, Z. Zhao, X. Cao, J. Zhu, X. He, and Y. Zhuang, “Adaptive spatio-temporal graph enhanced vision-language representation for video QA,” IEEE Trans. Image Process., vol. 30, pp. 5477–5489, 2021.
[7]
Z. Zhaoet al., “Long-form video question answering via dynamic hierarchical reinforced networks,” IEEE Trans. Image Process., vol. 28, no. 12, pp. 5939–5952, Dec. 2019.
[8]
Y. Zhuanget al., “Multichannel attention refinement for video question answering,” ACM Trans. Multimedia Comput., Commun., Appl., vol. 16, no. 1, pp. 1–23, Jan. 2020.
[9]
M. Peng, C. Wang, Y. Gao, Y. Shi, and X.-D. Zhou, “Multilevel hierarchical network with multiscale sampling for video question answering,” in Proc. 31st Int. Joint Conf. Artif. Intell., Jul. 2022, pp. 1276–1282.
[10]
T. Yu, J. Yu, Z. Yu, and D. Tao, “Compositional attention networks with two-stream fusion for video question answering,” IEEE Trans. Image Process., vol. 29, pp. 1204–1218, 2020.
[11]
J. Xiao, P. Zhou, T.-S. Chua, and S. Yan, “Video graph transformer for video question answering,” in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2022, pp. 39–58.
[12]
T. Yu, J. Yu, Z. Yu, Q. Huang, and Q. Tian, “Long-term video question answering via multimodal hierarchical memory attentive networks,” IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 3, pp. 931–944, Mar. 2021.
[13]
A. Radfordet al., “Learning transferable visual models from natural language supervision,” in Proc. Int. Conf. Mach. Learn., vol. 139, 2021, pp. 8748–8763.
[14]
M. Bain, A. Nagrani, G. Varol, and A. Zisserman, “Frozen in time: A joint video and image encoder for end-to-end retrieval,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 1728–1738.
[15]
J. Leiet al., “Less is more: CLIPBERT for video-and-language learning via sparse sampling,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 7331–7341.
[16]
D. Li, J. Li, H. Li, J. C. Niebles, and S. C. H. Hoi, “Align and prompt: Video-and-language pre-training with entity prompts,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 4953–4963.
[17]
W. Yuet al., “Learning from inside: Self-driven Siamese sampling and reasoning for video question answering,” in Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021, pp. 26462–26474.
[18]
K.-H. Zenget al., “Leveraging video descriptions to learn video question answering,” in Proc. AAAI, 2017, pp. 4334–4340.
[19]
S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko, “Sequence to sequence-video to text,” in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2015, pp. 4534–4542.
[20]
Z. Zhao, J. Lin, X. Jiang, D. Cai, X. He, and Y. Zhuang, “Video question answering via hierarchical dual-level attention network learning,” in Proc. 25th ACM Int. Conf. Multimedia, Oct. 2017, pp. 1050–1058.
[21]
Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim, “TGIF-QA: Toward spatio-temporal reasoning in visual question answering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1359–1367.
[22]
C. Fan, X. Zhang, S. Zhang, W. Wang, C. Zhang, and H. Huang, “Heterogeneous memory enhanced multimodal attention model for video question answering,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 1999–2007.
[23]
A. Vaswaniet al., “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 1–11.
[24]
X. Liet al., “Beyond RNNs: Positional self-attention with co-attention for video question answering,” in Proc. 33rd AAAI Conf. Artif. Intell., vol. 8, 2019, pp. 8658–8665.
[25]
M. Peng, C. Wang, Y. Gao, Y. Shi, and X.-D. Zhou, “Temporal pyramid transformer with multimodal interaction for video question answering,” 2021, arXiv:2109.04735.
[26]
P. Jiang and Y. Han, “Reasoning with heterogeneous graph alignment for video question answering,” in Proc. AAAI Conf. Artif. Intell., 2020, vol. 34, no. 7, pp. 11109–11116.
[27]
D. Huang, P. Chen, R. Zeng, Q. Du, M. Tan, and C. Gan, “Location-aware graph convolutional networks for video question answering,” in Proc. AAAI, 2020, pp. 11021–11028.
[28]
M. Gandhi, M. O. Gul, E. Prakash, M. Grunde-McLaughlin, R. Krishna, and M. Agrawala, “Measuring compositional consistency for video question answering,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 5036–5045.
[29]
Y. Li, X. Wang, J. Xiao, W. Ji, and T.-S. Chua, “Invariant grounding for video question answering,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2022, pp. 2928–2937.
[30]
J. Mao, C. Gan, P. Kohli, J. B. Tenenbaum, and J. Wu, “The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision,” in Proc. Int. Conf. Learn. Represent., 2019, pp. 1–28.
[31]
K. Yi, J. Wu, C. Gan, A. Torralba, P. Kohli, and J. Tenenbaum, “Neural-symbolic VQA: Disentangling reasoning from vision and language understanding,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 1039–1050.
[32]
K. Yiet al., “CLEVRER: Collision events for video representation and reasoning,” in Proc. ICLR, 2020, pp. 1–19.
[33]
Z. Chenet al., “Grounding physical concepts of objects and events through dynamic visual reasoning,” in Proc. Int. Conf. Learn. Represent., 2021, pp. 1–20.
[34]
D. Ding, F. Hill, A. Santoro, M. Reynolds, and M. Botvinick, “Attention over learned object embeddings enables complex visual reasoning,” in Proc. Adv. Neural Inf. Process. Syst., A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., 2021, pp. 9112–9124.
[35]
C. P. Burgesset al., “MONet: Unsupervised scene decomposition and representation,” 2019, arXiv:1901.11390.
[36]
M. Ding, Z. Chen, T. Du, P. Luo, J. B. Tenenbaum, and C. Gan, “Dynamic visual reasoning by learning differentiable physics models from video and language,” in Proc. Adv. Neural Inf. Process. Syst., 2021, pp. 887–899.
[37]
Y. Zhong, W. Ji, J. Xiao, Y. Li, W. Deng, and T.-S. Chua, “Video question answering: Datasets, algorithms and challenges,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2022, pp. 6439–6455.
[38]
M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler, “MovieQA: Understanding stories in movies through question-answering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 4631–4640.
[39]
J. Lei, L. Yu, M. Bansal, and T. Berg, “TVQA: Localized, compositional video question answering,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2018, pp. 1369–1379.
[40]
J. Lei, L. Yu, T. L. Berg, and M. Bansal, “TVQA+: Spatio-temporal grounding for video question answering,” 2019, arXiv:1904.11574.
[41]
Z. Yuet al., “ActivityNet-QA: A dataset for understanding complex web videos via question answering,” in Proc. AAAI Conf. Artif. Intell., vol. 33, 2019, pp. 9127–9134.
[42]
Z. Zhaoet al., “Open-ended long-form video question answering via adaptive hierarchical reinforced networks,” in Proc. Twenty-Seventh Int. Joint Conf. Artif. Intell., Jul. 2018, p. 8.
[43]
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2014, arXiv:1409.1556.
[44]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 770–778.
[45]
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3D convolutional networks,” in Proc. IEEE Int. Conf. Comput. Vis., Jun. 2015, pp. 4489–4497.
[46]
K. Hara, H. Kataoka, and Y. Satoh, “Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet?,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 6546–6555.
[47]
K. Linet al., “SwinBERT: End-to-end transformers with sparse attention for video captioning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 17949–17958.
[48]
C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, “VideoBERT: A joint model for video and language representation learning,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., Oct. 2019, pp. 7464–7473.
[49]
L. Zhu and Y. Yang, “ActBERT: Learning global-local video-text representations,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 8746–8755.
[50]
L. Li, Y.-C. Chen, Y. Cheng, Z. Gan, L. Yu, and J. Liu, “HERO: Hierarchical encoder for video+language omni-representation pre-training,” 2020, arXiv:2005.00200.
[51]
H. Luoet al., “UniVL: A unified video and language pre-training model for multimodal understanding and generation,” 2020, arXiv:2002.06353.
[52]
A. Miech, J.-B. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zisserman, “End-to-end learning of visual representations from uncurated instructional videos,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 9879–9889.
[53]
M. Patricket al., “Support-set bottlenecks for video-text representation learning,” in Proc. ICLR, 2021, pp. 1–18.
[54]
J. Li, D. Li, C. Xiong, and S. Hoi, “BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in Proc. ICML, 2022, pp. 12888–12900.
[55]
J. Wanget al., “OmniVL: One foundation model for image-language and video-language tasks,” 2022, arXiv:2209.07526.
[56]
G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?,” 2021, arXiv:2102.05095.
[57]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” 2018, arXiv:1810.04805.
[58]
Z. Wang, J. Yu, A. Wei Yu, Z. Dai, Y. Tsvetkov, and Y. Cao, “SimVLM: Simple visual language model pretraining with weak supervision,” 2021, arXiv:2108.10904.
[59]
J. Xiao, X. Shang, A. Yao, and T.-S. Chua, “NExT-QA: Next phase of question-answering to explaining temporal actions,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 9772–9781.
[60]
D. Xuet al., “Video question answering via gradually refined attention over appearance and motion,” in Proc. 25th ACM Int. Conf. Multimedia, Oct. 2017, pp. 1645–1653.
[61]
A. Paszkeet al., “PyTorch: An imperative style, high-performance deep learning library,” in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 8026–8037.
[62]
A. Dosovitskiyet al., “An image is worth 16×16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Represent., 2021, pp. 1–22.
[63]
P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in Proc. 56th Annu. Meeting Assoc. Comput. Linguistics, 2018, pp. 2556–2565.
[64]
I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” 2017, arXiv:1711.05101.
[65]
M. Malinowski and M. Fritz, “A multi-world approach to question answering about real-world scenes based on uncertain input,” in Proc. Adv. Neural Inf. Process. Syst., 2014.
[66]
Y. Jang, Y. Song, C. D. Kim, Y. Yu, Y. Kim, and G. Kim, “Video question answering with spatio-temporal reasoning,” Int. J. Comput. Vis., vol. 127, no. 10, pp. 1385–1412, Oct. 2019.
[67]
T. Yang, Z.-J. Zha, H. Xie, M. Wang, and H. Zhang, “Question-aware tube-switch network for video question answering,” in Proc. 27th ACM Int. Conf. Multimedia, Oct. 2019, pp. 1184–1192.
[68]
A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid, “Just ask: Learning to answer questions from millions of narrated videos,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2021, pp. 1686–1697.
[69]
S. Buch, C. Eyzaguirre, A. Gaidon, J. Wu, L. Fei-Fei, and J. C. Niebles, “Revisiting the ‘video’ in video-language understanding,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2022, pp. 2917–2927.
[70]
X. Chenet al., “Microsoft COCO captions: Data collection and evaluation server,” 2015, arXiv:1504.00325.
[71]
R. Krishnaet al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” Int. J. Comput. Vis., vol. 123, pp. 32–73, Feb. 2017.
[72]
A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid, “Zero-shot video question answering via frozen bidirectional language models,” in Proc. Adv. Neural Inf. Process. Syst., 2022, pp. 124–141.
[73]
J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” 2023, arXiv:2301.12597.
[74]
J. Johnsonet al., “Inferring and executing programs for visual reasoning,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 3008–3017.
[75]
J.-B. Alayracet al., “Flamingo: A visual language model for few-shot learning,” in Proc. Adv. Neural Inf. Process. Syst., vol. 35, 2022, pp. 23716–23736.
[76]
S. Zhanget al., “OPT: Open pre-trained transformer language models,” 2022, arXiv:2205.01068.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Image Processing
IEEE Transactions on Image Processing  Volume 33, Issue
2024
5082 pages

Publisher

IEEE Press

Publication History

Published: 24 April 2024

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Oct 2024

Other Metrics

Citations

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media