research-article

Multichannel Attention Refinement for Video Question Answering

Authors:

Yueting Zhuang,

Jun XiaoAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 16, Issue 1s

Article No.: 24, Pages 1 - 23

https://doi.org/10.1145/3366710

Published: 12 March 2020 Publication History

Abstract

Video Question Answering (VideoQA) is the extension of image question answering (ImageQA) in the video domain. Methods are required to give the correct answer after analyzing the provided video and question in this task. Comparing to ImageQA, the most distinctive part is the media type. Both tasks require the understanding of visual media, but VideoQA is much more challenging, mainly because of the complexity and diversity of videos. Particularly, working with the video needs to model its inherent temporal structure and analyze the diverse information it contains. In this article, we propose to tackle the task from a multichannel perspective. Appearance, motion, and audio features are extracted from the video, and question-guided attentions are refined to generate the expressive clues that support the correct answer. We also incorporate the relevant text information acquired from Wikipedia as an attempt to extend the capability of the method. Experiments on TGIF-QA and ActivityNet-QA datasets show the advantages of our method compared to existing methods. We also demonstrate the effectiveness and interpretability of our method by analyzing the refined attention weights during the question-answering procedure.

References

[1]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 265--283.

Digital Library

[2]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2425--2433.

Digital Library

[3]

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 961--970.

[4]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.

[5]

Jingyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and Tat-Seng Chua. 2017. Attentive collaborative filtering: Multimedia recommendation with item- and component-level attention. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 335--344.

Digital Library

[6]

Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. 2017. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5659--5667.

[7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171--4186.

[8]

Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 457--468.

[9]

Jiyang Gao, Runzhou Ge, Kan Chen, and Ram Nevatia. 2018. Motion-appearance co-memory networks for video question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6576--6585.

[10]

Lianli Gao, Xiangpeng Li, Jingkuan Song, and Heng Tao Shen. 2019. Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Trans. Pattern Anal. Mach. Intell. (2019). https://ieeexplore.ieee.org/abstract/document/8620348.

[11]

Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). IEEE, 776--780.

[12]

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 580--587.

Digital Library

[13]

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6904--6913.

[14]

Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6546--6555.

[15]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.

[16]

Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, et al. 2017. CNN architectures for large-scale audio classification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). IEEE, 131--135.

[17]

Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2017. TGIF-QA: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2758--2766.

[18]

Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1 (2013), 221--231.

Digital Library

[19]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, 675--678.

Digital Library

[20]

Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4565--4574.

[21]

Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1725--1732.

Digital Library

[22]

Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[23]

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision. 706--715.

[24]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 1 (2017), 32--73.

Digital Library

[25]

Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. 2016. Ask me anything: Dynamic memory networks for natural language processing. In Proceedings of the International Conference on Machine Learning. 1378--1387.

Digital Library

[26]

Jey Han Lau and Timothy Baldwin. 2016. An empirical evaluation of Doc2Vec with practical insights into document embedding generation. In Proceedings of the 1st Workshop on Representation Learning for NLP. 78--86.

[27]

Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning. 1188--1196.

Digital Library

[28]

Quoc V. Le, Will Y. Zou, Serena Y. Yeung, and Andrew Y. Ng. 2011. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[29]

Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. 2016. TGIF: A new dataset and benchmark on animated GIF description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4641--4650.

[30]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 3111--3119.

[31]

Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. 2016. Hierarchical recurrent neural encoder for video representation with application to captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1029--1038.

[32]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532--1543.

[33]

Mengye Ren, Ryan Kiros, and Richard Zemel. 2015. Exploring models and data for image question answering. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 2953--2961.

[34]

Matthew J. Roach, J. D. Mason, and Mark Pawlewski. 2001. Video genre classification using dynamics. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), Vol. 3. IEEE, 1557--1560.

Digital Library

[35]

Paul Scovanner, Saad Ali, and Mubarak Shah. 2007. A 3-dimensional sift descriptor and its application to action recognition. In Proceedings of the 15th ACM International Conference on Multimedia. ACM, 357--360.

Digital Library

[36]

Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. 2013. Overfeat: Integrated recognition, localization, and detection using convolutional networks. arXiv preprint arXiv:1312.6229 (2013).

[37]

Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 568--576.

[38]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[39]

Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016. MovieQA: Understanding stories in movies through question-answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4631--4640.

[40]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489--4497.

Digital Library

[41]

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6450--6459.

[42]

Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence—Video to text. In Proceedings of the IEEE International Conference on Computer Vision. 4534--4542.

Digital Library

[43]

Meng Wang, Weijie Fu, Shijie Hao, Hengchang Liu, and Xindong Wu. 2017. Learning on big graph: Label inference and regularization with anchor hierarchy. IEEE Trans. Knowl. Data Eng. 29, 5 (2017), 1101--1114.

Digital Library

[44]

Qi Wu, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel. 2016. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4622--4630.

[45]

Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic memory networks for visual and textual question answering. In Proceedings of the International Conference on Machine Learning. 2397--2406.

Digital Library

[46]

Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, and Yueting Zhuang. 2019. Self-supervised spatiotemporal learning via video clip order prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[47]

Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. 2017. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM International Conference on Multimedia. ACM, 1645--1653.

Digital Library

[48]

Huijuan Xu and Kate Saenko. 2016. Ask, attend, and answer: Exploring question-guided spatial attention for visual question answering. In Proceedings of the European Conference on Computer Vision. Springer, 451--466.

[49]

Hongyang Xue, Wenqing Chu, Zhou Zhao, and Deng Cai. 2018. A better way to attend: Attention with trees for video question answering. IEEE Trans. Image Proc. 27, 11 (2018), 5563--5574.

Digital Library

[50]

Hongyang Xue, Zhou Zhao, and Deng Cai. 2017. Unifying the video and question attentions for open-ended video question answering. IEEE Trans. Image Proc. 26, 12 (2017), 5656--5666.

[51]

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 21--29.

[52]

Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE International Conference on Computer Vision. 4507--4515.

Digital Library

[53]

Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. 2016. Video paragraph captioning using hierarchical recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4584--4593.

[54]

Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. 2017. End-to-end concept word detection for video captioning, retrieval, and question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3165--3173.

[55]

Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Yueting Zhuang, and Dacheng Tao. 2019. ActivityNet-QA: A dataset for understanding complex web videos via question answering. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence.

[56]

Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 6281--6290.

[57]

Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2017. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 1839--1848.

[58]

Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, and Dacheng Tao. 2018. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans. Neural Netw. Learn. Syst. 29, 12 (2018), 5947--5959.

[59]

Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4694--4702.

[60]

Mihai Zanfir, Elisabeta Marinoiu, and Cristian Sminchisescu. 2016. Spatio-temporal attention models for grounded video captioning. In Proceedings of the Asian Conference on Computer Vision. Springer, 104--119.

[61]

Kuo-Hao Zeng, Tseng-Hung Chen, Ching-Yao Chuang, Yuan-Hong Liao, Juan Carlos Niebles, and Min Sun. 2017. Leveraging video descriptions to learn video question answering. In Proceedings of the 31st AAAI Conference on Artificial Intelligence.

[62]

Hanwang Zhang, Xindi Shang, Huanbo Luan, Meng Wang, and Tat-Seng Chua. 2017. Learning from collective intelligence: Feature learning using social images and tags. ACM Trans. Multim. Comput. Commun. Applic. 13, 1 (2017), 1.

Digital Library

[63]

Hanwang Zhang, Zheng-Jun Zha, Yang Yang, Shuicheng Yan, Yue Gao, and Tat-Seng Chua. 2013. Attribute-augmented semantic hierarchy: Towards bridging semantic gap and intention gap in image retrieval. In Proceedings of the 21st ACM International Conference on Multimedia. ACM, 33--42.

Digital Library

[64]

Songyang Zhang, Yang Yang, Jun Xiao, Xiaoming Liu, Yi Yang, Di Xie, and Yueting Zhuang. 2018. Fusing geometric features for skeleton-based action recognition using multilayer LSTM networks. IEEE Trans. Multim. 20, 9 (2018), 2330--2343.

Digital Library

[65]

Shengping Zhang, Hongxun Yao, Xin Sun, Kuanquan Wang, Jun Zhang, Xiusheng Lu, and Yanhao Zhang. 2014. Action recognition based on overcomplete independent components analysis. Inf. Sci. 281 (2014), 635--647.

Digital Library

[66]

Zhou Zhao, Qifan Yang, Deng Cai, Xiaofei He, and Yueting Zhuang. 2017. Video question answering via hierarchical spatio-temporal attention networks. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’17). 3518--3524.

[67]

Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G. Hauptmann. 2017. Uncovering the temporal context for video question answering. Int. J. Comput. Vis. 124, 3 (2017), 409--421.

Digital Library

Cited By

Qian TCui RChen JPeng PGuo XJiang Y(2024)Locate Before Answering: Answer Guided Question Localization for Video Question AnsweringIEEE Transactions on Multimedia10.1109/TMM.2023.332387826(4554-4563)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3323878
Yu TFu KZhang JHuang QYu J(2024)Multi-Granularity Contrastive Cross-Modal Collaborative Generation for End-to-End Long-Term Video Question AnsweringIEEE Transactions on Image Processing10.1109/TIP.2024.339098433(3115-3129)Online publication date: 24-Apr-2024
https://dl.acm.org/doi/10.1109/TIP.2024.3390984
Luo YWang RZhang FZhou FLiu MFeng J(2024)Video Q &A based on two-stage deep exploration of temporally-evolving features with enhanced cross-modal attention mechanismNeural Computing and Applications10.1007/s00521-024-09482-836:14(8055-8071)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1007/s00521-024-09482-8
Show More Cited By

Index Terms

Multichannel Attention Refinement for Video Question Answering
1. Computing methodologies
  1. Artificial intelligence
    1. Knowledge representation and reasoning
      1. Temporal reasoning
    2. Natural language processing
      1. Information extraction
  2. Machine learning
    1. Machine learning algorithms
      1. Feature selection

Recommendations

Spatiotemporal-Textual Co-Attention Network for Video Question Answering
Special Section on Cross-Media Analysis for Visual Question Answering, Special Section on Big Data, Machine Learning and AI Technologies for Art and Design and Special Section on MMSys/NOSSDAV 2018

Visual Question Answering (VQA) is to provide a natural language answer for a pair of an image or video and a natural language question. Despite recent progress on VQA, existing works primarily focus on image question answering and are suboptimal for ...
Video Question Answering via Hierarchical Dual-Level Attention Network Learning
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Video question answering is a challenging task in visual information retrieval, which provides the accurate answer from the referenced video contents according to the given question. However, the existing visual question answering approaches mainly ...
Video Question Answering via Gradually Refined Attention over Appearance and Motion
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Recently image question answering (ImageQA) has gained lots of attention in the research community. However, as its natural extension, video question answering (VideoQA) is less explored. Although both tasks look similar, VideoQA is more challenging ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 16, Issue 1s

Special Issue on Multimodal Machine Learning for Human Behavior Analysis and Special Issue on Computational Intelligence for Biomedical Data and Imaging

January 2020

376 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3388236

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 March 2020

Accepted: 01 October 2019

Revised: 01 October 2019

Received: 01 April 2019

Published in TOMM Volume 16, Issue 1s

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Key Research and Development Program of China
Chinese Knowledge Center for Engineering Sciences and Technology
Joint Research Program of ZJU 8 Hikvision Research Institute
Fundamental Research Funds for the Central Universities
Zhejiang Natural Science Foundation
National Natural Science Foundation of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
419
Total Downloads

Downloads (Last 12 months)22
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Qian TCui RChen JPeng PGuo XJiang Y(2024)Locate Before Answering: Answer Guided Question Localization for Video Question AnsweringIEEE Transactions on Multimedia10.1109/TMM.2023.332387826(4554-4563)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3323878
Yu TFu KZhang JHuang QYu J(2024)Multi-Granularity Contrastive Cross-Modal Collaborative Generation for End-to-End Long-Term Video Question AnsweringIEEE Transactions on Image Processing10.1109/TIP.2024.339098433(3115-3129)Online publication date: 24-Apr-2024
https://dl.acm.org/doi/10.1109/TIP.2024.3390984
Luo YWang RZhang FZhou FLiu MFeng J(2024)Video Q &A based on two-stage deep exploration of temporally-evolving features with enhanced cross-modal attention mechanismNeural Computing and Applications10.1007/s00521-024-09482-836:14(8055-8071)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1007/s00521-024-09482-8
Peng MShao XShi YZhou X(2023)Hierarchical Synergy-Enhanced Multimodal Relational Network for Video Question AnsweringACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363010120:4(1-22)Online publication date: 11-Dec-2023
https://doi.org/10.1145/3630101
Xie JChen JCai YHuang QLi Q(2023)Visual Paraphrase Generation with Key Information RetainedACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358501019:6(1-19)Online publication date: 27-Feb-2023
https://dl.acm.org/doi/10.1145/3585010
Hao JSun HRen PZhong YWang JQi QLiao J(2023)Fine-Grained Text-to-Video Temporal Grounding from Coarse BoundaryACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357982519:5(1-21)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3579825
Wang YLiu MWu JNie L(2023)Multi-Granularity Interaction and Integration Network for Video Question AnsweringIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.327849233:12(7684-7695)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1109/TCSVT.2023.3278492
Chen GLiu XWang GZhang KTorr PZhang XTang Y(2023)Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01282(13899-13909)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICCV51070.2023.01282
Wang XZhu LWu FYang Y(2022)A Differentiable Parallel Sampler for Efficient Video ClassificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/356958419:3(1-18)Online publication date: 27-Oct-2022
https://dl.acm.org/doi/10.1145/3569584
Guo YNie LCheng HCheng ZKankanhalli MDel Bimbo A(2022)On Modality Bias Recognition and ReductionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/356526619:3(1-22)Online publication date: 29-Sep-2022
https://dl.acm.org/doi/10.1145/3565266
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents