Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3474085.3475519acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Discriminative Latent Semantic Graph for Video Captioning

Published: 17 October 2021 Publication History

Abstract

Video captioning aims to automatically generate natural language sentences that can describe the visual contents of a given video. Existing generative models like encoder-decoder frameworks cannot explicitly explore the object-level interactions and frame-level information from complex spatio-temporal data to generate semantic-rich captions. Our main contribution is to identify three key problems in a joint framework for future video summarization tasks. 1) Enhanced Object Proposal: we propose a novel Conditional Graph that can fuse spatio-temporal information into latent object proposal. 2) Visual Knowledge: Latent Proposal Aggregation is proposed to dynamically extract visual words with higher semantic levels. 3) Sentence Validation: A novel Discriminative Language Validator is proposed to verify generated captions so that key semantic concepts can be effectively preserved. Our experiments on two public datasets (MVSD and MSR-VTT) manifest significant improvements over state-of-the-art approaches on all metrics, especially for BLEU-4 and CIDEr. Our code is available at https://github.com/baiyang4/D-LSG-Video-Caption.

References

[1]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6077--6086.
[2]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.
[3]
David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 190--200.
[4]
Yangyu Chen, Shuhui Wang, Weigang Zhang, and Qingming Huang. 2018. Less is more: Picking informative frames for video captioning. In ECCV. 358--373.
[5]
Bo Dai, Sanja Fidler, Raquel Urtasun, and Dahua Lin. 2017. Towards diverse and natural image descriptions via a conditional gan. In ICCV. 2970--2979.
[6]
Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation. 376--380.
[7]
Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2013. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV. 2712--2719.
[8]
Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. 2017. Improved training of wasserstein gans. In Advances in neural information processing systems. 5767--5777.
[9]
Jingyi Hou, Xinxiao Wu, Xiaoxun Zhang, Yayun Qi, Yunde Jia, and Jiebo Luo. 2020. Joint Commonsense and Relation Reasoning for Image and Video Captioning. In AAAI. 10973--10980.
[10]
Yaosi Hu, Zhenzhong Chen, Zheng-Jun Zha, and Feng Wu. 2019. Hierarchical global-local temporal modeling for video captioning. In Proceedings of the 27th ACM International Conference on Multimedia. 774--783.
[11]
Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016).
[12]
Jin-Hwa Kim, Kyoung-Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. 2016. Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325 (2016).
[13]
Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, and Limin Wang. 2020. Tea: Temporal excitation and aggregation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 909--918.
[14]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74--81.
[15]
Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, and Juan Carlos Niebles. 2020. Spatio-Temporal Graph for Video Captioning with Knowledge Distillation. In CPVR. 10870--10879.
[16]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311--318.
[17]
Jae Sung Park, Marcus Rohrbach, Trevor Darrell, and Anna Rohrbach. 2019. Adversarial inference for multi-sentence video description. In CVPR. 6598--6608.
[18]
Wenjie Pei, Jiyuan Zhang, Xiangrong Wang, Lei Ke, Xiaoyong Shen, and Yu-Wing Tai. 2019. Memory-attended recurrent network for video captioning. In CVPR. 8347--8356.
[19]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497 (2015).
[20]
Anna Rohrbach, Marcus Rohrbach, Wei Qiu, Annemarie Friedrich, Manfred Pinkal, and Bernt Schiele. 2014. Coherent multi-sentence video description with variable level of detail. In German conference on pattern recognition. Springer, 184--195.
[21]
Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, Mario Fritz, and Bernt Schiele. 2017. Speaking the same language: Matching machine to human captions by adversarial training. In ICCV. 4135--4144.
[22]
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander Alemi. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31.
[23]
Ganchao Tan, Daqing Liu, Wang Meng, and Zheng-Jun Zha. 2020. Learning to Discretely Compose Reasoning Module Networks for Video Captioning. In IJCAI-PRICAI.
[24]
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In CPVR. 4566--4575.
[25]
Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence-video to text. In ICCV. 4534--4542.
[26]
Bairui Wang, Lin Ma, Wei Zhang, and Wei Liu. 2018. Reconstruction network for video captioning. In CVPR. 7622--7631.
[27]
Junyan Wang, Yang Bai, Yang Long, Bingzhang Hu, Zhenhua Chai, Yu Guan, and Xiaolin Wei. 2020. Query Twice: Dual Mixture Attention Meta Learning for Video Summarization. In Proceedings of the 28th ACM International Conference on Multimedia. 4023--4031.
[28]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In CVPR. 5288--5296.
[29]
Yang Yang, Jie Zhou, Jiangbo Ai, Yi Bin, Alan Hanjalic, Heng Tao Shen, and Yanli Ji. 2018. Video captioning by adversarial LSTM. IEEE Transactions on Image Processing 27, 11 (2018), 5600--5611.
[30]
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In ICCV. 4507--4515.
[31]
Yanhong Zeng, Jianlong Fu, and Hongyang Chao. 2020. Learning joint spatial- temporal transformations for video inpainting. In European Conference on Computer Vision. Springer, 528--543.
[32]
Junchao Zhang and Yuxin Peng. 2019. Object-aware aggregation with bidirectional temporal graph for video captioning. In CPVR. 8327--8336.
[33]
Songyang Zhang, Xuming He, and Shipeng Yan. 2019. Latentgnn: Learning efficient non-local relations for visual recognition. In ICML. PMLR, 7374--7383.
[34]
Ziqi Zhang, Yaya Shi, Chunfeng Yuan, Bing Li, Peijin Wang, Weiming Hu, and Zheng-Jun Zha. 2020. Object Relational Graph with Teacher-Recommended Learning for Video Captioning. In CVPR. 13278--13288.
[35]
Yi Zhu, Yang Long, Yu Guan, Shawn Newsam, and Ling Shao. 2018. Towards universal representation for unseen action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 9436--9445

Cited By

View all
  • (2025)Visual Commonsense-Aware Representation Network for Video CaptioningIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.332349136:1(1092-1103)Online publication date: Jan-2025
  • (2025)CroCaps: A CLIP-assisted cross-domain video captionerExpert Systems with Applications10.1016/j.eswa.2024.126296268(126296)Online publication date: Apr-2025
  • (2024)Chinese Title Generation for Short Videos: Dataset, Metric and AlgorithmIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.336573946:7(5192-5208)Online publication date: Jul-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '21: Proceedings of the 29th ACM International Conference on Multimedia
October 2021
5796 pages
ISBN:9781450386517
DOI:10.1145/3474085
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. discriminative
  2. graph neural networks
  3. semantic proposals
  4. video captioning

Qualifiers

  • Research-article

Funding Sources

  • supported by Medical Research Council (MRC) Fellowship
  • Engineering and Physical Sciences Research Council (EPSRC) Project CRITiCaL: Combatting cRiminals In The CLoud

Conference

MM '21
Sponsor:
MM '21: ACM Multimedia Conference
October 20 - 24, 2021
Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 1,291 of 5,076 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)30
  • Downloads (Last 6 weeks)2
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Visual Commonsense-Aware Representation Network for Video CaptioningIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.332349136:1(1092-1103)Online publication date: Jan-2025
  • (2025)CroCaps: A CLIP-assisted cross-domain video captionerExpert Systems with Applications10.1016/j.eswa.2024.126296268(126296)Online publication date: Apr-2025
  • (2024)Chinese Title Generation for Short Videos: Dataset, Metric and AlgorithmIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.336573946:7(5192-5208)Online publication date: Jul-2024
  • (2024)CLIP-based Semantic Enhancement and Vocabulary Expansion for Video Captioning Using Reinforcement Learning2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651205(1-8)Online publication date: 30-Jun-2024
  • (2024)Multi-level video captioning method based on semantic spaceMultimedia Tools and Applications10.1007/s11042-024-18372-z83:28(72113-72130)Online publication date: 8-Feb-2024
  • (2023)Focus and Align: Learning Tube Tokens for Video-Language Pre-TrainingIEEE Transactions on Multimedia10.1109/TMM.2022.323110825(8036-8050)Online publication date: 1-Jan-2023
  • (2023)Concept-Aware Video Captioning: Describing Videos With Effective Prior InformationIEEE Transactions on Image Processing10.1109/TIP.2023.330796932(5366-5378)Online publication date: 1-Jan-2023
  • (2023)Towards Few-shot Image Captioning with Cycle-based Compositional Semantic Enhancement Framework2023 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN54540.2023.10191558(1-8)Online publication date: 18-Jun-2023
  • (2023)Community-Aware Federated Video Summarization2023 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN54540.2023.10191101(1-8)Online publication date: 18-Jun-2023
  • (2023)Action knowledge for video captioning with graph neural networksJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2023.03.00635:4(50-62)Online publication date: 1-Apr-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media