research-article

Discriminative Latent Semantic Graph for Video Captioning

Authors:

Maurice Pagnucco,

Yu GuanAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 3556 - 3564

https://doi.org/10.1145/3474085.3475519

Published: 17 October 2021 Publication History

Abstract

Video captioning aims to automatically generate natural language sentences that can describe the visual contents of a given video. Existing generative models like encoder-decoder frameworks cannot explicitly explore the object-level interactions and frame-level information from complex spatio-temporal data to generate semantic-rich captions. Our main contribution is to identify three key problems in a joint framework for future video summarization tasks. 1) Enhanced Object Proposal: we propose a novel Conditional Graph that can fuse spatio-temporal information into latent object proposal. 2) Visual Knowledge: Latent Proposal Aggregation is proposed to dynamically extract visual words with higher semantic levels. 3) Sentence Validation: A novel Discriminative Language Validator is proposed to verify generated captions so that key semantic concepts can be effectively preserved. Our experiments on two public datasets (MVSD and MSR-VTT) manifest significant improvements over state-of-the-art approaches on all metrics, especially for BLEU-4 and CIDEr. Our code is available at https://github.com/baiyang4/D-LSG-Video-Caption.

References

[1]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6077--6086.

[2]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.

[3]

David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 190--200.

Digital Library

[4]

Yangyu Chen, Shuhui Wang, Weigang Zhang, and Qingming Huang. 2018. Less is more: Picking informative frames for video captioning. In ECCV. 358--373.

[5]

Bo Dai, Sanja Fidler, Raquel Urtasun, and Dahua Lin. 2017. Towards diverse and natural image descriptions via a conditional gan. In ICCV. 2970--2979.

[6]

Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation. 376--380.

[7]

Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2013. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV. 2712--2719.

Digital Library

[8]

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. 2017. Improved training of wasserstein gans. In Advances in neural information processing systems. 5767--5777.

Digital Library

[9]

Jingyi Hou, Xinxiao Wu, Xiaoxun Zhang, Yayun Qi, Yunde Jia, and Jiebo Luo. 2020. Joint Commonsense and Relation Reasoning for Image and Video Captioning. In AAAI. 10973--10980.

[10]

Yaosi Hu, Zhenzhong Chen, Zheng-Jun Zha, and Feng Wu. 2019. Hierarchical global-local temporal modeling for video captioning. In Proceedings of the 27th ACM International Conference on Multimedia. 774--783.

Digital Library

[11]

Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016).

[12]

Jin-Hwa Kim, Kyoung-Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. 2016. Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325 (2016).

[13]

Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, and Limin Wang. 2020. Tea: Temporal excitation and aggregation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 909--918.

[14]

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74--81.

[15]

Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, and Juan Carlos Niebles. 2020. Spatio-Temporal Graph for Video Captioning with Knowledge Distillation. In CPVR. 10870--10879.

[16]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311--318.

Digital Library

[17]

Jae Sung Park, Marcus Rohrbach, Trevor Darrell, and Anna Rohrbach. 2019. Adversarial inference for multi-sentence video description. In CVPR. 6598--6608.

[18]

Wenjie Pei, Jiyuan Zhang, Xiangrong Wang, Lei Ke, Xiaoyong Shen, and Yu-Wing Tai. 2019. Memory-attended recurrent network for video captioning. In CVPR. 8347--8356.

[19]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497 (2015).

[20]

Anna Rohrbach, Marcus Rohrbach, Wei Qiu, Annemarie Friedrich, Manfred Pinkal, and Bernt Schiele. 2014. Coherent multi-sentence video description with variable level of detail. In German conference on pattern recognition. Springer, 184--195.

[21]

Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, Mario Fritz, and Bernt Schiele. 2017. Speaking the same language: Matching machine to human captions by adversarial training. In ICCV. 4135--4144.

[22]

Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander Alemi. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31.

Digital Library

[23]

Ganchao Tan, Daqing Liu, Wang Meng, and Zheng-Jun Zha. 2020. Learning to Discretely Compose Reasoning Module Networks for Video Captioning. In IJCAI-PRICAI.

[24]

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In CPVR. 4566--4575.

[25]

Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence-video to text. In ICCV. 4534--4542.

Digital Library

[26]

Bairui Wang, Lin Ma, Wei Zhang, and Wei Liu. 2018. Reconstruction network for video captioning. In CVPR. 7622--7631.

[27]

Junyan Wang, Yang Bai, Yang Long, Bingzhang Hu, Zhenhua Chai, Yu Guan, and Xiaolin Wei. 2020. Query Twice: Dual Mixture Attention Meta Learning for Video Summarization. In Proceedings of the 28th ACM International Conference on Multimedia. 4023--4031.

Digital Library

[28]

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In CVPR. 5288--5296.

[29]

Yang Yang, Jie Zhou, Jiangbo Ai, Yi Bin, Alan Hanjalic, Heng Tao Shen, and Yanli Ji. 2018. Video captioning by adversarial LSTM. IEEE Transactions on Image Processing 27, 11 (2018), 5600--5611.

Digital Library

[30]

Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In ICCV. 4507--4515.

Digital Library

[31]

Yanhong Zeng, Jianlong Fu, and Hongyang Chao. 2020. Learning joint spatial- temporal transformations for video inpainting. In European Conference on Computer Vision. Springer, 528--543.

Digital Library

[32]

Junchao Zhang and Yuxin Peng. 2019. Object-aware aggregation with bidirectional temporal graph for video captioning. In CPVR. 8327--8336.

[33]

Songyang Zhang, Xuming He, and Shipeng Yan. 2019. Latentgnn: Learning efficient non-local relations for visual recognition. In ICML. PMLR, 7374--7383.

[34]

Ziqi Zhang, Yaya Shi, Chunfeng Yuan, Bing Li, Peijin Wang, Weiming Hu, and Zheng-Jun Zha. 2020. Object Relational Graph with Teacher-Recommended Learning for Video Captioning. In CVPR. 13278--13288.

[35]

Yi Zhu, Yang Long, Yu Guan, Shawn Newsam, and Ling Shao. 2018. Towards universal representation for unseen action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 9436--9445

Cited By

Zeng PZhang HGao LLi XQian JShen H(2025)Visual Commonsense-Aware Representation Network for Video CaptioningIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.332349136:1(1092-1103)Online publication date: Jan-2025
https://doi.org/10.1109/TNNLS.2023.3323491
Xu WXu YMiao ZCen YWan LMa X(2025)CroCaps: A CLIP-assisted cross-domain video captionerExpert Systems with Applications10.1016/j.eswa.2024.126296268(126296)Online publication date: Apr-2025
https://doi.org/10.1016/j.eswa.2024.126296
Zhang ZMa ZYuan CChen YWang PQi ZHao CLi BShan YHu WMaybank S(2024)Chinese Title Generation for Short Videos: Dataset, Metric and AlgorithmIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.336573946:7(5192-5208)Online publication date: Jul-2024
https://doi.org/10.1109/TPAMI.2024.3365739
Show More Cited By

Index Terms

Discriminative Latent Semantic Graph for Video Captioning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
    2. Natural language processing
      1. Natural language generation

Recommendations

Video Captioning with Guidance of Multimodal Latent Topics
MM '17: Proceedings of the 25th ACM international conference on Multimedia

The topic diversity of open-domain videos leads to various vocabularies and linguistic expressions in describing video contents, and therefore, makes the video captioning task even more challenging. In this paper, we propose an unified caption framework,...
BiTransformer: augmenting semantic context in video captioning via bidirectional decoder
Abstract
Video captioning is an important problem involved in many applications. It aims to generate some descriptions of the content of a video. Most of existing methods for video captioning are based on the deep encoder–decoder models, particularly, the ...
Global semantic enhancement network for video captioning
Abstract
Video captioning aims to briefly describe the content of a video in accurate and fluent natural language, which is a hot research topic in multimedia processing. As a bridge between video and natural language, video captioning is a challenging ...
Highlights
- A video captioning framework called global semantic enhancement network is proposed.
- It highlights features of informative frames in aggregated video representations.
- It enhances semantic correlations between video and language ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

supported by Medical Research Council (MRC) Fellowship
Engineering and Physical Sciences Research Council (EPSRC) Project CRITiCaL: Combatting cRiminals In The CLoud

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 1,291 of 5,076 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
299
Total Downloads

Downloads (Last 12 months)30
Downloads (Last 6 weeks)2

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zeng PZhang HGao LLi XQian JShen H(2025)Visual Commonsense-Aware Representation Network for Video CaptioningIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.332349136:1(1092-1103)Online publication date: Jan-2025
https://doi.org/10.1109/TNNLS.2023.3323491
Xu WXu YMiao ZCen YWan LMa X(2025)CroCaps: A CLIP-assisted cross-domain video captionerExpert Systems with Applications10.1016/j.eswa.2024.126296268(126296)Online publication date: Apr-2025
https://doi.org/10.1016/j.eswa.2024.126296
Zhang ZMa ZYuan CChen YWang PQi ZHao CLi BShan YHu WMaybank S(2024)Chinese Title Generation for Short Videos: Dataset, Metric and AlgorithmIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.336573946:7(5192-5208)Online publication date: Jul-2024
https://doi.org/10.1109/TPAMI.2024.3365739
Zheng LGuo PMiao ZXu W(2024)CLIP-based Semantic Enhancement and Vocabulary Expansion for Video Captioning Using Reinforcement Learning2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651205(1-8)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10651205
Yao XZeng YGu MYuan RLi JGe J(2024)Multi-level video captioning method based on semantic spaceMultimedia Tools and Applications10.1007/s11042-024-18372-z83:28(72113-72130)Online publication date: 8-Feb-2024
https://doi.org/10.1007/s11042-024-18372-z
Zhu YLi XZheng MYang JWang ZGuo XChai ZYuan YJiang S(2023)Focus and Align: Learning Tube Tokens for Video-Language Pre-TrainingIEEE Transactions on Multimedia10.1109/TMM.2022.323110825(8036-8050)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3231108
Yang BCao MZou Y(2023)Concept-Aware Video Captioning: Describing Videos With Effective Prior InformationIEEE Transactions on Image Processing10.1109/TIP.2023.330796932(5366-5378)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TIP.2023.3307969
Zhang PBai YSu JHuang YLong Y(2023)Towards Few-shot Image Captioning with Cycle-based Compositional Semantic Enhancement Framework2023 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN54540.2023.10191558(1-8)Online publication date: 18-Jun-2023
https://doi.org/10.1109/IJCNN54540.2023.10191558
Wan FWang JDuan HSong YPagnucco MLong Y(2023)Community-Aware Federated Video Summarization2023 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN54540.2023.10191101(1-8)Online publication date: 18-Jun-2023
https://doi.org/10.1109/IJCNN54540.2023.10191101
Hendria WVelda VPutra BAdzaka FJeong C(2023)Action knowledge for video captioning with graph neural networksJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2023.03.00635:4(50-62)Online publication date: 1-Apr-2023
https://dl.acm.org/doi/10.1016/j.jksuci.2023.03.006
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten