research-article

Scenario-Aware Recurrent Transformer for Goal-Directed Video Captioning

Authors:

Deqiang Ouyang,

Jie ShaoAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 18, Issue 4

Article No.: 104, Pages 1 - 17

https://doi.org/10.1145/3503927

Published: 04 March 2022 Publication History

Abstract

Fully mining visual cues to aid in content understanding is crucial for video captioning. However, most state-of-the-art video captioning methods are limited to generating captions purely based on straightforward information while ignoring the scenario and context information. To fill the gap, we propose a novel, simple but effective scenario-aware recurrent transformer (SART) model to execute video captioning. Our model contains a “scenario understanding” module to obtain a global perspective across multiple frames, providing a specific scenario to guarantee a goal-directed description. Moreover, for the sake of achieving narrative continuity in the generated paragraph, a unified recurrent transformer is adopted. To demonstrate the effectiveness of our proposed SART, we have conducted comprehensive experiments on various large-scale video description datasets, including ActivityNet, YouCookII, and VideoStory. Additionally, we extend a story-oriented evaluation framework for assessing the quality of the generated caption more precisely. The superior performance has shown that SART has a strong ability to generate correct, deliberative, and narrative coherent video descriptions.

References

[1]

Nayyer Aafaq, Ajmal Mian, Wei Liu, Syed Zulqarnain Gilani, and Mubarak Shah. 2020. Video description: A survey of methods, datasets, and evaluation metrics. ACM Computing Surveys 52, 6 (2020), 115:1–115:37.

Digital Library

[2]

Shubham Agarwal, Ondrej Dusek, Ioannis Konstas, and Verena Rieser. 2018. A knowledge-grounded multimodal search-based conversational agent. In Proceedings of the 2nd International Workshop on Search-Oriented Conversational AI. 59–66.

[3]

Amit Alfassy, Leonid Karlinsky, Amit Aides, Joseph Shtok, Sivan Harary, Rogério Schmidt Feris, Raja Giryes, and Alexander M. Bronstein. 2019. LaSO: Label-set operations networks for multi-label few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6548–6557.

[4]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65–72.

[5]

Shizhe Chen, Jia Chen, Qin Jin, and Alexander G. Hauptmann. 2017. Video captioning with guidance of multimodal latent topics. In Proceedings of the 2017 ACM on Multimedia Conference, MM 2017. 1838–1846.

Digital Library

[6]

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal image-TExt representation learning. In Proceedings of the 16th European Conference on Computer Vision. 104–120.

Digital Library

[7]

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc Viet Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019. 2978–2988.

[8]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019. 4171–4186.

[9]

Xuguang Duan, Wen-bing Huang, Chuang Gan, Jingdong Wang, Wenwu Zhu, and Junzhou Huang. 2018. Weakly supervised dense event captioning in videos. In Proceedings of the Annual Conference on Neural Information Processing Systems 2018. 3063–3073.

[10]

Soichiro Fujita, Tsutomu Hirao, Hidetaka Kamigaito, Manabu Okumura, and Masaaki Nagata. 2020. SODA: Story oriented dense video captioning evaluation framework. In Proceedings of the16th European Conference Computer Vision. 517–531.

Digital Library

[11]

Chuang Gan, Zhe Gan, Xiaodong He, Jianfeng Gao, and Li Deng. 2017. StyleNet: Generating attractive visual captions with styles. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. 955–964.

[12]

Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. 2017. Semantic compositional networks for visual captioning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. 1141–1150.

[13]

Lianli Gao, Zhao Guo, Hanwang Zhang, Xing Xu, and Heng Tao Shen. 2017. Video captioning with attention-based LSTM and semantic consistency. IEEE Transactions on Multimedia 19, 9 (2017), 2045–2055.

Digital Library

[14]

Spandana Gella, Mike Lewis, and Marcus Rohrbach. 2018. A dataset for telling the stories of social media videos. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 968–974.

[15]

Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, and Thomas Brox. 2020. COOT: Cooperative hierarchical transformer for video-text representation learning. In Proceedings of the Annual Conference on Neural Information Processing Systems 2020.

[16]

Daya Guo, Jiangshui Hong, Binli Luo, Qirui Yan, and Zhangming Niu. 2019. Multi-modal representation learning for short video understanding and recommendation. In Proceedings of the IEEE International Conference on Multimedia & Expo Workshops. 687–690.

[17]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity mappings in deep residual networks. In Proceedings of the 14th European Conference on Computer Vision, Vol. 9908. 630–645.

[18]

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 961–970.

[19]

Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015. 448–456.

Digital Library

[20]

Qin Jin, Jia Chen, Shizhe Chen, Yifan Xiong, and Alexander G. Hauptmann. 2016. Describing videos using multi-modal fusion. In Proceedings of the 2016 ACM Conference on Multimedia. 1087–1091.

Digital Library

[21]

Lingchao Kong and Rui Dai. 2018. Efficient video encoding for automatic video analysis in distributed wireless surveillance systems. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 3 (2018), 72:1–72:24.

Digital Library

[22]

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision. 706–715.

[23]

Jie Lei, Liwei Wang, Yelong Shen, Dong Yu, Tamara L. Berg, and Mohit Bansal. 2020. MART: Memory-augmented recurrent transformer for coherent video paragraph captioning. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2603–2614.

[24]

Sheng Li, Zhiqiang Tao, Kang Li, and Yun Fu. 2019. Visual to text: Survey of image and video captioning. IEEE Transactions on Emerging Topics in Computational Intelligence 3, 4 (2019), 297–312.

[25]

Xiang Long, Chuang Gan, and Gerard de Melo. 2018. Video captioning with multi-faceted attention. Transactions of the Association for Computational 6 (2018), 173–184.

[26]

Tao Mei, Lin-Xie Tang, Jinhui Tang, and Xian-Sheng Hua. 2013. Near-lossless semantic video summarization and its applications to video analysis. ACM Transactions on Multimedia Computing, Communications, and Applications 9, 3 (2013), 16:1–16:23.

Digital Library

[27]

Jonghwan Mun, Linjie Yang, Zhou Ren, Ning Xu, and Bohyung Han. 2019. Streamlined dense video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6588–6597.

[28]

Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. 2016. Hierarchical recurrent neural encoder for video representation with application to captioning. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 1029–1038.

[29]

Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video captioning with transferred semantic attributes. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017. 984–992.

[30]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318.

Digital Library

[31]

Jae Sung Park, Marcus Rohrbach, Trevor Darrell, and Anna Rohrbach. 2019. Adversarial inference for multi-sentence video description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6598–6608.

[32]

Razvan Pascanu, Tomás Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning. 1310–1318.

Digital Library

[33]

Deb Roy. 2005. Semiotic schemas: A framework for grounding language in action and perception. Artificial Intelligence 167, 1–2 (2005), 170–205.

Digital Library

[34]

Adam Santoro, David Raposo, David G. T. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter W. Battaglia, and Tim Lillicrap. 2017. A simple neural network module for relational reasoning. In Proceedings of the Annual Conference on Neural Information Processing Systems. 4967–4976.

[35]

Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. VideoBERT: A joint model for video and language representation learning. In Proceedings of the IEEE International Conference on Computer Vision. 7463–7472.

[36]

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the Annual Conference on Neural Information Processing Systems. 3104–3112.

[37]

Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP. 5099–5110.

[38]

Pengjie Tang, Hanli Wang, and Qinyu Li. 2019. Rich visual and language representation with complementary semantics for video captioning. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2 (2019), 31:1–31:23.

Digital Library

[39]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Annual Conference on Neural Information Processing Systems. 5998–6008.

[40]

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566–4575.

[41]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.

[42]

Anqi Wang, Haifeng Hu, and Liang Yang. 2018. Image captioning with affective guiding and selective attention. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 3 (2018), 73:1–73:15.

Digital Library

[43]

Xin Wang, Wenhu Chen, Jiawei Wu, Yuan-Fang Wang, and William Yang Wang. 2018. Video captioning via hierarchical reinforcement learning. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. 4213–4222.

[44]

Yilei Xiong, Bo Dai, and Dahua Lin. 2018. Move forward and tell: A progressive generator of video descriptions. In Proceedings of the 15th European Conference on Computer Vision. 489–505.

[45]

Chenggang Yan, Yunbin Tu, Xingzheng Wang, Yongbing Zhang, Xinhong Hao, Yongdong Zhang, and Qionghai Dai. 2020. STAT: Spatial-temporal attention mechanism for video captioning. IEEE Transactions on Multimedia 22, 1 (2020), 229–241.

Digital Library

[46]

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. In Proceedings of the Annual Conference on Neural Information Processing Systems. 5754–5764.

[47]

Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B. Tenenbaum. 2020. CLEVRER: Collision events for video representation and reasoning. In Proceedings of the 8th International Conference on Learning Representations.

[48]

Bowen Zhang, Hexiang Hu, and Fei Sha. 2018. Cross-modal and hierarchical modeling of video and text. In Proceedings of the 15th European Conference on Computer Vision. 385–401.

[49]

Junchao Zhang and Yuxin Peng. 2019. Object-aware aggregation with bidirectional temporal graph for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8327–8336.

[50]

Junchao Zhang and Yuxin Peng. 2020. Video captioning with object-aware spatio-temporal correlation and aggregation. IEEE Transactions on Image Processing 29 (2020), 6209–6222.

[51]

Luowei Zhou, Yannis Kalantidis, Xinlei Chen, Jason J. Corso, and Marcus Rohrbach. 2019. Grounded video description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6578–6587.

[52]

Luowei Zhou, Chenliang Xu, and Jason J. Corso. 2018. Towards automatic learning of procedures from web instructional videos. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, (AAAI-18). 7590–7598.

[53]

Luowei Zhou, Yingbo Zhou, Jason J. Corso, Richard Socher, and Caiming Xiong. 2018. End-to-end dense video captioning with masked transformer. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. 8739–8748.

Cited By

Zhao FZhang CGeng B(2024)Deep Multimodal Data FusionACM Computing Surveys10.1145/364944756:9(1-36)Online publication date: 24-Apr-2024
https://dl.acm.org/doi/10.1145/3649447
Gao PYang XZhang RHuang K(2024)Continuous Image Outpainting with Neural ODEACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364836720:7(1-16)Online publication date: 25-Apr-2024
https://dl.acm.org/doi/10.1145/3648367
Fu FFang SChen WMao Z(2024)Sentiment-Oriented Transformer-Based Variational Autoencoder Network for Live Video CommentingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333420:4(1-24)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3633334
Show More Cited By

Index Terms

Scenario-Aware Recurrent Transformer for Goal-Directed Video Captioning
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval

Recommendations

An attention based dual learning approach for video captioning
Abstract
Video captioning aims to generate sentences/captions to describe video contents. It is one of the key tasks in the field of multimedia processing. However, most of the current video captioning approaches utilize only the visual ...
Highlights
- We propose a novel attention based dual learning approach for video captioning.
Meaning Guided Video Captioning
Pattern Recognition
Abstract
Current video captioning approaches often suffer from problems of missing objects in the video to be described, while generating captions semantically similar with ground truth sentences. In this paper, we propose a new approach to video ...
Sentimental Visual Captioning using Multimodal Transformer
Abstract
We propose a new task called sentimental visual captioning that generates captions with the inherent sentiment reflected by the input image or video. Compared with the stylized visual captioning task that requires a predefined style independent of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 18, Issue 4

November 2022

497 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3514185

Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 March 2022

Accepted: 01 December 2021

Revised: 01 October 2021

Received: 01 August 2021

Published in TOMM Volume 18, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

National Natural Science Foundation of China
Open Fund of Intelligent Terminal Key Laboratory of Sichuan Province
Zhejiang Lab’s International Talent Fund for Young Professionals

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
813
Total Downloads

Downloads (Last 12 months)167
Downloads (Last 6 weeks)3

Reflects downloads up to 12 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhao FZhang CGeng B(2024)Deep Multimodal Data FusionACM Computing Surveys10.1145/364944756:9(1-36)Online publication date: 24-Apr-2024
https://dl.acm.org/doi/10.1145/3649447
Gao PYang XZhang RHuang K(2024)Continuous Image Outpainting with Neural ODEACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364836720:7(1-16)Online publication date: 25-Apr-2024
https://dl.acm.org/doi/10.1145/3648367
Fu FFang SChen WMao Z(2024)Sentiment-Oriented Transformer-Based Variational Autoencoder Network for Live Video CommentingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333420:4(1-24)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3633334
Graule MIsler V(2024)GG-LLM: Geometrically Grounding Large Language Models for Zero-shot Human Activity Forecasting in Human-Aware Task Planning2024 IEEE International Conference on Robotics and Automation (ICRA)10.1109/ICRA57147.2024.10611090(568-574)Online publication date: 13-May-2024
https://doi.org/10.1109/ICRA57147.2024.10611090
Le TLee J(2024)Quality Enhancement Based Video Captioning in Video Communication SystemsIEEE Access10.1109/ACCESS.2024.337831312(40989-40999)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3378313
Tan JWang HYuan J(2023)Characters Link Shots: Character Attention Network for Movie Scene SegmentationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363025720:4(1-23)Online publication date: 11-Dec-2023
https://dl.acm.org/doi/10.1145/3630257
Zhu JPeng BLi WShen HHuang QLei J(2023)Modeling Long-range Dependencies and Epipolar Geometry for Multi-view StereoACM Transactions on Multimedia Computing, Communications, and Applications10.1145/359644519:6(1-17)Online publication date: 12-Jul-2023
https://dl.acm.org/doi/10.1145/3596445
Niu TDong SChen ZLuo XGuo SHuang ZXu X(2023)Semantic Enhanced Video Captioning with Multi-feature FusionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358857219:6(1-21)Online publication date: 20-Mar-2023
https://dl.acm.org/doi/10.1145/3588572
Niu TChen ZLuo XZhang PHuang ZXu X(2023)Video Captioning by Learning from Global Sentence and Looking AheadACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358725219:5s(1-20)Online publication date: 7-Jun-2023
https://dl.acm.org/doi/10.1145/3587252
Hao JSun HRen PZhong YWang JQi QLiao J(2023)Fine-Grained Text-to-Video Temporal Grounding from Coarse BoundaryACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357982519:5(1-21)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3579825
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents