Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Scenario-Aware Recurrent Transformer for Goal-Directed Video Captioning

Published: 04 March 2022 Publication History

Abstract

Fully mining visual cues to aid in content understanding is crucial for video captioning. However, most state-of-the-art video captioning methods are limited to generating captions purely based on straightforward information while ignoring the scenario and context information. To fill the gap, we propose a novel, simple but effective scenario-aware recurrent transformer (SART) model to execute video captioning. Our model contains a “scenario understanding” module to obtain a global perspective across multiple frames, providing a specific scenario to guarantee a goal-directed description. Moreover, for the sake of achieving narrative continuity in the generated paragraph, a unified recurrent transformer is adopted. To demonstrate the effectiveness of our proposed SART, we have conducted comprehensive experiments on various large-scale video description datasets, including ActivityNet, YouCookII, and VideoStory. Additionally, we extend a story-oriented evaluation framework for assessing the quality of the generated caption more precisely. The superior performance has shown that SART has a strong ability to generate correct, deliberative, and narrative coherent video descriptions.

References

[1]
Nayyer Aafaq, Ajmal Mian, Wei Liu, Syed Zulqarnain Gilani, and Mubarak Shah. 2020. Video description: A survey of methods, datasets, and evaluation metrics. ACM Computing Surveys 52, 6 (2020), 115:1–115:37.
[2]
Shubham Agarwal, Ondrej Dusek, Ioannis Konstas, and Verena Rieser. 2018. A knowledge-grounded multimodal search-based conversational agent. In Proceedings of the 2nd International Workshop on Search-Oriented Conversational AI. 59–66.
[3]
Amit Alfassy, Leonid Karlinsky, Amit Aides, Joseph Shtok, Sivan Harary, Rogério Schmidt Feris, Raja Giryes, and Alexander M. Bronstein. 2019. LaSO: Label-set operations networks for multi-label few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6548–6557.
[4]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65–72.
[5]
Shizhe Chen, Jia Chen, Qin Jin, and Alexander G. Hauptmann. 2017. Video captioning with guidance of multimodal latent topics. In Proceedings of the 2017 ACM on Multimedia Conference, MM 2017. 1838–1846.
[6]
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal image-TExt representation learning. In Proceedings of the 16th European Conference on Computer Vision. 104–120.
[7]
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc Viet Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019. 2978–2988.
[8]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019. 4171–4186.
[9]
Xuguang Duan, Wen-bing Huang, Chuang Gan, Jingdong Wang, Wenwu Zhu, and Junzhou Huang. 2018. Weakly supervised dense event captioning in videos. In Proceedings of the Annual Conference on Neural Information Processing Systems 2018. 3063–3073.
[10]
Soichiro Fujita, Tsutomu Hirao, Hidetaka Kamigaito, Manabu Okumura, and Masaaki Nagata. 2020. SODA: Story oriented dense video captioning evaluation framework. In Proceedings of the16th European Conference Computer Vision. 517–531.
[11]
Chuang Gan, Zhe Gan, Xiaodong He, Jianfeng Gao, and Li Deng. 2017. StyleNet: Generating attractive visual captions with styles. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. 955–964.
[12]
Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. 2017. Semantic compositional networks for visual captioning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. 1141–1150.
[13]
Lianli Gao, Zhao Guo, Hanwang Zhang, Xing Xu, and Heng Tao Shen. 2017. Video captioning with attention-based LSTM and semantic consistency. IEEE Transactions on Multimedia 19, 9 (2017), 2045–2055.
[14]
Spandana Gella, Mike Lewis, and Marcus Rohrbach. 2018. A dataset for telling the stories of social media videos. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 968–974.
[15]
Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, and Thomas Brox. 2020. COOT: Cooperative hierarchical transformer for video-text representation learning. In Proceedings of the Annual Conference on Neural Information Processing Systems 2020.
[16]
Daya Guo, Jiangshui Hong, Binli Luo, Qirui Yan, and Zhangming Niu. 2019. Multi-modal representation learning for short video understanding and recommendation. In Proceedings of the IEEE International Conference on Multimedia & Expo Workshops. 687–690.
[17]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity mappings in deep residual networks. In Proceedings of the 14th European Conference on Computer Vision, Vol. 9908. 630–645.
[18]
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 961–970.
[19]
Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015. 448–456.
[20]
Qin Jin, Jia Chen, Shizhe Chen, Yifan Xiong, and Alexander G. Hauptmann. 2016. Describing videos using multi-modal fusion. In Proceedings of the 2016 ACM Conference on Multimedia. 1087–1091.
[21]
Lingchao Kong and Rui Dai. 2018. Efficient video encoding for automatic video analysis in distributed wireless surveillance systems. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 3 (2018), 72:1–72:24.
[22]
Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision. 706–715.
[23]
Jie Lei, Liwei Wang, Yelong Shen, Dong Yu, Tamara L. Berg, and Mohit Bansal. 2020. MART: Memory-augmented recurrent transformer for coherent video paragraph captioning. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2603–2614.
[24]
Sheng Li, Zhiqiang Tao, Kang Li, and Yun Fu. 2019. Visual to text: Survey of image and video captioning. IEEE Transactions on Emerging Topics in Computational Intelligence 3, 4 (2019), 297–312.
[25]
Xiang Long, Chuang Gan, and Gerard de Melo. 2018. Video captioning with multi-faceted attention. Transactions of the Association for Computational 6 (2018), 173–184.
[26]
Tao Mei, Lin-Xie Tang, Jinhui Tang, and Xian-Sheng Hua. 2013. Near-lossless semantic video summarization and its applications to video analysis. ACM Transactions on Multimedia Computing, Communications, and Applications 9, 3 (2013), 16:1–16:23.
[27]
Jonghwan Mun, Linjie Yang, Zhou Ren, Ning Xu, and Bohyung Han. 2019. Streamlined dense video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6588–6597.
[28]
Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. 2016. Hierarchical recurrent neural encoder for video representation with application to captioning. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 1029–1038.
[29]
Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video captioning with transferred semantic attributes. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017. 984–992.
[30]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318.
[31]
Jae Sung Park, Marcus Rohrbach, Trevor Darrell, and Anna Rohrbach. 2019. Adversarial inference for multi-sentence video description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6598–6608.
[32]
Razvan Pascanu, Tomás Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning. 1310–1318.
[33]
Deb Roy. 2005. Semiotic schemas: A framework for grounding language in action and perception. Artificial Intelligence 167, 1–2 (2005), 170–205.
[34]
Adam Santoro, David Raposo, David G. T. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter W. Battaglia, and Tim Lillicrap. 2017. A simple neural network module for relational reasoning. In Proceedings of the Annual Conference on Neural Information Processing Systems. 4967–4976.
[35]
Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. VideoBERT: A joint model for video and language representation learning. In Proceedings of the IEEE International Conference on Computer Vision. 7463–7472.
[36]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the Annual Conference on Neural Information Processing Systems. 3104–3112.
[37]
Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP. 5099–5110.
[38]
Pengjie Tang, Hanli Wang, and Qinyu Li. 2019. Rich visual and language representation with complementary semantics for video captioning. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 2 (2019), 31:1–31:23.
[39]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Annual Conference on Neural Information Processing Systems. 5998–6008.
[40]
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566–4575.
[41]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.
[42]
Anqi Wang, Haifeng Hu, and Liang Yang. 2018. Image captioning with affective guiding and selective attention. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 3 (2018), 73:1–73:15.
[43]
Xin Wang, Wenhu Chen, Jiawei Wu, Yuan-Fang Wang, and William Yang Wang. 2018. Video captioning via hierarchical reinforcement learning. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. 4213–4222.
[44]
Yilei Xiong, Bo Dai, and Dahua Lin. 2018. Move forward and tell: A progressive generator of video descriptions. In Proceedings of the 15th European Conference on Computer Vision. 489–505.
[45]
Chenggang Yan, Yunbin Tu, Xingzheng Wang, Yongbing Zhang, Xinhong Hao, Yongdong Zhang, and Qionghai Dai. 2020. STAT: Spatial-temporal attention mechanism for video captioning. IEEE Transactions on Multimedia 22, 1 (2020), 229–241.
[46]
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. In Proceedings of the Annual Conference on Neural Information Processing Systems. 5754–5764.
[47]
Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B. Tenenbaum. 2020. CLEVRER: Collision events for video representation and reasoning. In Proceedings of the 8th International Conference on Learning Representations.
[48]
Bowen Zhang, Hexiang Hu, and Fei Sha. 2018. Cross-modal and hierarchical modeling of video and text. In Proceedings of the 15th European Conference on Computer Vision. 385–401.
[49]
Junchao Zhang and Yuxin Peng. 2019. Object-aware aggregation with bidirectional temporal graph for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8327–8336.
[50]
Junchao Zhang and Yuxin Peng. 2020. Video captioning with object-aware spatio-temporal correlation and aggregation. IEEE Transactions on Image Processing 29 (2020), 6209–6222.
[51]
Luowei Zhou, Yannis Kalantidis, Xinlei Chen, Jason J. Corso, and Marcus Rohrbach. 2019. Grounded video description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6578–6587.
[52]
Luowei Zhou, Chenliang Xu, and Jason J. Corso. 2018. Towards automatic learning of procedures from web instructional videos. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, (AAAI-18). 7590–7598.
[53]
Luowei Zhou, Yingbo Zhou, Jason J. Corso, Richard Socher, and Caiming Xiong. 2018. End-to-end dense video captioning with masked transformer. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. 8739–8748.

Cited By

View all
  • (2024)Deep Multimodal Data FusionACM Computing Surveys10.1145/364944756:9(1-36)Online publication date: 24-Apr-2024
  • (2024)Continuous Image Outpainting with Neural ODEACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364836720:7(1-16)Online publication date: 25-Apr-2024
  • (2024)Sentiment-Oriented Transformer-Based Variational Autoencoder Network for Live Video CommentingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333420:4(1-24)Online publication date: 11-Jan-2024
  • Show More Cited By

Index Terms

  1. Scenario-Aware Recurrent Transformer for Goal-Directed Video Captioning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 4
    November 2022
    497 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3514185
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 March 2022
    Accepted: 01 December 2021
    Revised: 01 October 2021
    Received: 01 August 2021
    Published in TOMM Volume 18, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Transformer
    2. video captioning
    3. scenario-aware
    4. long-time dependency

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • National Natural Science Foundation of China
    • Open Fund of Intelligent Terminal Key Laboratory of Sichuan Province
    • Zhejiang Lab’s International Talent Fund for Young Professionals

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)167
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 12 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Deep Multimodal Data FusionACM Computing Surveys10.1145/364944756:9(1-36)Online publication date: 24-Apr-2024
    • (2024)Continuous Image Outpainting with Neural ODEACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364836720:7(1-16)Online publication date: 25-Apr-2024
    • (2024)Sentiment-Oriented Transformer-Based Variational Autoencoder Network for Live Video CommentingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333420:4(1-24)Online publication date: 11-Jan-2024
    • (2024)GG-LLM: Geometrically Grounding Large Language Models for Zero-shot Human Activity Forecasting in Human-Aware Task Planning2024 IEEE International Conference on Robotics and Automation (ICRA)10.1109/ICRA57147.2024.10611090(568-574)Online publication date: 13-May-2024
    • (2024)Quality Enhancement Based Video Captioning in Video Communication SystemsIEEE Access10.1109/ACCESS.2024.337831312(40989-40999)Online publication date: 2024
    • (2023)Characters Link Shots: Character Attention Network for Movie Scene SegmentationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363025720:4(1-23)Online publication date: 11-Dec-2023
    • (2023)Modeling Long-range Dependencies and Epipolar Geometry for Multi-view StereoACM Transactions on Multimedia Computing, Communications, and Applications10.1145/359644519:6(1-17)Online publication date: 12-Jul-2023
    • (2023)Semantic Enhanced Video Captioning with Multi-feature FusionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358857219:6(1-21)Online publication date: 20-Mar-2023
    • (2023)Video Captioning by Learning from Global Sentence and Looking AheadACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358725219:5s(1-20)Online publication date: 7-Jun-2023
    • (2023)Fine-Grained Text-to-Video Temporal Grounding from Coarse BoundaryACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357982519:5(1-21)Online publication date: 16-Mar-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media