research-article

Rich Visual and Language Representation with Complementary Semantics for Video Captioning

Authors:

Qinyu LiAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 15, Issue 2

Article No.: 31, Pages 1 - 23

https://doi.org/10.1145/3303083

Published: 05 June 2019 Publication History

Abstract

It is interesting and challenging to translate a video to natural description sentences based on the video content. In this work, an advanced framework is built to generate sentences with coherence and rich semantic expressions for video captioning. A long short term memory (LSTM) network with an improved factored way is first developed, which takes the inspiration of LSTM with a conventional factored way and a common practice to feed multi-modal features into LSTM at the first time step for visual description. Then, the incorporation of the LSTM network with the proposed improved factored way and un-factored way is exploited, and a voting strategy is utilized to predict candidate words. In addition, for robust and abstract visual and language representation, residuals are employed to enhance the gradient signals that are learned from the residual network (ResNet), and a deeper LSTM network is constructed. Furthermore, three convolutional neural network based features extracted from GoogLeNet, ResNet101, and ResNet152, are fused to catch more comprehensive and complementary visual information. Experiments are conducted on two benchmark datasets, including MSVD and MSR-VTT2016, and competitive performances are obtained by the proposed techniques as compared to other state-of-the-art methods.

References

[1]

Nicolas Ballas, Li Yao, Chris Pal, and Aaron Courville. 2015. Delving deeper into convolutional networks for learning video representations. In Proceedings of the International Conference on Learning Representations.

[2]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Meeting of the Association for Computational Linguistics Workshop. 65–72.

[3]

Lorenzo Baraldi, Grana Costantino, and Cucchiara Rita. 2017. Hierarchical boundary-aware neural encoder for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3185–3194.

[4]

Yi Bin, Yang Yang, Fumin Shen, Ning Xie, Heng Tao Shen, and Xuelong Li. 2019. Describing video with attention based bidirectional LSTM. IEEE Trans. Cyber. 49, 7 (July 2019), 2631--2641.

[5]

João Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the Kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.

[6]

David L. Chen and William B. Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the Meeting of the Association for Computational Linguistics. 190–200.

Digital Library

[7]

Yanyu Chen, Shuhui Wang, Weigang Zhang, and Qingming Huang. 2018. Less is more: Picking informative frames for video captioning. In Proceedings of the European Conference on Computer Vision.

[8]

Kyunghyun Cho, Aaron Courville, and Yoshua Bengio. 2015. Describing multimedia content using attention-based encoder-decoder networks. IEEE Trans. Multimed. 17, 11 (Nov. 2015), 1875–1886.

Digital Library

[9]

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625–2634.

[10]

Jianfeng Dong, Xirong Li, Weiyu Lan, Yujia Huo, and Cees G. M. Snoek. 2016. Early embedding and late reranking for video captioning. In Proceedings of the ACM Conference on Multimedia. ACM, 1082–1086.

Digital Library

[11]

Fang Fang, Qinyu Li, Hanli Wang, and Pengjie Tang. 2018. Refining attention: A sequential attention model for image captioning. In Proceedings of the IEEE International Conference on Multimedia and Expo. 1–6.

[12]

Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. 2017. Semantic compositional networks for visual captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5630–5639.

[13]

Lianli Gao, Zhao Guo, Hanwang Zhang, Xing Xu, and Heng Tao Shen. 2017. Video captioning with attention-based LSTM and semantic consistency. IEEE Trans. Multimed. 19, 9 (2017), 2045–2055.

Digital Library

[14]

Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2013. YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In Proceedings of the IEEE International Conference on Computer Vision. 2712–2719.

Digital Library

[15]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.

[16]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM Conference on Multimedia. ACM, 675–678.

Digital Library

[17]

Qin Jin, Jia Chen, Shizhe Chen, Yifan Xiong, and Alexander Hauptmann. 2016. Describing videos using multi-modal fusion. In Proceedings of the ACM Conference on Multimedia. ACM, 1087–1091.

Digital Library

[18]

Andrej Karpathy and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128–3137.

[19]

Ryan Kiros, Ruslan Salakhutdinov, and Rich Zemel. 2014. Multimodal neural language models. In Proceedings of the International Conference on Machine Learning. 595–603.

Digital Library

[20]

Niveda Krishnamoorthy, Girish Malkarnenkar, Raymond J. Mooney, Kate Saenko, and Sergio Guadarrama. 2013. Generating natural-language video descriptions using text-mined knowledge. In Proceedings of the AAAI Conference on Artificial Intelligence. 541–547.

Digital Library

[21]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the Conference on Neural Information Processing Systems. 1097–1105.

Digital Library

[22]

Chin-Yew Lin and Franz Josef Och. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the Meeting of the Association for Computational Linguistics. 21–26.

Digital Library

[23]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. 740–755.

[24]

Andriy Mnih and Geoffrey E. Hinton. 2007. Three new graphical models for statistical language modelling. In Proceedings of the International Conference on Machine Learning. 641–648.

Digital Library

[25]

Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4694–4702.

[26]

Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. 2016. Hierarchical recurrent neural encoder for video representation with application to captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1029–1038.

[27]

Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016. Jointly modeling embedding and translation to bridge video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4594–4602.

[28]

Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video captioning with transferred semantic attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 984–992.

[29]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the Meeting of the Association for Computational Linguistics. 311–318.

Digital Library

[30]

Yunchen Pu, Martin Renqiang Min, Zhe Gan, and Lawrence Carin. 2016. Adaptive feature abstraction for translating video to language. Retrieved from arXiv preprint arXiv:1611.07837.

[31]

Vasili Ramanishka, Abir Das, Dong Huk Park, Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, and Kate Saenko. 2016. Multimodal video description. In Proceedings of the ACM Conference on Multimedia. ACM, 1092–1096.

Digital Library

[32]

Zhiqiang Shen, Jianguo Li, Zhou Su, Minjun Li, Yurong Chen, Yu-Gang Jiang, and Xiangyang Xue. 2017. Weakly supervised dense video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1916–1924.

[33]

Rakshith Shetty and Jorma Laaksonen. 2016. Frame- and segment-level features and candidate pool evaluation for video caption generation. In Proceedings of the ACM Conference on Multimedia. ACM, 1073–1076.

Digital Library

[34]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations.

[35]

Jingkuan Song, Yuyu Guo, Lianli Gao, Xuelong Li, Alan Hanjalic, and Heng Tao Shen. 2017. From deterministic to generative: Multimodal stochastic RNNs for video captioning. arXiv preprint arXiv: 1708.02478.

[36]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition. 1–9.

[37]

Pengjie Tang, Hanli Wang, Hanzhang Wang, and Kaisheng Xu. 2017. Richer semantic visual and language representation for video captioning. In Proceedings of the ACM Conference on Multimedia. ACM, 1871–1876.

Digital Library

[38]

Jesse Thomason, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Raymond J. Mooney. 2014. Integrating language and vision to generate natural language descriptions of videos in the wild. In Proceedings of the International Conference on Computational Linguistics. 1218–1227.

[39]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 489–4497.

Digital Library

[40]

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566–4575.

[41]

Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence-video to text. In Proceedings of the IEEE International Conference on Computer Vision. 4534–4542.

Digital Library

[42]

Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2015. Translating videos to natural language using deep recurrent neural networks. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1494–1504.

[43]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.

[44]

Bairui Wang, Lin Ma, Wei Zhang, and Wei Liu. 2018. Reconstruction network for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7622–7631.

[45]

Hanzhang Wang, Hanli Wang, and Kaisheng Xu. 2018. Categorizing concepts with basic level for vision-to-language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4962–4970.

[46]

Junbo Wang, Wei Wang, Yan Huang, Liang Wang, and Tieniu Tan. 2018 M<sup>3</sup> . M: Multimodal memory modelling for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7512–7520.

[47]

Shikui Wei, Yao Zhao, Zhenfeng Zhu, and Nan Liu. 2010. Multimodal fusion for video search reranking. IEEE Trans. Knowl. Data Eng. 22, 8 (Aug. 2010), 1191–1199.

Digital Library

[48]

Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, and Anton Hengel. 2016. What value do explicit high level concepts have in vision to language problems? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 203–212.

[49]

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5288–5296.

[50]

Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048–2057.

Digital Library

[51]

Kaisheng Xu, Hanli Wang, and Pengjie Tang. 2017. Image captioning with deep LSTM based on sequential residual. In Proceedings of the IEEE International Conference on Multimedia and Expo. 361–366.

[52]

Yang Yang, Jie Zhou, Jiangbo Ai, Yi Bin, Alan Hanjalic, and Heng Tao Shen. 2018. Video captioning by adversarial LSTM. IEEE Trans. Image Proc. 27, 11 (2018), 5600–5611.

Digital Library

[53]

Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE International Conference on Computer Vision. 4507–4515.

Digital Library

[54]

Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision. 4904–4912.

[55]

Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. 2016. Video paragraph captioning using hierarchical recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4584–4593.

[56]

Mingxing Zhang, Yang Yang, Hanwang Zhang, Yanli Ji, Heng Tao Shen, and Tat-Seng Chua. 2018. More is better: Precise and detailed image captioning using online positive recall and missing concepts mining. IEEE Trans. Image Proc. 28, 1 (2018), 32–44.

Digital Library

[57]

Mohammadreza Zolfaghari, Kamaljeet Singh, and Thomas Brox. 2018. ECO: Efficient convolutional network for online video understanding. In Proceedings of the European Conference on Computer Vision.

Cited By

Zhang BGao JYuan Y(2024)Center-enhanced video captioning model with multimodal semantic alignmentNeural Networks10.1016/j.neunet.2024.106744180(106744)Online publication date: Dec-2024
https://doi.org/10.1016/j.neunet.2024.106744
Tang PRao HZhang ATan Y(2024)Video emotional description with fact reinforcement and emotion awakingJournal of Ambient Intelligence and Humanized Computing10.1007/s12652-024-04779-x15:6(2839-2852)Online publication date: 20-Apr-2024
https://doi.org/10.1007/s12652-024-04779-x
Niu TDong SChen ZLuo XGuo SHuang ZXu X(2023)Semantic Enhanced Video Captioning with Multi-feature FusionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358857219:6(1-21)Online publication date: 20-Mar-2023
https://dl.acm.org/doi/10.1145/3588572
Show More Cited By

Index Terms

Rich Visual and Language Representation with Complementary Semantics for Video Captioning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Video summarization

Recommendations

Richer Semantic Visual and Language Representation for Video Captioning
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Translating and summarizing a video into natural language is an interesting and challenging visual task. In this work, a novel framework is built to generate sentences for videos with more coherence and semantics. A long short term memory (LSTM) network ...
Learning Multimodal Attention LSTM Networks for Video Captioning
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Automatic generation of video caption is a challenging task as video is an information-intensive media with complex variations. Most existing methods, either based on language templates or sequence learning, have treated video as a flat data sequence ...
Residual attention-based LSTM for video captioning

Recently great success has been achieved by proposing a framework with hierarchical LSTMs in video captioning, such as stacked LSTM networks. When deeper LSTM layers are able to start converging, a degradation problem has been exposed. With the number ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 15, Issue 2

May 2019

375 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3339884

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 June 2019

Accepted: 01 December 2018

Revised: 01 September 2018

Received: 01 June 2018

Published in TOMM Volume 15, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Natural Science Foundation of China
Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning
Shanghai Engineering Research Center of Industrial Vision Perception & Intelligent Computing
IBM Shared University Research Awards Program

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
302
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)4

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang BGao JYuan Y(2024)Center-enhanced video captioning model with multimodal semantic alignmentNeural Networks10.1016/j.neunet.2024.106744180(106744)Online publication date: Dec-2024
https://doi.org/10.1016/j.neunet.2024.106744
Tang PRao HZhang ATan Y(2024)Video emotional description with fact reinforcement and emotion awakingJournal of Ambient Intelligence and Humanized Computing10.1007/s12652-024-04779-x15:6(2839-2852)Online publication date: 20-Apr-2024
https://doi.org/10.1007/s12652-024-04779-x
Niu TDong SChen ZLuo XGuo SHuang ZXu X(2023)Semantic Enhanced Video Captioning with Multi-feature FusionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358857219:6(1-21)Online publication date: 20-Mar-2023
https://dl.acm.org/doi/10.1145/3588572
Niu TChen ZLuo XZhang PHuang ZXu X(2023)Video Captioning by Learning from Global Sentence and Looking AheadACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358725219:5s(1-20)Online publication date: 7-Jun-2023
https://dl.acm.org/doi/10.1145/3587252
Hao JSun HRen PZhong YWang JQi QLiao J(2023)Fine-Grained Text-to-Video Temporal Grounding from Coarse BoundaryACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357982519:5(1-21)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3579825
Ding XPan YLi YYao TZeng DMei T(2023)Boosting Relationship Detection in Images with Multi-Granular Self-Supervised LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/355697819:2s(1-18)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3556978
Cao GZhou FLiu KWang AFan L(2023)A Decoupled Kernel Prediction Network Guided by Soft Mask for Single Image HDR ReconstructionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/355027719:2s(1-23)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3550277
Dong SNiu TLuo XLiu WXu X(2023)Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/355027619:2(1-18)Online publication date: 6-Feb-2023
https://dl.acm.org/doi/10.1145/3550276
Attaoui MFahmy HPastore FBriand L(2023)Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction and ClusteringACM Transactions on Software Engineering and Methodology10.1145/355027132:3(1-40)Online publication date: 26-Apr-2023
https://dl.acm.org/doi/10.1145/3550271
Dramko LLacomis JYin PSchwartz EAllamanis MNeubig GVasilescu BLe Goues C(2023)DIRE and its Data: Neural Decompiled Variable Renamings with Respect to Software ClassACM Transactions on Software Engineering and Methodology10.1145/354694632:2(1-34)Online publication date: 29-Mar-2023
https://dl.acm.org/doi/10.1145/3546946
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents