Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Constrained LSTM and Residual Attention for Image Captioning

Published: 05 July 2020 Publication History

Abstract

Visual structure and syntactic structure are essential in images and texts, respectively. Visual structure depicts both entities in an image and their interactions, whereas syntactic structure in texts can reflect the part-of-speech constraints between adjacent words. Most existing methods either use visual global representation to guide the language model or generate captions without considering the relationships of different entities or adjacent words. Thus, their language models lack relevance in both visual and syntactic structure. To solve this problem, we propose a model that aligns the language model to certain visual structure and also constrains it with a specific part-of-speech template. In addition, most methods exploit the latent relationship between words in a sentence and pre-extracted visual regions in an image yet ignore the effects of unextracted regions on predicted words. We develop a residual attention mechanism to simultaneously focus on the pre-extracted visual objects and unextracted regions in an image. Residual attention is capable of capturing precise regions of an image corresponding to the predicted words considering both the effects of visual objects and unextracted regions. The effectiveness of our entire framework and each proposed module are verified on two classical datasets: MSCOCO and Flickr30k. Our framework is on par with or even better than the state-of-the-art methods and achieves superior performance on COCO captioning Leaderboard.

References

[1]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18).
[2]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473.
[3]
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In Proceedings of the International Conference on Neural Information Processing Systems. 1171--1179.
[4]
Danqi Chen and Christopher Manning. 2014. A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 740--750.
[5]
Fuhai Chen, Rongrong Ji, Xiaoshuai Sun, Yongjian Wu, and Jinsong Su. 2018. GroupCap: Group-based image captioning with structured relevance and diversity constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).
[6]
Shi Chen and Qi Zhao. 2018. Boosted attention: Leveraging human attention for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV’18).
[7]
Xinlei Chen, Hao Fang, Tsung Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv:1504.00325.
[8]
Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems. 3844--3852.
[9]
Hao Fang, Saurabh Gupta, Forrest N. Iandola, Rupesh Kumar Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, et al. 2014. From captions to visual concepts and back. arXiv:1411.4952.
[10]
Kun Fu, Junqi Jin, Runpeng Cui, Fei Sha, and Changshui Zhang. 2017. Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 12 (2017), 2321--2334.
[11]
Kulkarni Girish, Premraj Visruth, Ordonez Vicente, Dhar Sagnik, Li Siming, Choi Yejin, Alexander C. Berg, and Tamara L. Berg. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12 (2013), 2891--2903.
[12]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 2961--2969.
[13]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770--778.
[14]
Xu Jia, Efstratios Gavves, Basura Fernando, and Tinne Tuytelaars. 2015. Guiding long-short term memory for image caption generation. arXiv:1509.04942.
[15]
Andrej Karpathy and Fei Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3128--3137.
[16]
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980.
[17]
Thomas N. Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv:1609.02907.
[18]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li Jia Li, and David A. Shamma. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32--73.
[19]
Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg, Tamara L. Berg, and Yejin Choi. 2012. Collective generation of natural image descriptions. In Proceedings of the Meeting of the Association for Computational Linguistics: Long Papers.
[20]
Polina Kuznetsova, Vicente Ordonez, Tamara Berg, and Yejin Choi. 2014. TREETALK: Composition and compression of trees for image descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 351--362.
[21]
Siming Li, Girish Kulkarni, Tamara L. Berg, Alexander C. Berg, and Yejin Choi. 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the 15th Conference on Computational Natural Language Learning.
[22]
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV’18). 21--37.
[23]
Xihui Liu, Hongsheng Li, Jing Shao, Dapeng Chen, and Xiaogang Wang. 2018. Show, tell and discriminate: Image captioning by self-retrieval with partially labeled data. In Proceedings of the European Conference on Computer Vision (ECCV’18).
[24]
Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), Vol. 6. 2.
[25]
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2014. Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv:1412.6632.
[26]
Margaret Mitchell, Xufeng Han, Jesse Dodge, Alyssa Mensch, and Iii Hal Daum. 2012. Midge: Generating image descriptions from computer vision detections. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics.
[27]
Yu Qin, Jiajun Du, Yonghua Zhang, and Hongtao Lu. 2019. Look back and predict forward in image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).
[28]
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 779--788.
[29]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91--99.
[30]
Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR’17), Vol. 1. 3.
[31]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.
[32]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1 (2014), 1929--1958.
[33]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 2818--2826.
[34]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3156--3164.
[35]
Anqi Wang, Haifeng Hu, and Liang Yang. 2018. Image captioning with affective guiding and selective attention. ACM Transactions on Multimedia Computing, Communications, and Applications 14 (July 2018), 1--15.
[36]
Qingzhong Wang and Antoni B. Chan. 2019. Describing like humans: On diversity in image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).
[37]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15). 2048--2057.
[38]
Liang Yang and Haifeng Hu. 2017. TVPRNN for image caption generation. Electronics Letters 53, 22 (2017), 1471--1473.
[39]
Liang Yang and Haifeng Hu. 2019. Adaptive syncretic attention for constrained image captioning. Neural Processing Letters 50 (2019), 549--564.
[40]
Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).
[41]
Yezhou Yang, Ching Lik Teo, Hal Daum Iii, and Yiannis Aloimonos. 2011. Corpus-guided sentence generation of natural images. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.
[42]
Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 22--29.
[43]
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 4651--4659.
[44]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67--78.
[45]
Yue Zheng, Yali Li, and Shengjin Wang. 2019. Intention oriented image captions with guiding objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).

Cited By

View all
  • (2024)Region-Focused Network for Dense CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364837020:6(1-20)Online publication date: 15-Feb-2024
  • (2024)Sentiment-Oriented Transformer-Based Variational Autoencoder Network for Live Video CommentingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333420:4(1-24)Online publication date: 11-Jan-2024
  • (2024)LSTM-Aided Selective Beam Tracking in Multi-Cell Scenario for mmWave Wireless SystemsIEEE Transactions on Wireless Communications10.1109/TWC.2023.328326723:2(890-907)Online publication date: 1-Feb-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 3
August 2020
364 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3409646
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 July 2020
Online AM: 07 May 2020
Accepted: 01 March 2020
Revised: 01 September 2019
Received: 01 June 2019
Published in TOMM Volume 16, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Image captioning
  2. LSTM
  3. object detection
  4. visual attention
  5. visual skeleton

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • National Natural Science Foundation of China
  • National Key R8D Program of China
  • Natural Science Foundation of Guangdong
  • Fundamental Research Funds for the Central Universities of China
  • Science and Technology Program of Guangzhou

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)21
  • Downloads (Last 6 weeks)1
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Region-Focused Network for Dense CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364837020:6(1-20)Online publication date: 15-Feb-2024
  • (2024)Sentiment-Oriented Transformer-Based Variational Autoencoder Network for Live Video CommentingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333420:4(1-24)Online publication date: 11-Jan-2024
  • (2024)LSTM-Aided Selective Beam Tracking in Multi-Cell Scenario for mmWave Wireless SystemsIEEE Transactions on Wireless Communications10.1109/TWC.2023.328326723:2(890-907)Online publication date: 1-Feb-2024
  • (2024)NPoSC-A3Engineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107732131:COnline publication date: 1-May-2024
  • (2023)A Context Semantic Auxiliary Network for Image CaptioningInformation10.3390/info1407041914:7(419)Online publication date: 20-Jul-2023
  • (2023)Language-guided Residual Graph Attention Network and Data Augmentation for Visual GroundingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/360455720:1(1-23)Online publication date: 14-Jun-2023
  • (2023)Video Captioning by Learning from Global Sentence and Looking AheadACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358725219:5s(1-20)Online publication date: 7-Jun-2023
  • (2023)Image Captioning With Novel Topics Guidance and Retrieval-Based Topics Re-WeightingIEEE Transactions on Multimedia10.1109/TMM.2022.320269025(5984-5999)Online publication date: 1-Jan-2023
  • (2023)Image Captioning for Chest X-Rays Using GRU - Based Attention Mechanism2023 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS)10.1109/ICCCIS60361.2023.10425174(518-525)Online publication date: 3-Nov-2023
  • (2023)RelNet-MAM: Relation Network with Multilevel Attention Mechanism for Image CaptioningMicroprocessors and Microsystems10.1016/j.micpro.2023.104931(104931)Online publication date: Sep-2023
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media