Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Image Captioning via Semantic Guidance Attention and Consensus Selection Strategy

Published: 10 October 2018 Publication History

Abstract

Recently, a series of attempts have incorporated spatial attention mechanisms into the task of image captioning, which achieves a remarkable improvement in the quality of generative captions. However, the traditional spatial attention mechanism adopts latent and delayed semantic representations to decide which area should be paid more attention to, resulting in inaccurate semantic guidance and the introduction of redundant information. In order to optimize the spatial attention mechanism, we propose the Semantic Guidance Attention (SGA) mechanism in this article. Specifically, SGA utilizes semantic word representations to provide an intuitive semantic guidance that focuses accurately on semantic-related regions. Moreover, we reduce the difficulty of generating fluent sentences by updating the attention information in time. At the same time, the beam search algorithm is widely used to predict words during sequence generation. This algorithm generates a sentence according to the probabilities of words, so it is easy to push out a generic sentence and discard some distinctive captions. In order to overcome this limitation, we design the Consensus Selection (CS) strategy to choose the most descriptive and informative caption, which is selected by the semantic similarity of captions instead of the probabilities of words. The consensus caption is determined by selecting the one with the highest cumulative semantic similarity with respect to the reference captions. Our proposed model (SGA-CS) is validated on Flickr30k and MSCOCO, which shows that SGA-CS outperforms state-of-the-art approaches. To our best knowledge, SGA-CS is the first attempt to jointly produce semantic attention guidance and select descriptive captions for image captioning tasks, achieving one of the best performance ratings among any cross-entropy training methods.

References

[1]
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Guided open vocabulary image captioning with constrained beam search. arXiv preprint arXiv:1612.00576 (2016).
[2]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
[3]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Vol. 29. Association for Computational Linguistics, Michigan, USA, 65--72.
[4]
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems (NIPS). Neural Information Processing Systems Foundation, Montréal, Canada, 1171--1179.
[5]
Hui Chen, Guiguang Ding, Sicheng Zhao, and Jungong Han. 2018. Temporal-difference learning with sampling baseline for image captioning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI’18). Association for the Advancement of Artificial Intelligence, New Orleans, Louisiana, USA, 6706--6713.
[6]
Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, and Tat-Seng Chua. 2016. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. arXiv preprint arXiv:1611.05594 (2016).
[7]
Tianshui Chen, Zhouxia Wang, Guanbin Li, and Liang Lin. 2017. Recurrent attentional reinforcement learning for multi-label image recognition. arXiv preprint arXiv:1712.07465 (2017).
[8]
Tianshui Chen, Wenxi Wu, Yuefang Gao, Le Dong, Xiaonan Luo, and Liang Lin. 2018. Fine-grained representation learning and recognition by exploiting hierarchical semantic embedding. arXiv preprint arXiv:1808.04505 (2018).
[9]
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C. Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).
[10]
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).
[11]
Jacob Devlin, Saurabh Gupta, Ross Girshick, Margaret Mitchell, and C. Lawrence Zitnick. 2015. Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467 (2015).
[12]
Kun Fu, Junqi Jin, Runpeng Cui, Fei Sha, and Changshui Zhang. 2017. Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 12 (2017), 2321--2334.
[13]
Lianli Gao, Zhao Guo, Hanwang Zhang, Xing Xu, and Heng Tao Shen. 2017. Video captioning with attention-based lstm and semantic consistency. IEEE Transactions on Multimedia 19, 9 (2017), 2045--2055.
[14]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, Las Vegas, USA, 770--778.
[15]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735--1780.
[16]
Xu Jia, Efstratios Gavves, Basura Fernando, and Tinne Tuytelaars. 2015. Guiding long-short term memory for image caption generation. arXiv preprint arXiv:1509.04942 (2015).
[17]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, Boston, USA, 3128--3137.
[18]
Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[19]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, Vol. 8. Association for Computational Linguistics, Barcelona, Spain, 1--10.
[20]
Feng Liu, Tao Xiang, Timothy M. Hospedales, Wankou Yang, and Changyin Sun. 2017. Semantic regularisation for recurrent image annotation. arXiv preprint (2017).
[21]
Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. 2016. Improved image captioning via policy gradient optimization of spider. arXiv preprint arXiv:1612.00370 (2016).
[22]
Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2016. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. arXiv preprint arXiv:1612.01887 (2016).
[23]
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2014. Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632 (2014).
[24]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, USA, 311--318.
[25]
Ramakanth Pasunuru and Mohit Bansal. 2017. Multi-task video captioning with video and entailment generation. arXiv preprint arXiv:1704.07489 (2017).
[26]
Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, Santiago, Chile, 2641--2649.
[27]
Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732 (2015).
[28]
Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting.Journal of Machine Learning Research 15, 1 (2014), 1929--1958.
[29]
Qing Sun, Stefan Lee, and Dhruv Batra. 2017. Bidirectional beam search: Forward-backward inference in neural sequence models for fill-in-the-blank image captioning. arXiv preprint arXiv:1705.08759 (2017).
[30]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the 28th Annual Conference on Neural Information Processing Systems (NIPS’14). Neural Information Processing Systems Foundation, Montréal, Canada. 3104--3112.
[31]
Amazon Mechanical Turk. 2012. Amazon mechanical turk. Retrieved August 17 (2012), 2012.
[32]
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, Boston, Massachusetts, USA, 4566--4575.
[33]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, Boston, Massachusetts, USA, 3156--3164.
[34]
Zhouxia Wang, Tianshui Chen, Guanbin Li, Ruijia Xu, and Liang Lin. 2017. Multi-label image recognition by recurrently discovering attentional regions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, Honolulu, Hawaii, USA, 464--472.
[35]
Jie Wu and Haifeng Hu. 2018. Cascade recurrent neural network for image caption generation. Electronics Letters 53, 25 (2018), 1642--1643.
[36]
Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin, Tao Qin, Nenghai Yu, and Tie-Yan Liu. 2017. Deliberation networks: Sequence generation beyond one-pass decoding. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS’17). Neural Information Processing Systems Foundation, Long Beach, USA, 1782--1792.
[37]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. International Machine Learning Society (IMLS), Lille, France, 2048--2057.
[38]
Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2016. Boosting image captioning with attributes. OpenReview 2, 5 (2016), 8.
[39]
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, Las Vegas, USA, 4651--4659.

Cited By

View all
  • (2023)Real-time Computational Cinematographic Editing for Broadcasting of Volumetric-captured events: an Application to Ultimate FightingProceedings of the 16th ACM SIGGRAPH Conference on Motion, Interaction and Games10.1145/3623264.3624468(1-8)Online publication date: 15-Nov-2023
  • (2023)AMSA: Adaptive Multimodal Learning for Sentiment AnalysisACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357291519:3s(1-21)Online publication date: 24-Feb-2023
  • (2023)Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/355027619:2(1-18)Online publication date: 6-Feb-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 14, Issue 4
Special Section on Deep Learning for Intelligent Multimedia Analytics
November 2018
221 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3282485
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2018
Accepted: 01 August 2018
Revised: 01 June 2018
Received: 01 April 2018
Published in TOMM Volume 14, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Image captioning
  2. beam search
  3. consensus selection strategy
  4. semantic guidance attention mechanism
  5. spatial attention mechanism

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • National Natural Science Foundation of China
  • Natural Science Foundation of Guangdong Province
  • Science and Technology Program of Guangzhou
  • Fundamental Research Funds for the Central Universities of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)1
Reflects downloads up to 18 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Real-time Computational Cinematographic Editing for Broadcasting of Volumetric-captured events: an Application to Ultimate FightingProceedings of the 16th ACM SIGGRAPH Conference on Motion, Interaction and Games10.1145/3623264.3624468(1-8)Online publication date: 15-Nov-2023
  • (2023)AMSA: Adaptive Multimodal Learning for Sentiment AnalysisACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357291519:3s(1-21)Online publication date: 24-Feb-2023
  • (2023)Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/355027619:2(1-18)Online publication date: 6-Feb-2023
  • (2023)Boosting Scene Graph Generation with Visual Relation SaliencyACM Transactions on Multimedia Computing, Communications, and Applications10.1145/351404119:1(1-17)Online publication date: 5-Jan-2023
  • (2022)Content-Based Collaborative Filtering With Predictive Error Reduction-Based CNN Using IPU ModelInternational Journal of Information Security and Privacy10.4018/IJISP.30830916:2(1-19)Online publication date: 16-Sep-2022
  • (2022)Shoot360: Normal View Video Creation from City Panorama FootageACM SIGGRAPH 2022 Conference Proceedings10.1145/3528233.3530702(1-9)Online publication date: 27-Jul-2022
  • (2021)Online Multi-Granularity Distillation for GAN Compression2021 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV48922.2021.00672(6773-6783)Online publication date: Oct-2021
  • (2020)Adaptive Attention-based High-level Semantic Introduction for Image CaptionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/340938816:4(1-22)Online publication date: 17-Dec-2020
  • (2020)Learning From Music to Visual Storytelling of Shots: A Deep Interactive Learning MechanismProceedings of the 28th ACM International Conference on Multimedia10.1145/3394171.3413985(102-110)Online publication date: 12-Oct-2020
  • (2020)Joint Attention for Automated Video EditingProceedings of the 2020 ACM International Conference on Interactive Media Experiences10.1145/3391614.3393656(55-64)Online publication date: 17-Jun-2020
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media