research-article

Image Captioning With Visual-Semantic Double Attention

Authors:

Haifeng HuAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 15, Issue 1

Article No.: 26, Pages 1 - 16

https://doi.org/10.1145/3292058

Published: 23 January 2019 Publication History

Abstract

In this article, we propose a novel Visual-Semantic Double Attention (VSDA) model for image captioning. In our approach, VSDA consists of two parts: a modified visual attention model is used to extract sub-region image features, then a new SEmantic Attention (SEA) model is proposed to distill semantic features. Traditional attribute-based models always neglect the distinctive importance of each attribute word and fuse all of them into recurrent neural networks, resulting in abundant irrelevant semantic features. In contrast, at each timestep, our model selects the most relevant word that aligns with current context. In other words, the real power of VSDA lies in the ability of not only leveraging semantic features but also eliminating the influence of irrelevant attribute words to make the semantic guidance more precise. Furthermore, our approach solves the problem that visual attention models cannot boost generating non-visual words. Considering that visual and semantic features are complementary to each other, our model can leverage both of them to strengthen the generations of visual and non-visual words. Extensive experiments are conducted on famous datasets: MS COCO and Flickr30k. The results show that VSDA outperforms other methods and achieves promising performance.

References

[1]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65--72.

[2]

Jacob Devlin, Hao Cheng, Hao Fang, Saurabh Gupta, Li Deng, Xiaodong He, et al. 2015. Language models for image captioning: The quirks and what works. arXiv:1505.01809.

[3]

Desmond Elliott and Frank Keller. 2013. Image description using visual dependency representations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1292--1302.

[4]

Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh Srivastava, Li Deng, Piotr Dollár, et al. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1473--1482.

[5]

Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, et al. 2010. Every picture tells a story: Generating sentences from images. In Proceedings of the European Conference on Computer Vision. 15--29.

Digital Library

[6]

Jiuxiang Gu, Gang Wang, Jianfei Cai, and Tsuhan Chen. 2017. An empirical study of language CNN for image captioning. In Proceedings of the International Conference on Computer Vision (ICCV’17). 1222--1231.

[7]

Chen He and Haifeng Hu. 2018. Image captioning with text-based visual attention. Neural Processing Letters (2018), 1--9.

[8]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770--778.

[9]

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3128--3137.

[10]

Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980.

[11]

Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, et al. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12, 2891--2903.

Digital Library

[12]

Rémi Lebret, Pedro O. Pinheiro, and Ronan Collobert. 2014. Simple image description generator via a linear phrase-based approach. arXiv:1412.8419.

[13]

Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Proceedings of theWorkshoponText Summarization Branches Out, Post-Conference Workshop of ACL 2004.

[14]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, et al. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. 740--755.

[15]

Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. 2017. Improved image captioning via policy gradient optimization of spider. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17), Vol. 3. 3.

[16]

Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), Vol. 6.

[17]

Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Advances in Neural Information Processing Systems. 289--297.

Digital Library

[18]

Junhua Mao, Xu Wei, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan L. Yuille. 2015. Learning like a child: Fast novel visual concept learning from sentence descriptions of images. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 2533--2541.

Digital Library

[19]

Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2014. Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv:1412.6632.

[20]

Margaret Mitchell, Xufeng Han, Jesse Dodge, Alyssa Mensch, Amit Goyal, Alex Berg, et al. 2012. Midge: Generating image descriptions from computer vision detections. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. 747--756.

Digital Library

[21]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 311--318.

Digital Library

[22]

Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), Vol. 1. 3.

[23]

Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, Mario Fritz, and Bernt Schiele. 2017. Speaking the same language: Matching machine to human captions by adversarial training. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 4155--4164.

[24]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.

[25]

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 4566--4575.

[26]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). IEEE, Los Alamitos, CA, 3156--3164.

[27]

Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic memory networks for visual and textual question answering. In Proceedings of the International Conference on Machine Learning. 2397--2406.

Digital Library

[28]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, et al. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048--2057.

Digital Library

[29]

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 21--29.

[30]

Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 22--29.

[31]

Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 4651--4659.

[32]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2, 67--78.

Cited By

Al-Qatf MHawbani AWang XAbdusallam AZhao LAlsamhi SCurry E(2024)NPoSC-A3: A novel part of speech clues-aware adaptive attention mechanism for image captioningEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107732131(107732)Online publication date: May-2024
https://doi.org/10.1016/j.engappai.2023.107732
Hossen MYe ZAbdussalam AWahab F(2024)Attribute guided fusion network for obtaining fine-grained image captionsMultimedia Tools and Applications10.1007/s11042-024-19410-6Online publication date: 27-May-2024
https://doi.org/10.1007/s11042-024-19410-6
Pan YLi YYao TMei T(2023)Bottom-up and Top-down Object Inference Networks for Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358036619:5(1-18)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3580366
Show More Cited By

Index Terms

Image Captioning With Visual-Semantic Double Attention
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Natural language generation
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory

Recommendations

Bi-Directional Co-Attention Network for Image Captioning
Image Captioning, which automatically describes an image with natural language, is regarded as a fundamental challenge in computer vision. In recent years, significant advance has been made in image captioning through improving attention mechanism. ...
Integrating Scene Semantic Knowledge into Image Captioning

Most existing image captioning methods use only the visual information of the image to guide the generation of captions, lack the guidance of effective scene semantic information, and the current visual attention mechanism cannot adjust the focus ...
Image Captioning with a Joint Attention Mechanism by Visual Concept Samples

The attention mechanism has been established as an effective method for generating caption words in image captioning; it explores one noticed subregion in an image to predict a related caption word. However, even though the attention mechanism could ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 15, Issue 1

February 2019

265 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3309717

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 January 2019

Accepted: 01 November 2018

Revised: 01 September 2018

Received: 01 April 2018

Published in TOMM Volume 15, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Science and Technology Program of Guangzhou of China
Fundamental Research Funds for the Central Universities of China
National Natural Science Foundation of China
Natural Science Foundation of Guangdong Province

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
656
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)0

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Al-Qatf MHawbani AWang XAbdusallam AZhao LAlsamhi SCurry E(2024)NPoSC-A3: A novel part of speech clues-aware adaptive attention mechanism for image captioningEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107732131(107732)Online publication date: May-2024
https://doi.org/10.1016/j.engappai.2023.107732
Hossen MYe ZAbdussalam AWahab F(2024)Attribute guided fusion network for obtaining fine-grained image captionsMultimedia Tools and Applications10.1007/s11042-024-19410-6Online publication date: 27-May-2024
https://doi.org/10.1007/s11042-024-19410-6
Pan YLi YYao TMei T(2023)Bottom-up and Top-down Object Inference Networks for Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358036619:5(1-18)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3580366
Im SChan K(2023)Context-Adaptive-Based Image Captioning by Bi-CARUIEEE Access10.1109/ACCESS.2023.330251211(84934-84943)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3302512
Meghwal VMittal NSingh G(2023)A Multiheaded Attention-Based Model for Generating Hindi CaptionsProceedings of Third Emerging Trends and Technologies on Intelligent Systems10.1007/978-981-99-3963-3_51(677-684)Online publication date: 20-Sep-2023
https://doi.org/10.1007/978-981-99-3963-3_51
Chen FLi XTang JLi SWang T(2023)Benefit from AMR: Image Captioning with Explicit Relations and Endogenous KnowledgeWeb and Big Data10.1007/978-981-97-2390-4_25(363-376)Online publication date: 6-Oct-2023
https://dl.acm.org/doi/10.1007/978-981-97-2390-4_25
Nursikuwagus AMunir RKhodra M(2022)Hybrid of Deep Learning and Word Embedding in Generating Captions: Image-Captioning Solution for Geological Rock ImagesJournal of Imaging10.3390/jimaging81102948:11(294)Online publication date: 22-Oct-2022
https://doi.org/10.3390/jimaging8110294
Riaz ARahman HArshad MNabeel MYasin AAl-Adhaileh MEldin EGhamry N(2022)Augmentation of Deep Learning Models for Multistep Traffic Speed PredictionApplied Sciences10.3390/app1219972312:19(9723)Online publication date: 27-Sep-2022
https://doi.org/10.3390/app12199723
Jiang WWang WHu H(2021)Bi-Directional Co-Attention Network for Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/346047417:4(1-20)Online publication date: 12-Nov-2021
https://dl.acm.org/doi/10.1145/3460474
Chen FTang JLi SLi XWang T(2021)Looking Back and Forward: Enhancing Image Captioning with Global Semantic Guidance2021 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN52387.2021.9533418(1-8)Online publication date: 2021
https://doi.org/10.1109/IJCNN52387.2021.9533418
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents