Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Image Captioning With Visual-Semantic Double Attention

Published: 23 January 2019 Publication History
  • Get Citation Alerts
  • Abstract

    In this article, we propose a novel Visual-Semantic Double Attention (VSDA) model for image captioning. In our approach, VSDA consists of two parts: a modified visual attention model is used to extract sub-region image features, then a new SEmantic Attention (SEA) model is proposed to distill semantic features. Traditional attribute-based models always neglect the distinctive importance of each attribute word and fuse all of them into recurrent neural networks, resulting in abundant irrelevant semantic features. In contrast, at each timestep, our model selects the most relevant word that aligns with current context. In other words, the real power of VSDA lies in the ability of not only leveraging semantic features but also eliminating the influence of irrelevant attribute words to make the semantic guidance more precise. Furthermore, our approach solves the problem that visual attention models cannot boost generating non-visual words. Considering that visual and semantic features are complementary to each other, our model can leverage both of them to strengthen the generations of visual and non-visual words. Extensive experiments are conducted on famous datasets: MS COCO and Flickr30k. The results show that VSDA outperforms other methods and achieves promising performance.

    References

    [1]
    Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65--72.
    [2]
    Jacob Devlin, Hao Cheng, Hao Fang, Saurabh Gupta, Li Deng, Xiaodong He, et al. 2015. Language models for image captioning: The quirks and what works. arXiv:1505.01809.
    [3]
    Desmond Elliott and Frank Keller. 2013. Image description using visual dependency representations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1292--1302.
    [4]
    Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh Srivastava, Li Deng, Piotr Dollár, et al. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1473--1482.
    [5]
    Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, et al. 2010. Every picture tells a story: Generating sentences from images. In Proceedings of the European Conference on Computer Vision. 15--29.
    [6]
    Jiuxiang Gu, Gang Wang, Jianfei Cai, and Tsuhan Chen. 2017. An empirical study of language CNN for image captioning. In Proceedings of the International Conference on Computer Vision (ICCV’17). 1222--1231.
    [7]
    Chen He and Haifeng Hu. 2018. Image captioning with text-based visual attention. Neural Processing Letters (2018), 1--9.
    [8]
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770--778.
    [9]
    Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3128--3137.
    [10]
    Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980.
    [11]
    Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, et al. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12, 2891--2903.
    [12]
    Rémi Lebret, Pedro O. Pinheiro, and Ronan Collobert. 2014. Simple image description generator via a linear phrase-based approach. arXiv:1412.8419.
    [13]
    Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Proceedings of theWorkshoponText Summarization Branches Out, Post-Conference Workshop of ACL 2004.
    [14]
    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, et al. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. 740--755.
    [15]
    Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. 2017. Improved image captioning via policy gradient optimization of spider. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17), Vol. 3. 3.
    [16]
    Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), Vol. 6.
    [17]
    Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Advances in Neural Information Processing Systems. 289--297.
    [18]
    Junhua Mao, Xu Wei, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan L. Yuille. 2015. Learning like a child: Fast novel visual concept learning from sentence descriptions of images. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 2533--2541.
    [19]
    Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2014. Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv:1412.6632.
    [20]
    Margaret Mitchell, Xufeng Han, Jesse Dodge, Alyssa Mensch, Amit Goyal, Alex Berg, et al. 2012. Midge: Generating image descriptions from computer vision detections. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. 747--756.
    [21]
    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 311--318.
    [22]
    Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), Vol. 1. 3.
    [23]
    Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, Mario Fritz, and Bernt Schiele. 2017. Speaking the same language: Matching machine to human captions by adversarial training. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 4155--4164.
    [24]
    Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.
    [25]
    Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 4566--4575.
    [26]
    Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). IEEE, Los Alamitos, CA, 3156--3164.
    [27]
    Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic memory networks for visual and textual question answering. In Proceedings of the International Conference on Machine Learning. 2397--2406.
    [28]
    Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, et al. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048--2057.
    [29]
    Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 21--29.
    [30]
    Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 22--29.
    [31]
    Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 4651--4659.
    [32]
    Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2, 67--78.

    Cited By

    View all
    • (2024)NPoSC-A3: A novel part of speech clues-aware adaptive attention mechanism for image captioningEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107732131(107732)Online publication date: May-2024
    • (2024)Attribute guided fusion network for obtaining fine-grained image captionsMultimedia Tools and Applications10.1007/s11042-024-19410-6Online publication date: 27-May-2024
    • (2023)Bottom-up and Top-down Object Inference Networks for Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358036619:5(1-18)Online publication date: 16-Mar-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 15, Issue 1
    February 2019
    265 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3309717
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 January 2019
    Accepted: 01 November 2018
    Revised: 01 September 2018
    Received: 01 April 2018
    Published in TOMM Volume 15, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Visual-semantic double attention
    2. image captioning
    3. semantic attention

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • Science and Technology Program of Guangzhou of China
    • Fundamental Research Funds for the Central Universities of China
    • National Natural Science Foundation of China
    • Natural Science Foundation of Guangdong Province

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)15
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)NPoSC-A3: A novel part of speech clues-aware adaptive attention mechanism for image captioningEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107732131(107732)Online publication date: May-2024
    • (2024)Attribute guided fusion network for obtaining fine-grained image captionsMultimedia Tools and Applications10.1007/s11042-024-19410-6Online publication date: 27-May-2024
    • (2023)Bottom-up and Top-down Object Inference Networks for Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358036619:5(1-18)Online publication date: 16-Mar-2023
    • (2023)Context-Adaptive-Based Image Captioning by Bi-CARUIEEE Access10.1109/ACCESS.2023.330251211(84934-84943)Online publication date: 2023
    • (2023)A Multiheaded Attention-Based Model for Generating Hindi CaptionsProceedings of Third Emerging Trends and Technologies on Intelligent Systems10.1007/978-981-99-3963-3_51(677-684)Online publication date: 20-Sep-2023
    • (2023)Benefit from AMR: Image Captioning with Explicit Relations and Endogenous KnowledgeWeb and Big Data10.1007/978-981-97-2390-4_25(363-376)Online publication date: 6-Oct-2023
    • (2022)Hybrid of Deep Learning and Word Embedding in Generating Captions: Image-Captioning Solution for Geological Rock ImagesJournal of Imaging10.3390/jimaging81102948:11(294)Online publication date: 22-Oct-2022
    • (2022)Augmentation of Deep Learning Models for Multistep Traffic Speed PredictionApplied Sciences10.3390/app1219972312:19(9723)Online publication date: 27-Sep-2022
    • (2021)Bi-Directional Co-Attention Network for Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/346047417:4(1-20)Online publication date: 12-Nov-2021
    • (2021)Looking Back and Forward: Enhancing Image Captioning with Global Semantic Guidance2021 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN52387.2021.9533418(1-8)Online publication date: 2021
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media