Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3126686.3126714acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Image Caption with Synchronous Cross-Attention

Published: 23 October 2017 Publication History

Abstract

The image caption aims to translate images into descriptive sentences, involving both visual and textual resources. The Deep Neural Network (DNN) based models are widely applied to solve this task, due to their impressive performance in the computer vision and natural language processing. Specifically, the attention mechanism is proposed to allow the models to focus on the essential parts of images. However, the previous models ignore both the correlation between the attention at different time, and the supervision of words on attention selection. This paper proposes an Image Caption model with Synchronous Cross-Attention (IC-SCA), which captures a visual sequence of attention with the information of words. Our IC-SCA model has two stages, visual and textual, which jointly model the multimodal information to generate the descriptions. This model is evaluated on one of the largest datasets for image caption, namely the MS-COCO dataset. Experimental results on BLEU-1~4, METEOR and CIDEr metrics demonstrate that our IC-SCA model outperforms the benchmarks. By attention visualization, the effectiveness of our proposed mechanism is also verified.

References

[1]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
[2]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, Vol. 29. 65--72.
[3]
Xinlei Chen, Hao Fang, Tsung Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. 2015. Microsoft COCO Captions: Data Collection and Evaluation Server. Computer Science (2015).
[4]
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).
[5]
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625--2634.
[6]
Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, and others. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1473--1482.
[7]
Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: Generating sentences from images European Conference on Computer Vision. Springer, 15--29.
[8]
Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal retrieval with correspondence autoencoder Proceedings of the 22nd ACM international conference on Multimedia. ACM, 7--16.
[9]
David R. Hardoon, Sandor Szedmak, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural computation, Vol. 16, 12 (2004), 2639--2664.
[10]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[11]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.
[12]
Xu Jia, Efstratios Gavves, Basura Fernando, and Tinne Tuytelaars. 2015. Guiding long-short term memory for image caption generation. arXiv preprint arXiv:1509.04942 (2015).
[13]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding Proceedings of the ACM International Conference on Multimedia. ACM, 675--678.
[14]
Andrej Karpathy and Fei Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions Computer Vision and Pattern Recognition. 3128--3137.
[15]
Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[16]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks Advances in neural information processing systems. 1097--1105.
[17]
Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 35, 12 (2013), 2891--2903.
[18]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision. Springer, 740--755.
[19]
Chang Liu, Changhu Wang, Fuchun Sun, and Yong Rui. 2016. Image2Text: A Multimodal Image Captioner. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 746--748.
[20]
Yu Liu, Yanming Guo, and Michael S. Lew. 2017. What Convnets Make for Image Captioning? In International Conference on Multimedia Modeling. Springer, 416--428.
[21]
Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2016. Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning. arXiv preprint arXiv:1612.01887 (2016).
[22]
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2015. Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN). ICLR (2015).
[23]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 311--318.
[24]
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. ICML (3) Vol. 28 (2013), 1310--1318.
[25]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[26]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks Advances in neural information processing systems. 3104--3112.
[27]
Kenneth Tran, Xiaodong He, Lei Zhang, Jian Sun, Cornelia Carapcea, Chris Thrasher, Chris Buehler, and Chris Sienkiewicz. 2016. Rich image captioning in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 49--56.
[28]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research Vol. 9, 2579--2605 (2008), 85.
[29]
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566--4575.
[30]
Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2016. Captioning images with diverse objects. arXiv preprint arXiv:1606.07770 (2016).
[31]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164.
[32]
Cheng Wang, Haojin Yang, Christian Bartz, and Christoph Meinel. 2016. Image captioning with deep bidirectional LSTMs. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 988--997.
[33]
Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, and Anton van den Hengel. 2016. What value do explicit high level concepts have in vision to language problems? Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 203--212.
[34]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Proceedings of the 32nd International Conference on Machine Learning (ICML-15), David Blei and Francis Bach (Eds.). JMLR Workshop and Conference Proceedings, 2048--2057. http://jmlr.org/proceedings/papers/v37/xuc15.pdf
[35]
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure Proceedings of the IEEE International Conference on Computer Vision. 4507--4515.
[36]
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4651--4659.
[37]
Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent Neural Network Regularization. CoRR Vol. abs/1409.2329 (2014). http://arxiv.org/abs/1409.2329
[38]
Luowei Zhou, Chenliang Xu, Parker Koch, and Jason J. Corso. 2016. Image Caption Generation with Text-Conditional Semantic Attention. arXiv preprint arXiv:1606.04621 (2016).

Cited By

View all
  • (2024)A Novel Energy Saving Algorithm for Network Deep Learning Tasks2024 Sixth International Conference on Next Generation Data-driven Networks (NGDN)10.1109/NGDN61651.2024.10744084(339-343)Online publication date: 26-Apr-2024
  • (2023)Multi-Granularity Cross-Attention Network for Visual Question Answering2023 IEEE 22nd International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)10.1109/TrustCom60117.2023.00291(2098-2103)Online publication date: 1-Nov-2023
  • (2022)A reference-based model using deep learning for image captioningMultimedia Systems10.1007/s00530-022-00937-329:3(1665-1681)Online publication date: 9-May-2022
  • Show More Cited By

Index Terms

  1. Image Caption with Synchronous Cross-Attention

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    Thematic Workshops '17: Proceedings of the on Thematic Workshops of ACM Multimedia 2017
    October 2017
    558 pages
    ISBN:9781450354165
    DOI:10.1145/3126686
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 October 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. convolutional neural network
    2. deep learning
    3. image caption
    4. long short-term memory
    5. multimodal learning

    Qualifiers

    • Research-article

    Funding Sources

    • National Science Fund
    • China 111 Project

    Conference

    MM '17
    Sponsor:
    MM '17: ACM Multimedia Conference
    October 23 - 27, 2017
    California, Mountain View, USA

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 23 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Novel Energy Saving Algorithm for Network Deep Learning Tasks2024 Sixth International Conference on Next Generation Data-driven Networks (NGDN)10.1109/NGDN61651.2024.10744084(339-343)Online publication date: 26-Apr-2024
    • (2023)Multi-Granularity Cross-Attention Network for Visual Question Answering2023 IEEE 22nd International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)10.1109/TrustCom60117.2023.00291(2098-2103)Online publication date: 1-Nov-2023
    • (2022)A reference-based model using deep learning for image captioningMultimedia Systems10.1007/s00530-022-00937-329:3(1665-1681)Online publication date: 9-May-2022
    • (2021)Attention-guided Image Captioning with Adaptive Global and Local Feature FusionJournal of Visual Communication and Image Representation10.1016/j.jvcir.2021.103138(103138)Online publication date: Jun-2021
    • (2020)Reference-based model using multimodal gated recurrent units for image captioningMultimedia Tools and Applications10.1007/s11042-020-09539-5Online publication date: 15-Aug-2020
    • (2019)Training Efficient Saliency Prediction Models with Knowledge DistillationProceedings of the 27th ACM International Conference on Multimedia10.1145/3343031.3351089(512-520)Online publication date: 15-Oct-2019

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media