research-article

Image Caption with Synchronous Cross-Attention

Authors:

Xiaojie WangAuthors Info & Claims

Thematic Workshops '17: Proceedings of the on Thematic Workshops of ACM Multimedia 2017

Pages 433 - 441

https://doi.org/10.1145/3126686.3126714

Published: 23 October 2017 Publication History

Abstract

The image caption aims to translate images into descriptive sentences, involving both visual and textual resources. The Deep Neural Network (DNN) based models are widely applied to solve this task, due to their impressive performance in the computer vision and natural language processing. Specifically, the attention mechanism is proposed to allow the models to focus on the essential parts of images. However, the previous models ignore both the correlation between the attention at different time, and the supervision of words on attention selection. This paper proposes an Image Caption model with Synchronous Cross-Attention (IC-SCA), which captures a visual sequence of attention with the information of words. Our IC-SCA model has two stages, visual and textual, which jointly model the multimodal information to generate the descriptions. This model is evaluated on one of the largest datasets for image caption, namely the MS-COCO dataset. Experimental results on BLEU-1~4, METEOR and CIDEr metrics demonstrate that our IC-SCA model outperforms the benchmarks. By attention visualization, the effectiveness of our proposed mechanism is also verified.

References

[1]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).

[2]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, Vol. 29. 65--72.

[3]

Xinlei Chen, Hao Fang, Tsung Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. 2015. Microsoft COCO Captions: Data Collection and Evaluation Server. Computer Science (2015).

[4]

Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).

[5]

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625--2634.

[6]

Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, and others. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1473--1482.

[7]

Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: Generating sentences from images European Conference on Computer Vision. Springer, 15--29.

Digital Library

[8]

Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal retrieval with correspondence autoencoder Proceedings of the 22nd ACM international conference on Multimedia. ACM, 7--16.

Digital Library

[9]

David R. Hardoon, Sandor Szedmak, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural computation, Vol. 16, 12 (2004), 2639--2664.

Digital Library

[10]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.

[11]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.

Digital Library

[12]

Xu Jia, Efstratios Gavves, Basura Fernando, and Tinne Tuytelaars. 2015. Guiding long-short term memory for image caption generation. arXiv preprint arXiv:1509.04942 (2015).

Digital Library

[13]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding Proceedings of the ACM International Conference on Multimedia. ACM, 675--678.

Digital Library

[14]

Andrej Karpathy and Fei Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions Computer Vision and Pattern Recognition. 3128--3137.

[15]

Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[16]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks Advances in neural information processing systems. 1097--1105.

Digital Library

[17]

Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 35, 12 (2013), 2891--2903.

Digital Library

[18]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision. Springer, 740--755.

[19]

Chang Liu, Changhu Wang, Fuchun Sun, and Yong Rui. 2016. Image2Text: A Multimodal Image Captioner. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 746--748.

Digital Library

[20]

Yu Liu, Yanming Guo, and Michael S. Lew. 2017. What Convnets Make for Image Captioning? In International Conference on Multimedia Modeling. Springer, 416--428.

[21]

Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2016. Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning. arXiv preprint arXiv:1612.01887 (2016).

[22]

Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2015. Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN). ICLR (2015).

[23]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 311--318.

Digital Library

[24]

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. ICML (3) Vol. 28 (2013), 1310--1318.

Digital Library

[25]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[26]

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks Advances in neural information processing systems. 3104--3112.

Digital Library

[27]

Kenneth Tran, Xiaodong He, Lei Zhang, Jian Sun, Cornelia Carapcea, Chris Thrasher, Chris Buehler, and Chris Sienkiewicz. 2016. Rich image captioning in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 49--56.

[28]

Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research Vol. 9, 2579--2605 (2008), 85.

[29]

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566--4575.

[30]

Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2016. Captioning images with diverse objects. arXiv preprint arXiv:1606.07770 (2016).

[31]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164.

[32]

Cheng Wang, Haojin Yang, Christian Bartz, and Christoph Meinel. 2016. Image captioning with deep bidirectional LSTMs. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 988--997.

Digital Library

[33]

Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, and Anton van den Hengel. 2016. What value do explicit high level concepts have in vision to language problems? Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 203--212.

[34]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Proceedings of the 32nd International Conference on Machine Learning (ICML-15), David Blei and Francis Bach (Eds.). JMLR Workshop and Conference Proceedings, 2048--2057. http://jmlr.org/proceedings/papers/v37/xuc15.pdf

Digital Library

[35]

Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure Proceedings of the IEEE International Conference on Computer Vision. 4507--4515.

Digital Library

[36]

Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4651--4659.

[37]

Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent Neural Network Regularization. CoRR Vol. abs/1409.2329 (2014). http://arxiv.org/abs/1409.2329

[38]

Luowei Zhou, Chenliang Xu, Parker Koch, and Jason J. Corso. 2016. Image Caption Generation with Text-Conditional Semantic Attention. arXiv preprint arXiv:1606.04621 (2016).

Cited By

Wang XWang YCheng CXie ZGao WCheng XXu LTan YTong J(2024)A Novel Energy Saving Algorithm for Network Deep Learning Tasks2024 Sixth International Conference on Next Generation Data-driven Networks (NGDN)10.1109/NGDN61651.2024.10744084(339-343)Online publication date: 26-Apr-2024
https://doi.org/10.1109/NGDN61651.2024.10744084
Wang YGao WCheng XWang XZhao HXie ZXu L(2023)Multi-Granularity Cross-Attention Network for Visual Question Answering2023 IEEE 22nd International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)10.1109/TrustCom60117.2023.00291(2098-2103)Online publication date: 1-Nov-2023
https://doi.org/10.1109/TrustCom60117.2023.00291
do Carmo Nogueira TVinhal Cda Cruz Júnior GUllmann MMarques T(2022)A reference-based model using deep learning for image captioningMultimedia Systems10.1007/s00530-022-00937-329:3(1665-1681)Online publication date: 9-May-2022
https://doi.org/10.1007/s00530-022-00937-3
Show More Cited By

Index Terms

Image Caption with Synchronous Cross-Attention
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks

Recommendations

Adaptive Attention-based High-level Semantic Introduction for Image Caption

There have been several attempts to integrate a spatial visual attention mechanism into an image caption model and introduce semantic concepts as the guidance of image caption generation. High-level semantic information consists of the abstractedness ...
A New Attention-Based LSTM for Image Captioning
Abstract
Image captioning aims to describe the content of an image with a complete and natural sentence. Recently, the image captioning methods with encoder-decoder architecture has made great progress, in which LSTM became a dominant decoder to generate ...
Regenerating Image Caption with High-Level Semantics
Intelligent Computing Methodologies
Abstract
Automatically describing an image with a sentence is a challenging task in the crossing area of computer vision and natural language processing. Most existing models generate image captions by an encoder-decoder process based on convolutional ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

Thematic Workshops '17: Proceedings of the on Thematic Workshops of ACM Multimedia 2017

October 2017

558 pages

ISBN:9781450354165

DOI:10.1145/3126686

Program Chairs:
Wanmin Wu
Google, USA
,
Jianchao Yang
Snap Inc., USA
,
Qi Tian
The University of Texas at San Antonio, USA
,
Roger Zimmermann
National University of Singapore, Singapore

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 October 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Fund
China 111 Project

Conference

MM '17

Sponsor:

SIGMM

MM '17: ACM Multimedia Conference

October 23 - 27, 2017

California, Mountain View, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
204
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 23 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang XWang YCheng CXie ZGao WCheng XXu LTan YTong J(2024)A Novel Energy Saving Algorithm for Network Deep Learning Tasks2024 Sixth International Conference on Next Generation Data-driven Networks (NGDN)10.1109/NGDN61651.2024.10744084(339-343)Online publication date: 26-Apr-2024
https://doi.org/10.1109/NGDN61651.2024.10744084
Wang YGao WCheng XWang XZhao HXie ZXu L(2023)Multi-Granularity Cross-Attention Network for Visual Question Answering2023 IEEE 22nd International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)10.1109/TrustCom60117.2023.00291(2098-2103)Online publication date: 1-Nov-2023
https://doi.org/10.1109/TrustCom60117.2023.00291
do Carmo Nogueira TVinhal Cda Cruz Júnior GUllmann MMarques T(2022)A reference-based model using deep learning for image captioningMultimedia Systems10.1007/s00530-022-00937-329:3(1665-1681)Online publication date: 9-May-2022
https://doi.org/10.1007/s00530-022-00937-3
Zhong XNie GHuang WLiu WMa BLin C(2021)Attention-guided Image Captioning with Adaptive Global and Local Feature FusionJournal of Visual Communication and Image Representation10.1016/j.jvcir.2021.103138(103138)Online publication date: Jun-2021
https://doi.org/10.1016/j.jvcir.2021.103138
do Carmo Nogueira TVinhal Cda Cruz Júnior GUllmann M(2020)Reference-based model using multimodal gated recurrent units for image captioningMultimedia Tools and Applications10.1007/s11042-020-09539-5Online publication date: 15-Aug-2020
https://doi.org/10.1007/s11042-020-09539-5
Zhang PSu LLi LBao BCosman PLi GHuang QAmsaleg LHuet BLarson MGravier GHung HNgo CTsang Ooi W(2019)Training Efficient Saliency Prediction Models with Knowledge DistillationProceedings of the 27th ACM International Conference on Multimedia10.1145/3343031.3351089(512-520)Online publication date: 15-Oct-2019
https://dl.acm.org/doi/10.1145/3343031.3351089

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents