Image Captioning with Memorized Knowledge

Chen, Hui; Ding, Guiguang; Lin, Zijia; Guo, Yuchen; Shan, Caifeng; Han, Jungong

doi:10.1007/s12559-019-09656-w

Image Captioning with Memorized Knowledge

Published: 10 June 2019

Volume 13, pages 807–820, (2021)
Cite this article

Cognitive Computation Aims and scope Submit manuscript

Hui Chen¹,
Guiguang Ding¹,
Zijia Lin²,
Yuchen Guo¹,
Caifeng Shan³ &
…
Jungong Han⁴

793 Accesses
Explore all metrics

Abstract

Image captioning, which aims to automatically generate text description of given images, has received much attention from researchers. Most existing approaches adopt a recurrent neural network (RNN) as a decoder to generate captions conditioned on the input image information. However, traditional RNNs deal with the sequence in a recurrent way, squeezing the information of all previous words into hidden cells and updating the context information by fusing the hidden states with the current word information. This may miss the rich knowledge too far in the past. In this paper, we propose a memory-enhanced captioning model for image captioning. We firstly introduce an external memory to store the past knowledge, i.e., all the information of generated words. When predicting the next word, the decoder can retrieve knowledge information about the past by means of a selective reading mechanism. Furthermore, to better explore the knowledge stored in the memory, we introduce several variants that consider different types of past knowledge. To verify the effectiveness of the proposed model, we conduct extensive experiments and comparisons on the well-known image captioning dataset MS COCO. Compared with the state-of-the-art captioning models, the proposed memory-enhanced captioning model shows a significant improvement in terms of the performance (improving 3.5% in terms of CIDEr). The proposed memory-enhanced captioning model, as demonstrated in the experiments, is more effective and superior to the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Attend to Knowledge: Memory-Enhanced Attention Network for Image Captioning

Transformer with Prior Language Knowledge for Image Captioning

Neural Image Caption Generation with Weighted Training and Reference

Article Open access 08 August 2018

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

References

Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L. 2017. Bottom-up and top-down attention for image captioning and vqa. arXiv:1707.07998.
Banerjee S, Lavie A. Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 2005. vol. 29, p. 65–72.
Chen H, Ding G, Lin Z, Guo Y, Han J. Attend to knowledge: memory-enhanced attention network for image captioning. International Conference on Brain Inspired Cognitive Systems. Springer; 2018. p. 161–71.
Chen H, Ding G, Lin Z, Zhao S, Han J. Show, observe and tell: attribute-driven attention model for image captioning. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18. International Joint Conferences on Artificial Intelligence Organization; 2018. p. 606–12.
Chen H, Ding G, Zhao S, Han J. 2018. Temporal-difference learning with sampling baseline for image captioning. AAAI Conference on Artificial Intelligence.
Chen L, Zhang H, Xiao J, Nie L, Shao J, Chua TS. 2017. Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning CVPR.
Chen M, Ding G, Zhao S, Chen H, Liu Q, Han J. 2017. Reference based LSTM for image captioning AAAI.
Cho K, Van Merriënboer B, Gülçehre Ç, Bahdanau D, Bougares F, Schwenk H, Bengio Y. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Conference on Empirical Methods on Natural Language processing. 2014. p. 1724–34.
Devlin J, Cheng H, Fang H, Gupta S, Deng L, He X, Zweig G, Mitchell M. 2015. Language models for image captioning: the quirks and what works. In Annual Meeting of the Association for Computational Linguistics. 2015. p. 100–5.
Devlin J, Gupta S, Girshick R, Mitchell M, Zitnick CL. 2015. Exploring nearest neighbor approaches for image captioning. arXiv:1505.04467.
Ding G, Chen M, Zhao S, Chen H, Han J, Liu Q. 2018. Neural image caption generation with weighted training and reference. Cognitive Computation. https://doi.org/10.1007/s12559-018-9581-x.
Ding G, Guo Y, Chen K, Chu C, Han J, Dai Q. 2019. Decode: deep confidence network for robust image classification. IEEE Transactions on Image Processing.
Ding G, Guo Y, Zhou J, Gao Y. Large-scale cross-modality search via collective matrix factorization hashing. TIP 2016;25(11):5427–40.
MathSciNet MATH Google Scholar
Dodds A. 2013. Rehabilitating blind and visually impaired people: a psychological approach. Springer.
Elliott D, Keller F. Image description using visual dependency representations. In Conference on Empirical Methods on Natural Language Processing. 2013. p. 1292–302.
Fakoor R, Mohamed Ar, Mitchell M, Kang SB, Kohli P. 2016. Memory-augmented attention modelling for videos. arXiv:1611.02261.
Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D. Every picture tells a story: generating sentences from images. In European Conference on Computer Vision. 2010. p. 15–29.
Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L. 2017. Semantic compositional networks for visual captioning. In CVPR.
Gong Y, Wang L, Hodosh M, Hockenmaier J, Lazebnik S. Improving image-sentence embeddings using large weakly annotated photo collections. In European Conference on Computer Vision. 2014. p. 529–45.
Gu J, Cai J, Wang G, Chen T. 2018. Stack-captioning: coarse-to-fine learning for image captioning. In AAAI.
He K, Zhang X, Ren S, Sun J. . Deep residual learning for image recognition 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016;00:770–778.
Article Google Scholar
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput 1997;9(8):1735–1780.
Article Google Scholar
Hodosh M, Young P, Hockenmaier J. Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 2013;47:853–99.
Article MathSciNet Google Scholar
Jia X, Gavves E, Fernando B, Tuytelaars T. 2015. Guiding the long-short term memory model for image caption generation. In IEEE International Conference on Computer Vision. 2015. p. 2407–15.
Jin J, Fu K, Cui R, Sha F, Zhang C. 2015. Aligning where to see and what to tell: image caption with region-based attention and scene factorization. arXiv:1506.06272.
Kaiser L, Nachum O, Roy A, Bengio S. 2017. Learning to remember rare events CVPR.
Karpathy A, Li FF. Deep visual-semantic alignments for generating image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition. 2015. p. 3128–37.
Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 2012. p. 1097–105.
Kulkarni G, Premraj V, Dhar S, Li S, Choi Y, Berg A, Berg T. Baby talk: understanding and generating simple image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition. 2011. p. 1601–8.
Kuznetsova P, Ordonez V, Berg A, Berg T, Choi Y. Collective generation of natural image descriptions. In Annual Meeting of the Association for Computational Linguistics. 2012. p. 359–68.
Kuznetsova P, Ordonez V, Berg T, Choi Y. Treetalk: composition and compression of trees for image descriptions. Trans Assoc Comput Ling 2014;2(10):351–62.
Google Scholar
Lan X, Ma A, Yuen PC, Chellappa R. Joint sparse representation and robust feature-level fusion for multi-cue visual tracking. IEEE Trans Image Process 2015;24(12):5826.
Article MathSciNet Google Scholar
Lan X, Ye M, Shao R, Zhong B, Yuen PC, Zhou H. Learning modality-consistency feature templates: a robust rgb-infrared tracking system. IEEE Trans Ind Electron. 2019:1–1. https://doi.org/10.1109/TIE.2019.2898618.
Lan X, Ye M, Zhang S, Zhou H, Yuen PC. Modality-correlation-aware sparse representation for RGB-infrared object tracking. Pattern Recogn Lett. 2018. https://doi.org/10.1016/j.patrec.2018.10.002.
Lan X, Zhang S, Yuen PC, Chellappa R. Learning common and feature-specific patterns: a novel multiple-sparse-representation-based tracker. IEEE Trans Image Process 2018;27(4):2022–37.
Article MathSciNet Google Scholar
Li J, Zhang Z, He H. Hierarchical convolutional neural networks for EEG-based emotion recognition. Cogn Comput 2018;10(2):368–80.
Article Google Scholar
Li N, Chen Z. Image captioning with visual-semantic LSTM. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18. International Joint Conferences on Artificial Intelligence Organization; 2018. p. 793–799.
Li Y, Pan Q, Yang T, Wang S, Tang J, Cambria E. Learning word representations for sentiment analysis. Cogn Comput. 2017;843–851.
Lin CY, Hovy E. Automatic evaluation of summaries using n-gram co-occurrence statistics. Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Association for Computational Linguistics; 2003. p. 71–78.
Lin Z, Ding G, Han J, Shao L. End-to-end feature-aware label space encoding for multilabel classification with many classes. IEEE Trans Neural Netw Learn Syst 2018;29(6):2472–87.
Article MathSciNet Google Scholar
Lin Z, Ding G, Han J, Wang J. 2016. Cross-view retrieval via probability-based semantics-preserving hashing. IEEE Transactions on Cybernetics.
Liu S, Zhu Z, Ye N, Guadarrama S, Murphy K. Improved image captioning via policy gradient optimization of spider. In: Proceedings of the IEEE International Conference on Computer Vision. 2017. p. 873–81.
Liu X, Li H, Shao J, Chen D, Wang X. 2018. Show, tell and discriminate: image captioning by self-retrieval with partially labeled data. arXiv:1803.08314.
Liu Y, Vong C, Wong P. Extreme learning machine for huge hypotheses re-ranking in statistical machine translation. Cogn Comput 2017;9(2):285–94.
Article Google Scholar
Lu J, Xiong C, Parikh D, Socher R. 2017. Knowing when to look: adaptive attention via a visual sentinel for image captioning.
Luo R, Price B, Cohen S, Shakhnarovich G. 2018. Discriminability objective for training descriptive captions. arXiv:1803.04376.
Mao J, Xu W, Yang Y, Wang J, Yuille AL. 2015. Deep captioning with multimodal recurrent neural networks (m-RNN). In International Conference on Learning Representations.
Mitchell M, Han X, Dodge J, Mensch A, Goyal A, Berg A, Yamaguchi K, Berg T, Stratos K, Daumé H III. Midge: generating image descriptions from computer vision detections. In Conference of the European Chapter of the Association for Computational Linguistics. 2012. p. 747–56.
Papineni K, Roukos S, Ward T, Zhu WJ. Bleu: a method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting on Association for Computational linguistics. Association for Computational Linguistics; 2002. p. 311–8.
Ranzato M, Chopra S, Auli M, Zaremba W. 2015. Sequence level training with recurrent neural networks. arXiv:1511.06732.
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V. 2016. Self-critical sequence training for image captioning CVPR.
Roopnarine J, Johnson JE. 2013. Approaches to early childhood education. Merrill/Prentice Hall.
Vedantam R, Lawrence Zitnick C, Parikh D. Cider: consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. p. 4566–75.
Vinyals O, Toshev A, Bengio S, Erhan D. Show and tell: a neural image caption generator. InCVPR. 2015 p. 3156–64.
Wang M, Lu Z, Li H, Liu Q. Memory-enhanced decoder for neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. p. 278–86.
Weston J, Chopra S, Bordes A. 2014. Memory networks. arXiv:1410.3916.
Wu G, Han J, Guo Y, Liu L, Ding G, Ni Q, Shao L. Unsupervised deep video hashing via balanced code for large-scale video retrieval. IEEE Trans Image Process 2019;28(4):1993–2007.
Article MathSciNet Google Scholar
Wu G, Han J, Lin Z, Ding G, Zhang B, Ni Q. 2018. Joint image-text hashing for fast large-scale cross-media retrieval using self-supervised deep learning. IEEE Transactions on Industrial Electronics.
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y. Show, attend and tell: neural image caption generation with visual attention. In ICML. 2015. p. 2048–57.
Yang Z, Yuan Y, Wu Y, Salakhutdinov R, Cohen WW. 2016. Encode, review, and decode: reviewer module for caption generation NIPS.
Yao T, Pan Y, Li Y, Qiu Z, Mei T. 2016. Boosting image captioning with attributes. arXiv:1611.01646.
You Q, Jin H, Wang Z, Fang C, Luo J. 2016. Image captioning with semantic attention. In IEEE Conference on Computer Vision and Pattern Recognition. 2016. p. 4651–59.
Zhong G, Yan S, Huang K, Cai Y, Dong J. Reducing and stretching deep convolutional activation features for accurate image classification. Cogn Comput 2018;10(1):179–86.
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Software, Tsinghua University, Beijing, China
Hui Chen, Guiguang Ding & Yuchen Guo
Microsoft Research, Beijing, China
Zijia Lin
Philips Research, Eindhoven, Netherlands
Caifeng Shan
WMG Data Science, University of Warwick, Coventry, UK
Jungong Han

Authors

Hui Chen
View author publications
You can also search for this author in PubMed Google Scholar
Guiguang Ding
View author publications
You can also search for this author in PubMed Google Scholar
Zijia Lin
View author publications
You can also search for this author in PubMed Google Scholar
Yuchen Guo
View author publications
You can also search for this author in PubMed Google Scholar
Caifeng Shan
View author publications
You can also search for this author in PubMed Google Scholar
Jungong Han
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guiguang Ding.

Ethics declarations

Ethical Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by the National Key R&D Program of China (Nos. 2018YFC0806900) and the National Natural Science Foundation of China (Nos. 61571269).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, H., Ding, G., Lin, Z. et al. Image Captioning with Memorized Knowledge. Cogn Comput 13, 807–820 (2021). https://doi.org/10.1007/s12559-019-09656-w

Download citation

Received: 12 November 2018
Accepted: 29 May 2019
Published: 10 June 2019
Issue Date: July 2021
DOI: https://doi.org/10.1007/s12559-019-09656-w

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Image Captioning with Memorized Knowledge

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Attend to Knowledge: Memory-Enhanced Attention Network for Image Captioning

Transformer with Prior Language Knowledge for Image Captioning

Neural Image Caption Generation with Weighted Training and Reference

Explore related subjects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethical Approval

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now