research-article

Deep Visual-Semantic Alignments for Generating Image Descriptions

Authors:

Andrej Karpathy,

Li Fei-FeiAuthors Info & Claims

IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 39, Issue 4

Pages 664 - 676

https://doi.org/10.1109/TPAMI.2016.2598339

Published: 01 April 2017 Publication History

Abstract

We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks (RNN) over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate that our alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCO datasets. We then show that the generated descriptions outperform retrieval baselines on both full images and on a new dataset of region-level annotations. Finally, we conduct large-scale analysis of our RNN language model on the Visual Genome dataset of 4.1 million captions and highlight the differences between image and region-level caption statistics.

References

[1]

L. Fei-Fei, A. Iyer, C. Koch, and P. Perona, “What do we perceive in a glance of a real-world scene?” J. Vis., vol. Volume 7, no. Issue 1, 2007, Art. no. 10.

[2]

M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The Pascal visual object classes (VOC) challenge,” Int. J. Comput. Vis., vol. Volume 88, no. Issue 2, pp. 303–338, 2010.

Digital Library

[3]

O. Russakovsky, et al., “Imagenet large scale visual recognition challenge,” Int. J. Comput. Vis., vol. Volume 115, no. Issue 3, pp. 211–252, 2015.

Digital Library

[4]

G. Kulkarni, et al., “Baby talk: Understanding and generating simple image descriptions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2011, pp. 1601–1608.

Digital Library

[5]

A. Farhadi, et al., “Every picture tells a story: Generating sentences from images,” in Proc. 11th Eur. Conf. Comput. Vis., 2010, pp. 15–29.

Digital Library

[6]

M. Hodosh, P. Young, and J. Hockenmaier, “Framing image description as a ranking task: Data, models and evaluation metrics,” J. Artificial Intell. Res., vol. Volume 47, pp. 853–899, 2013.

Digital Library

[7]

X. Chen and C. L. Zitnick, “Learning a recurrent visual representation for image caption generation,” arXiv preprint arXiv:1411.5654, 2014.

[8]

P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Trans. Assoc. Comput. Linguistics, vol. Volume 2, pp. 67–78, 2014.

[9]

K. Barnard, P. Duygulu, D. Forsyth, N. De Freitas, D. M. Blei, and M. I. Jordan, “Matching words and pictures,” J. Mach. Learn. Res., vol. Volume 3, pp. 1107–1135, 2003.

Digital Library

[10]

R. Socher and L. Fei-Fei, “Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2010, pp. 966–973.

[11]

S. Fidler, A. Sharma, and R. Urtasun, “A sentence is worth a thousand pixels,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2013, pp.1995–2002.

Digital Library

[12]

S. Gould, R. Fulton, and D. Koller, “Decomposing a scene into geometric and semantically consistent regions,” in Proc. IEEE 12th Int. Conf. Comput. Vis., 2009, pp. 1–8.

[13]

L.-J. Li and L. Fei-Fei, “What, where and who? classifying events by scene and object recognition,” in Proc. Int. Conf. Comput. Vis., 2007, pp. 1–8.

[14]

L.-J. Li, R. Socher, and L. Fei-Fei, “Towards total scene understanding: Classification, annotation and segmentation in an automatic framework,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2009, pp. 2036–2043.

[15]

Y. Jia, M. Salzmann, and T. Darrell, “Learning cross-modality similarity for multinomial data,” in Proc. IEEE Int. Conf. Comput. Vis., 2011, pp. 2407–2414.

Digital Library

[16]

V. Ordonez, G. Kulkarni, and T. L. Berg, “Im2text: Describing images using 1 million captioned photographs,” in Proc. Advances Neural Inf. Process. Syst., 2011, pp. 1143–1151.

Digital Library

[17]

R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng, “Grounded compositional semantics for finding and describing images with sentences,” Trans. Assoc. Comput. Linguistics, vol. Volume 2, pp. 207–218, 2014.

[18]

P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and Y. Choi, “Collective generation of natural image descriptions,” in Proc. 50th Annu. Meeting Assoc. Comput. Linguistics, 2012, pp. 359–368.

Digital Library

[19]

P. Kuznetsova, V. Ordonez, T. L. Berg, U. C. Hill, and Y. Choi, “Treetalk: Composition and compression of trees for image descriptions,” Trans. Assoc. Comput. Linguistics, vol. Volume 2, no. Issue 10, pp. 351–362, 2014.

[20]

S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi, “Composing simple image descriptions using web-scale n-grams,” in Proc. 15th Conf. Comput. Natural Language Learn., 2011, pp. 220–228.

Digital Library

[21]

A. Barbu, et al., “Video in sentences out,” arXiv:1204.2742, 2012.

[22]

D. Elliott and F. Keller, “Image description using visual dependency representations,” in Proc. Empirical Methods Natural Language Process., 2013, pp. 1292–1302.

[23]

A. Gupta and P. Mannem, “From image annotation to image description,” in Neural Information Processing . Berlin, Germany: Springer, 2012.

Digital Library

[24]

Y. Yang, C. L. Teo, H. Daumé III, and Y. Aloimonos, “Corpus-guided sentence generation of natural images,” in Proc. Conf. Empirical Methods Natural Language Process., 2011, pp. 444–454.

Digital Library

[25]

B. Z. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu, “I2T: Image parsing to text description,” in Proc. IEEE, vol. Volume 98, no. Issue 8, pp. 1485–1508, 2010.

[26]

M. Yatskar, L. Vanderwende, and L. Zettlemoyer, “See no evil, say no evil: Description generation from densely labeled images,” in Proc. 3rd Joint Conf. Lexical Comput. Semantics, 2014, pp. 110–120.

[27]

M. Mitchell, et al., “Midge: Generating image descriptions from computer vision detections,” in Proc. 13th Conf. Eur. Chapter Assoc. Comput. Linguistics, 2012, pp. 747–756.

Digital Library

[28]

A. Karpathy, A. Joulin, and L. Fei-Fei, “Deep fragment embeddings for bidirectional image sentence mapping,” Advances in neural information processing systems, 2014, pp. 1889–1897.

Digital Library

[29]

R. Kiros, R. S. Zemel, and R. Salakhutdinov, “Multimodal neural language models,” in Proc. Int. Conf. Mach. Learn., 2014, pp. 595–603.

Digital Library

[30]

X. Chen and C. L. Zitnick, “Learning a recurrent visual representation for image caption generation,” CoRR, 2014. {Online}. Available: http://arxiv.org/abs/1411.5654

[31]

J. Donahue, et al., “Long-term recurrent convolutional networks for visual recognition and description,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2015, pp. 2625–2634.

[32]

H. Fang, et al., “From captions to visual concepts and back,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2015, pp. 1473–1482.

[33]

J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille, “Explain images with multimodal recurrent neural networks,” arXiv:1410.1090, 2014.

[34]

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2015, pp. 3156–3164.

[35]

T. L. Berg “Names and faces in the news,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2004, vol. Volume 2, pp. II-848–II-854.

Digital Library

[36]

C. Kong, D. Lin, M. Bansal, R. Urtasun, and S. Fidler, “What are you talking about? text-to-image coreference,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2014, pp. 351–362.

Digital Library

[37]

D. Lin, S. Fidler, C. Kong, and R. Urtasun, “Visual semantic search: Retrieving videos via complex textual queries,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2014, pp. 2657–2667.

Digital Library

[38]

C. Matuszek, N. FitzGerald, L. Zettlemoyer, L. Bo, and D. Fox, “A joint model of language and perception for grounded attribute learning,” in Proc. 29th Int. Conf. Mach. Learn., Jun. 2012, pp. 1671–1678.

Digital Library

[39]

C. L. Zitnick, D. Parikh, and L. Vanderwende, “Learning the visual interpretation of sentences,” in Proc. IEEE Int. Conf. Comput. Vis., 2013, pp. 1681–1688.

Digital Library

[40]

Y. Zhu, et al., “Aligning books and movies: Towards story-like visual explanations by watching movies and reading books,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 19–27.

Digital Library

[41]

A. Frome, et al., “Devise: A deep visual-semantic embedding model,” in Proc. Advances Neural Inf. Process. Syst., 2013, pp. 2121–2129.

Digital Library

[42]

A. Karpathy, A. Joulin, and F. F. F. Li, “Deep fragment embeddings for bidirectional image sentence mapping,” in Proc. Advances Neural Inf. Process. Syst., 2014, pp. 1889–1897.

Digital Library

[43]

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. Advances Neural Inf. Process. Syst., 2012, pp. 1097–1105.

Digital Library

[44]

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. Volume 86, no. Issue 11, pp. 2278–2324, 1998.

[45]

Y. Bengio, H. Schwenk, J.-S. Senécal, F. Morin, and J.-L. Gauvain, “Neural probabilistic language models,” in Innovations in Machine Learning . Berlin, Germany: Springer, 2006.

[46]

R. Socher, J. Pennington, and C. Manning, “Glove: Global vectors for word representation,” in Proc. Empirical Methods Natural Language Process., 2014, pp. 1532–1543.

[47]

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Proc. Advances Neural Inf. Process. Syst., 2013, pp. 3111–3119.

Digital Library

[48]

Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A neural probabilistic language model,” J. Mach. Learn. Res., vol. Volume 3, pp. 1137–1155, 2003.

Digital Library

[49]

T. Mikolov, M. Karafiát, L. Burget, J. Cernocky, and S. Khudanpur, “Recurrent neural network based language model,” in Proc. 11th Annu. Conf. Int. Speech Commun. Assoc., 2010, pp. 1045–1048.

[50]

I. Sutskever, J. Martens, and G. E. Hinton, “Generating text with recurrent neural networks,” in Proc. 28th Int. Conf. Mach. Learn., 2011, pp. 1017–1024.

Digital Library

[51]

R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2014, pp. 580–587.

Digital Library

[52]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2009, pp. 248–255.

[53]

M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Trans. Signal Process., vol. Volume 45, no. Issue 11, pp. 2673–2681, 1997.

Digital Library

[54]

A. J. Viterbi, “Error bounds for convolutional codes and an asymptotically optimum decoding algorithm,” IEEE Trans. Inf. Theory, vol. Volume TIT-13, no. Issue 2, pp. 260–269, 1967.

Digital Library

[55]

J. L. Elman, “Finding structure in time,” Cogn. Science, vol. Volume 14, no. Issue 2, pp. 179–211, 1990.

[56]

W. Zhang, State-Space Search: Algorithms, Complexity, Extensions, and Applications . Berlin, Germany: Springer, 1999.

[57]

T. Tieleman and G. Hinton, “Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural Networks for Machine Learning, vol. Volume 4, no. Issue 2, 2012.

[58]

T. Tieleman and G. E. Hinton, “Lecture 6.5-RmsProp: Divide the gradient by a running average of its recent magnitude,” 2012.

[59]

R. Krishna, et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” 2016. {Online}. Available: http://arxiv.org/abs/1602.07332

[60]

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. Volume 9, no. Issue 8, pp. 1735–1780, 1997.

Digital Library

[61]

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556, 2014.

[62]

C. Szegedy, et al., “Going deeper with convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2015, pp. 1–9.

[63]

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method for automatic evaluation of machine translation,” in Proc. 40th Annu. Meeting Assoc. Comput. Linguistics, 2002, pp. 311–318.

Digital Library

[64]

M. Denkowski and A. Lavie, “METEOR universal: Language specific translation evaluation for any target language,” in Proc. 9th Workshop Statistical Mach. Transl., 2014, pp. 67–78.

[65]

R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “CIDEr: Consensus-based image description evaluation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2015, pp. 4566–4575.

[66]

X. Chen, et al., “Microsoft Coco captions: Data collection and evaluation server,” arXiv:1504.00325, 2015.

[67]

J. Devlin, S. Gupta, R. Girshick, M. Mitchell, and C. L. Zitnick, “Exploring nearest neighbor approaches for image captioning,” arXiv:1505.04467, 2015.

[68]

R. Collobert, K. Kavukcuoglu, and C. Farabet, “Torch7: A Matlab-like environment for machine learning,” in Proc. Big Learn Advances Neural Inf. Process. Syst. Workshop, 2011, pp. 1681–1688.

Cited By

Luo JLi YPan YYao TFeng JChao HMei T(2025)Exploring Vision-Language Foundation Model for Novel Object CaptioningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.345243735:1(91-102)Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1109/TCSVT.2024.3452437
Brar KGoyal BDogra AMustafa MMajumdar RAlkhayyat AKukreja V(2025)Image segmentation reviewInformation Fusion10.1016/j.inffus.2024.102608114:COnline publication date: 1-Feb-2025
https://dl.acm.org/doi/10.1016/j.inffus.2024.102608
Gao NYao RChen PLiang RSun GTang J(2025)Multi-granularity semantic relational mapping for image captionExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.125847264:COnline publication date: 10-Mar-2025
https://dl.acm.org/doi/10.1016/j.eswa.2024.125847
Show More Cited By

Deep Visual-Semantic Alignments for Generating Image Descriptions
1. Computing methodologies
  1. Artificial intelligence
  2. Machine learning
    1. Machine learning approaches

Recommendations

Mobile Application for Archaeological Site Image Content Retrieval and Automated Generating Image Descriptions with Neural Network

This paper presents a novel algorithm for generating descriptions of stupa image such as stupa's era, stupa's architecture and other description in mobile application by using key points generated from SIFT algorithms and learning stupa description from ...
A region-based image caption generator with refined descriptions

A region based deep learning approach is proposed for image description generation.It employs a regional object detector and RNN-based attribute prediction.It also embeds encoderdecoder based description generation.It shows superiority over other ...
Single Image Dehazing via Image Generating
Image and Video Technology
Abstract
Outdoor images taken in bad weather conditions often suffer from poor visibility. However, single image haze removal is an ill-posed problem, because the number of the equations is smaller than the number of unknowns. In this paper, a deep ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Pattern Analysis and Machine Intelligence

IEEE Transactions on Pattern Analysis and Machine Intelligence Volume 39, Issue 4

April 2017

208 pages

ISSN:0162-8828

Issue’s Table of Contents

Copyright © 2017.

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 April 2017

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

199
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Luo JLi YPan YYao TFeng JChao HMei T(2025)Exploring Vision-Language Foundation Model for Novel Object CaptioningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.345243735:1(91-102)Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1109/TCSVT.2024.3452437
Brar KGoyal BDogra AMustafa MMajumdar RAlkhayyat AKukreja V(2025)Image segmentation reviewInformation Fusion10.1016/j.inffus.2024.102608114:COnline publication date: 1-Feb-2025
https://dl.acm.org/doi/10.1016/j.inffus.2024.102608
Gao NYao RChen PLiang RSun GTang J(2025)Multi-granularity semantic relational mapping for image captionExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.125847264:COnline publication date: 10-Mar-2025
https://dl.acm.org/doi/10.1016/j.eswa.2024.125847
Wu SFei HQu LJi WChua TSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)NExT-GPTProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694257(53366-53397)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694257
Huang YTang JChen ZZhang RZhang XChen WZhao ZZhao ZLv THu ZZhang WWooldridge MDy JNatarajan S(2024)Structure-CLIPProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i3.28017(2417-2425)Online publication date: 20-Feb-2024
https://dl.acm.org/doi/10.1609/aaai.v38i3.28017
Fu ZSong KZhou LYang YWooldridge MDy JNatarajan S(2024)Noise-aware image captioning with progressively exploring mismatched wordsProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i11.29097(12091-12099)Online publication date: 20-Feb-2024
https://dl.acm.org/doi/10.1609/aaai.v38i11.29097
Shi HLiu MMu XSong XHu YNie L(2024)Breaking Through the Noisy Correspondence: A Robust Model for Image-Text MatchingACM Transactions on Information Systems10.1145/366273242:6(1-26)Online publication date: 19-Aug-2024
https://dl.acm.org/doi/10.1145/3662732
Tu YLi LSu LZha ZHuang Q(2024)SMART: Syntax-Calibrated Multi-Aspect Relation Transformer for Change CaptioningIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.336510446:7(4926-4943)Online publication date: 13-Feb-2024
https://dl.acm.org/doi/10.1109/TPAMI.2024.3365104
Zhang GLi SWei SGe SCai NZhao Y(2024)Multimodal Composition Example Mining for Composed Query Image RetrievalIEEE Transactions on Image Processing10.1109/TIP.2024.335906233(1149-1161)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TIP.2024.3359062
Dai ZTran VMarkham ATrigoni NRahman MWijayasingha LStankovic JLi C(2024)EgoCap and EgoFormerPattern Recognition Letters10.1016/j.patrec.2024.03.012181:C(50-56)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1016/j.patrec.2024.03.012
Show More Cited By

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents