Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Deep Visual-Semantic Alignments for Generating Image Descriptions

Published: 01 April 2017 Publication History

Abstract

We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks (RNN) over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate that our alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCO datasets. We then show that the generated descriptions outperform retrieval baselines on both full images and on a new dataset of region-level annotations. Finally, we conduct large-scale analysis of our RNN language model on the Visual Genome dataset of 4.1 million captions and highlight the differences between image and region-level caption statistics.

References

[1]
L. Fei-Fei, A. Iyer, C. Koch, and P. Perona, “What do we perceive in a glance of a real-world scene?” J. Vis., vol. Volume 7, no. Issue 1, 2007, Art. no. 10.
[2]
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The Pascal visual object classes (VOC) challenge,” Int. J. Comput. Vis., vol. Volume 88, no. Issue 2, pp. 303–338, 2010.
[3]
O. Russakovsky, et al., “Imagenet large scale visual recognition challenge,” Int. J. Comput. Vis., vol. Volume 115, no. Issue 3, pp. 211–252, 2015.
[4]
G. Kulkarni, et al., “Baby talk: Understanding and generating simple image descriptions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2011, pp. 1601–1608.
[5]
A. Farhadi, et al., “Every picture tells a story: Generating sentences from images,” in Proc. 11th Eur. Conf. Comput. Vis., 2010, pp. 15–29.
[6]
M. Hodosh, P. Young, and J. Hockenmaier, “Framing image description as a ranking task: Data, models and evaluation metrics,” J. Artificial Intell. Res., vol. Volume 47, pp. 853–899, 2013.
[7]
X. Chen and C. L. Zitnick, “Learning a recurrent visual representation for image caption generation,” arXiv preprint arXiv:1411.5654, 2014.
[8]
P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Trans. Assoc. Comput. Linguistics, vol. Volume 2, pp. 67–78, 2014.
[9]
K. Barnard, P. Duygulu, D. Forsyth, N. De Freitas, D. M. Blei, and M. I. Jordan, “Matching words and pictures,” J. Mach. Learn. Res., vol. Volume 3, pp. 1107–1135, 2003.
[10]
R. Socher and L. Fei-Fei, “Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2010, pp. 966–973.
[11]
S. Fidler, A. Sharma, and R. Urtasun, “A sentence is worth a thousand pixels,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2013, pp.1995–2002.
[12]
S. Gould, R. Fulton, and D. Koller, “Decomposing a scene into geometric and semantically consistent regions,” in Proc. IEEE 12th Int. Conf. Comput. Vis., 2009, pp. 1–8.
[13]
L.-J. Li and L. Fei-Fei, “What, where and who? classifying events by scene and object recognition,” in Proc. Int. Conf. Comput. Vis., 2007, pp. 1–8.
[14]
L.-J. Li, R. Socher, and L. Fei-Fei, “Towards total scene understanding: Classification, annotation and segmentation in an automatic framework,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2009, pp. 2036–2043.
[15]
Y. Jia, M. Salzmann, and T. Darrell, “Learning cross-modality similarity for multinomial data,” in Proc. IEEE Int. Conf. Comput. Vis., 2011, pp. 2407–2414.
[16]
V. Ordonez, G. Kulkarni, and T. L. Berg, “Im2text: Describing images using 1 million captioned photographs,” in Proc. Advances Neural Inf. Process. Syst., 2011, pp. 1143–1151.
[17]
R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng, “Grounded compositional semantics for finding and describing images with sentences,” Trans. Assoc. Comput. Linguistics, vol. Volume 2, pp. 207–218, 2014.
[18]
P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and Y. Choi, “Collective generation of natural image descriptions,” in Proc. 50th Annu. Meeting Assoc. Comput. Linguistics, 2012, pp. 359–368.
[19]
P. Kuznetsova, V. Ordonez, T. L. Berg, U. C. Hill, and Y. Choi, “Treetalk: Composition and compression of trees for image descriptions,” Trans. Assoc. Comput. Linguistics, vol. Volume 2, no. Issue 10, pp. 351–362, 2014.
[20]
S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi, “Composing simple image descriptions using web-scale n-grams,” in Proc. 15th Conf. Comput. Natural Language Learn., 2011, pp. 220–228.
[21]
A. Barbu, et al., “Video in sentences out,” arXiv:1204.2742, 2012.
[22]
D. Elliott and F. Keller, “Image description using visual dependency representations,” in Proc. Empirical Methods Natural Language Process., 2013, pp. 1292–1302.
[23]
A. Gupta and P. Mannem, “From image annotation to image description,” in Neural Information Processing . Berlin, Germany: Springer, 2012.
[24]
Y. Yang, C. L. Teo, H. Daumé III, and Y. Aloimonos, “Corpus-guided sentence generation of natural images,” in Proc. Conf. Empirical Methods Natural Language Process., 2011, pp. 444–454.
[25]
B. Z. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu, “I2T: Image parsing to text description,” in Proc. IEEE, vol. Volume 98, no. Issue 8, pp. 1485–1508, 2010.
[26]
M. Yatskar, L. Vanderwende, and L. Zettlemoyer, “See no evil, say no evil: Description generation from densely labeled images,” in Proc. 3rd Joint Conf. Lexical Comput. Semantics, 2014, pp. 110–120.
[27]
M. Mitchell, et al., “Midge: Generating image descriptions from computer vision detections,” in Proc. 13th Conf. Eur. Chapter Assoc. Comput. Linguistics, 2012, pp. 747–756.
[28]
A. Karpathy, A. Joulin, and L. Fei-Fei, “Deep fragment embeddings for bidirectional image sentence mapping,” Advances in neural information processing systems, 2014, pp. 1889–1897.
[29]
R. Kiros, R. S. Zemel, and R. Salakhutdinov, “Multimodal neural language models,” in Proc. Int. Conf. Mach. Learn., 2014, pp. 595–603.
[30]
X. Chen and C. L. Zitnick, “Learning a recurrent visual representation for image caption generation,” CoRR, 2014. {Online}. Available: http://arxiv.org/abs/1411.5654
[31]
J. Donahue, et al., “Long-term recurrent convolutional networks for visual recognition and description,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2015, pp. 2625–2634.
[32]
H. Fang, et al., “From captions to visual concepts and back,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2015, pp. 1473–1482.
[33]
J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille, “Explain images with multimodal recurrent neural networks,” arXiv:1410.1090, 2014.
[34]
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2015, pp. 3156–3164.
[35]
T. L. Berg “Names and faces in the news,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2004, vol. Volume 2, pp. II-848–II-854.
[36]
C. Kong, D. Lin, M. Bansal, R. Urtasun, and S. Fidler, “What are you talking about? text-to-image coreference,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2014, pp. 351–362.
[37]
D. Lin, S. Fidler, C. Kong, and R. Urtasun, “Visual semantic search: Retrieving videos via complex textual queries,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2014, pp. 2657–2667.
[38]
C. Matuszek, N. FitzGerald, L. Zettlemoyer, L. Bo, and D. Fox, “A joint model of language and perception for grounded attribute learning,” in Proc. 29th Int. Conf. Mach. Learn., Jun. 2012, pp. 1671–1678.
[39]
C. L. Zitnick, D. Parikh, and L. Vanderwende, “Learning the visual interpretation of sentences,” in Proc. IEEE Int. Conf. Comput. Vis., 2013, pp. 1681–1688.
[40]
Y. Zhu, et al., “Aligning books and movies: Towards story-like visual explanations by watching movies and reading books,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 19–27.
[41]
A. Frome, et al., “Devise: A deep visual-semantic embedding model,” in Proc. Advances Neural Inf. Process. Syst., 2013, pp. 2121–2129.
[42]
A. Karpathy, A. Joulin, and F. F. F. Li, “Deep fragment embeddings for bidirectional image sentence mapping,” in Proc. Advances Neural Inf. Process. Syst., 2014, pp. 1889–1897.
[43]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. Advances Neural Inf. Process. Syst., 2012, pp. 1097–1105.
[44]
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. Volume 86, no. Issue 11, pp. 2278–2324, 1998.
[45]
Y. Bengio, H. Schwenk, J.-S. Senécal, F. Morin, and J.-L. Gauvain, “Neural probabilistic language models,” in Innovations in Machine Learning . Berlin, Germany: Springer, 2006.
[46]
R. Socher, J. Pennington, and C. Manning, “Glove: Global vectors for word representation,” in Proc. Empirical Methods Natural Language Process., 2014, pp. 1532–1543.
[47]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Proc. Advances Neural Inf. Process. Syst., 2013, pp. 3111–3119.
[48]
Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A neural probabilistic language model,” J. Mach. Learn. Res., vol. Volume 3, pp. 1137–1155, 2003.
[49]
T. Mikolov, M. Karafiát, L. Burget, J. Cernocky, and S. Khudanpur, “Recurrent neural network based language model,” in Proc. 11th Annu. Conf. Int. Speech Commun. Assoc., 2010, pp. 1045–1048.
[50]
I. Sutskever, J. Martens, and G. E. Hinton, “Generating text with recurrent neural networks,” in Proc. 28th Int. Conf. Mach. Learn., 2011, pp. 1017–1024.
[51]
R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2014, pp. 580–587.
[52]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2009, pp. 248–255.
[53]
M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Trans. Signal Process., vol. Volume 45, no. Issue 11, pp. 2673–2681, 1997.
[54]
A. J. Viterbi, “Error bounds for convolutional codes and an asymptotically optimum decoding algorithm,” IEEE Trans. Inf. Theory, vol. Volume TIT-13, no. Issue 2, pp. 260–269, 1967.
[55]
J. L. Elman, “Finding structure in time,” Cogn. Science, vol. Volume 14, no. Issue 2, pp. 179–211, 1990.
[56]
W. Zhang, State-Space Search: Algorithms, Complexity, Extensions, and Applications . Berlin, Germany: Springer, 1999.
[57]
T. Tieleman and G. Hinton, “Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural Networks for Machine Learning, vol. Volume 4, no. Issue 2, 2012.
[58]
T. Tieleman and G. E. Hinton, “Lecture 6.5-RmsProp: Divide the gradient by a running average of its recent magnitude,” 2012.
[59]
R. Krishna, et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” 2016. {Online}. Available: http://arxiv.org/abs/1602.07332
[60]
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. Volume 9, no. Issue 8, pp. 1735–1780, 1997.
[61]
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556, 2014.
[62]
C. Szegedy, et al., “Going deeper with convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2015, pp. 1–9.
[63]
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method for automatic evaluation of machine translation,” in Proc. 40th Annu. Meeting Assoc. Comput. Linguistics, 2002, pp. 311–318.
[64]
M. Denkowski and A. Lavie, “METEOR universal: Language specific translation evaluation for any target language,” in Proc. 9th Workshop Statistical Mach. Transl., 2014, pp. 67–78.
[65]
R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “CIDEr: Consensus-based image description evaluation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2015, pp. 4566–4575.
[66]
X. Chen, et al., “Microsoft Coco captions: Data collection and evaluation server,” arXiv:1504.00325, 2015.
[67]
J. Devlin, S. Gupta, R. Girshick, M. Mitchell, and C. L. Zitnick, “Exploring nearest neighbor approaches for image captioning,” arXiv:1505.04467, 2015.
[68]
R. Collobert, K. Kavukcuoglu, and C. Farabet, “Torch7: A Matlab-like environment for machine learning,” in Proc. Big Learn Advances Neural Inf. Process. Syst. Workshop, 2011, pp. 1681–1688.

Cited By

View all
  1. Deep Visual-Semantic Alignments for Generating Image Descriptions

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image IEEE Transactions on Pattern Analysis and Machine Intelligence
      IEEE Transactions on Pattern Analysis and Machine Intelligence  Volume 39, Issue 4
      April 2017
      208 pages

      Publisher

      IEEE Computer Society

      United States

      Publication History

      Published: 01 April 2017

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 25 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)Exploring Vision-Language Foundation Model for Novel Object CaptioningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.345243735:1(91-102)Online publication date: 1-Jan-2025
      • (2025)Image segmentation reviewInformation Fusion10.1016/j.inffus.2024.102608114:COnline publication date: 1-Feb-2025
      • (2025)Multi-granularity semantic relational mapping for image captionExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.125847264:COnline publication date: 10-Mar-2025
      • (2024)NExT-GPTProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694257(53366-53397)Online publication date: 21-Jul-2024
      • (2024)Structure-CLIPProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i3.28017(2417-2425)Online publication date: 20-Feb-2024
      • (2024)Noise-aware image captioning with progressively exploring mismatched wordsProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i11.29097(12091-12099)Online publication date: 20-Feb-2024
      • (2024)Breaking Through the Noisy Correspondence: A Robust Model for Image-Text MatchingACM Transactions on Information Systems10.1145/366273242:6(1-26)Online publication date: 19-Aug-2024
      • (2024)SMART: Syntax-Calibrated Multi-Aspect Relation Transformer for Change CaptioningIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.336510446:7(4926-4943)Online publication date: 13-Feb-2024
      • (2024)Multimodal Composition Example Mining for Composed Query Image RetrievalIEEE Transactions on Image Processing10.1109/TIP.2024.335906233(1149-1161)Online publication date: 1-Jan-2024
      • (2024)EgoCap and EgoFormerPattern Recognition Letters10.1016/j.patrec.2024.03.012181:C(50-56)Online publication date: 1-May-2024
      • Show More Cited By

      View Options

      View options

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media