Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/2380816.2380907dlproceedingsArticle/Chapter ViewAbstractPublication PageseaclConference Proceedingsconference-collections
research-article
Free access

Midge: generating image descriptions from computer vision detections

Published: 23 April 2012 Publication History

Abstract

This paper introduces a novel generation system that composes humanlike descriptions of images from computer vision detections. By leveraging syntactically informed word co-occurrence statistics, the generator filters and constrains the noisy detections output from a vision system to generate syntactic trees that detail what the computer vision system sees. Results show that the generation system outperforms state-of-the-art systems, automatically generating some of the most natural image descriptions to date.

References

[1]
Amazon. 2011. Amazon mechanical turk: Artificial artificial intelligence.
[2]
Holly P. Branigan, Martin J. Pickering, and Mikihiro Tanaka. 2007. Contributions of animacy to grammatical function assignment and word order during production. Lingua, 118(2):172--189.
[3]
Thorsten Brants and Alex Franz. 2006. Web 1T 5-gram version 1.
[4]
Chris Callison-Burch and Mark Dredze. 2010. Creating speech and language data with Amazon's Mechanical Turk. NAACL 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk.
[5]
Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detections. Proceedings of CVPR 2005.
[6]
Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. 2009. Describing objects by their attributes. Proceedings of CVPR 2009.
[7]
Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: generating sentences for images. Proceedings of ECCV 2010.
[8]
Pedro Felzenszwalb, David McAllester, and Deva Ramaman. 2008. A discriminatively trained, multiscale, deformable part model. Proceedings of CVPR 2008.
[9]
Flickr. 2011. http://www.flickr.com. Accessed 1.Sep.11.
[10]
Kotaro Funakoshi, Satoru Watanabe, Naoko Kuriyama, and Takenobu Tokunaga. 2004. Generating referring expressions using perceptual groups. Proceedings of the 3rd INLG.
[11]
Albert Gatt. 2006. Generating collective spatial references. Proceedings of the 28th CogSci.
[12]
David Graff and Christopher Cieri. 2003. English Gigaword. Linguistic Data Consortium, Philadelphia, PA. LDC Catalog No. LDC2003T05.
[13]
Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. MT Summit. http://www.statmt.org/europarl/.
[14]
Zornitsa Kozareva, Ellen Riloff, and Eduard Hovy. 2008. Semantic class learning from the web with hyponym pattern linkage graphs. Proceedings of ACL-08: HLT.
[15]
Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara Berg. 2011. Baby talk: Understanding and generating image descriptions. Proceedings of the 24th CVPR.
[16]
Irene Langkilde and Kevin Knight. 1998. Generation that exploits corpus-based statistical knowledge. Proceedings of the 36th ACL.
[17]
Siming Li, Girish Kulkarni, Tamara L. Berg, Alexander C. Berg, and Yejin Choi. 2011. Composing simple image descriptions using web-scale n-grams. Proceedings of CoNLL 2011.
[18]
Mitchell Marcus, Ann Bies, Constance Cooper, Mark Ferguson, and Alyson Littman. 1995. Treebank II bracketing guide.
[19]
George A. Miller. 1995. WordNet: A lexical database for english. Communications of the ACM, 38(11):39--41.
[20]
Margaret Mitchell, Aaron Dunlop, and Brian Roark. 2011. Semi-supervised modeling for prenominal modifier ordering. Proceedings of the 49th ACL:HLT.
[21]
Courtney Napoles, Benjamin Van Durme, and Chris Callison-Burch. 2011. Evaluating sentence compression: Pitfalls and suggested remedies. ACL-HLT Workshop on Monolingual Text-To-Text Generation.
[22]
Vicente Ordonez, Girish Kulkarni, and Tamara L Berg. 2011. Im2text: Describing images using 1 million captioned photographs. Proceedings of NIPS 2011.
[23]
Slav Petrov. 2010. Berkeley parser. GNU General Public License v.2.
[24]
Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting image annotations using amazon's mechanical turk. Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk.
[25]
Ehud Reiter and Anja Belz. 2009. An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics, 35(4):529--558.
[26]
Ehud Reiter and Robert Dale. 1997. Building applied natural language generation systems. Journal of Natural Language Engineering, pages 57--87.
[27]
Ehud Reiter and Robert Dale. 2000. Building Natural Language Generation Systems. Cambridge University Press.
[28]
Yezhou Yang, Ching Lik Teo, Hal Daumé III, and Yiannis Aloimonos. 2011. Corpus-guided sentence generation of natural images. Proceedings of EMNLP 2011.
[29]
Benjamin Z. Yao, Xiong Yang, Liang Lin, Mun Wai Lee, and Song-Chun Zhu. 2010. I2T: Image parsing to text description. Proceedings of IEEE 2010, 98(8):1485--1508.

Cited By

View all
  • (2024)Improving Web Accessibility through Artificial Intelligence: A Focus on Image Description Generation: Améliorer l'Accessibilité des Sites Web grâce à l'Intelligence Artificielle : Focus sur la Génération de Descriptions d'ImagesProceedings of the 35th International Francophone Conference on Human-Computer Interaction10.1145/3650104.3652908(1-13)Online publication date: 25-Mar-2024
  • (2022)Improving Image Captioning via Enhancing Dual-Side Context AwarenessProceedings of the 2022 International Conference on Multimedia Retrieval10.1145/3512527.3531379(389-397)Online publication date: 27-Jun-2022
  • (2021)Trends in Integration of Vision and Language ResearchJournal of Artificial Intelligence Research10.1613/jair.1.1168871(1183-1317)Online publication date: 10-Sep-2021
  • Show More Cited By
  1. Midge: generating image descriptions from computer vision detections

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image DL Hosted proceedings
      EACL '12: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
      April 2012
      884 pages
      ISBN:9781937284190

      Publisher

      Association for Computational Linguistics

      United States

      Publication History

      Published: 23 April 2012

      Qualifiers

      • Research-article

      Acceptance Rates

      Overall Acceptance Rate 100 of 360 submissions, 28%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)109
      • Downloads (Last 6 weeks)18
      Reflects downloads up to 25 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Improving Web Accessibility through Artificial Intelligence: A Focus on Image Description Generation: Améliorer l'Accessibilité des Sites Web grâce à l'Intelligence Artificielle : Focus sur la Génération de Descriptions d'ImagesProceedings of the 35th International Francophone Conference on Human-Computer Interaction10.1145/3650104.3652908(1-13)Online publication date: 25-Mar-2024
      • (2022)Improving Image Captioning via Enhancing Dual-Side Context AwarenessProceedings of the 2022 International Conference on Multimedia Retrieval10.1145/3512527.3531379(389-397)Online publication date: 27-Jun-2022
      • (2021)Trends in Integration of Vision and Language ResearchJournal of Artificial Intelligence Research10.1613/jair.1.1168871(1183-1317)Online publication date: 10-Sep-2021
      • (2021)Distributed Attention for Grounded Image CaptioningProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475354(1966-1975)Online publication date: 17-Oct-2021
      • (2021)Chinese Image Captioning via Fuzzy Attention-based DenseNet-BiLSTMACM Transactions on Multimedia Computing, Communications, and Applications10.1145/342266817:1s(1-18)Online publication date: 31-Mar-2021
      • (2020)"I Hope This Is Helpful"Proceedings of the ACM on Human-Computer Interaction10.1145/34151764:CSCW2(1-26)Online publication date: 15-Oct-2020
      • (2019)Adaptively aligned image captioning via adaptive attention timeProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455089(8942-8951)Online publication date: 8-Dec-2019
      • (2019)A Novel Pedestrian Positioning System Using Monocular in-Vehicle CamerasProceedings of the 2019 International Conference on Artificial Intelligence and Computer Science10.1145/3349341.3349466(548-554)Online publication date: 12-Jul-2019
      • (2019)The effect of explanations and algorithmic accuracy on visual recommender systems of artistic imagesProceedings of the 24th International Conference on Intelligent User Interfaces10.1145/3301275.3302274(408-416)Online publication date: 17-Mar-2019
      • (2019)A Comprehensive Survey of Deep Learning for Image CaptioningACM Computing Surveys10.1145/329574851:6(1-36)Online publication date: 4-Feb-2019
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media