research-article

Free access

Midge: generating image descriptions from computer vision detections

Authors:

Margaret Mitchell,

Kota Yamaguchi,

Hal Daumé, IIIAuthors Info & Claims

EACL '12: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

Pages 747 - 756

Published: 23 April 2012 Publication History

Abstract

This paper introduces a novel generation system that composes humanlike descriptions of images from computer vision detections. By leveraging syntactically informed word co-occurrence statistics, the generator filters and constrains the noisy detections output from a vision system to generate syntactic trees that detail what the computer vision system sees. Results show that the generation system outperforms state-of-the-art systems, automatically generating some of the most natural image descriptions to date.

References

[1]

Amazon. 2011. Amazon mechanical turk: Artificial artificial intelligence.

[2]

Holly P. Branigan, Martin J. Pickering, and Mikihiro Tanaka. 2007. Contributions of animacy to grammatical function assignment and word order during production. Lingua, 118(2):172--189.

[3]

Thorsten Brants and Alex Franz. 2006. Web 1T 5-gram version 1.

[4]

Chris Callison-Burch and Mark Dredze. 2010. Creating speech and language data with Amazon's Mechanical Turk. NAACL 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk.

Digital Library

[5]

Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detections. Proceedings of CVPR 2005.

Digital Library

[6]

Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. 2009. Describing objects by their attributes. Proceedings of CVPR 2009.

[7]

Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: generating sentences for images. Proceedings of ECCV 2010.

Digital Library

[8]

Pedro Felzenszwalb, David McAllester, and Deva Ramaman. 2008. A discriminatively trained, multiscale, deformable part model. Proceedings of CVPR 2008.

[9]

Flickr. 2011. http://www.flickr.com. Accessed 1.Sep.11.

[10]

Kotaro Funakoshi, Satoru Watanabe, Naoko Kuriyama, and Takenobu Tokunaga. 2004. Generating referring expressions using perceptual groups. Proceedings of the 3rd INLG.

Digital Library

[11]

Albert Gatt. 2006. Generating collective spatial references. Proceedings of the 28th CogSci.

[12]

David Graff and Christopher Cieri. 2003. English Gigaword. Linguistic Data Consortium, Philadelphia, PA. LDC Catalog No. LDC2003T05.

[13]

Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. MT Summit. http://www.statmt.org/europarl/.

[14]

Zornitsa Kozareva, Ellen Riloff, and Eduard Hovy. 2008. Semantic class learning from the web with hyponym pattern linkage graphs. Proceedings of ACL-08: HLT.

[15]

Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara Berg. 2011. Baby talk: Understanding and generating image descriptions. Proceedings of the 24th CVPR.

Digital Library

[16]

Irene Langkilde and Kevin Knight. 1998. Generation that exploits corpus-based statistical knowledge. Proceedings of the 36th ACL.

Digital Library

[17]

Siming Li, Girish Kulkarni, Tamara L. Berg, Alexander C. Berg, and Yejin Choi. 2011. Composing simple image descriptions using web-scale n-grams. Proceedings of CoNLL 2011.

Digital Library

[18]

Mitchell Marcus, Ann Bies, Constance Cooper, Mark Ferguson, and Alyson Littman. 1995. Treebank II bracketing guide.

[19]

George A. Miller. 1995. WordNet: A lexical database for english. Communications of the ACM, 38(11):39--41.

Digital Library

[20]

Margaret Mitchell, Aaron Dunlop, and Brian Roark. 2011. Semi-supervised modeling for prenominal modifier ordering. Proceedings of the 49th ACL:HLT.

Digital Library

[21]

Courtney Napoles, Benjamin Van Durme, and Chris Callison-Burch. 2011. Evaluating sentence compression: Pitfalls and suggested remedies. ACL-HLT Workshop on Monolingual Text-To-Text Generation.

Digital Library

[22]

Vicente Ordonez, Girish Kulkarni, and Tamara L Berg. 2011. Im2text: Describing images using 1 million captioned photographs. Proceedings of NIPS 2011.

[23]

Slav Petrov. 2010. Berkeley parser. GNU General Public License v.2.

[24]

Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting image annotations using amazon's mechanical turk. Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk.

Digital Library

[25]

Ehud Reiter and Anja Belz. 2009. An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics, 35(4):529--558.

Digital Library

[26]

Ehud Reiter and Robert Dale. 1997. Building applied natural language generation systems. Journal of Natural Language Engineering, pages 57--87.

Digital Library

[27]

Ehud Reiter and Robert Dale. 2000. Building Natural Language Generation Systems. Cambridge University Press.

Digital Library

[28]

Yezhou Yang, Ching Lik Teo, Hal Daumé III, and Yiannis Aloimonos. 2011. Corpus-guided sentence generation of natural images. Proceedings of EMNLP 2011.

Digital Library

[29]

Benjamin Z. Yao, Xiong Yang, Liang Lin, Mun Wai Lee, and Song-Chun Zhu. 2010. I2T: Image parsing to text description. Proceedings of IEEE 2010, 98(8):1485--1508.

Cited By

Acosta-Salgado LDaviet JJeanson L(2024)Improving Web Accessibility through Artificial Intelligence: A Focus on Image Description Generation: Améliorer l'Accessibilité des Sites Web grâce à l'Intelligence Artificielle : Focus sur la Génération de Descriptions d'ImagesProceedings of the 35th International Francophone Conference on Human-Computer Interaction10.1145/3650104.3652908(1-13)Online publication date: 25-Mar-2024
https://dl.acm.org/doi/10.1145/3650104.3652908
Gao YWang NSuo WSun MWang POria VSapino MSatoh SKerhervé BCheng WIde ISingh V(2022)Improving Image Captioning via Enhancing Dual-Side Context AwarenessProceedings of the 2022 International Conference on Multimedia Retrieval10.1145/3512527.3531379(389-397)Online publication date: 27-Jun-2022
https://dl.acm.org/doi/10.1145/3512527.3531379
Mogadala AKalimuthu MKlakow D(2021)Trends in Integration of Vision and Language ResearchJournal of Artificial Intelligence Research10.1613/jair.1.1168871(1183-1317)Online publication date: 10-Sep-2021
https://dl.acm.org/doi/10.1613/jair.1.11688
Show More Cited By

Midge: generating image descriptions from computer vision detections
1. Computing methodologies
  1. Artificial intelligence
2. Hardware
  1. Power and energy
    1. Power estimation and optimization

Recommendations

Midge: generating descriptions of images
INLG '12: Proceedings of the Seventh International Natural Language Generation Conference

We demonstrate a novel, robust vision-to-language generation system called Midge. Midge is a prototype system that connects computer vision to syntactic structures with semantic constraints, allowing for the automatic generation of detailed image ...
Engineering and Computer Graphics Workbook Using SolidWorks 2007
Careers in computer graphics

Information is unquestionably today's most important commodity. Computer graphics, as a means of expressing that information, now appears in virtually every application area. When we began gathering information for this article, we intended to describe ...

Comments

Information & Contributors

Information

Published In

cover image DL Hosted proceedings

EACL '12: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

April 2012

884 pages

ISBN:9781937284190

General Chair:
Walter Daelemans
University of Antwerp, Belgium

Publisher

Association for Computational Linguistics

United States

Publication History

Published: 23 April 2012

Qualifiers

Research-article

Acceptance Rates

Overall Acceptance Rate 100 of 360 submissions, 28%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

49
Total Citations
View Citations
862
Total Downloads

Downloads (Last 12 months)109
Downloads (Last 6 weeks)18

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Acosta-Salgado LDaviet JJeanson L(2024)Improving Web Accessibility through Artificial Intelligence: A Focus on Image Description Generation: Améliorer l'Accessibilité des Sites Web grâce à l'Intelligence Artificielle : Focus sur la Génération de Descriptions d'ImagesProceedings of the 35th International Francophone Conference on Human-Computer Interaction10.1145/3650104.3652908(1-13)Online publication date: 25-Mar-2024
https://dl.acm.org/doi/10.1145/3650104.3652908
Gao YWang NSuo WSun MWang POria VSapino MSatoh SKerhervé BCheng WIde ISingh V(2022)Improving Image Captioning via Enhancing Dual-Side Context AwarenessProceedings of the 2022 International Conference on Multimedia Retrieval10.1145/3512527.3531379(389-397)Online publication date: 27-Jun-2022
https://dl.acm.org/doi/10.1145/3512527.3531379
Mogadala AKalimuthu MKlakow D(2021)Trends in Integration of Vision and Language ResearchJournal of Artificial Intelligence Research10.1613/jair.1.1168871(1183-1317)Online publication date: 10-Sep-2021
https://dl.acm.org/doi/10.1613/jair.1.11688
Chen NPan XChen RYang LLin ZRen YYuan HGuo XHuang FWang WShen HZhuang YSmith JYang YCesar PMetze FPrabhakaran B(2021)Distributed Attention for Grounded Image CaptioningProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475354(1966-1975)Online publication date: 17-Oct-2021
https://dl.acm.org/doi/10.1145/3474085.3475354
Lu HYang RDeng ZZhang YGao GLan R(2021)Chinese Image Captioning via Fuzzy Attention-based DenseNet-BiLSTMACM Transactions on Multimedia Computing, Communications, and Applications10.1145/342266817:1s(1-18)Online publication date: 31-Mar-2021
https://dl.acm.org/doi/10.1145/3422668
Simons RGurari DFleischmann K(2020)"I Hope This Is Helpful"Proceedings of the ACM on Human-Computer Interaction10.1145/34151764:CSCW2(1-26)Online publication date: 15-Oct-2020
https://dl.acm.org/doi/10.1145/3415176
Huang LWang WXia YChen JWallach HLarochelle HBeygelzimer Ad'Alché-Buc FFox E(2019)Adaptively aligned image captioning via adaptive attention timeProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455089(8942-8951)Online publication date: 8-Dec-2019
https://dl.acm.org/doi/10.5555/3454287.3455089
Shao HQian JZhu XLiu P(2019)A Novel Pedestrian Positioning System Using Monocular in-Vehicle CamerasProceedings of the 2019 International Conference on Artificial Intelligence and Computer Science10.1145/3349341.3349466(548-554)Online publication date: 12-Jul-2019
https://dl.acm.org/doi/10.1145/3349341.3349466
Dominguez VMessina PDonoso-Guzmán IParra DFu WPan SBrdiczka OChau PCalvary G(2019)The effect of explanations and algorithmic accuracy on visual recommender systems of artistic imagesProceedings of the 24th International Conference on Intelligent User Interfaces10.1145/3301275.3302274(408-416)Online publication date: 17-Mar-2019
https://dl.acm.org/doi/10.1145/3301275.3302274
Hossain MSohel FShiratuddin MLaga H(2019)A Comprehensive Survey of Deep Learning for Image CaptioningACM Computing Surveys10.1145/329574851:6(1-36)Online publication date: 4-Feb-2019
https://dl.acm.org/doi/10.1145/3295748
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten