research-article

ImageSpirit: Verbal Guided Image Parsing

Authors:

Ming-Ming Cheng,

Niloy J. Mitra, and

Philip TorrAuthors Info & Claims

ACM Transactions on Graphics (TOG), Volume 34, Issue 1

Article No.: 3, Pages 1 - 11

https://doi.org/10.1145/2682628

Published: 29 December 2014 Publication History

Abstract

Humans describe images in terms of nouns and adjectives while algorithms operate on images represented as sets of pixels. Bridging this gap between how humans would like to access images versus their typical representation is the goal of image parsing, which involves assigning object and attribute labels to pixels. In this article we propose treating nouns as object labels and adjectives as visual attribute labels. This allows us to formulate the image parsing problem as one of jointly estimating per-pixel object and attribute labels from a set of training images. We propose an efficient (interactive time) solution. Using the extracted labels as handles, our system empowers a user to verbally refine the results. This enables hands-free parsing of an image into pixel-wise object/attribute labels that correspond to human semantics. Verbally selecting objects of interest enables a novel and natural interaction modality that can possibly be used to interact with new generation devices (e.g., smartphones, Google Glass, livingroom devices). We demonstrate our system on a large number of real-world images with varying complexity. To help understand the trade-offs compared to traditional mouse-based interactions, results are reported for both a large-scale quantitative evaluation and a user study.

Supplementary Material

MP4 File (a3.mp4)

Download
15.87 MB

References

[1]

A. Adams, J. Baek, and M. A. Davis. 2010. Fast high-dimensional filtering using the permutohedral lattice. Comput. Graph. Forum. 29, 2, 753--762.

[2]

C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman. 2009. Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Trans. Graph. 3, 24:1--11.

Digital Library

[3]

B. Berlin and P. Kay. 1991. Basic Color Terms: Their Universality and Evolution. University of California Press.

[4]

F. Berthouzoz, W. Li, M. Dontcheva, and M. Agrawala. 2011. A framework for content-adaptive photo manipulation macros: Application to face, landscape, and global manipulations. ACM Trans. Graph. 30, 5, 120.

Digital Library

[5]

R. A. Bolt. 1980. Put-that-there: Voice and gesture at the graphics interface. In Proceedings of the 7^th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH'80). 262--270.

Digital Library

[6]

Y. Boykov and M.-P. Jolly. 2001. Interactive graph cuts for optimal boundary and region segmentation of objects in n-d images. In Proceedings of the IEEE International Conference on Computer Vision (ICCV'01). 105--112.

[7]

S. Branson, C. Wah, F. Schroff, B. Babenko, P. Welinder, P. Perona, and S. Belongie. 2010. Visual recognition with humans in the loop. In Proceedings of the European Conference on Computer Vision (ECCV'10). 438--451.

Digital Library

[8]

R. Carroll, A. Agarwala, and M. Agrawala. 2010. Image warps for artistic perspective manipulation. ACM Trans. Graph. 29, 4, 127.

Digital Library

[9]

T. Chen, M.-M. Cheng, P. Tan, A. Shamir, and S.-M. Hu. 2009. Sketch2photo: Internet image montage. ACM Trans. Graph. 28, 5, 124:1--10.

Digital Library

[10]

M.-M. Cheng, N. J. Mitra, X. Huang, P. Torr, and S.-M. Hu. 2014a. Global contrast based salient region detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'14). 409--416.

Digital Library

[11]

M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. H. S. Torr. 2014b. BING: Binarized normed gradients for objectness estimation at 300fps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'14).

Digital Library

[12]

M.-M. Cheng, F.-L. Zhang, N. J. Mitra, X. Huang, and S.-M. Hu. 2010. RepFinder: Finding approximately repeated scene elements for image editing. ACM Trans. Graph. 29, 4, 83:1--8.

Digital Library

[13]

T. Dean, M. A. Ruzon, M. Segal, J. Shlens, S. Vijaya-Narasimhan, and J. Yagnik. 2013. Fast, accurate detection of 100,000 object classes on a single machine. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'13). 1814--1821.

Digital Library

[14]

A. Efros and W. Freeman. 2001. Image quilting for texture synthesis and transfer. In Proceedings of the 28^th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH'01). 341--346.

Digital Library

[15]

A. Farhadi, I. Endres, and D. Hoiem. 2010. Attribute-centric recognition for cross-category generalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'10).

[16]

A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. 2009. Describing objects by their attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'09). 1778--1785.

[17]

P. Felzenszwalb and D. Huttenlocher. 2004. Efficient graph-based image segmentation. Int. J. Comput. Vis. 59, 2, 167--181.

Digital Library

[18]

V. Ferrari and A. Zisserman. 2007. Learning visual attributes. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'07).

[19]

C. Goldberg, T. Chen, F.-L. Zhang, A. Shamir, and S.-M. Hu. 2012. Data-driven object manipulation in images. Comput. Graph. Forum 31, 2.1, 265--274.

Digital Library

[20]

S. Henderson. 2008. Augmented reality for maintenance and repair. http://www.youtube.com/watch&quest;v=mn-zvymlSvk.

[21]

F. S. Khan, R. Anwer, J. Van De Weijer, A. Bagdanov, M. Van-Rell, and A. Lopez. 2012. Color attributes for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'12). 3306--3313.

Digital Library

[22]

D. Koller and N. Friedman. 2009. Probabilistic Graphical Models: Principles and Techniques. MIT Press.

Digital Library

[23]

P. Krahenbuhl and V. Koltun. 2011. Efficient inference in fully connected crfs with gaussian edge potentials. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'11).

[24]

G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. 2011. Baby talk: Understanding and generating simple image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'11). 1601--1608.

Digital Library

[25]

L. Ladicky, C. Russell, P. Kohli, and P. H. S. Torr. 2009. Associative hierarchical crfs for object class image segmentation. In Proceedings of the IEEE Conference on Computer Vision (ICCV'09).

[26]

L. Ladicky, P. Sturgess, C. Russell, S. Sengupta, Y. Bastanlar, W. F. Clocksin, and P. H. S. Torr. 2010. Joint optimisation for object class segmentation and dense stereo reconstruction. In Proceedings of the British Machine Vision Conference (BMVC'10). 104.1--104.11.

[27]

J. Lalonde, D. Hoiem, A. Efros, C. Rother, J. Winn, and A. Criminisi. 2007. Photo clip art. ACM Trans. Graph. 26, 3.

Digital Library

[28]

C. H. Lampert, H. Nickisch, and S. Harmeling. 2009. Learning to detect unseen object classes by between-class attribute transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'09). 951--958.

[29]

G. Laput, M. Dontcheva, G. Wilensky, W. Chang, A. Agarwala, J. Linder, and E. Adar. 2013. Pixeltone: A multimodal interface for image editing. In Extended Abstracts on Human Factors in Computing Systems (CHI/EA'13). 2829--2830.

Digital Library

[30]

V. Lempitsky, P. Kohli, C. Rother, and T. Sharp. 2009. Image segmentation with a bounding box prior. In Proceedings of the IEEE International Conference on Computer Vision (ICCV'09). 277--284.

[31]

A. Levin, D. Lischinski, and Y. Weiss. 2008. A closed-form solution to natural image matting. IEEE Trans. Pattern Mach. Intell. 30, 2, 228--242.

Digital Library

[32]

Y. Li, J. Sun, C.-K. Tang, and H.-Y. Shum. 2004. Lazy snapping. ACM Trans. Graph. 23, 3, 303--308.

Digital Library

[33]

J. Liu, J. Sun, and H.-Y. Shum. 2009. Paint selection. ACM Trans. Graph. 28, 3.

Digital Library

[34]

T. Malisiewicz and A. A. Efros. 2008. Recognition by association via learning per-exemplar distances. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'08). 1--8.

[35]

Microsoft. 2012. Microsoft speech platform--Sdk. http://www.microsoft.com/download/details.aspx&quest;id=27226.

[36]

G. Patterson and J. Hays. 2012. Sun attribute database: Discovering, annotating, and recognizing scene attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'12). 2751--2758.

Digital Library

[37]

R. B. Potts. 1952. Some generalized order-disorder transformations. Math. Proc. Cambridge Philos. Soc. 48, 1, 106--109.

[38]

A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. 2007. Objects in context. In Proceedings of the IEEE International Conference on Computer Vision (ICCV'07). 1--8.

[39]

C. Rother, V. Kolmogorov, and A. Blake. 2004. Grabcut: Interactive foreground extraction using iterated graph cuts. ACM Trans. Graph. 23, 3, 309--314.

Digital Library

[40]

J. Shotton, J. Winn, C. Rother, and A. Criminisi. 2009. Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. Int. J. Comput. Vis. 81, 1, 2--23.

Digital Library

[41]

N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. 2012. Indoor segmentation and support inference from rgbd images. In Proceedings of the European Conference on Computer Vision (ECCV'12). 746--760.

Digital Library

[42]

P. Sturgess, L. Ladicky, N. Crook, and P. H. S. Torr. 2012. Scalable cascade inference for semantic image segmentation. In Proceedings of the British Machine Vision Conference (BMVC'12).

[43]

J. Sun, L. Yuan, J. Jia, and H.-Y. Shum. 2005. Image completion with structure propagation. ACM Trans. Graph. 24, 3, 861--868.

Digital Library

[44]

Sunnybrook Hospital. 2008. Xbox kinect in the hospital operating room. http://www.youtube.com/watch&quest;v=f5Ep3oqicVU.

[45]

C. Sutton, K. Rohanimanesh, and A. McCallum. 2004. Dynamic conditional random fields: Factorized probabilistic models for labeling and segmenting sequence data. In Proceedings of the International Conference on Machine Learning (ICML'04).

Digital Library

[46]

J. Tighe and S. Lazebnik. 2011. Understanding scenes on many levels. In Proceedings of the IEEE International Conference on Computer Vision (ICCV'11). 335--342.

Digital Library

[47]

J. Tighe and S. Lazebnik. 2013. SuperParsing: Scalable nonparametric image parsing with superpixels. Int. J. Comput. Vis. 101, 2, 329--349.

Digital Library

[48]

G. Tsoumakas, A. Dimou, E. Spyromitros-Xioufis, V. Mezaris, I. Kompatsiaris, and I. Vlahavas. 2009. Correlation-based pruning of stacked binary relevance models for multi-label learning. In Proceedings of the ECML/PKDD Workshop on Learning from Multi-Label Data (MLD'09).

[49]

Z. Tu, X. Chen, A. L. Yuille, and S.-C. Zhu. 2005. Image parsing: Unifying segmentation, detection, and recognition. Int. J. Comput. Vis. 63, 2, 113--140.

Digital Library

[50]

J. Valentin, V. Vineet, M.-M. Cheng, D. Kim, S. Izadi, J. Shotton, P. Kohli, M. Niessner, A. Criminisi, and P. Torr. 2014. SemanticPaint: Interactive 3d labeling and learning at your fingertips. ACM Trans. Graph. (to appear).

[51]

K. Van De Sande, T. Gevers, and C. G. M. Snoek. 2010. Evaluating color descriptors for object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 32, 9, 1582--1596.

Digital Library

[52]

J. Verbeek and W. Triggs. 2007. Scene segmentation with crfs learned from partially labeled images. Adv. Neural Inf. Process. Syst. 20, 1553--1560.

[53]

C. Wah, S. Branson, P. Perona, and S. Belongie. 2011. Multi-class recognition and part localization with humans in the loop. In Proceedings of the IEEE Conference on Computer Vision (ICCV'11). 2524--2531.

Digital Library

[54]

Y. Wang and G. Mori. 2010. A discriminative latent model of object classes and attributes. In Proceedings of the 11^th European Conference on Computer Vision (ECCV'10). 155--168.

Digital Library

[55]

J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. 2010. Sun database: Large-scale scene recognition from abbey to zoo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'10). 3485--3492.

[56]

S. Zheng, M.-M. Cheng, J. Warrell, P. Sturgess, V. Vineet, C. Rother, and P. Torr. 2014. Dense semantic image segmentation with objects and attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'14). 3214--3221.

Digital Library

[57]

Y. Zheng, X. Chen, M.-M. Cheng, K. Zhou, S.-M. Hu, and N. J. Mitra. 2012. Interactive images: Cuboid proxies for smart image manipulation. ACM Trans. Graph. 31, 4, 99:1--11.

Digital Library

[58]

S. Zhou, H. Fu, L. Liu, D. Ohen-Or, and X. Han. 2010. Parametric reshaping of human bodies in images. ACM Trans. Graph. 29, 4, 126:1--126:10.

Digital Library

Cited By

Wang ZZhao NHancke GLau R(2023)Language-based Photo Color Adjustment for Graphic DesignsACM Transactions on Graphics10.1145/359211142:4(1-16)Online publication date: 26-Jul-2023
https://dl.acm.org/doi/10.1145/3592111
Tripathi AMishra AChakraborty A(2023)Grounding Scene Graphs on Natural Images via Visio-Lingual Message Passing2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV56688.2023.00437(4380-4389)Online publication date: Jan-2023
https://doi.org/10.1109/WACV56688.2023.00437
Virmani MM APathak MPai KPrasad V(2023)Image Synthesis from Themes Captured in Poems using Latent Diffusion Models2023 2nd International Conference on Applied Artificial Intelligence and Computing (ICAAIC)10.1109/ICAAIC56838.2023.10141274(655-660)Online publication date: 4-May-2023
https://doi.org/10.1109/ICAAIC56838.2023.10141274
Show More Cited By

Index Terms

ImageSpirit: Verbal Guided Image Parsing
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object recognition
  2. Computer graphics
    1. Graphics systems and interfaces
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interaction devices
      1. Graphics input devices

Recommendations

A discriminative graph inferring framework towards weakly supervised image parsing

In this paper, we focus on the task of assigning labels to the over-segmented image patches in a weakly supervised manner, in which the training images contain the labels but do not have the labels' locations in the images. We propose a unified ...
Read More
Scene Parsing with Object Instance Inference Using Regions and Per-exemplar Detectors

This paper describes a system for interpreting a scene by assigning a semantic label at every pixel and inferring the spatial extent of individual object instances together with their occlusion relationships. First we present a method for labeling each ...
Read More
Finding Things: Image Parsing with Regions and Per-Exemplar Detectors
CVPR '13: Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition

This paper presents a system for image parsing, or labeling each pixel in an image with its semantic category, aimed at achieving broad coverage across hundreds of object categories, many of them sparsely sampled. The system combines region-level ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Graphics

ACM Transactions on Graphics Volume 34, Issue 1

November 2014

153 pages

ISSN:0730-0301

EISSN:1557-7368

DOI:10.1145/2702692

Editor:
Julie Dorsey
Yale University

Issue’s Table of Contents

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 December 2014

Accepted: 01 May 2014

Revised: 01 May 2014

Received: 01 December 2013

Published in TOG Volume 34, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

38
Total Citations
View Citations
544
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)2

Other Metrics

View Author Metrics

Citations

Cited By

Wang ZZhao NHancke GLau R(2023)Language-based Photo Color Adjustment for Graphic DesignsACM Transactions on Graphics10.1145/359211142:4(1-16)Online publication date: 26-Jul-2023
https://dl.acm.org/doi/10.1145/3592111
Tripathi AMishra AChakraborty A(2023)Grounding Scene Graphs on Natural Images via Visio-Lingual Message Passing2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV56688.2023.00437(4380-4389)Online publication date: Jan-2023
https://doi.org/10.1109/WACV56688.2023.00437
Virmani MM APathak MPai KPrasad V(2023)Image Synthesis from Themes Captured in Poems using Latent Diffusion Models2023 2nd International Conference on Applied Artificial Intelligence and Computing (ICAAIC)10.1109/ICAAIC56838.2023.10141274(655-660)Online publication date: 4-May-2023
https://doi.org/10.1109/ICAAIC56838.2023.10141274
Fu TYu LZhang NFu CSu JWang WBell S(2023)Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.01029(10681-10692)Online publication date: Jun-2023
https://doi.org/10.1109/CVPR52729.2023.01029
Liu XLi LWang SZha ZLi ZTian QHuang Q(2022)Entity-enhanced Adaptive Reconstruction Network for Weakly Supervised Referring Expression GroundingIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2022.3186410(1-16)Online publication date: 2022
https://doi.org/10.1109/TPAMI.2022.3186410
Fu TWang XGrafton SEckstein MWang W(2022) M 3 L: Language-based Video Editing via Multi-Modal Multi-Level Transformers 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52688.2022.01026(10503-10512)Online publication date: Jun-2022
https://doi.org/10.1109/CVPR52688.2022.01026
Zhao YRen DChen YJia WWang RLiu X(2022)Cartoon Image Processing: A SurveyInternational Journal of Computer Vision10.1007/s11263-022-01645-1130:11(2733-2769)Online publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1007/s11263-022-01645-1
Fu TWang XWang W(2022)Language-Driven Artistic Style TransferComputer Vision – ECCV 202210.1007/978-3-031-20059-5_41(717-734)Online publication date: 29-Oct-2022
https://doi.org/10.1007/978-3-031-20059-5_41
Chang MHuh MKim JKitamura YQuigley AIsbister KIgarashi TBjørn PDrucker S(2021)RubySlippers: Supporting Content-based Voice Navigation for How-to VideosProceedings of the 2021 CHI Conference on Human Factors in Computing Systems10.1145/3411764.3445131(1-14)Online publication date: 6-May-2021
https://dl.acm.org/doi/10.1145/3411764.3445131
Yang CShen YZhou B(2021)Semantic Hierarchy Emerges in Deep Generative Representations for Scene SynthesisInternational Journal of Computer Vision10.1007/s11263-020-01429-5Online publication date: 10-Feb-2021
https://doi.org/10.1007/s11263-020-01429-5
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents