Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content

ImageSpirit: Verbal Guided Image Parsing

Published: 29 December 2014 Publication History
  • Get Citation Alerts
  • Abstract

    Humans describe images in terms of nouns and adjectives while algorithms operate on images represented as sets of pixels. Bridging this gap between how humans would like to access images versus their typical representation is the goal of image parsing, which involves assigning object and attribute labels to pixels. In this article we propose treating nouns as object labels and adjectives as visual attribute labels. This allows us to formulate the image parsing problem as one of jointly estimating per-pixel object and attribute labels from a set of training images. We propose an efficient (interactive time) solution. Using the extracted labels as handles, our system empowers a user to verbally refine the results. This enables hands-free parsing of an image into pixel-wise object/attribute labels that correspond to human semantics. Verbally selecting objects of interest enables a novel and natural interaction modality that can possibly be used to interact with new generation devices (e.g., smartphones, Google Glass, livingroom devices). We demonstrate our system on a large number of real-world images with varying complexity. To help understand the trade-offs compared to traditional mouse-based interactions, results are reported for both a large-scale quantitative evaluation and a user study.

    Supplementary Material

    MP4 File (a3.mp4)


    A. Adams, J. Baek, and M. A. Davis. 2010. Fast high-dimensional filtering using the permutohedral lattice. Comput. Graph. Forum. 29, 2, 753--762.
    C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman. 2009. Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Trans. Graph. 3, 24:1--11.
    B. Berlin and P. Kay. 1991. Basic Color Terms: Their Universality and Evolution. University of California Press.
    F. Berthouzoz, W. Li, M. Dontcheva, and M. Agrawala. 2011. A framework for content-adaptive photo manipulation macros: Application to face, landscape, and global manipulations. ACM Trans. Graph. 30, 5, 120.
    R. A. Bolt. 1980. Put-that-there: Voice and gesture at the graphics interface. In Proceedings of the 7th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH'80). 262--270.
    Y. Boykov and M.-P. Jolly. 2001. Interactive graph cuts for optimal boundary and region segmentation of objects in n-d images. In Proceedings of the IEEE International Conference on Computer Vision (ICCV'01). 105--112.
    S. Branson, C. Wah, F. Schroff, B. Babenko, P. Welinder, P. Perona, and S. Belongie. 2010. Visual recognition with humans in the loop. In Proceedings of the European Conference on Computer Vision (ECCV'10). 438--451.
    R. Carroll, A. Agarwala, and M. Agrawala. 2010. Image warps for artistic perspective manipulation. ACM Trans. Graph. 29, 4, 127.
    T. Chen, M.-M. Cheng, P. Tan, A. Shamir, and S.-M. Hu. 2009. Sketch2photo: Internet image montage. ACM Trans. Graph. 28, 5, 124:1--10.
    M.-M. Cheng, N. J. Mitra, X. Huang, P. Torr, and S.-M. Hu. 2014a. Global contrast based salient region detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'14). 409--416.
    M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. H. S. Torr. 2014b. BING: Binarized normed gradients for objectness estimation at 300fps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'14).
    M.-M. Cheng, F.-L. Zhang, N. J. Mitra, X. Huang, and S.-M. Hu. 2010. RepFinder: Finding approximately repeated scene elements for image editing. ACM Trans. Graph. 29, 4, 83:1--8.
    T. Dean, M. A. Ruzon, M. Segal, J. Shlens, S. Vijaya-Narasimhan, and J. Yagnik. 2013. Fast, accurate detection of 100,000 object classes on a single machine. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'13). 1814--1821.
    A. Efros and W. Freeman. 2001. Image quilting for texture synthesis and transfer. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH'01). 341--346.
    A. Farhadi, I. Endres, and D. Hoiem. 2010. Attribute-centric recognition for cross-category generalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'10).
    A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. 2009. Describing objects by their attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'09). 1778--1785.
    P. Felzenszwalb and D. Huttenlocher. 2004. Efficient graph-based image segmentation. Int. J. Comput. Vis. 59, 2, 167--181.
    V. Ferrari and A. Zisserman. 2007. Learning visual attributes. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'07).
    C. Goldberg, T. Chen, F.-L. Zhang, A. Shamir, and S.-M. Hu. 2012. Data-driven object manipulation in images. Comput. Graph. Forum 31, 2.1, 265--274.
    S. Henderson. 2008. Augmented reality for maintenance and repair. http://www.youtube.com/watch?v=mn-zvymlSvk.
    F. S. Khan, R. Anwer, J. Van De Weijer, A. Bagdanov, M. Van-Rell, and A. Lopez. 2012. Color attributes for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'12). 3306--3313.
    D. Koller and N. Friedman. 2009. Probabilistic Graphical Models: Principles and Techniques. MIT Press.
    P. Krahenbuhl and V. Koltun. 2011. Efficient inference in fully connected crfs with gaussian edge potentials. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'11).
    G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. 2011. Baby talk: Understanding and generating simple image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'11). 1601--1608.
    L. Ladicky, C. Russell, P. Kohli, and P. H. S. Torr. 2009. Associative hierarchical crfs for object class image segmentation. In Proceedings of the IEEE Conference on Computer Vision (ICCV'09).
    L. Ladicky, P. Sturgess, C. Russell, S. Sengupta, Y. Bastanlar, W. F. Clocksin, and P. H. S. Torr. 2010. Joint optimisation for object class segmentation and dense stereo reconstruction. In Proceedings of the British Machine Vision Conference (BMVC'10). 104.1--104.11.
    J. Lalonde, D. Hoiem, A. Efros, C. Rother, J. Winn, and A. Criminisi. 2007. Photo clip art. ACM Trans. Graph. 26, 3.
    C. H. Lampert, H. Nickisch, and S. Harmeling. 2009. Learning to detect unseen object classes by between-class attribute transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'09). 951--958.
    G. Laput, M. Dontcheva, G. Wilensky, W. Chang, A. Agarwala, J. Linder, and E. Adar. 2013. Pixeltone: A multimodal interface for image editing. In Extended Abstracts on Human Factors in Computing Systems (CHI/EA'13). 2829--2830.
    V. Lempitsky, P. Kohli, C. Rother, and T. Sharp. 2009. Image segmentation with a bounding box prior. In Proceedings of the IEEE International Conference on Computer Vision (ICCV'09). 277--284.
    A. Levin, D. Lischinski, and Y. Weiss. 2008. A closed-form solution to natural image matting. IEEE Trans. Pattern Mach. Intell. 30, 2, 228--242.
    Y. Li, J. Sun, C.-K. Tang, and H.-Y. Shum. 2004. Lazy snapping. ACM Trans. Graph. 23, 3, 303--308.
    J. Liu, J. Sun, and H.-Y. Shum. 2009. Paint selection. ACM Trans. Graph. 28, 3.
    T. Malisiewicz and A. A. Efros. 2008. Recognition by association via learning per-exemplar distances. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'08). 1--8.
    Microsoft. 2012. Microsoft speech platform--Sdk. http://www.microsoft.com/download/details.aspx?id=27226.
    G. Patterson and J. Hays. 2012. Sun attribute database: Discovering, annotating, and recognizing scene attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'12). 2751--2758.
    R. B. Potts. 1952. Some generalized order-disorder transformations. Math. Proc. Cambridge Philos. Soc. 48, 1, 106--109.
    A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. 2007. Objects in context. In Proceedings of the IEEE International Conference on Computer Vision (ICCV'07). 1--8.
    C. Rother, V. Kolmogorov, and A. Blake. 2004. Grabcut: Interactive foreground extraction using iterated graph cuts. ACM Trans. Graph. 23, 3, 309--314.
    J. Shotton, J. Winn, C. Rother, and A. Criminisi. 2009. Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. Int. J. Comput. Vis. 81, 1, 2--23.
    N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. 2012. Indoor segmentation and support inference from rgbd images. In Proceedings of the European Conference on Computer Vision (ECCV'12). 746--760.
    P. Sturgess, L. Ladicky, N. Crook, and P. H. S. Torr. 2012. Scalable cascade inference for semantic image segmentation. In Proceedings of the British Machine Vision Conference (BMVC'12).
    J. Sun, L. Yuan, J. Jia, and H.-Y. Shum. 2005. Image completion with structure propagation. ACM Trans. Graph. 24, 3, 861--868.
    Sunnybrook Hospital. 2008. Xbox kinect in the hospital operating room. http://www.youtube.com/watch?v=f5Ep3oqicVU.
    C. Sutton, K. Rohanimanesh, and A. McCallum. 2004. Dynamic conditional random fields: Factorized probabilistic models for labeling and segmenting sequence data. In Proceedings of the International Conference on Machine Learning (ICML'04).
    J. Tighe and S. Lazebnik. 2011. Understanding scenes on many levels. In Proceedings of the IEEE International Conference on Computer Vision (ICCV'11). 335--342.
    J. Tighe and S. Lazebnik. 2013. SuperParsing: Scalable nonparametric image parsing with superpixels. Int. J. Comput. Vis. 101, 2, 329--349.
    G. Tsoumakas, A. Dimou, E. Spyromitros-Xioufis, V. Mezaris, I. Kompatsiaris, and I. Vlahavas. 2009. Correlation-based pruning of stacked binary relevance models for multi-label learning. In Proceedings of the ECML/PKDD Workshop on Learning from Multi-Label Data (MLD'09).
    Z. Tu, X. Chen, A. L. Yuille, and S.-C. Zhu. 2005. Image parsing: Unifying segmentation, detection, and recognition. Int. J. Comput. Vis. 63, 2, 113--140.
    J. Valentin, V. Vineet, M.-M. Cheng, D. Kim, S. Izadi, J. Shotton, P. Kohli, M. Niessner, A. Criminisi, and P. Torr. 2014. SemanticPaint: Interactive 3d labeling and learning at your fingertips. ACM Trans. Graph. (to appear).
    K. Van De Sande, T. Gevers, and C. G. M. Snoek. 2010. Evaluating color descriptors for object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 32, 9, 1582--1596.
    J. Verbeek and W. Triggs. 2007. Scene segmentation with crfs learned from partially labeled images. Adv. Neural Inf. Process. Syst. 20, 1553--1560.
    C. Wah, S. Branson, P. Perona, and S. Belongie. 2011. Multi-class recognition and part localization with humans in the loop. In Proceedings of the IEEE Conference on Computer Vision (ICCV'11). 2524--2531.
    Y. Wang and G. Mori. 2010. A discriminative latent model of object classes and attributes. In Proceedings of the 11th European Conference on Computer Vision (ECCV'10). 155--168.
    J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. 2010. Sun database: Large-scale scene recognition from abbey to zoo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'10). 3485--3492.
    S. Zheng, M.-M. Cheng, J. Warrell, P. Sturgess, V. Vineet, C. Rother, and P. Torr. 2014. Dense semantic image segmentation with objects and attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'14). 3214--3221.
    Y. Zheng, X. Chen, M.-M. Cheng, K. Zhou, S.-M. Hu, and N. J. Mitra. 2012. Interactive images: Cuboid proxies for smart image manipulation. ACM Trans. Graph. 31, 4, 99:1--11.
    S. Zhou, H. Fu, L. Liu, D. Ohen-Or, and X. Han. 2010. Parametric reshaping of human bodies in images. ACM Trans. Graph. 29, 4, 126:1--126:10.

    Cited By

    View all
    • (2023)Language-based Photo Color Adjustment for Graphic DesignsACM Transactions on Graphics10.1145/359211142:4(1-16)Online publication date: 26-Jul-2023
    • (2023)Grounding Scene Graphs on Natural Images via Visio-Lingual Message Passing2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV56688.2023.00437(4380-4389)Online publication date: Jan-2023
    • (2023)Image Synthesis from Themes Captured in Poems using Latent Diffusion Models2023 2nd International Conference on Applied Artificial Intelligence and Computing (ICAAIC)10.1109/ICAAIC56838.2023.10141274(655-660)Online publication date: 4-May-2023
    • Show More Cited By



    Information & Contributors


    Published In

    cover image ACM Transactions on Graphics
    ACM Transactions on Graphics  Volume 34, Issue 1
    November 2014
    153 pages
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]


    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 29 December 2014
    Accepted: 01 May 2014
    Revised: 01 May 2014
    Received: 01 December 2013
    Published in TOG Volume 34, Issue 1


    Request permissions for this article.

    Check for updates

    Author Tags

    1. Image parsing
    2. image parsing
    3. multilabel CRF
    4. natural language control
    5. object class segmentation
    6. speech interface
    7. visual attributes


    • Research-article
    • Research
    • Refereed

    Funding Sources


    Other Metrics

    Bibliometrics & Citations


    Article Metrics

    • Downloads (Last 12 months)14
    • Downloads (Last 6 weeks)2

    Other Metrics


    Cited By

    View all
    • (2023)Language-based Photo Color Adjustment for Graphic DesignsACM Transactions on Graphics10.1145/359211142:4(1-16)Online publication date: 26-Jul-2023
    • (2023)Grounding Scene Graphs on Natural Images via Visio-Lingual Message Passing2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV56688.2023.00437(4380-4389)Online publication date: Jan-2023
    • (2023)Image Synthesis from Themes Captured in Poems using Latent Diffusion Models2023 2nd International Conference on Applied Artificial Intelligence and Computing (ICAAIC)10.1109/ICAAIC56838.2023.10141274(655-660)Online publication date: 4-May-2023
    • (2023)Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.01029(10681-10692)Online publication date: Jun-2023
    • (2022)Entity-enhanced Adaptive Reconstruction Network for Weakly Supervised Referring Expression GroundingIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2022.3186410(1-16)Online publication date: 2022
    • (2022) M 3 L: Language-based Video Editing via Multi-Modal Multi-Level Transformers 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52688.2022.01026(10503-10512)Online publication date: Jun-2022
    • (2022)Cartoon Image Processing: A SurveyInternational Journal of Computer Vision10.1007/s11263-022-01645-1130:11(2733-2769)Online publication date: 1-Nov-2022
    • (2022)Language-Driven Artistic Style TransferComputer Vision – ECCV 202210.1007/978-3-031-20059-5_41(717-734)Online publication date: 29-Oct-2022
    • (2021)RubySlippers: Supporting Content-based Voice Navigation for How-to VideosProceedings of the 2021 CHI Conference on Human Factors in Computing Systems10.1145/3411764.3445131(1-14)Online publication date: 6-May-2021
    • (2021)Semantic Hierarchy Emerges in Deep Generative Representations for Scene SynthesisInternational Journal of Computer Vision10.1007/s11263-020-01429-5Online publication date: 10-Feb-2021
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options


    View or Download as a PDF file.



    View online with eReader.








    Share this Publication link

    Share on social media