Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

CapVis: Toward Better Understanding of Visual-Verbal Saliency Consistency

Published: 28 November 2018 Publication History
  • Get Citation Alerts
  • Abstract

    When looking at an image, humans shift their attention toward interesting regions, making sequences of eye fixations. When describing an image, they also come up with simple sentences that highlight the key elements in the scene. What is the correlation between where people look and what they describe in an image? To investigate this problem intuitively, we develop a visual analytics system, CapVis, to look into visual attention and image captioning, two types of subjective annotations that are relatively task-free and natural. Using these annotations, we propose a word-weighting scheme to extract visual and verbal saliency ranks to compare against each other. In our approach, a number of low-level and semantic-level features relevant to visual-verbal saliency consistency are proposed and visualized for a better understanding of image content. Our method also shows the different ways that a human and a computational model look at and describe images, which provides reliable information for a captioning model. Experiment also shows that the visualized feature can be integrated into a computational model to effectively predict the consistency between the two modalities on an image dataset with both types of annotations.

    References

    [1]
    Alexander C. Berg, Tamara L. Berg, Hal Daume, Jesse Dodge, Amit Goyal, Xufeng Han, Alyssa Mensch, Margaret Mitchell, Aneesh Sood, Karl Stratos, et al. 2012. Understanding and predicting importance in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Providence Rhode Island, 3562–3569.
    [2]
    Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python. O’Reilly Media, Inc.
    [3]
    Neil Bruce and John Tsotsos. 2007. Attention based on information maximization. Journal of Vision 7, 9 (2007), 950–950.
    [4]
    Neil D. B. Bruce and John K. Tsotsos. 2009. Saliency, attention, and visual search: An information theoretic approach. Journal of Vision 9, 3 (2009), 5.
    [5]
    Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 3 (2011), 27:1–27:27. Software available at https://www.csie.ntu.edu.tw/∼cjlin/libsvm/.
    [6]
    Jaegul Choo, Changhyun Lee, Chandan K Reddy, and Haesun Park. 2013. Utopian: User-driven topic modeling based on interactive nonnegative matrix factorization. IEEE Transactions on Visualization and Computer Graphics 19, 12 (2013), 1992–2001.
    [7]
    Jason Chuang, Christopher D. Manning, and Jeffrey Heer. 2012. Termite: Visualization techniques for assessing textual topic models. In Proceedings of the International Working Conference on Advanced Visual Interfaces. ACM, Capri Island, Italy, 74–77.
    [8]
    Weiwei Cui, Shixia Liu, Li Tan, Conglei Shi, Yangqiu Song, Zekai Gao, Huamin Qu, and Xin Tong. 2011. Textflow: Towards better understanding of evolving topics in text. IEEE Transactions on Visualization and Computer Graphics 17, 12 (2011), 2412–2421.
    [9]
    Robert Desimone and John Duncan. 1995. Neural mechanisms of selective visual attention. Annual Review of Neuroscience 18, 1 (1995), 193–222.
    [10]
    Wenwen Dou, Xiaoyu Wang, Remco Chang, and William Ribarsky. 2011. Paralleltopics: A probabilistic approach to exploring document collections. In IEEE Conference on Visual Analytics Science and Technology (VAST’11). IEEE, Providence Rhode Island, 231–240.
    [11]
    Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, et al. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Boston, 1473–1482.
    [12]
    Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. 2013. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems. MIT, Lake Tahoe, 2121–2129.
    [13]
    Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu. 2015. Are you talking to a machine? Dataset and methods for multilingual image question. In Advances in Neural Information Processing Systems. MIT, Montreal Canada, 2296–2304.
    [14]
    Jonathan Harel, Christof Koch, and Pietro Perona. 2006. Graph-based visual saliency. In Advances in Neural Information Processing Systems. MIT, Vancouver, 545–552.
    [15]
    W. Harwin, A. Ginige, and R. Jackson. 1986. A potential application in early education and a possible role for a vision system in a workstation based robotic aid for physically disabled persons. Interactive Robotic Aids-One Option for Independent Living: An International Perspective, Volume Monograph 37 (1986), 18–23.
    [16]
    Susan Havre, Elizabeth Hetzler, Paul Whitney, and Lucy Nowell. 2002. Themeriver: Visualizing thematic changes in large document collections. IEEE Transactions on Visualization and Computer Graphics 8, 1 (2002), 9–20.
    [17]
    James E. Hoffman and Baskaran Subramaniam. 1995. The role of visual attention in saccadic eye movements. Attention, Perception, & Psychophysics 57, 6 (1995), 787–795.
    [18]
    Xiaodi Hou and Liqing Zhang. 2007. Saliency detection: A spectral residual approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Minneapolis, 1–8.
    [19]
    Xun Huang, Chengyao Shen, Xavier Boix, and Qi Zhao. 2015. SALICON: Reducing the semantic gap in saliency prediction by adapting deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Boston, 262–270.
    [20]
    Laurent Itti, Christof Koch, and Ernst Niebur. 1998. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 11 (1998), 1254–1259.
    [21]
    Mainak Jas and Devi Parikh. 2015. Image specificity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Boston, 2727–2736.
    [22]
    M. Jiang, S. Huang, J. Duan, and Q. Zhao. 2015. SALICON: Saliency in context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Boston, 1072–1080.
    [23]
    Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Las Vegas, 4565–4574.
    [24]
    Tilke Judd, Krista Ehinger, Frédo Durand, and Antonio Torralba. 2009. Learning to predict where humans look. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Miami.
    [25]
    Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Boston. 3128–3137.
    [26]
    Andrej Karpathy, Armand Joulin, and Feifei Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in Neural Information Processing Systems. MIT, Montreal, 1889–1897.
    [27]
    Ryan Kiros, Ruslan Salakhutdinov, and Rich Zemel. 2014. Multimodal neural language models. In Proceedings of the 31st International Conference on Machine Learning (ICML-14). IEEE, Beijing, 595–603.
    [28]
    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. MIT, Lake Tahoe, 1097–1105.
    [29]
    Daniel D. Lee and H. Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401, 6755 (1999), 788.
    [30]
    Hanseung Lee, Jaeyeon Kihm, Jaegul Choo, John Stasko, and Haesun Park. 2012. iVisClustering: An interactive visual document clustering via topic modeling. In Computer Graphics Forum, Vol. 31. Wiley Online Library, 1155–1164.
    [31]
    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computar Vision. IEEE, Zurich, 740–755.
    [32]
    Mengchen Liu, Jiaxin Shi, Zhen Li, Chongxuan Li, Jun Zhu, and Shixia Liu. 2017. Towards better analysis of deep convolutional neural networks. IEEE Transactions on Visualization and Computer Graphics 23, 1 (2017), 91–100.
    [33]
    Nian Liu, Junwei Han, Dingwen Zhang, Shifeng Wen, and Tianming Liu. 2015. Predicting eye fixations using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Boston, 362–370.
    [34]
    Mateusz Malinowski and Mario Fritz. 2015. Hard to cheat: A Turing test based on answering questions about images. arXiv Preprint arXiv:1501.03302 (2015).
    [35]
    Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille. 2014. Explain images with multimodal recurrent neural networks. arXiv Preprint arXiv:1410.1090 (2014).
    [36]
    Bo Pang, Lillian Lee, et al. 2008. Opinion mining and sentiment analysis. Foundations and Trends® in Information Retrieval 2, 1–2 (2008), 1–135.
    [37]
    Derrick Parkhurst, Klinton Law, and Ernst Niebur. 2002. Modeling the role of salience in the allocation of overt visual attention. Vision Research 42, 1 (2002), 107–123.
    [38]
    F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in python. Journal of Machine Learning Research 12 (2011), 2825–2830.
    [39]
    Vasili Ramanishka, Abir Das, Jianming Zhang, and Kate Saenko. 2016. Top-down visual saliency guided by captions. arXiv Preprint arXiv:1612.07360 (2016).
    [40]
    Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting image annotations using Amazon’s Mechanical Turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. Association for Computational Linguistics, 139–147.
    [41]
    Mengye Ren, Ryan Kiros, and Richard Zemel. 2015. Exploring models and data for image question answering. In Advances in Neural Information Processing Systems. MIT, Montreal, 2953–2961.
    [42]
    Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2016. Grad-CAM: Visual explanations from deep networks via gradient-based localization. See https://arxiv.org/abs/1610.02391v3 (2016).
    [43]
    Alex J. Smola and Bernhard Schölkopf. 2004. A tutorial on support vector regression. Statistics and Computing 14, 3 (2004), 199–222.
    [44]
    Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2 (2014), 207–218.
    [45]
    Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. 2014. Striving for simplicity: The all convolutional net. arXiv Preprint arXiv:1412.6806 (2014).
    [46]
    Maite Taboada, Julian Brooke, Milan Tofiloski, Kimberly Voll, and Manfred Stede. 2011. Lexicon-based methods for sentiment analysis. Computational Linguistics 37, 2 (2011), 267–307.
    [47]
    Hwang Tae-Hyun, Joo In-Hak, and Cho Seong-Ik. 2006. Detection of traffic lights for vision-based car navigation system. In Advances in Image and Video Technology. 682–691.
    [48]
    Charles Thorpe, Martial Hebert, Takeo Kanade, and Steven Shafer. 1989. Vision and Navigation for the Carnegie Mellon Navlab. Springer.
    [49]
    Kristina Toutanova and Christopher D. Manning. 2000. Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora: Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics-Volume 13. 63–70.
    [50]
    F.-Y. Tzeng and K.-L. Ma. 2005. Opening the black box-data driven visualization of neural networks. In Visualization, 2005. VIS 05. IEEE. IEEE, Baltimore, 383–390.
    [51]
    Furu Wei, Shixia Liu, Yangqiu Song, Shimei Pan, Michelle X. Zhou, Weihong Qian, Lei Shi, Li Tan, and Qiang Zhang. 2010. Tiara: A visual exploratory text analytic system. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 153–162.
    [52]
    Zhibiao Wu and Martha Palmer. 1994. Verbs semantics and lexical selection. In Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics. 133–138.
    [53]
    Juan Xu, Ming Jiang, Shuo Wang, Mohan S. Kankanhalli, and Qi Zhao. 2014. Predicting human gaze beyond pixels. Journal of Vision 14, 1 (2014), 1–20.
    [54]
    Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. IEEE, Lille, 2048–2057.
    [55]
    Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. 2015. Understanding neural networks through deep visualization. arXiv Preprint arXiv:1506.06579 (2015).
    [56]
    Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67–78.
    [57]
    Lingyun Zhang, Matthew H. Tong, Tim K. Marks, Honghao Shan, and Garrison W. Cottrell. 2008. SUN: A bayesian framework for saliency using natural statistics. Journal of Vision 8, 7 (2008), 32.

    Cited By

    View all
    • (2023)A survey on automatic generation of medical imaging reports based on deep learningBioMedical Engineering OnLine10.1186/s12938-023-01113-y22:1Online publication date: 18-May-2023

    Index Terms

    1. CapVis: Toward Better Understanding of Visual-Verbal Saliency Consistency

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Intelligent Systems and Technology
      ACM Transactions on Intelligent Systems and Technology  Volume 10, Issue 1
      Special Issue on Visual Analytics
      January 2019
      235 pages
      ISSN:2157-6904
      EISSN:2157-6912
      DOI:10.1145/3295616
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 28 November 2018
      Accepted: 01 March 2018
      Revised: 01 March 2018
      Received: 01 August 2017
      Published in TIST Volume 10, Issue 1

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Image captioning
      2. visual analytics
      3. visual saliency

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      • National Science Foundation of China

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)10
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 09 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)A survey on automatic generation of medical imaging reports based on deep learningBioMedical Engineering OnLine10.1186/s12938-023-01113-y22:1Online publication date: 18-May-2023

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media