research-article

CapVis: Toward Better Understanding of Visual-Verbal Saliency Consistency

Authors:

Qi ZhaoAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology (TIST), Volume 10, Issue 1

Article No.: 10, Pages 1 - 23

https://doi.org/10.1145/3200767

Published: 28 November 2018 Publication History

Abstract

When looking at an image, humans shift their attention toward interesting regions, making sequences of eye fixations. When describing an image, they also come up with simple sentences that highlight the key elements in the scene. What is the correlation between where people look and what they describe in an image? To investigate this problem intuitively, we develop a visual analytics system, CapVis, to look into visual attention and image captioning, two types of subjective annotations that are relatively task-free and natural. Using these annotations, we propose a word-weighting scheme to extract visual and verbal saliency ranks to compare against each other. In our approach, a number of low-level and semantic-level features relevant to visual-verbal saliency consistency are proposed and visualized for a better understanding of image content. Our method also shows the different ways that a human and a computational model look at and describe images, which provides reliable information for a captioning model. Experiment also shows that the visualized feature can be integrated into a computational model to effectively predict the consistency between the two modalities on an image dataset with both types of annotations.

References

[1]

Alexander C. Berg, Tamara L. Berg, Hal Daume, Jesse Dodge, Amit Goyal, Xufeng Han, Alyssa Mensch, Margaret Mitchell, Aneesh Sood, Karl Stratos, et al. 2012. Understanding and predicting importance in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Providence Rhode Island, 3562–3569.

Digital Library

[2]

Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python. O’Reilly Media, Inc.

Digital Library

[3]

Neil Bruce and John Tsotsos. 2007. Attention based on information maximization. Journal of Vision 7, 9 (2007), 950–950.

[4]

Neil D. B. Bruce and John K. Tsotsos. 2009. Saliency, attention, and visual search: An information theoretic approach. Journal of Vision 9, 3 (2009), 5.

[5]

Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 3 (2011), 27:1–27:27. Software available at https://www.csie.ntu.edu.tw/∼cjlin/libsvm/.

Digital Library

[6]

Jaegul Choo, Changhyun Lee, Chandan K Reddy, and Haesun Park. 2013. Utopian: User-driven topic modeling based on interactive nonnegative matrix factorization. IEEE Transactions on Visualization and Computer Graphics 19, 12 (2013), 1992–2001.

Digital Library

[7]

Jason Chuang, Christopher D. Manning, and Jeffrey Heer. 2012. Termite: Visualization techniques for assessing textual topic models. In Proceedings of the International Working Conference on Advanced Visual Interfaces. ACM, Capri Island, Italy, 74–77.

Digital Library

[8]

Weiwei Cui, Shixia Liu, Li Tan, Conglei Shi, Yangqiu Song, Zekai Gao, Huamin Qu, and Xin Tong. 2011. Textflow: Towards better understanding of evolving topics in text. IEEE Transactions on Visualization and Computer Graphics 17, 12 (2011), 2412–2421.

Digital Library

[9]

Robert Desimone and John Duncan. 1995. Neural mechanisms of selective visual attention. Annual Review of Neuroscience 18, 1 (1995), 193–222.

[10]

Wenwen Dou, Xiaoyu Wang, Remco Chang, and William Ribarsky. 2011. Paralleltopics: A probabilistic approach to exploring document collections. In IEEE Conference on Visual Analytics Science and Technology (VAST’11). IEEE, Providence Rhode Island, 231–240.

[11]

Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, et al. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Boston, 1473–1482.

[12]

Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. 2013. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems. MIT, Lake Tahoe, 2121–2129.

Digital Library

[13]

Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu. 2015. Are you talking to a machine? Dataset and methods for multilingual image question. In Advances in Neural Information Processing Systems. MIT, Montreal Canada, 2296–2304.

Digital Library

[14]

Jonathan Harel, Christof Koch, and Pietro Perona. 2006. Graph-based visual saliency. In Advances in Neural Information Processing Systems. MIT, Vancouver, 545–552.

Digital Library

[15]

W. Harwin, A. Ginige, and R. Jackson. 1986. A potential application in early education and a possible role for a vision system in a workstation based robotic aid for physically disabled persons. Interactive Robotic Aids-One Option for Independent Living: An International Perspective, Volume Monograph 37 (1986), 18–23.

[16]

Susan Havre, Elizabeth Hetzler, Paul Whitney, and Lucy Nowell. 2002. Themeriver: Visualizing thematic changes in large document collections. IEEE Transactions on Visualization and Computer Graphics 8, 1 (2002), 9–20.

Digital Library

[17]

James E. Hoffman and Baskaran Subramaniam. 1995. The role of visual attention in saccadic eye movements. Attention, Perception, & Psychophysics 57, 6 (1995), 787–795.

[18]

Xiaodi Hou and Liqing Zhang. 2007. Saliency detection: A spectral residual approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Minneapolis, 1–8.

[19]

Xun Huang, Chengyao Shen, Xavier Boix, and Qi Zhao. 2015. SALICON: Reducing the semantic gap in saliency prediction by adapting deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Boston, 262–270.

Digital Library

[20]

Laurent Itti, Christof Koch, and Ernst Niebur. 1998. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 11 (1998), 1254–1259.

Digital Library

[21]

Mainak Jas and Devi Parikh. 2015. Image specificity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Boston, 2727–2736.

[22]

M. Jiang, S. Huang, J. Duan, and Q. Zhao. 2015. SALICON: Saliency in context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Boston, 1072–1080.

[23]

Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Las Vegas, 4565–4574.

[24]

Tilke Judd, Krista Ehinger, Frédo Durand, and Antonio Torralba. 2009. Learning to predict where humans look. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, Miami.

[25]

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Boston. 3128–3137.

[26]

Andrej Karpathy, Armand Joulin, and Feifei Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in Neural Information Processing Systems. MIT, Montreal, 1889–1897.

Digital Library

[27]

Ryan Kiros, Ruslan Salakhutdinov, and Rich Zemel. 2014. Multimodal neural language models. In Proceedings of the 31st International Conference on Machine Learning (ICML-14). IEEE, Beijing, 595–603.

Digital Library

[28]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. MIT, Lake Tahoe, 1097–1105.

Digital Library

[29]

Daniel D. Lee and H. Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401, 6755 (1999), 788.

[30]

Hanseung Lee, Jaeyeon Kihm, Jaegul Choo, John Stasko, and Haesun Park. 2012. iVisClustering: An interactive visual document clustering via topic modeling. In Computer Graphics Forum, Vol. 31. Wiley Online Library, 1155–1164.

Digital Library

[31]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computar Vision. IEEE, Zurich, 740–755.

[32]

Mengchen Liu, Jiaxin Shi, Zhen Li, Chongxuan Li, Jun Zhu, and Shixia Liu. 2017. Towards better analysis of deep convolutional neural networks. IEEE Transactions on Visualization and Computer Graphics 23, 1 (2017), 91–100.

Digital Library

[33]

Nian Liu, Junwei Han, Dingwen Zhang, Shifeng Wen, and Tianming Liu. 2015. Predicting eye fixations using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Boston, 362–370.

[34]

Mateusz Malinowski and Mario Fritz. 2015. Hard to cheat: A Turing test based on answering questions about images. arXiv Preprint arXiv:1501.03302 (2015).

[35]

Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille. 2014. Explain images with multimodal recurrent neural networks. arXiv Preprint arXiv:1410.1090 (2014).

[36]

Bo Pang, Lillian Lee, et al. 2008. Opinion mining and sentiment analysis. Foundations and Trends® in Information Retrieval 2, 1–2 (2008), 1–135.

Digital Library

[37]

Derrick Parkhurst, Klinton Law, and Ernst Niebur. 2002. Modeling the role of salience in the allocation of overt visual attention. Vision Research 42, 1 (2002), 107–123.

[38]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in python. Journal of Machine Learning Research 12 (2011), 2825–2830.

Digital Library

[39]

Vasili Ramanishka, Abir Das, Jianming Zhang, and Kate Saenko. 2016. Top-down visual saliency guided by captions. arXiv Preprint arXiv:1612.07360 (2016).

[40]

Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting image annotations using Amazon’s Mechanical Turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. Association for Computational Linguistics, 139–147.

Digital Library

[41]

Mengye Ren, Ryan Kiros, and Richard Zemel. 2015. Exploring models and data for image question answering. In Advances in Neural Information Processing Systems. MIT, Montreal, 2953–2961.

Digital Library

[42]

Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2016. Grad-CAM: Visual explanations from deep networks via gradient-based localization. See https://arxiv.org/abs/1610.02391v3 (2016).

[43]

Alex J. Smola and Bernhard Schölkopf. 2004. A tutorial on support vector regression. Statistics and Computing 14, 3 (2004), 199–222.

Digital Library

[44]

Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2 (2014), 207–218.

[45]

Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. 2014. Striving for simplicity: The all convolutional net. arXiv Preprint arXiv:1412.6806 (2014).

[46]

Maite Taboada, Julian Brooke, Milan Tofiloski, Kimberly Voll, and Manfred Stede. 2011. Lexicon-based methods for sentiment analysis. Computational Linguistics 37, 2 (2011), 267–307.

Digital Library

[47]

Hwang Tae-Hyun, Joo In-Hak, and Cho Seong-Ik. 2006. Detection of traffic lights for vision-based car navigation system. In Advances in Image and Video Technology. 682–691.

Digital Library

[48]

Charles Thorpe, Martial Hebert, Takeo Kanade, and Steven Shafer. 1989. Vision and Navigation for the Carnegie Mellon Navlab. Springer.

Digital Library

[49]

Kristina Toutanova and Christopher D. Manning. 2000. Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora: Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics-Volume 13. 63–70.

Digital Library

[50]

F.-Y. Tzeng and K.-L. Ma. 2005. Opening the black box-data driven visualization of neural networks. In Visualization, 2005. VIS 05. IEEE. IEEE, Baltimore, 383–390.

[51]

Furu Wei, Shixia Liu, Yangqiu Song, Shimei Pan, Michelle X. Zhou, Weihong Qian, Lei Shi, Li Tan, and Qiang Zhang. 2010. Tiara: A visual exploratory text analytic system. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 153–162.

Digital Library

[52]

Zhibiao Wu and Martha Palmer. 1994. Verbs semantics and lexical selection. In Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics. 133–138.

Digital Library

[53]

Juan Xu, Ming Jiang, Shuo Wang, Mohan S. Kankanhalli, and Qi Zhao. 2014. Predicting human gaze beyond pixels. Journal of Vision 14, 1 (2014), 1–20.

[54]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. IEEE, Lille, 2048–2057.

Digital Library

[55]

Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. 2015. Understanding neural networks through deep visualization. arXiv Preprint arXiv:1506.06579 (2015).

[56]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67–78.

[57]

Lingyun Zhang, Matthew H. Tong, Tim K. Marks, Honghao Shan, and Garrison W. Cottrell. 2008. SUN: A bayesian framework for saliency using natural statistics. Journal of Vision 8, 7 (2008), 32.

Cited By

Pang TLi PZhao L(2023)A survey on automatic generation of medical imaging reports based on deep learningBioMedical Engineering OnLine10.1186/s12938-023-01113-y22:1Online publication date: 18-May-2023
https://doi.org/10.1186/s12938-023-01113-y

Index Terms

CapVis: Toward Better Understanding of Visual-Verbal Saliency Consistency
1. Human-centered computing
  1. Visualization
    1. Visualization application domains
      1. Visual analytics

Recommendations

Perceptual modeling in the problem of active object recognition in visual scenes

Incorporating models of human perception into the process of scene interpretation and object recognition in visual content is a strong trend in computer vision. In this paper we tackle the modeling of visual perception via automatic visual saliency maps ...
Linking visual saliency deviation to image quality degradation: A saliency deviation-based image quality index
Abstract
Advances in image quality research have shown the benefits of modeling functional components of the human visual system in image quality metrics. Recently, visual saliency, an important aspect of the human visual system, is ...
Highlights
- Visual quality degradation is linked and modelled with saliency deviation for the first time.
Visual saliency detection based on region descriptors and prior knowledge

Visual saliency detection not only plays a significant role, but it is also a challenging task in computer vision. In this paper we propose a new method for saliency detection. It incorporates visual features and spatial information with a guidance of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology

ACM Transactions on Intelligent Systems and Technology Volume 10, Issue 1

Special Issue on Visual Analytics

January 2019

235 pages

ISSN:2157-6904

EISSN:2157-6912

DOI:10.1145/3295616

Editor:
Yu Zheng
JD Finance, China

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 November 2018

Accepted: 01 March 2018

Revised: 01 March 2018

Received: 01 August 2017

Published in TIST Volume 10, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Science Foundation of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
244
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Pang TLi PZhao L(2023)A survey on automatic generation of medical imaging reports based on deep learningBioMedical Engineering OnLine10.1186/s12938-023-01113-y22:1Online publication date: 18-May-2023
https://doi.org/10.1186/s12938-023-01113-y

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents