Grad-CAM: Why did you say that? Visual Explanations from Deep Networks via Gradient-based Localization

Selvaraju, Ramprasaath R.; Das, Abhishek; Vedantam, Ramakrishna; Cogswell, Michael; Parikh, Devi; Batra, Dhruv

Computer Science > Computer Vision and Pattern Recognition

arXiv:1610.02391v2 (cs)

[Submitted on 7 Oct 2016 (v1), revised 30 Dec 2016 (this version, v2), latest version 3 Dec 2019 (v4)]

Title:Grad-CAM: Why did you say that? Visual Explanations from Deep Networks via Gradient-based Localization

Authors:Ramprasaath R. Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, Dhruv Batra

View PDF

Abstract:We propose a technique for making CNN-based models more transparent by visualizing the input image regions that are important for predictions from these models- producing visual explanations. Our approach, called Gradient-weighted Class Activation Mapping (Grad-CAM), uses the class-specific gradient information flowing into the final convolutional layer of a CNN to produce a coarse localization map of the regions in the image important for each class. Grad-CAM is a strict generalization of Class Activation Mapping (CAM). Unlike CAM, Grad-CAM is broadly applicable to any CNN-based architectures and needs no re-training. We show how Grad-CAM may be combined with pixel-space visualizations (such as Guided Backprop) to create a high-resolution class-discriminative visualization (Guided Grad-CAM). We generate Grad-CAM and Guided Grad-CAM visualizations to better understand off-the-shelf image classification, image captioning, and visual question answering (VQA) models, including Res-Net based architectures. In the context of image classification models, our visualizations (a) lend insight into model's failure modes, and (b) outperform pixel-space gradient visualizations on the ILSVRC-15 weakly-supervised localization. For image captioning and VQA, our visualizations expose the somewhat surprising insight that common CNN+LSTM models are good at localizing discriminative input image regions despite not being trained on grounded image-text pairs. Finally, through human studies we show that our explanations help users establish trust in the predictions made by deep networks. Interestingly, we find that Guided Grad-CAM helps untrained users successfully discern a stronger deep network from a weaker one even when both make identical decisions. Our code is available at this http URL and a demo is available at this http URL. Video of the demo can be found at this http URL.

Comments:	22 pages, 19 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:1610.02391 [cs.CV]
	(or arXiv:1610.02391v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.1610.02391

Submission history

From: Ramprasaath Ramasamy Selvaraju [view email]
[v1] Fri, 7 Oct 2016 19:54:24 UTC (8,245 KB)
[v2] Fri, 30 Dec 2016 07:19:35 UTC (8,596 KB)
[v3] Tue, 21 Mar 2017 23:48:00 UTC (9,133 KB)
[v4] Tue, 3 Dec 2019 02:13:03 UTC (7,321 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Grad-CAM: Why did you say that? Visual Explanations from Deep Networks via Gradient-based Localization

Submission history

Access Paper:

References & Citations

14 blog links

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Grad-CAM: Why did you say that? Visual Explanations from Deep Networks via Gradient-based Localization

Submission history

Access Paper:

References & Citations

14 blog links

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators