Rethinking Robustness of Model Attributions: Sandesh Kamath, Sankalp Mittal, Amit Deshpande, Vineeth N Balasubramanian
Rethinking Robustness of Model Attributions: Sandesh Kamath, Sankalp Mittal, Amit Deshpande, Vineeth N Balasubramanian
Rethinking Robustness of Model Attributions: Sandesh Kamath, Sankalp Mittal, Amit Deshpande, Vineeth N Balasubramanian
3-LENS-recall@k
to multiple research efforts to make a model’s attributions 0.8 0.8
0.7 0.7
robust. Chen et al. (2019) proposed a regularization-based 0.6 0.6
0.5 0.5
approach, where an explicit regularizer term is added to the 0.4 0.4
0.3 0.3
loss function to maintain the model gradient across input 0.2 0.2
0.1 0.1
(IG, in particular) while training the DNN model. This was 0.0 0.0
subsequently extended by (Sarkar, Sarkar, and Balasubra- 0 1 2 4 8 0 1 2 4 8
manian 2021; Singh et al. 2020; Wang et al. 2020), all of 1.0 1.0
3-LENS-recall@k-div
whom provide different training strategies and regularizers 0.9 0.9
0.8 0.8
to improve attributional robustness of models. Each of these 0.7 0.7
0.6 0.6
methods including Ghorbani, Abid, and Zou (2019) mea- 0.5 0.5
0.4 0.4
sures change in attribution before and after input perturba- 0.3 0.3
0.2 0.2
tion using the same metrics: top-k intersection, and/or rank 0.1 0.1
0.0 0.0
correlations like Spearman’s ρ and Kendall’ τ . Such metrics
0 1 2 4 8 0 1 2 4 8
have recently, in fact, further been used to understand issues
surrounding attributional robustness (Wang and Kong 2022).
Other efforts that quantify stability of attributions in tabular Figure 2: From top to bottom, we plot average top-k in-
data also use Euclidean distance (or its variants) between tersection (currently used metric), 3-LENS-recall@k and
the original and perturbed attribution maps (Alvarez-Melis 3-LENS-recall@k-div (proposed metrics) against the ℓ∞ -
and Jaakkola 2018; Yeh et al. 2019; Agarwal et al. 2022). norm of attributional attack perturbations for Simple Gra-
Each of these metrics look for dimension-wise correlation dients (SG) (left) and Integrated Gradients (IG) (right) of a
or pixel-level matching between attribution maps before and SqueezeNet model on Imagenet. We use k = 1000 and three
after perturbation, and thus penalize even a minor change in attributional attack variants proposed by Ghorbani, Abid,
attribution (say, even by one pixel coordinate location). This and Zou (2019). Evidently, the proposed metrics show more
results in a false sense of fragility, and could even be mis- robustness under the same attacks.
leading. In this work, we highlight the need to revisit such
metrics, and propose variants based on locality and diversity
that can be easily integrated into existing metrics. 3 Locality-sENSitive Metrics (LENS) for
Attributional Robustness
As a motivating example, Figure 2 presents the results ob-
tained using (Ghorbani, Abid, and Zou 2019) with Sim-
ple Gradients (SG) and Integrated Gradients (IG) of an NN
model trained on ImageNet. The top row, which reports the
Other Related Work. In other related efforts that have currently followed top-k intersection measure of attribution
studied similar properties of attribution-based explanations, robustness, shows a significant drop in robustness perfor-
(Carvalho, Pereira, and Cardoso 2019; Bhatt, Weller, and mance even for the random sign attack (green line). The
Moura 2020) stated that stable explanations should not vary subsequent rows, which report our metrics for the same ex-
too much between similar input samples, unless the model’s periments, show significant improvements in robustness –
prediction changes drastically. The abovementioned attribu- especially when combining the notions of locality and di-
tional attacks and defense methods (Ghorbani, Abid, and versity. Observations made on current metrics could lead to
Zou 2019; Sarkar, Sarkar, and Balasubramanian 2021; Singh a false sense of fragility, which overpenalizes even an attri-
et al. 2020; Wang et al. 2020) maintain this property, since bution shift by 1-2 pixels. A detailed description of our ex-
they focus on input perturbations that change the attribu- perimental setup for these results is available in Appendix C.
tion without changing the model prediction itself. Similarly, Motivated by these observations, we explore improved mea-
Arun et al. (2020) and Fel et al. (2022) introduced the no- sures for attributional robustness that maintain the overall
tions of repeatability/reproducibility and generalizability re- requirements of robustness, but do not overpenalize minor
spectively, both of which focus on the desired property that deviations.
a trustworthy explanation must point to similar evidence
across similar input images. In this work, we provide a prac- 3.1 Defining LENS Metrics for Attributions
tical metric to study this notion of similarity by considering To begin with, we propose an extension of existing simi-
locality-sensitive metrics. larity measures to incorporate the locality of pixel attribu-
interesting properties. Let a1 and a2 be two attribution
vectors for two images, and let Sk and Tk be the set
of top k pixels in these images according to a1 and
a2 , respectively. We define a locality-sensitive top-k
distance between two attribution vectors a1 and a2 as
(w) def (w) (w)
dk (a1 , a2 ) = preck (a1 , a2 ) + recallk (a1 , a2 ),
(w) def |Sk \Nw (Tk )|
where preck (a1 , a2 ) = k and
(w) def |Tk \Nw (Sk )|
recallk (a1 , a2 ) = k ,
similar to precision
and recall used in ranking literature, with the key difference
being the inclusion of neighborhood items based on locality.
(w)
Figure 3: A sample image from Flower dataset before (top) Below we state a monotonicity property of dk (a1 , a2 ) and
and after (bottom) the top-k attributional attack of (Ghor- upper bound it in terms of the symmetric set difference of
bani, Abid, and Zou 2019) on a ResNet model for Integrated top-k attributions.
Gradients (IG) attribution method. From left to right: the im- (w )
age, its top-k pixels as per IG, the union of the 3 × 3-pixel Proposition 1. For any w1 ≤ w2 , we have dk 2 (a1 , a2 ) ≤
(w )
neighborhoods and 5 × 5-pixel neighborhoods of the top-k dk 1 (a1 , a2 ) ≤ |Sk △Tk | /k, where △ denotes the symmet-
pixels, respectively, for k = 1000. Quantitatively, top-k in- ric set difference, i.e., A△B = (A \ B) ∪ (B \ A).
tersection: 0.14, 1-LENS-recall@k: 0.25, 1-LENS-pre@k: (w)
Combining dk (a1 , a2 ) across different values of k and
0.37, 2-LENS-recall@k: 0.40, 2-LENS-pre@k: 0.62.
w, we can define a distance
∞ ∞
(w)
X X
d(a1 , a2 ) = αk βw dk (a1 , a2 ),
tions in images to derive more practical and useful measures
k=1 w=0
of attributional robustness. Let aij (x) denote the attribution
value or importance assigned to the (i, j)-th pixel in an in- where αk and βw be non-negative weights, monotonically
P
put image x, and let Sk (x) denote the set of k pixel po- decreasing
P in k and w, respectively, such that k αk < ∞
sitions with the largest attribution values. Let Nw (i, j) = and w βw < ∞. We show that the distance defined above
{(p, q) : i − w ≤ p ≤ i + w, j − w ≤ q ≤ j + w} be the is upper-bounded by a metric similar to those proposed in
neighboring pixel positions within a (2w + 1) × (2w + 1) (Fagin, Kumar, and Sivakumar 2003) based on symmetric
window around the (i, j)-th pixel. BySa slight abuse of no- set difference of top-k ranks to compare two rankings.
tation, we use Nw (Sk (x)) to denote (i,j)∈Sk (x) Nw (i, j), Proposition 2. d(a1 , a2 ) defined above is upper-bounded
that is, the set of all pixel positions that lie in the union of by u(a1 , a2 ) given by
(2w + 1) × (2w + 1) windows around the top-k pixels. ∞ ∞
For a given attributional perturbation Att(·), let Tk =
X X |Sk △Tk |
u(a1 , a2 ) = αk βw ,
Sk (x + Att(x)) denote the top-k pixels in attribution val- w=0
k
k=1
ues after applying the attributional perturbation Att(x).
The currently used top-k intersection metric is then com- and u(a1 , a2 ) defines a bounded metric on the space of at-
puted as: |Sk (x) ∩ Tk (x)| /k. To address the abovemen- tribution vectors.
tioned issues, we instead propose Locality-sENSitive top- Note that top-k intersection, Spearman’s ρ and Kendall’s
k metrics (LENS-top-k) as |Nw (Sk (x)) ∩ Tk (x)| /k and τ do not take the attribution values aij (x)’s into account
|Sk (x) ∩ Nw (Tk (x))| /k, which are also closer to more but only the rank order of pixels according to these values.
widely used metrics such as precision and recall in We also define a locality-sensitive w-smoothed attribution
ranking methods. We simiarly define Locality-sENSitive as follows.
Spearman’s ρ (LENS-Spearman) and Locality-sENSitive
(w) 1 X
Kendall’s τ (LENS-Kendall) metrics as rank correlation ãij (x) = apq (x)
(2w + 1) 2
coefficients for the smoothed ranking orders according to (p,q)∈Nw (i,j),
ãij (x)’s and ãij (x + Att(x))’s, respectively. These can be 1≤p,q≤n
used to compare two different attributions for the same im-
We show that the w-smoothed attribution leads to a contrac-
age, the same attribution method on two different images, or
tion in the ℓ2 norm commonly used in theoretical analysis of
even two different attributions on two different images, as
simple gradients as attributions.
long as the attribution vectors lie in the same space, e.g.,
images of the same dimensions where attributions assign Proposition 3. For any inputs x, y and any w ≥ 0,
importance values to pixels. Figure 3 provides the visual- ã(w) (x) − ã(w) (y) 2 ≤ ∥a(x) − a(y)∥2 .
ization of the explanation map of a sample from the Flower Thus, any theoretical bounds on the attributional robust-
dataset with the top-1000 pixels followed by the correspond- ness of simple gradients in ℓ2 norm proved in previous
ing maps with 1-LENS@k and 2-LENS@k. works continue to hold for locality-sensitive w-smoothed
We show that the proposed locality-sensitive variants gradients. For example, (Wang et al. 2020) show the fol-
of the robustness metrics also possess some theoretically lowing Hessian-based bound on simple gradients. For an
1.0
input x and a classifier or model defined by f , let ∇x (f )
and ∇y (f ) be the simple gradients w.r.t. the inputs at x 0.8
w-LENS@k
0.6
ℓ2 distance between the simple gradients of nearby points
∥x − y∥2 ≤ δ as ∥∇x (f ) − ∇y (f )∥2 ≲ δ λmax (Hx (f )), 0.4
w-LENS-prec@k-div
(w) (w)
w-LENS-prec@k
1 − preck (a1 , a2 ) and 1 − recallk (a1 , a2 ), respectively, 3-LENS-prec@k 3-LENS-prec@k-div
0.6 0.6
where a1 is the attribution of the original image and a2 is
0.4
the attribution of the perturbed image. For rank correlation 0.4
0.2
top-k Table 2: Average top-k intersection, 3-LENS-prec@k and 3-
1-LENS-recall@k
0.0
1-LENS-prec@k LENS-prec@k-div for random sign perturbation attack ap-
Natural PGD IG-SUM-NORM plied to different attribution methods on ImageNet for natu-
rally and adversarially(PGD)-trained ResNet50 models.
Figure 8: For Flower dataset, average top-k intersec-
tion, 1-LENS-prec@k, 1-LENS-recall@k measured be-
tween IG(original image) and IG(perturbed image) for mod-
els that are naturally trained, PGD-trained and IG-SUM- 6 Conclusion and Future Work
NORM trained. The perturbation used is the top-t attack of We show that the fragility of attributions is an effect of us-
(Ghorbani, Abid, and Zou 2019). Note top-k is equivalent to ing fragile robustness metrics such as top-k intersection that
0-LENS-prec@k, 0-LENS-recall@k. only look at the rank order of attributions and fail to cap-
ture the locality of pixel positions with high attributions. We
Figure 8 shows that PGD-trained and IG-SUM-NORM highlight the need for locality-sensitive metrics for attribu-
trained models have more robust Integrated Gradients (IG) tional robustness and propose natural locality-sensitive ex-
in comparison to their naturally trained counterparts, and tensions of existing metrics. We introduce another method of
this holds for the previously used measures of attribu- picking diverse top-k pixels that can be naturally extended
tional robustness (e.g., top-k intersection) as well as the with locality to obtain improved measure of attributional
new locality-sensitive measures we propose (e.g., 1-LENS- robustness. Theoretical understanding of locality-sensitive
prec@k, 1-LENS-recall@k) across all datasets in Chen et al. metrics of attributional robustness, constructing stronger at-
(2019) experiments (Refer Appendix E.2 and E.3). The top- tributional attacks for these metrics, and using them to build
k attack of Ghorbani, Abid, and Zou (2019) is not a threat attributionally robust models are important future directions.
References Fel, T.; Vigouroux, D.; Cadène, R.; and Serre, T. 2022. How
Adebayo, J.; Gilmer, J.; Muelly, M.; Goodfellow, I. J.; Hardt, good is your explanation? algorithmic stability measures to
M.; and Kim, B. 2018. Sanity Checks for Saliency Maps. In assess the quality of explanations for deep neural networks.
Bengio, S.; Wallach, H. M.; Larochelle, H.; Grauman, K.; In Proceedings of the IEEE/CVF Winter Conference on Ap-
Cesa-Bianchi, N.; and Garnett, R., eds., Advances in Neu- plications of Computer Vision, 720–730.
ral Information Processing Systems 31: Annual Conference Gade, K.; Geyik, S. C.; Kenthapadi, K.; Mithal, V.; and Taly,
on Neural Information Processing Systems 2018, NeurIPS A. 2020. Explainable AI in industry: practical challenges
2018, December 3-8, 2018, Montréal, Canada, 9525–9536. and lessons learned: implications tutorial. In Hildebrandt,
Agarwal, C.; Johnson, N.; Pawelczyk, M.; Krishna, S.; Sax- M.; Castillo, C.; Celis, L. E.; Ruggieri, S.; Taylor, L.; and
ena, E.; Zitnik, M.; and Lakkaraju, H. 2022. Rethinking Zanfir-Fortuna, G., eds., FAT* ’20: Conference on Fairness,
Stability for Attribution-based Explanations. arXiv preprint Accountability, and Transparency, Barcelona, Spain, Jan-
arXiv:2203.06877. uary 27-30, 2020, 699. ACM.
Ghorbani, A.; Abid, A.; and Zou, J. Y. 2019. Interpretation
Alvarez-Melis, D.; and Jaakkola, T. S. 2018. On the
of Neural Networks Is Fragile.
robustness of interpretability methods. arXiv preprint
arXiv:1806.08049. Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2015. Explain-
ing and Harnessing Adversarial Examples.
Arun, N.; Gaw, N.; Singh, P.; Chang, K.; Aggarwal, M.;
Chen, B.; et al. 2020. Assessing the (Un) Trustworthiness of He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Deep Residual
saliency maps for localizing abnormalities in medical imag- Learning for Image Recognition. CoRR, abs/1512.03385.
ing. arXiv. Preprint]. Iandola, F. N.; Moskewicz, M. W.; Ashraf, K.; Han, S.;
Dally, W. J.; and Keutzer, K. 2016. SqueezeNet: AlexNet-
Bach, S.; Binder, A.; Montavon, G.; Klauschen, F.; Müller,
level accuracy with 50x fewer parameters and <1MB model
K.-R.; and Samek, W. 2015. On Pixel-Wise Explanations for
size. CoRR, abs/1602.07360.
Non-Linear Classifier Decisions by Layer-Wise Relevance
Propagation. In:PloS One 10.7 (2015), e0130140. Kim, B.; Wattenberg, M.; Gilmer, J.; Cai, C.; Wexler, J.;
Viegas, F.; et al. 2018. Interpretability beyond feature at-
Bhatt, U.; Weller, A.; and Moura, J. M. 2020. Evaluating tribution: Quantitative testing with concept activation vec-
and aggregating feature-based model explanations. arXiv tors (tcav). In International conference on machine learning,
preprint arXiv:2005.00631. 2668–2677. PMLR.
Biggio, B.; Corona, I.; Maiorca, D.; Nelson, B.; Šrndić, N.; Korte, B.; and Lovász, L. 1981. Mathematical structures
Laskov, P.; Giacinto, G.; and Roli, F. 2013. Evasion Attacks underlying greedy algorithms. In Gécseg, F., ed., Funda-
against Machine Learning at Test Time. Lecture Notes in mentals of Computation Theory, 205–209. Springer Berlin
Computer Science, 387–402. Heidelberg. ISBN 978-3-540-38765-7.
Boggust, A.; Suresh, H.; Strobelt, H.; Guttag, J. V.; and Lakkaraju, H.; Arsov, N.; and Bastani, O. 2020. Robust and
Satyanarayan, A. 2022. Beyond Faithfulness: A Framework Stable Black Box Explanations. In International Conference
to Characterize and Compare Saliency Methods. CoRR, on Machine Learning, 5628–5638.
abs/2206.02958. Lecue, F.; Guidotti, R.; Minervini, P.; and Giannotti, F.
Carvalho, D. V.; Pereira, E. M.; and Cardoso, J. S. 2019. 2021. 2021 Explainable AI Tutorial. https://xaitutorial2021.
Machine learning interpretability: A survey on methods and github.io/. Visited on 14-09-2021.
metrics. Electronics, 8(8): 832. Lipton, Z. C. 2018. The mythos of model interpretability.
Chalasani, P.; Chen, J.; Chowdhury, A. R.; Wu, X.; and Jha, Commun. ACM, 61(10): 36–43.
S. 2020. Concise Explanations of Neural Networks using Lundberg, S. M.; and Lee, S. 2017. A Unified Approach to
Adversarial Training. In Proceedings of the 37th Interna- Interpreting Model Predictions.
tional Conference on Machine Learning, ICML 2020, 13-18 Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; and
July 2020, Virtual Event, volume 119 of Proceedings of Ma- Vladu, A. 2018. Towards Deep Learning Models Resistant
chine Learning Research, 1383–1391. PMLR. to Adversarial Attacks.
Chattopadhyay, A.; Sarkar, A.; Howlader, P.; and Balasubra- Molnar, C. 2019. Interpretable Machine Learning. https:
manian, V. N. 2018. Grad-CAM++: Generalized Gradient- //christophm.github.io/interpretable-ml-book/.
Based Visual Explanations for Deep Convolutional Net-
Nourelahi, M.; Kotthoff, L.; Chen, P.; and Nguyen, A.
works. 839–847.
2022. How explainable are adversarially-robust CNNs?
Chen, J.; Wu, X.; Rastogi, V.; Liang, Y.; and Jha, S. 2019. arXiv:2205.13042.
Robust Attribution Regularization. Oh, C.; and Jeong, J. 2020. VODCA: Verification of Diag-
Fagin, R.; Kumar, R.; and Sivakumar, D. 2003. Comparing nosis Using CAM-Based Approach for Explainable Process
Top k Lists. SIAM Journal on Discrete Mathematics, 17(1): Monitoring. Sensors, 20(23): 6858.
134–160. Oviedo, F.; Ferres, J. L.; Buonassisi, T.; and Butler, K. T.
Fan, F.; Xiong, J.; Li, M.; and Wang, G. 2021. On 2022. Interpretable and Explainable Machine Learning for
Interpretability of Artificial Neural Networks: A Survey. Materials Science and Chemistry. Accounts of Materials Re-
arXiv:2001.02522 [cs, stat]. search, 3(6): 597–607.
Ozdag, M. 2018. Adversarial Attacks and Defenses Against Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan,
Deep Neural Networks: A Survey. Procedia Computer Sci- D.; Goodfellow, I. J.; and Fergus, R. 2014. Intriguing prop-
ence, 140: 152–161. erties of neural networks.
Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. ”Why Tang, S.; Ghorbani, A.; Yamashita, R.; Rehman, S.; Dunn-
Should I Trust You?”: Explaining the Predictions of Any mon, J. A.; Zou, J. Y.; and Rubin, D. L. 2021. Data Valuation
Classifier. for Medical Imaging Using Shapley Value: Application on A
Large-scale Chest X-ray Dataset. Scientific Reports(Nature
Samek, W.; Montavon, G.; Vedaldi, A.; Hansen, L. K.; and
Publisher Group).
Müller, K., eds. 2019. Explainable AI: Interpreting, Ex-
plaining and Visualizing Deep Learning, volume 11700 of Tomsett, R.; Harborne, D.; Chakraborty, S.; Gurram, P.; and
Lecture Notes in Computer Science. Springer. ISBN 978-3- Preece, A. D. 2020. Sanity Checks for Saliency Metrics.
030-28953-9. In The Thirty-Fourth AAAI Conference on Artificial Intelli-
gence, AAAI 2020, The Thirty-Second Innovative Applica-
Sarkar, A.; Sarkar, A.; and Balasubramanian, V. N. 2021. tions of Artificial Intelligence Conference, IAAI 2020, The
Enhanced Regularizers for Attributional Robustness. Tenth AAAI Symposium on Educational Advances in Artifi-
Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; cial Intelligence, EAAI 2020, New York, NY, USA, February
Parikh, D.; and Batra, D. 2017. Grad-CAM: Visual Explana- 7-12, 2020, 6021–6029. AAAI Press.
tions from Deep Networks via Gradient-Based Localization. Wang, F.; and Kong, A. W. 2022. Exploiting the Relation-
618–626. ship Between Kendall’s Rank Correlation and Cosine Simi-
Shrikumar, A.; Greenside, P.; and Kundaje, A. 2017. Learn- larity for Attribution Protection. CoRR, abs/2205.07279.
ing Important Features Through Propagating Activation Dif- Wang, Z.; Wang, H.; Ramkumar, S.; Mardziel, P.; Fredrik-
ferences. son, M.; and Datta, A. 2020. Smoothed Geometry for Robust
Shrikumar, A.; Greenside, P.; Shcherbina, A.; and Kundaje, Attribution.
A. 2016. Not Just a Black Box: Learning Important Features Yap, M.; Johnston, R. L.; Foley, H.; MacDonald, S.; Kon-
Through Propagating Activation Differences. 1605.01713. drashova, O.; Tran, K. A.; Nones, K.; Koufariotis, L. T.;
Silva, S. H.; and Najafirad, P. 2020. Opportunities and Chal- Bean, C.; Pearson, J. V.; Trzaskowski, M.; and Waddell, N.
lenges in Deep Learning Adversarial Robustness: A Survey. 2021. Verifying explainability of a deep learning tissue clas-
CoRR, abs/2007.00753. sifier trained on RNA-seq data. Scientific Reports(Nature
Publisher Group).
Simonyan, K.; Vedaldi, A.; and Zisserman, A. 2014. Deep
Inside Convolutional Networks: Visualising Image Classifi- Yeh, C.-K.; Hsieh, C.-Y.; Suggala, A.; Inouye, D. I.; and
cation Models and Saliency Maps. Ravikumar, P. K. 2019. On the (in) fidelity and sensitivity of
explanations. Advances in Neural Information Processing
Singh, M.; Kumari, N.; Mangla, P.; Sinha, A.; Balasubrama- Systems, 32.
nian, V. N.; and Krishnamurthy, B. 2020. Attributional Ro-
Zeiler, M. D.; and Fergus, R. 2014. Visualizing and Under-
bustness Training Using Input-Gradient Spatial Alignment.
standing Convolutional Networks. In Proceedings of The
Slack, D.; Hilgard, A.; Lakkaraju, H.; and Singh, S. 2021a. European Conference on Computer Vision (ECCV).
Counterfactual Explanations Can Be Manipulated. In Ad- Zeiler, M. D.; Krishnan, D.; Taylor, G. W.; and Fergus, R.
vances in Neural Information Processing Systems. 2010. Deconvolutional networks. In The Twenty-Third IEEE
Slack, D.; Hilgard, A.; Singh, S.; and Lakkaraju, H. 2021b. Conference on Computer Vision and Pattern Recognition,
Reliable Post hoc Explanations: Modeling Uncertainty in CVPR 2010, San Francisco, CA, USA, 13-18 June 2010,
Explainability. In Advances in Neural Information Process- 2528–2535. IEEE Computer Society.
ing Systems, 9391–9404. Zhang, Y.; Tiňo, P.; Leonardis, A.; and Tang, K.
Slack, D.; Hilgard, S.; Jia, E.; Singh, S.; and Lakkaraju, H. 2020. A Survey on Neural Network Interpretability.
2020. Fooling LIME and SHAP: Adversarial Attacks on arXiv:2012.14261 [cs].
Post hoc Explanation Methods. Zhou, B.; Khosla, A.; Lapedriza, À.; Oliva, A.; and Torralba,
Smilkov, D.; Thorat, N.; Kim, B.; Viégas, F. B.; and Wat- A. 2016. Learning Deep Features for Discriminative Local-
tenberg, M. 2017. SmoothGrad: removing noise by adding ization. In 2016 IEEE Conference on Computer Vision and
noise. 1706.03825. Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June
27-30, 2016, 2921–2929. IEEE Computer Society.
Springenberg, J. T.; Dosovitskiy, A.; Brox, T.; and Ried-
miller, M. A. 2015. Striving for Simplicity: The All Con-
volutional Net.
Sundararajan, M.; Taly, A.; and Yan, Q. 2017. Axiomatic
Attribution for Deep Networks. In Precup, D.; and Teh,
Y. W., eds., Proceedings of the 34th International Confer-
ence on Machine Learning, ICML 2017, Sydney, NSW, Aus-
tralia, 6-11 August 2017, volume 70 of Proceedings of Ma-
chine Learning Research, 3319–3328. PMLR.
A Supplementary : Rethinking Robustness
of Model Attributions
The Appendix contains proofs, additional experiments to
show that the trends hold across different datasets and other
ablation studies which could not be included in the main pa-
per due to space constraints.
Nw (i, j)’s 20 40 60
Eval no. of top-k
80 100 20 40 60
Eval no. of top-k
80 100 20 40 60
Eval no. of top-k
80 100 0 200 400 600
Eval no. of top-k
800 1000
0.85 0.75
0.8
0.7 0.7 0.25 0.85 0.80
1-LENS-prec@k 1-LENS-prec@k 1-LENS-prec@k 0.85 1-LENS-prec@k 0.70
0.7 0.6 0.80 top-k top-k 0.75
top-k top-k 0.6 0.20
0.80
0.65
0.5 0.75 0.70
0.6 0.5 0.15 0.60
0.4 0.75 0.65
1-LENS-prec@k 1-LENS-prec@k 0.70 1-LENS-prec@k 0.55 1-LENS-prec@k
0.5 0.4 0.10 0.60
0.3
top-k top-k 0.65 0.70 top-k 0.50
top-k
0.3 0.55
0.2 0.05 20 40 60 80 100 20 40 60 80 100 20 40 60 80 100 0 200 400 600 800 1000
20 40 60 80 100 20 40 60 80 100 20 40 60 80 100 0 200 400 600 800 1000 Eval no. of top-k Eval no. of top-k Eval no. of top-k Eval no. of top-k
Eval no. of top-k Eval no. of top-k Eval no. of top-k Eval no. of top-k
(a) MNIST: top- (b) F-MNIST: (c) GTSRB: top- (d) Flower: top- (a) MNIST IG : (b) F-MNIST IG (c) GTSRB IG : (d) Flower IG :
k top-t k k top-k : top-k top-k top-k
1.0 1.00
1.00 0.975
0.95
0.9 0.95 0.95
0.95 0.9 0.95 0.950
0.90
Average top-k Intersection
(e) MNIST: ran- (f) F-MNIST: (g) GTSRB: ran- (h) Flower: ran- (e) MNIST IG : (f) F-MNIST IG (g) GTSRB IG : (h) Flower IG :
dom random dom dom random : random top-k random
Figure 10: Attributional robustness of IG on naturally Figure 12: Attributional robustness of IG on IG-SUM-
trained models measured as average top-k intersection NORM trained models measured as average top-k intersec-
and 1-LENS-prec@k between IG(original image) and tion and 1-LENS-prec@k between IG(original image) and
IG(perturbed image). Perturbations are obtained by the top- IG(perturbed image). Perturbations are obtained by the top-
k attack (Ghorbani, Abid, and Zou 2019) and random per- k attack (Ghorbani, Abid, and Zou 2019) and random per-
turbation. The plots show how the above measures change turbation. The plots show how the above measures change
with varying k across different datasets. with varying k across different datasets.
0.6 1.0
top-k
0.8 1-LENS-prec@k
0.4
w-LENS@k
0.6
0.2
0.4
0.0
MNIST Fashion MNIST GTSRB Flower 0.2
Dataset
0.0
MNIST Fashion-MNIST Flower
Figure 13: Attributional robustness of IG on naturally
trained models measured as average Spearman’s ρ, 1- Figure 15: Average top-k intersection between IG(original
LENS-Spearman, Kendall τ and 1-LENS-Kendall between image) and IG(perturbed image) on naturally trained mod-
IG(original image) and IG(perturbed image). The perturba- els where the perturbation is obtained by incorporating 1-
tions are obtained by the top-t attack of Ghorbani, Abid, and LENS-prec@k objective in the Ghorbani, Abid, and Zou
Zou (2019). (2019) attack.
1.0 1.0
Spearman's
0.8 1-LENS-Spearman 0.8
Average top-k Intersection
Kendall's
0.6 1-LENS-Kendall 0.6
0.8 0.8
Average top-k Intersection
is the top-k attack of Ghorbani, Abid, and Zou (2019). top-k top-k
0.2 1-LENS-recall@k 0.2 1-LENS-recall@k
Shown for (a) MNIST, (b) Fashion MNIST, (c) GTSRB and 1-LENS-prec@k 1-LENS-prec@k
0.0 0.0
Natural PGD IG-SUM-NORM Natural PGD IG-SUM-NORM
(d) Flower datasets. Type of Training Type of Training
E.2 Experiments with Integrated Gradients Figure 16: Average top-k intersection, 1-LENS-prec@k, 1-
Below we present additional experimental results for Inte- LENS-recall@k measured between IG(original image) and
grated Gradients (IG). IG(perturbed image) for models that are naturally trained,
PGD-trained and IG-SUM-NORM trained. The perturbation
used is the top-k attack of (Ghorbani, Abid, and Zou 2019).
Shown for (a) MNIST, (b) Fashion MNIST, (c) GTSRB and
(d) Flower datasets.
1.0 1.0
0.8 0.8
Average top-k Intersection
0.4 0.4
top-k top-k
0.2 1-LENS-recall@k 0.2 1-LENS-recall@k
1-LENS-prec@k 1-LENS-prec@k
0.0 0.0
Natural PGD IG-SUM-NORM Natural PGD IG-SUM-NORM
Type of Training Type of Training
0.8 0.8
Average top-k Intersection
0.6 0.6
0.4 0.4
top-k top-k
0.2 1-LENS-recall@k 0.2 1-LENS-recall@k
1-LENS-prec@k 1-LENS-prec@k 1.0 1.0
0.0 0.0
Natural PGD IG-SUM-NORM Natural PGD IG-SUM-NORM
Type of Training Type of Training 0.8 0.8
the IG of the original images and the IG of their perturba- (a) IG : top-k (b) IG : random
tions obtained by the random sign attack (Ghorbani, Abid,
1.0
and Zou 2019) across different datasets.
0.8
Average top-k Intersection
0.6
0.4
top-k
1-LENS-prec@k
0.2 2-LENS-prec@k
3-LENS-prec@k
0.0
Natural PGD IG-SUM-NORM
Type of Attack
1.0 1.0
(c) IG : center of mass
0.8 0.8
Average top-k Intersection
0.6 0.6
0.4 0.4
top-k top-k
0.2 1-LENS-recall@k 0.2 1-LENS-recall@k
1-LENS-prec@k 1-LENS-prec@k
0.0 0.0
Natural PGD IG-SUM-NORM Natural PGD IG-SUM-NORM
Type of Training Type of Training
0.6 0.6
(c) Flower
0.4 0.4
top-k Figure 22: Attributional robustness of Simple Gradients on
0.2 0.2 1-LENS-recall@k
1-LENS-prec@k naturally, PGD and IG-SUM-NORM trained models mea-
0.0 0.0
Natural PGD
Type of Training
IG-SUM-NORM Natural PGD
Type of Training
IG-SUM-NORM sured as top-k intersection, 1-LENS-prec@k and 1-LENS-
recall@k between the Simple Gradient of the original im-
(c) GTSRB (d) Flower ages and the Simple Gradient of their perturbations obtained
by the mass center attack (Ghorbani, Abid, and Zou 2019)
Figure 20: Attributional robustness of Simple Gradients on across different datasets.
naturally, PGD and IG-SUM-NORM trained models mea-
sured as top-k intersection, 1-LENS-prec@k and 1-LENS-
recall@k between the Simple Gradient of the original im-
ages and the Simple Gradient of their perturbations obtained 1.0 top-k
1-LENS-prec@k
1.0
2-LENS-prec@k
by the top-k attack (Ghorbani, Abid, and Zou 2019) across 0.8 3-LENS-prec@k 0.8
Average top-k Intersection
0.4 0.4
0.2 0.2
0.6 0.6
(a) SG : top-k (b) SG : random
0.4 0.4 1.0 top-k
1-LENS-prec@k
top-k top-k 2-LENS-prec@k
0.2 0.2 1-LENS-recall@k 0.8 3-LENS-prec@k
1-LENS-recall@k
Average top-k Intersection
1-LENS-prec@k 1-LENS-prec@k
0.0 0.0 0.6
Natural PGD IG-SUM-NORM Natural PGD IG-SUM-NORM
Type of Training Type of Training
0.4
0.6 0.6
(c) SG : center of mass
0.4 0.4
top-k top-k Figure 23: Attributional robustness of Simple Gradients on
0.2 1-LENS-recall@k 0.2 1-LENS-recall@k naturally, PGD and IG-SUM-NORM trained models mea-
1-LENS-prec@k 1-LENS-prec@k
0.0
Natural PGD
Type of Training
IG-SUM-NORM
0.0
Natural PGD
Type of Training
IG-SUM-NORM sured as top-k intersection and w-LENS-prec@k between
the IG of the original images and the IG of their perturba-
(c) GTSRB (d) Flower tions. Perturbations are obtained by the top-k attack and the
mass center attack (Ghorbani, Abid, and Zou 2019) as well
Figure 21: Attributional robustness of Simple Gradients on as a random perturbation. The plots show the effect of vary-
naturally, PGD and IG-SUM-NORM trained models mea- ing w on Flower dataset.
sured as top-k intersection, 1-LENS-prec@k and 1-LENS-
recall@k between the Simple Gradient of the original im-
ages and the Simple Gradient of their perturbations obtained
by the random sign attack (Ghorbani, Abid, and Zou 2019)
across different datasets.
E.4 Experiments with Other Explanation
Methods
In this section we present all the results with other explana-
tion methods other than IG and SG shown in previous sec-
tions with naturally and PGD trained models using mainly
random sign perturbation.
Top-1000 Intersection
1.0 1.0 1.0 1.0
0.9 0.9 0.9 0.9
0.8 0.8 0.8 0.8
0.7 0.7 0.7 0.7
0.6 0.6 0.6 0.6
0.5 0.5 0.5 0.5
0.4 0.4 0.4 0.4
0.3 0.3 0.3 0.3
0.2 0.2 0.2 0.2
0.1 0.1 0.1 0.1
0.0 0.0 0.0 0.0
0 1 2 4 8 0 1 2 4 8 0 1 2 4 8 0 1 2 4 8
Figure 24: Similar to Figure 2, From top to bottom, we plot average top-k intersection (currently used metric, top), 3-LENS-recall@k
and 3-LENS-recall@k-div (proposed metrics, middle and bottom respectively) against attributional attack perturbations for four attribution
methods of a SqueezeNet model (as used by Ghorbani, Abid, and Zou (2019)) on Imagenet: (left) Simple Gradients (SG), (center left)
Integrated Gradients (IG), (center right) DeepLift, (right) Deep Taylor Decomposition. We use k = 1000 with an ℓ∞ -norm attack and three
attack variants proposed by Ghorbani, Abid, and Zou (2019). Evidently, the proposed metrics show more robustness under the same attacks.
w-LENS@k
w-LENS@k
w-LENS@k
w-LENS@k
w-LENS@k
(a) MNIST (b) Fashion MNIST (c) ImageNet (a) MNIST (b) Fashion MNIST (c) ImageNet
Figure 25: Attributional robustness of Simple Gradients(SG) Figure 26: Attributional robustness of Image × Gradients
on naturally and PGD trained models measured as top-k on naturally and PGD trained models measured as top-k
intersection, w-LENS-prec@k and w-LENS-recall@k be- intersection, w-LENS-prec@k and w-LENS-recall@k be-
tween the Simple Gradients(SG) of the original images and tween the Image × Gradients of the original images and of
of their perturbations obtained by the random sign perturba- their perturbations obtained by the random sign perturbation
tion across different datasets. across different datasets.
1.0 1.0 1.0
top-k top-k top-k
1-LENS-recall@k 1-LENS-recall@k 3-LENS-recall@k
0.8
1-LENS-prec@k 0.8
1-LENS-prec@k 0.8
3-LENS-prec@k
w-LENS@k
w-LENS@k
w-LENS@k
0.6 0.6 0.6
w-LENS-prec@k
w-LENS-prec@k
0.6 0.6
w-LENS@k
w-LENS@k
0.0 0.0
0.2 0.2 0.2
Natural/top-k Natural/top-k-div PGD/top-k PGD/top-k-div Natural/top-k Natural/top-k-div PGD/top-k PGD/top-k-div
0.0 0.0 0.0
Natural/top-k Natural/top-k-div PGD/top-k PGD/top-k-div Natural/top-k Natural/top-k-div PGD/top-k PGD/top-k-div Natural/top-k Natural/top-k-div PGD/top-k PGD/top-k-div
(a) Simple Gradient (b) Image × Gradient
1.0 1.0
(a) MNIST (b) Fashion MNIST (c) ImageNet top-k top-k
1-LENS-prec@k 1-LENS-prec@k
0.8 0.8
2-LENS-prec@k 2-LENS-prec@k
Figure 28: Attributional robustness of DeepLIFT (Shriku- 3-LENS-prec@k 3-LENS-prec@k
w-LENS-prec@k
w-LENS-prec@k
mar, Greenside, and Kundaje 2017) on naturally and PGD 0.6 0.6
w-LENS@k
w-LENS@k
w-LENS-prec@k
0.2 0.2 0.2
0.6 0.6
0.0 0.0 0.0
Natural/top-k Natural/top-k-div PGD/top-k PGD/top-k-div Natural/top-k Natural/top-k-div PGD/top-k PGD/top-k-div Natural/top-k Natural/top-k-div PGD/top-k PGD/top-k-div
0.4 0.4
(a) MNIST (b) Fashion MNIST (c) ImageNet
0.2 0.2
measured as top-k intersection, w-LENS-prec@k and w- (e) GradSHAP [Lundberg 2017] (f) IG [Sundararajan 2017]
LENS-recall@k between the GradSHAP of the original im-
ages and of their perturbations obtained by the random sign Figure 31: Attributional robustness of explanation methods
perturbation across different datasets. on naturally and PGD trained models measured as top-k
intersection and w-LENS-prec@k between the explanation
1.0
top-k
1-LENS-recall@k
1.0
top-k
1-LENS-recall@k
1.0
top-k
3-LENS-recall@k
map of the original images and their perturbations. Pertur-
0.8
1-LENS-prec@k 0.8
1-LENS-prec@k 0.8
3-LENS-prec@k
bations are obtained by the random sign perturbation. The
w-LENS@k
w-LENS@k
w-LENS@k
(e) Original IG (f) top-100 pixels (g) Perturbed IG (h) top-100 pixels
Figure 32: Sample Integrated Gradients(IG) map using Flower dataset where top-100 is highlighted before and after image is
perturbed with top-k attack of Ghorbani, Abid, and Zou (2019).
Figure 33: Example based on locality. Sample Integrated Gradients(IG) map using Flower dataset with top-k highlighted fol-
lowed by w-LENS@k maps (row 1) Original IG and (row 2) Perturbed IG with top-k attack (Ghorbani, Abid, and Zou 2019).
(a) Original IG (b) top-1000-div pixels (c) 1-LENS-recall@k-div (d) 2-LENS-recall@k-div
Figure 34: Example based on diversity. Sample Integrated Gradients(IG) map using Flower dataset with top-k-div highlighted
followed by w-LENS@k-div maps (row 1) Original IG and (row 2) Perturbed IG with top-k attack (Ghorbani, Abid, and Zou
2019).
(a) Original IG (b) top-1000 diverse pixels with 3×3 window (c) top-1000 diverse pixels with 5×5 window
(d) Perturbed IG (e) top-1000 diverse pixels with 3×3 window (f) top-1000 diverse pixels with 5×5 window
Figure 35: Example based on diversity with different window sizes. Sample Integrated Gradients(IG) map using Flower dataset
with top-k-div highlighted with (column 2) 3×3 window (column 3) 5×5 window. (column 1) (top) map of unperturbed image
(bottom) map of perturbed image with top-k attack (Ghorbani, Abid, and Zou 2019).
(a) Original Image (b) top-1000 of Original IG (c) 3-LENS-recall@1000 (d) 3-LENS-recall@1000 0/1
(e) Perturbed Image (f) top-1000 of Perturbed IG (g) 3-LENS-prec@1000 (h) 3-LENS-prec@1000 0/1
Figure 36: Example based on locality(LENS) with sample Integrated Gradients(IG) map from ImageNet dataset (a) the original
image and (b) being the perturbed image. (c) and (d) show top-k highlighted and (e) and (f) are the corresponding maps with
LENS. Maps (g), (h) are the maps corresponding to (e), (f), respectively with non-zero value pixels shown as white. top-k:0.108,
3-LENS-recall@k:0.254, 3-LENS-prec@k:0.433, top-k-div:0.090, 3-LENS-recall@k-div:0.758, 3-LENS-prec@k-div:0.807
(a) Original IG (b) top-1000-div of Original IG (c) 3-LENS-recall@1000-div (d) 3-LENS-recall@1000-div 0/1
(e) Perturbed IG (f) top-1000-div of Perturbed IG (g) 3-LENS-prec@1000-div (h) 3-LENS-prec@1000-div 0/1
Figure 37: Example based on diversity with 7×7 window size with sample Integrated Gradients(IG) map from ImageNet
dataset (a) the original IG of unperturbed image and (b) perturbed IG obtained with top-k attack(Ghorbani, Abid, and Zou
2019). (c) and (d) show top-k-div highlighted and (e) and (f) are the corresponding maps with LENS. Maps (g), (h) are the
maps corresponding to (e), (f), respectively with non-zero value pixels shown as white. top-k:0.108, 3-LENS-recall@k:0.254,
3-LENS-prec@k:0.433, top-k-div:0.090, 3-LENS-recall@k-div:0.758, 3-LENS-prec@k-div:0.807
G Additional results for PGD-trained and
IG-SUM-NORM trained models
Figure 11 and Figure 12 shows the impact of k in top-k
for adversarially(PGD) trained and attributional(IG-SUM-
NORM) trained network, respectively. But an important
point to be noticed is that even with small number of fea-
tures LENS is able to cross 70-80% which supports the ob-
servation of sparsity and stability of attributions achieved
by adversarially(PGD) trained models by Chalasani et al.
(2020). Similarly, the experiments with different w value for
w-LENS-top-k in Figure 19 clearly incidates that due to the
stability properties at lower window sizes LENS is able to
cross the 80% intersection quickly. Supporting that our met-
ric nicely captures local stability very well.
While above we observed only the top-k version of
LENS. Figure 14 further extends the observation to LENS-
Spearman and LENS-Kendall who to show that with LENS
with a smoothing of 3 × 3 the maps from PGD-trained and
IG-SUM-NORM trained models have a higher top-k inter-
section above 70% in comparison to natural trained model
across all datasets used in our experiments, which further
strengthen the conclusions from previous papers that IG on
PGD-trained and IG-SUM-NORM trained models give bet-
ter attributions.
Appendix E and H provide results on PGD-trained
and IG-SUM-NORM trained models along with naturally
trained models for compact presentation of results.
Table 4: Table with top-k intersection and top-k-div results for MNIST and Fashion MNIST with LeNet based model using
Integrated Gradients(IG) (Sundararajan, Taly, and Yan 2017). The columns first contain locality results: top-k intersection, 1-
LENS-recall@k, 1-LENS-prec@k followed by diversity results: top-k-div, 1-LENS-recall@k-div, 1-LENS-prec@k-div. Mod-
els trained naturally, PGD and IG-SUM-NORM are used. Results include top-k, center of mass attack of Ghorbani, Abid, and
Zou (2019) as well as random sign perturbation.
Dataset Train Type Attack Type Attribution Method top-k intersection 2-LENS-recall@k 2-LENS-prec@k top-k-div 2-LENS-recall@k-div 2-LENS-prec@k-div
Flower Nat top-k IG 0.1789 0.4091 0.5128 0.2355 0.9482 0.9560
Nat mass center IG 0.4248 0.7196 0.6664 0.2645 0.9360 0.9420
Nat random IG 0.8206 0.9709 0.9747 0.4613 0.9741 0.9778
Flower PGD top-k IG 0.5941 0.8444 0.9165 0.4078 0.9733 0.9737
PGD mass center IG 0.6983 0.9223 0.9303 0.4247 0.9656 0.9655
PGD random IG 0.9033 0.9929 0.9934 0.5948 0.9847 0.9886
Flower IG-SUM-NORM top-k IG 0.6179 0.8720 0.9299 0.3875 0.9747 0.9757
IG-SUM-NORM mass center IG 0.7053 0.9303 0.9435 0.4031 0.9667 0.9672
IG-SUM-NORM random IG 0.8874 0.9917 0.9927 0.5707 0.9841 0.9869
Table 5: Table with top-k intersection and top-k-div results for Flower with ResNet based model using Integrated Gradients(IG)
(Sundararajan, Taly, and Yan 2017). The columns first contain locality results: top-k intersection, 2-LENS-recall@k, 2-LENS-
prec@k followed by diversity results: top-k-div, 2-LENS-recall@k-div, 2-LENS-prec@k-div. Models trained naturally, PGD
and IG-SUM-NORM are used. Results include top-k, center of mass attack of Ghorbani, Abid, and Zou (2019) as well as
random sign perturbation.
Dataset Train Type Attack Type Attribution Method top-k intersection 3-LENS-recall@k 3-LENS-prec@k top-k-div 3-LENS-recall@k-div 3-LENS-prec@k-div
ImageNet Nat top-k IG 0.0544 0.2822 0.3189 0.0684 0.7303 0.7458
Nat mass center IG 0.0869 0.3994 0.2082 0.0914 0.7221 0.7114
Nat random IG 0.3133 0.8463 0.8460 0.2157 0.8713 0.8689
Table 6: Table with top-k intersection and top-k-div results for ImageNet with SqueezeNet model using Integrated Gradi-
ents(IG) (Sundararajan, Taly, and Yan 2017). The columns first contain locality results: top-k intersection, 3-LENS-recall@k, 3-
LENS-prec@k followed by diversity results: top-k-div, 3-LENS-recall@k-div, 3-LENS-prec@k-div. Models naturally trained
are used. Results include top-k, center of mass attack of Ghorbani, Abid, and Zou (2019) as well as random sign perturbation.
Dataset Train Type Attack Type Attribution method top-k 1-LENS-recall@k 1-LENS-prec@k top-k-div 1-LENS-recall@k-div 1-LENS-prec@k-div
MNIST Nat random Simple Gradient 0.6548 0.8872 0.9355 0.2431 0.8724 0.8942
Nat random Image × Gradient 0.6887 0.8590 0.9791 0.1640 0.6725 0.7735
Nat random LRP [Bach 2015] 0.6887 0.8590 0.9791 0.1640 0.6725 0.7733
Nat random DeepLIFT [Shrikumar 2017] 0.6896 0.8602 0.9792 0.1642 0.6725 0.7724
Nat random GradSHAP [Lundberg 2017] 0.6950 0.8629 0.9792 0.1646 0.6716 0.7707
Nat random IG [Sundararajan 2017] 0.6978 0.8636 0.9795 0.1654 0.6717 0.7705
MNIST PGD random Simple Gradient 0.4456 0.7544 0.9712 0.1798 0.8034 0.8719
PGD random Image × Gradient 0.5355 0.8102 0.9853 0.1620 0.6788 0.8037
PGD random LRP [Bach 2015] 0.6887 0.8590 0.9791 0.2786 0.8669 0.9557
PGD random DeepLIFT [Shrikumar 2017] 0.5387 0.8111 0.9855 0.1626 0.6795 0.8056
PGD random GradSHAP [Lundberg 2017] 0.6831 0.9164 0.9835 0.1611 0.6684 0.7619
PGD random IG [Sundararajan 2017] 0.6729 0.9359 0.9847 0.1671 0.6641 0.7573
Fashion-MNIST Nat random Simple Gradient 0.5216 0.8421 0.8614 0.2691 0.8718 0.8820
Nat random Image × Gradient 0.5840 0.9160 0.9317 0.2570 0.8357 0.8719
Nat random LRP [Bach 2015] 0.5840 0.9160 0.9317 0.2570 0.8357 0.8719
Nat random DeepLIFT [Shrikumar 2017] 0.5821 0.9165 0.9311 0.2560 0.8370 0.8716
Nat random GradSHAP [Lundberg 2017] 0.6075 0.9208 0.9377 0.2641 0.8391 0.8717
Nat random IG [Sundararajan 2017] 0.6279 0.9281 0.9443 0.2736 0.8387 0.8713
Fashion-MNIST PGD random Simple Gradient 0.6036 0.8374 0.9362 0.2974 0.8202 0.8696
PGD random Image × Gradient 0.6561 0.9236 0.9648 0.3118 0.8472 0.8820
PGD random LRP [Bach 2015] 0.6561 0.9236 0.9648 0.3118 0.8472 0.8820
PGD random DeepLIFT [Shrikumar 2017] 0.6628 0.9255 0.9662 0.3124 0.8496 0.8838
PGD random GradSHAP [Lundberg 2017] 0.6678 0.9428 0.9625 0.3023 0.8573 0.8723
PGD random IG [Sundararajan 2017] 0.7103 0.9638 0.9809 0.3242 0.8583 0.8751
Table 7: Table with top-k intersection and top-k-div results for MNIST and Fashion MNIST with LeNet based model trained
naturally and adversarially(PGD), using different explanation methods with random sign perturbation. The columns first con-
tain locality results: top-k intersection, 1-LENS-recall@k, 1-LENS-prec@k followed by diversity results: top-k-div, 3-LENS-
recall@k-div, 3-LENS-prec@k-div.
Dataset Train Type Attack Type Attribution method top-k 3-LENS-recall@k 3-LENS-prec@k top-k-div 3-LENS-recall@k-div 3-LENS-prec@k-div
ImageNet Nat random Simple Gradient 0.3825 0.7875 0.7949 0.3096 0.8290 0.8115
Nat random Image × Gradient 0.3316 0.7765 0.8134 0.2905 0.8655 0.8516
Nat random LRP [Bach 2015] 0.1027 0.2487 0.2649 0.0970 0.7518 0.7371
Nat random DeepLIFT [Shrikumar 2017] 0.2907 0.7641 0.7637 0.2547 0.8504 0.8578
Nat random GradSHAP [Lundberg 2017] 0.2290 0.6513 0.6778 0.1885 0.8099 0.7979
Nat random IG [Sundararajan 2017] 0.2638 0.7148 0.7064 0.2366 0.8380 0.8296
ImageNet PGD random Simple Gradient 0.1725 0.7245 0.7306 0.1410 0.8004 0.8004
PGD random Image × Gradient 0.1714 0.7269 0.8043 0.1332 0.8552 0.8687
PGD random LRP [Bach 2015] 0.2374 0.4147 0.4350 0.1240 0.8161 0.8210
PGD random DeepLIFT [Shrikumar 2017] 0.5572 0.9746 0.9808 0.2924 0.8977 0.9078
PGD random GradSHAP [Lundberg 2017] 0.1714 0.7270 0.8044 0.1336 0.8552 0.8688
PGD random IG [Sundararajan 2017] 0.1947 0.7335 0.8036 0.1524 0.8584 0.8696
Table 8: Table with top-k intersection and top-k-div results for ImageNet with naturally and adversarially(PGD) trained
ResNet50 model using different explanation methods with random sign perturbation. The columns first contain locality re-
sults: top-k intersection, 1-LENS-recall@k, 1-LENS-prec@k followed by diversity results: top-k-div, 3-LENS-recall@k-div,
3-LENS-prec@k-div.
I Details of Survey Conducted to Study despite the top-k intersection being less the 30% between the
Human’s Perception of Robustness maps, users who fall into the Agree with 3-LENS-prec@k
metric category is large indicating that current top-k based
Detailed description of the survey conducted to study hu- comparison is weak.
man’s interpretation of attribution maps.
Survey Format: Each question consisted of an unper-
turbed image from the Flower dataset and a pair of expla-
nation/attribution maps. The pair can be any combination of
original and attribution map obtained with random pertur-
bation or with Ghorbani, Abid, and Zou (2019) attack. We
used Integrated Gradients (IG) (Sundararajan, Taly, and Yan
2017) to obtain the attribution maps. Perturbed maps were
obtained using the Ghorbani, Abid, and Zou (2019) attack Figure 38: Sample question from survey using Lily Valley
and random noise with appropriate ϵ-budget. The questions image with original IG map and IG map with Ghorbani,
were presented at random. At no point in the survey we re- Abid, and Zou (2019) attack using ϵ = 8/255.
vealed the type of map or perturbation added to obtain the
maps. This ensured the user was not biased by this extra in- In another sample from the survey (Figure 38), surpris-
formation while answering the survey. A sample question ingly more the 50% who fall in the category Agree with
presented to the user is as given below : 3-LENS-prec@k metric preferred the perturbed map over
Here is an image of Tigerlily with two attribution maps. the original map. top-k and 3-LENS-prec@k values were
36% and 88%, respectively.
We did have few questions in the survey to study the ef-
fectiveness of random perturbation as an attack as observed
by Ghorbani, Abid, and Zou (2019)[Figure 3] with top-k
metric. We used ϵ = 8/255 for the random perturbation.
The results of the survey were very unanimous with users
responses overwhelmingly(above 90%) fell into the Agree
with 3-LENS-prec@k metric. This strongly indicates that
random perturbation considered as an attack under current
Do both these attribution maps explain the image well? metrics gives a false sense of attribution robustness. Refer to
(1) Yes, both are similar and explain the image well. the last entries in Table 9.
(2) Yes, both explain the image well, but are dissimilar. Attack Type Agree with 3-LENS-prec@k metric(%) Agree with top-k metric(%) top-1000 intersection 3-LENS-prec@1000
top-k 70.37 29.63 0.343 0.928
(3) Only A explains the image well, B is different. top-k
random
81.48
93.55
18.52
6.45
0.0805
0.357
0.521
0.7965