Rethinking Robustness of Model Attributions: Sandesh Kamath, Sankalp Mittal, Amit Deshpande, Vineeth N Balasubramanian

Rethinking Robustness of Model Attributions
Sandesh Kamath1 , Sankalp Mittal1 , Amit Deshpande2 , Vineeth N Balasubramanian1

1
Indian Institute of Technology, Hyderabad
2
Microsoft Research, Bengaluru
sandesh.kamath@gmail.com,
arXiv:2312.10534v1 [cs.LG] 16 Dec 2023
Abstract numbers of attribution methods has led to a concerted fo-

cus on studying the robustness of attributions to input per-
For machine learning models to be reliable and trustwor- turbations to handle potential security hazards (Chen et al.
thy, their decisions must be interpretable. As these models
find increasing use in safety-critical applications, it is impor-
2019; Sarkar, Sarkar, and Balasubramanian 2021; Wang and
tant that not just the model predictions but also their expla- Kong 2022; Agarwal et al. 2022). One could view these ef-
nations (as feature attributions) be robust to small human- forts as akin to adversarial robustness that focuses on de-
imperceptible input perturbations. Recent works have shown fending against attacks on model predictions, whereas attri-
that many attribution methods are fragile and have proposed butional robustness focuses on defending against attacks on
improvements in either these methods or the model training. model explanations. For example, an explanation for a pre-
We observe two main causes for fragile attributions: first, the dicted credit card failure cannot change significantly for a
existing metrics of robustness (e.g., top-k intersection) over- small human-imperceptible change in input features, or the
penalize even reasonable local shifts in attribution, thereby saliency maps explaining the COVID risk prediction from
making random perturbations to appear as a strong attack, a chest X-ray should not change significantly with a minor
and second, the attribution can be concentrated in a small re-
gion even when there are multiple important parts in an im-
human-imperceptible change in the image.
age. To rectify this, we propose simple ways to strengthen DNN-based models are known to have a vulnerability
existing metrics and attribution methods that incorporate lo- to imperceptible adversarial perturbations (Biggio et al.
cality of pixels in robustness metrics and diversity of pixel 2013; Szegedy et al. 2014; Goodfellow, Shlens, and Szegedy
locations in attributions. Towards the role of model training 2015), which make them misclassify input images. Adver-
in attributional robustness, we empirically observe that ad- sarial training (Madry et al. 2018) is widely understood to
versarially trained models have more robust attributions on provide a reasonable degree of robustness to such perturba-
smaller datasets, however, this advantage disappears in larger
tion attacks. While adversarial robustness has received sig-
datasets. Code is made available1 .
nificant attention over the last few years (Ozdag 2018; Silva
and Najafirad 2020), the need for stable and robust attribu-
1 Introduction tions, corresponding explanation methods and their aware-
The explosive increase in the use of deep neural network ness are still in their early stages at this time (Ghorbani,
(DNN)-based models for applications across domains has Abid, and Zou 2019; Chen et al. 2019; Slack et al. 2020;
resulted in a very strong need to find ways to interpret the de- Sarkar, Sarkar, and Balasubramanian 2021; Lakkaraju, Ar-
cisions made by these models (Gade et al. 2020; Tang et al. sov, and Bastani 2020; Slack et al. 2021a,b). In an early
2021; Yap et al. 2021; Oviedo et al. 2022; Oh and Jeong effort, (Ghorbani, Abid, and Zou 2019) provided a method
2020). Interpretability is an important aspect of responsi- to construct a small imperceptible perturbation which when
ble and trustworthy AI, and model explanation methods added to an input x results in a change in attribution map
(also known as attribution methods) are an important aspect of the original map to that of the perturbed image. This is
of the community’s efforts towards explaining and debug- measured through top-k intersection, Spearman’s rank-order
ging real-world AI/ML systems. Attribution methods (Zeiler correlation or Kendall’s rank-order correlation between the
et al. 2010; Simonyan, Vedaldi, and Zisserman 2014; Bach two attribution maps (of original and perturbed images). See
et al. 2015; Selvaraju et al. 2017; Chattopadhyay et al. 2018; Figure 1 for an example. Defenses proposed against such at-
Sundararajan, Taly, and Yan 2017; Shrikumar et al. 2016; tributional attacks (Chen et al. 2019; Singh et al. 2020; Wang
Smilkov et al. 2017; Lundberg and Lee 2017) attempt to et al. 2020; Sarkar, Sarkar, and Balasubramanian 2021) also
explain the decisions made by DNN models through input- leverage the same metrics to evaluate the robustness of attri-
output attributions or saliency maps. (Lipton 2018; Samek bution methods.
et al. 2019; Fan et al. 2021; Zhang et al. 2020) present While these efforts have showcased the need and impor-
detailed surveys on these methods. Recently, the growing tance of studying the robustness of attribution methods, we
note in this work that the metrics used, and hence the meth-
1
https://github.com/ksandeshk/LENS ods, can be highly sensitive to minor local changes in attribu-
with their rank order. Besides avoiding overpenalizing at-
tribution methods for minor local drifts, we show that our
proposed LENS variants are well-motivated by metrics de-
fined on the space of attributions.
• We subsequently introduce a second measure based on di-
versity that enriches model attributions by preventing the
localized grouping of top model attributions. LENS can be
naturally applied to this measure, thereby giving a method
to incorporate both diversity and locality in measuring at-
tributional robustness.
• Our comprehensive empirical results on benchmark
datasets and models used in existing work clearly support
our aforementioned observations, as well as the need to
rethink the evaluation of the robustness of model attribu-
tions using locality and diversity.
• Finally, we also show that existing methods for robust at-
tributions implicitly support such a locality-sensitive met-
ric for evaluating progress in the field.
2 Background and Related Work

We herein discuss background literature from three different
Figure 1: Sample images from Flower dataset with Inte- perspectives that may be related to our work: model explana-
grated Gradients (IG) before and after attributional attack. tion/attribution methods, efforts on attributional robustness
The attack used here is the top-k attributional attack of Ghor- (both attacks and defenses), and other recent related work.
bani, Abid, and Zou (2019) on a ResNet model. Robustness Attribution Methods. Existing efforts on explainability in
of attribution measured by top-k intersection is small, and DNN models can be broadly categorized as: local and global
ranges from 0.04 (first image) to 0.45 (third image) as it pe- methods, model-agnostic and model-specific methods, or as
nalizes for both local changes in attribution and concentra- post-hoc and ante-hoc (intrinsically interpretable) methods
tion of top pixels in a small region. Visually, we can observe (Molnar 2019; Lecue et al. 2021). Most existing methods in
that such overpenalization leads to a wrong sense of robust- use today – including methods to visualize weights and neu-
ness as the changes are within the object of importance. rons (Simonyan, Vedaldi, and Zisserman 2014; Zeiler and
Fergus 2014), guided backpropagation (Springenberg et al.
2015), CAM (Zhou et al. 2016), GradCAM (Selvaraju et al.
tions (see Fig 1 row 2). We, in fact, show (in Appendix B.1) 2017), Grad-CAM++ (Chattopadhyay et al. 2018), LIME
that under existing metrics to evaluate robustness of attribu- (Ribeiro, Singh, and Guestrin 2016), DeepLIFT (Shriku-
tions, a random perturbation can be as strong an attributional mar et al. 2016; Shrikumar, Greenside, and Kundaje 2017),
attack as existing benchmark methods. This may not be a LRP (Bach et al. 2015), Integrated Gradients (Sundararajan,
true indicator of the robustness of a model’s attributions, and Taly, and Yan 2017), SmoothGrad (Smilkov et al. 2017)),
can mislead further research efforts in the community. We DeepSHAP (Lundberg and Lee 2017) and TCAV (Kim et al.
hence focus our efforts in this work on rethinking metrics 2018) – are post-hoc methods, which are used on top of a
and methods to study the robustness of model attributions pre-trained DNN model to explain its predictions. We fo-
(in particular, we study image-based attribution methods to cus on such post-hoc attribution methods in this work. For
have a focused discussion and analysis). Beyond highlight- a more detailed survey of explainability methods for DNN
ing this important issue, we propose locality-sensitive im- models, please see (Lecue et al. 2021; Molnar 2019; Samek
provements of the above metrics that incorporate the locality et al. 2019).
of attributions along with their rank order. We show that such Robustness of Attributions. The growing numbers of attri-
a locality-sensitive distance is upper-bounded by a metric bution methods proposed has also led to efforts on identify-
based on symmetric set difference. We also introduce a new ing the desirable characteristics of such methods (Alvarez-
measure top-k-div that incorporates diversity of a model’s Melis and Jaakkola 2018; Adebayo et al. 2018; Yeh et al.
attributions. Our key contributions are summarized below: 2019; Chalasani et al. 2020; Tomsett et al. 2020; Bog-
gust et al. 2022; Agarwal et al. 2022). A key desired trait
• Firstly, we observe that existing robustness metrics for that has been highlighted by many of these efforts is ro-
model attributions overpenalize minor drifts in attribution, bustness or stability of attributions, i.e., the explanation
leading to a false sense of fragility. should not vary significantly within a small local neigh-
• In order to address this issue, we propose Locality- borhood of the input (Alvarez-Melis and Jaakkola 2018;
sENSitive (LENS) improvements of existing met- Chalasani et al. 2020). Ghorbani, Abid, and Zou (2019)
rics, namely, LENS-top-k, LENS-Spearman and LENS- showed that well-known methods such as gradient-based at-
Kendall, that incorporate the locality of attributions along tributions, DeepLIFT (Shrikumar, Greenside, and Kundaje
Top-1000 Intersection
2017) and Integrated Gradients (IG) (Sundararajan, Taly, 1.0 1.0
0.9 0.9
and Yan 2017) are vulnerable to such input perturbations, 0.8 0.8
0.7 0.7
and also provided an algorithm to construct a small imper- 0.6 0.6
0.5 0.5
ceptible perturbation which when added to the input results 0.4 0.4
in changes in the attribution. Slack et al. (2020) later showed 0.3 0.3
0.2 0.2
that methods like LIME (Ribeiro, Singh, and Guestrin 2016) 0.1 0.1
0.0 0.0
and DeepSHAP (Lundberg and Lee 2017) are also vulnera- 0 1 2 4 8 0 1 2 4 8
ble to such manipulations. The identification of such vulner-
1.0 1.0
ability and potential for attributional attacks has since led 0.9 0.9
3-LENS-recall@k
to multiple research efforts to make a model’s attributions 0.8 0.8
0.7 0.7
robust. Chen et al. (2019) proposed a regularization-based 0.6 0.6
0.5 0.5
approach, where an explicit regularizer term is added to the 0.4 0.4
0.3 0.3
loss function to maintain the model gradient across input 0.2 0.2
0.1 0.1
(IG, in particular) while training the DNN model. This was 0.0 0.0
subsequently extended by (Sarkar, Sarkar, and Balasubra- 0 1 2 4 8 0 1 2 4 8
manian 2021; Singh et al. 2020; Wang et al. 2020), all of 1.0 1.0
3-LENS-recall@k-div
whom provide different training strategies and regularizers 0.9 0.9
0.8 0.8
to improve attributional robustness of models. Each of these 0.7 0.7
0.6 0.6
methods including Ghorbani, Abid, and Zou (2019) mea- 0.5 0.5
0.4 0.4
sures change in attribution before and after input perturba- 0.3 0.3
0.2 0.2
tion using the same metrics: top-k intersection, and/or rank 0.1 0.1
0.0 0.0
correlations like Spearman’s ρ and Kendall’ τ . Such metrics
0 1 2 4 8 0 1 2 4 8
have recently, in fact, further been used to understand issues
surrounding attributional robustness (Wang and Kong 2022).
Other efforts that quantify stability of attributions in tabular Figure 2: From top to bottom, we plot average top-k in-
data also use Euclidean distance (or its variants) between tersection (currently used metric), 3-LENS-recall@k and
the original and perturbed attribution maps (Alvarez-Melis 3-LENS-recall@k-div (proposed metrics) against the ℓ∞ -
and Jaakkola 2018; Yeh et al. 2019; Agarwal et al. 2022). norm of attributional attack perturbations for Simple Gra-
Each of these metrics look for dimension-wise correlation dients (SG) (left) and Integrated Gradients (IG) (right) of a
or pixel-level matching between attribution maps before and SqueezeNet model on Imagenet. We use k = 1000 and three
after perturbation, and thus penalize even a minor change in attributional attack variants proposed by Ghorbani, Abid,
attribution (say, even by one pixel coordinate location). This and Zou (2019). Evidently, the proposed metrics show more
results in a false sense of fragility, and could even be mis- robustness under the same attacks.
leading. In this work, we highlight the need to revisit such
metrics, and propose variants based on locality and diversity
that can be easily integrated into existing metrics. 3 Locality-sENSitive Metrics (LENS) for
Attributional Robustness
As a motivating example, Figure 2 presents the results ob-
tained using (Ghorbani, Abid, and Zou 2019) with Sim-
ple Gradients (SG) and Integrated Gradients (IG) of an NN
model trained on ImageNet. The top row, which reports the
Other Related Work. In other related efforts that have currently followed top-k intersection measure of attribution
studied similar properties of attribution-based explanations, robustness, shows a significant drop in robustness perfor-
(Carvalho, Pereira, and Cardoso 2019; Bhatt, Weller, and mance even for the random sign attack (green line). The
Moura 2020) stated that stable explanations should not vary subsequent rows, which report our metrics for the same ex-
too much between similar input samples, unless the model’s periments, show significant improvements in robustness –
prediction changes drastically. The abovementioned attribu- especially when combining the notions of locality and di-
tional attacks and defense methods (Ghorbani, Abid, and versity. Observations made on current metrics could lead to
Zou 2019; Sarkar, Sarkar, and Balasubramanian 2021; Singh a false sense of fragility, which overpenalizes even an attri-
et al. 2020; Wang et al. 2020) maintain this property, since bution shift by 1-2 pixels. A detailed description of our ex-
they focus on input perturbations that change the attribu- perimental setup for these results is available in Appendix C.
tion without changing the model prediction itself. Similarly, Motivated by these observations, we explore improved mea-
Arun et al. (2020) and Fel et al. (2022) introduced the no- sures for attributional robustness that maintain the overall
tions of repeatability/reproducibility and generalizability re- requirements of robustness, but do not overpenalize minor
spectively, both of which focus on the desired property that deviations.
a trustworthy explanation must point to similar evidence
across similar input images. In this work, we provide a prac- 3.1 Defining LENS Metrics for Attributions
tical metric to study this notion of similarity by considering To begin with, we propose an extension of existing simi-
locality-sensitive metrics. larity measures to incorporate the locality of pixel attribu-
interesting properties. Let a1 and a2 be two attribution
vectors for two images, and let Sk and Tk be the set
of top k pixels in these images according to a1 and
a2 , respectively. We define a locality-sensitive top-k
distance between two attribution vectors a1 and a2 as
(w) def (w) (w)
dk (a1 , a2 ) = preck (a1 , a2 ) + recallk (a1 , a2 ),
(w) def |Sk \Nw (Tk )|
where preck (a1 , a2 ) = k and
(w) def |Tk \Nw (Sk )|
recallk (a1 , a2 ) = k ,
similar to precision
and recall used in ranking literature, with the key difference
being the inclusion of neighborhood items based on locality.
(w)
Figure 3: A sample image from Flower dataset before (top) Below we state a monotonicity property of dk (a1 , a2 ) and
and after (bottom) the top-k attributional attack of (Ghor- upper bound it in terms of the symmetric set difference of
bani, Abid, and Zou 2019) on a ResNet model for Integrated top-k attributions.
Gradients (IG) attribution method. From left to right: the im- (w )
age, its top-k pixels as per IG, the union of the 3 × 3-pixel Proposition 1. For any w1 ≤ w2 , we have dk 2 (a1 , a2 ) ≤
(w )
neighborhoods and 5 × 5-pixel neighborhoods of the top-k dk 1 (a1 , a2 ) ≤ |Sk △Tk | /k, where △ denotes the symmet-
pixels, respectively, for k = 1000. Quantitatively, top-k in- ric set difference, i.e., A△B = (A \ B) ∪ (B \ A).
tersection: 0.14, 1-LENS-recall@k: 0.25, 1-LENS-pre@k: (w)
Combining dk (a1 , a2 ) across different values of k and
0.37, 2-LENS-recall@k: 0.40, 2-LENS-pre@k: 0.62.
w, we can define a distance
∞ ∞
(w)
X X
d(a1 , a2 ) = αk βw dk (a1 , a2 ),
tions in images to derive more practical and useful measures
k=1 w=0
of attributional robustness. Let aij (x) denote the attribution
value or importance assigned to the (i, j)-th pixel in an in- where αk and βw be non-negative weights, monotonically
P
put image x, and let Sk (x) denote the set of k pixel po- decreasing
P in k and w, respectively, such that k αk < ∞
sitions with the largest attribution values. Let Nw (i, j) = and w βw < ∞. We show that the distance defined above
{(p, q) : i − w ≤ p ≤ i + w, j − w ≤ q ≤ j + w} be the is upper-bounded by a metric similar to those proposed in
neighboring pixel positions within a (2w + 1) × (2w + 1) (Fagin, Kumar, and Sivakumar 2003) based on symmetric
window around the (i, j)-th pixel. BySa slight abuse of no- set difference of top-k ranks to compare two rankings.
tation, we use Nw (Sk (x)) to denote (i,j)∈Sk (x) Nw (i, j), Proposition 2. d(a1 , a2 ) defined above is upper-bounded
that is, the set of all pixel positions that lie in the union of by u(a1 , a2 ) given by
(2w + 1) × (2w + 1) windows around the top-k pixels. ∞ ∞
For a given attributional perturbation Att(·), let Tk =
X X |Sk △Tk |
u(a1 , a2 ) = αk βw ,
Sk (x + Att(x)) denote the top-k pixels in attribution val- w=0
k
k=1
ues after applying the attributional perturbation Att(x).
The currently used top-k intersection metric is then com- and u(a1 , a2 ) defines a bounded metric on the space of at-
puted as: |Sk (x) ∩ Tk (x)| /k. To address the abovemen- tribution vectors.
tioned issues, we instead propose Locality-sENSitive top- Note that top-k intersection, Spearman’s ρ and Kendall’s
k metrics (LENS-top-k) as |Nw (Sk (x)) ∩ Tk (x)| /k and τ do not take the attribution values aij (x)’s into account
|Sk (x) ∩ Nw (Tk (x))| /k, which are also closer to more but only the rank order of pixels according to these values.
widely used metrics such as precision and recall in We also define a locality-sensitive w-smoothed attribution
ranking methods. We simiarly define Locality-sENSitive as follows.
Spearman’s ρ (LENS-Spearman) and Locality-sENSitive
(w) 1 X
Kendall’s τ (LENS-Kendall) metrics as rank correlation ãij (x) = apq (x)
(2w + 1) 2
coefficients for the smoothed ranking orders according to (p,q)∈Nw (i,j),
ãij (x)’s and ãij (x + Att(x))’s, respectively. These can be 1≤p,q≤n
used to compare two different attributions for the same im-
We show that the w-smoothed attribution leads to a contrac-
age, the same attribution method on two different images, or
tion in the ℓ2 norm commonly used in theoretical analysis of
even two different attributions on two different images, as
simple gradients as attributions.
long as the attribution vectors lie in the same space, e.g.,
images of the same dimensions where attributions assign Proposition 3. For any inputs x, y and any w ≥ 0,
importance values to pixels. Figure 3 provides the visual- ã(w) (x) − ã(w) (y) 2 ≤ ∥a(x) − a(y)∥2 .
ization of the explanation map of a sample from the Flower Thus, any theoretical bounds on the attributional robust-
dataset with the top-1000 pixels followed by the correspond- ness of simple gradients in ℓ2 norm proved in previous
ing maps with 1-LENS@k and 2-LENS@k. works continue to hold for locality-sensitive w-smoothed
We show that the proposed locality-sensitive variants gradients. For example, (Wang et al. 2020) show the fol-
of the robustness metrics also possess some theoretically lowing Hessian-based bound on simple gradients. For an
1.0
input x and a classifier or model defined by f , let ∇x (f )
and ∇y (f ) be the simple gradients w.r.t. the inputs at x 0.8
and y. Theorem 3 in (Wang et al. 2020) upper bounds the
w-LENS@k
0.6
ℓ2 distance between the simple gradients of nearby points
∥x − y∥2 ≤ δ as ∥∇x (f ) − ∇y (f )∥2 ≲ δ λmax (Hx (f )), 0.4
where Hx (f ) is the Hessian of f w.r.t. the input at x and top-k

λmax (Hx (f )) is its maximum eigenvalue. By Proposition 3 0.2 1-LENS-recall@k
1-LENS-prec@k
above, the same continues to hold for w-smoothed gradients, 0.0
MNIST Fashion-MNIST GTSRB Flower
i.e., ∇ ˜ (w) ˜ (w)
x (f ) − ∇y (f ) ≲ δ λmax (Hx (f )). The proofs
2
of all the propositions above are included in Appendix D. Figure 4: Attributional robustness of IG on naturally trained
models measured as average top-k intersection, 1-LENS-
3.2 Relevance to Attributional Robustness prec@k and 1-LENS-recall@k between IG(original image)
and IG(perturbed image) obtained by the top-t attack (Ghor-
The top-k intersection is a measure of similarity instead bani, Abid, and Zou 2019) across different datasets.
of distance. Therefore, in our experiments for attribu-
tional robustness, we use locality-sensitive similarity mea- 1.0
top-k
1.0
top-k-div
sures w-LENS-prec@k and w-LENS-recall@k to denote 0.8
1-LENS-prec@k
0.8
1-LENS-prec@k-div
2-LENS-prec@k 2-LENS-prec@k-div
w-LENS-prec@k-div
(w) (w)
w-LENS-prec@k
1 − preck (a1 , a2 ) and 1 − recallk (a1 , a2 ), respectively, 3-LENS-prec@k 3-LENS-prec@k-div
0.6 0.6
where a1 is the attribution of the original image and a2 is
0.4
the attribution of the perturbed image. For rank correlation 0.4
coefficients such as Kendall’s τ and Spearman’s ρ, we com- 0.2 0.2
pute w-LENS-Kendall and w-LENS-Spearman as the same 0.0 0.0

Kendall’s τ and Spearman’s ρ but computed on the locality- top-k random center of mass top-k random center of mass
sensitive w-smoothed attribution map ã(w) instead of the
Figure 5: Effect of increasing w on average w-LENS-
original attribution map a. We also study how these simi-
prec@k and w-LENS-prec@k-div in comparison with top-
larity measures and their resulting attributional robustness
k intersection for IG map on ImageNet using a SqueezeNet
measures change as we vary w. In this section, we mea-
model, when attacked with three attributional attacks (viz.,
sure the attributional robustness of Integrated Gradients (IG)
top-k, random sign perturbation and mass center) of Ghor-
on naturally trained models as top-k intersection, w-LENS-
bani, Abid, and Zou (2019).
prec@k and w-LENS-recall@k between the IG of the origi-
nal images and the IG of their perturbations obtained by vari-
ous attacks. The attacks we consider are the top-t attack and
the mass-center attack of Ghorbani, Abid, and Zou (2019) w-LENS-prec@k for varying w. In Figure 5(left) w-
as well as random perturbation. All perturbations have ℓ∞ LENS-prec@k increases as we increase w to consider larger
norm bounded by δ = 0.3 for MNIST, δ = 0.1 for Fashion neighborhoods around the pixels with top attribution values.
MNIST, and δ = 8/255 for GTSRB and Flower datasets. This holds for multiple perturbations, namely, top-t attack
The values of t used to construct top-t attacks of Ghor- and mass-center attack by Ghorbani, Abid, and Zou (2019)
bani, Abid, and Zou (2019) are t = 200 on MNIST, t = 100 as well as a random perturbation. Notice that the top-t attack
on Fashion MNIST and GTSRB, t = 1000 on Flower. In of Ghorbani, Abid, and Zou (2019) is constructed specifi-
the robustness evaluations for a fixed k, we use k = 100 on cally for the top-t intersection objective, and perhaps as a re-
MNIST, Fashion MNIST, GTSRB, and k = 1000 on Flower. sult, shows larger change when we increase local-sensitivity
by increasing w in the robustness measure.
Comparison of top-k intersection, 1-LENS-prec@k and Due to space constraint and purposes of coherence, we
1-LENS-recall@k. Figure 4 shows that top-k intersection present few results with IG here; we present similar results
penalizes IG even for small, local changes. 1-LENS-prec@k on other explanation methods in the Appendix E. Refer to
and 1-LENS-recall@k values are always higher in compari- Appendix E.2 for similar plots with random sign pertur-
son across all datasets in our experiments. Moreover, on both bation and mass center attack of Ghorbani, Abid, and Zou
MNIST and Fashion MNIST, 1-LENS-prec@k is roughly (2019). Appendix E.3 contains additional results with sim-
2x higher (above 90%) compared to top-k intersection (near ilar conclusions when Simple Gradients are used instead of
40%). In other words, an attack may appear stronger under a Integrated Gradients (IG) for obtaining the attributions.
weaker measure of attributional robustness, if it ignores lo- As a natural follow-up question we present in Appendix
cality. This increase clearly shows that the top-k attack of E.1 results obtained by modifying the similarity objective
Ghorbani, Abid, and Zou (2019) appears to be weaker on of top-k attack of Ghorbani, Abid, and Zou (2019) with 1-
these datasets as the proportional increase by using locality LENS-prec@k with the assumption to obtain a stronger at-
indicates that the attack is only creating a local change than tack. But surprisingly, we notice that it leads to a worse at-
previously thought. We can see that for MNIST, Fashion- tributional attack, if we measure its effectiveness using the
MNIST and GTSRB for < 20% of the samples, the top-k at- top-k intersections and 1-LENS-prec@k. In other words, at-
tack was able to make changes larger than what 1-LENS@k tributional attacks against locality-sensitive measures of at-
could measure. tributional robustness are non-trivial and may require funda-
mentally different ideas. ple, when an image has multiple important parts, concen-
tration of top attribution pixels in a small region increases
3.3 Alignment of attributional robustness metrics vulnerability to attributional attacks. To alleviate this vul-
to human perception nerability, we propose post-processing any given attribution
We conducted a survey with human participants, where we method to output top-k diverse pixels instead of just the top-
presented images from the Flower dataset and a pair of at- k pixels with the highest attribution scores. We use a natu-
tribution maps—an attribution map of the original image ral notion of w-diversity based on pixel neighborhoods, so
alongside an attribution map of their random perturbation that these diverse pixels can be picked by a simple greedy
or attributional attacked version Ghorbani, Abid, and Zou algorithm. Starting with S ← ∅, repeat for k steps: Pick
(2019), in a random order and without revealing this in- the pixel of highest attribution score or importance outside
formation to the participants. The survey participants were S, add it to S and disallow the (2w + 1) × (2w + 1)-
asked whether the two maps were relatable to the image and pixel neighborhood around it for future selection. The set
if one of them was different than the other. In Table 1 we of k diverse pixels picked as above contains no two pixels
summarize the results obtained from the survey. We sim- within (2w + 1) × (2w + 1)-pixel neighborhood of each
plify the choices presented to the user into 2 final categories other, and moreover, has the highest total importance (as
- (1) Agree with w-LENS-prec@k (2) Agree with top-k met- the sum of pixel-wise attribution scores) among all such sets
ric Category (1) includes all results where the user found the of k pixels. The sets of k pixels where no two pixels lie in
maps the same, relatable to the image but dissimilar or the (2w + 1) × (2w + 1)-pixel neighborhood of each other form
perturbed map was preferred over the original map. Cate- a matroid, where the optimality of greedy algorithm is well-
gory (2) was the case where the user preferred the original known; see Korte and Lovász (1981).
map over the perturbed map. Refer to Appendix I for more Once we have the top-k diverse pixels as described above,
details. we can extend our locality-sensitive robustness metrics from
Agree with 3-LENS-prec@k metric(%) Agree with top-k metric(%) the previous section to w-LENS-prec@k-div and w-LENS-
70.37 29.63 recall@k-div, defined analogously using the union of (2w +
1) × (2w + 1)-pixel neighborhoods of top-k diverse pixels.
Table 1: Survey results based on humans ability to relate the In other words, define S̃k (x) as the top-k diverse pixels for
explanation map to the original image with or without noise image x and T̃k = S̃k (x + Att(x)), and use S˜k and T̃k to
using the Flower dataset. replace Sk and Tk used in Subsection 3.1.
For k = 1000, Figure 6 shows a sample image from
Flower dataset before and after the top-k attributional attack
of Ghorbani, Abid, and Zou (2019). Figure 6 visually shows
the top-k diverse pixels in the Integrated Gradients (IG) and
the union of their (2w + 1) × (2w + 1)-pixel neighborhoods,
for w = {1, 2}, for this image before and after the attri-
butional attack. The reader may be required to zoom in to
see the top-k diverse pixels. See Appendix F for more ex-
amples. Note that 0-LENS-prec@k and 0-LENS-recall@k
are both the same and equivalent to top-k intersection. How-
ever, a combined effect of locality and diversity can show a
drastic leap from top-k intersection value 0.14 to 2-LENS-
recall@k-div value 0.95 (see Fig.3 and Fig.6). Fig. 5(right)
Figure 6: A sample image from Flower dataset before (top) shows the effect of increasing w on the w-LENS-prec@k-
and after (bottom) the top-k attributional attack of Ghorbani, div metric on ImageNet.
Abid, and Zou (2019) on a ResNet model. For both, we
show from left to right: the image, its top-k diverse pixels 5 A Stronger Model for Attributional
as per IG, the union of 3 × 3-pixel neighborhoods and 5 × 5- Robustness
pixel neighborhoods of the top-k diverse pixels, respectively,
for k = 1000. Quantitatively, improved overlap is captured A common approach to get robust attributions is to keep the
by top-k-div intersection: 0.22, 1-LENS-recall@k-div: 0.87, attribution method unchanged but train the models differ-
1-LENS-pre@k-div: 0.86, 2-LENS-recall@k-div: 0.95, 2- ently in a way that the resulting attributions are more robust
LENS-pre@k-div: 0.93. Zoom in required to see the diverse to small perturbations of inputs. Chen et al. (2019) proposed
pixels. the first defense against the attributional attack of Ghorbani,
Abid, and Zou (2019). Wang et al. (2020) also find that IG-
NORM based training of Chen et al. (2019) gives models
that exhibit attributional robustness against the top-k attack
4 Diverse Attribution for Robustness of Ghorbani, Abid, and Zou (2019) along with adversarially
Column 1 of Figure 6 shows a typical image from Flower trained models. Figure 7 shows a sample image from the
dataset whose top-1000 pixels according to IG are concen- Flower dataset, where the Integrated Gradients (IG) of the
trated in a small region. As seen in this illustrative exam- original image and its perturbation by the top-k attack are vi-
to IG if we simply measure its effectiveness using 1-LENS-
prec@k (Appendix E.2, E.3 for MNIST, Fashion MNIST
and GTSRB). The above observation about robustness of
Integrated Gradients (IG) for PGD-trained and IG-SUM-
NORM trained models holds even when we use 1-LENS-
Spearman and 1-LENS-Kendall measures to quantify the at-
tributional robustness to the top-k attack of Ghorbani, Abid,
and Zou (2019), and it holds across the datasets used by
Chen et al. (2019) in their study; see Appendix E.
Chalasani et al. (2020) show theoretically that ℓ∞ -
adversarial training (PGD-training) leads to stable Integrated
Gradients (IG) under ℓ1 norm. They also show empiri-
cally that PGD-training leads to sparse attributions (IG &
DeepSHAP) when sparseness in measured indirectly as the
change in Gini index. Our empirical results extend their the-
Figure 7: From left to right: a sample image from Flower oretical observation about stability of IG for PGD-trained
dataset and Integrated Gradients (IG) before and after models, as we measure local stability in terms of both the
the top-k attributional attack of Ghorbani, Abid, and Zou top attribution values and their positions in the image.
(2019). The top row uses PGD-trained model whereas the Table 2 obtains the top-k intersection, 3-LENS-recall@k,
bottom row uses IG-SUM-NORM-trained model. and 3-LENS-recall@k-div of different attribution meth-
ods on ImageNet for naturally trained and PGD-trained
ResNet50 models. We observe that for random sign attack
sually similar for models that are either adversarially trained the improvement obtained on top-k intersection is reduced
(trained using Projected Gradient Descent or PGD-trained, for a large dataset like ImageNet. Still our conclusions about
as proposed by (Madry et al. 2018)) or IG-SUM-NORM locality and diversity in attribution robustness in compari-
trained as in Chen et al. (2019). In other words, these dif- son with the top-k intersection baseline holds as we observe
ferently trained model guard the sample image against the improvements in using diversity and locality. More results
attributional top-k attack. Recent work by Nourelahi et al. about incorporating diversity in the attribution and the re-
(2022) has empirically studied the effectiveness of adversar- sulting robustness metrics are available in Appendix H.
ially (PGD) trained models in obtaining better attributions, Training Attribution method top-k 3-LENS-recall@k 3-LENS-recall@k-div
e.g., Figure 7(center) shows sharper attributions to features Natural Simple Gradient 0.3825 0.7875 0.8290
Natural Image × Gradient 0.3316 0.7765 0.8655
highlighting the ground-truth class. Natural LRP [Bach 2015] 0.1027 0.2487 0.7518
Natural DeepLIFT [Shrikumar 2017] 0.2907 0.7641 0.8504
Natural GradSHAP [Lundberg 2017] 0.2290 0.6513 0.8099
1.0 Natural IG [Sundararajan 2017] 0.2638 0.7148 0.8380
PGD Simple Gradient 0.1725 0.7245 0.8004
0.8 PGD Image × Gradient 0.1714 0.7269 0.8552
PGD LRP [Bach 2015] 0.2374 0.4147 0.8161
0.6
PGD DeepLIFT [Shrikumar 2017] 0.5572 0.9746 0.8977
w-LENS@k
PGD GradSHAP [Lundberg 2017] 0.1714 0.7270 0.8552

PGD IG [Sundararajan 2017] 0.1947 0.7335 0.8584
0.4
0.2
top-k Table 2: Average top-k intersection, 3-LENS-prec@k and 3-
1-LENS-recall@k
0.0
1-LENS-prec@k LENS-prec@k-div for random sign perturbation attack ap-
Natural PGD IG-SUM-NORM plied to different attribution methods on ImageNet for natu-
rally and adversarially(PGD)-trained ResNet50 models.
Figure 8: For Flower dataset, average top-k intersec-
tion, 1-LENS-prec@k, 1-LENS-recall@k measured be-
tween IG(original image) and IG(perturbed image) for mod-
els that are naturally trained, PGD-trained and IG-SUM- 6 Conclusion and Future Work
NORM trained. The perturbation used is the top-t attack of We show that the fragility of attributions is an effect of us-
(Ghorbani, Abid, and Zou 2019). Note top-k is equivalent to ing fragile robustness metrics such as top-k intersection that
0-LENS-prec@k, 0-LENS-recall@k. only look at the rank order of attributions and fail to cap-
ture the locality of pixel positions with high attributions. We
Figure 8 shows that PGD-trained and IG-SUM-NORM highlight the need for locality-sensitive metrics for attribu-
trained models have more robust Integrated Gradients (IG) tional robustness and propose natural locality-sensitive ex-
in comparison to their naturally trained counterparts, and tensions of existing metrics. We introduce another method of
this holds for the previously used measures of attribu- picking diverse top-k pixels that can be naturally extended
tional robustness (e.g., top-k intersection) as well as the with locality to obtain improved measure of attributional
new locality-sensitive measures we propose (e.g., 1-LENS- robustness. Theoretical understanding of locality-sensitive
prec@k, 1-LENS-recall@k) across all datasets in Chen et al. metrics of attributional robustness, constructing stronger at-
(2019) experiments (Refer Appendix E.2 and E.3). The top- tributional attacks for these metrics, and using them to build
k attack of Ghorbani, Abid, and Zou (2019) is not a threat attributionally robust models are important future directions.
References Fel, T.; Vigouroux, D.; Cadène, R.; and Serre, T. 2022. How
Adebayo, J.; Gilmer, J.; Muelly, M.; Goodfellow, I. J.; Hardt, good is your explanation? algorithmic stability measures to
M.; and Kim, B. 2018. Sanity Checks for Saliency Maps. In assess the quality of explanations for deep neural networks.
Bengio, S.; Wallach, H. M.; Larochelle, H.; Grauman, K.; In Proceedings of the IEEE/CVF Winter Conference on Ap-
Cesa-Bianchi, N.; and Garnett, R., eds., Advances in Neu- plications of Computer Vision, 720–730.
ral Information Processing Systems 31: Annual Conference Gade, K.; Geyik, S. C.; Kenthapadi, K.; Mithal, V.; and Taly,
on Neural Information Processing Systems 2018, NeurIPS A. 2020. Explainable AI in industry: practical challenges
2018, December 3-8, 2018, Montréal, Canada, 9525–9536. and lessons learned: implications tutorial. In Hildebrandt,
Agarwal, C.; Johnson, N.; Pawelczyk, M.; Krishna, S.; Sax- M.; Castillo, C.; Celis, L. E.; Ruggieri, S.; Taylor, L.; and
ena, E.; Zitnik, M.; and Lakkaraju, H. 2022. Rethinking Zanfir-Fortuna, G., eds., FAT* ’20: Conference on Fairness,
Stability for Attribution-based Explanations. arXiv preprint Accountability, and Transparency, Barcelona, Spain, Jan-
arXiv:2203.06877. uary 27-30, 2020, 699. ACM.
Ghorbani, A.; Abid, A.; and Zou, J. Y. 2019. Interpretation
Alvarez-Melis, D.; and Jaakkola, T. S. 2018. On the
of Neural Networks Is Fragile.
robustness of interpretability methods. arXiv preprint
arXiv:1806.08049. Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2015. Explain-
ing and Harnessing Adversarial Examples.
Arun, N.; Gaw, N.; Singh, P.; Chang, K.; Aggarwal, M.;
Chen, B.; et al. 2020. Assessing the (Un) Trustworthiness of He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Deep Residual
saliency maps for localizing abnormalities in medical imag- Learning for Image Recognition. CoRR, abs/1512.03385.
ing. arXiv. Preprint]. Iandola, F. N.; Moskewicz, M. W.; Ashraf, K.; Han, S.;
Dally, W. J.; and Keutzer, K. 2016. SqueezeNet: AlexNet-
Bach, S.; Binder, A.; Montavon, G.; Klauschen, F.; Müller,
level accuracy with 50x fewer parameters and <1MB model
K.-R.; and Samek, W. 2015. On Pixel-Wise Explanations for
size. CoRR, abs/1602.07360.
Non-Linear Classifier Decisions by Layer-Wise Relevance
Propagation. In:PloS One 10.7 (2015), e0130140. Kim, B.; Wattenberg, M.; Gilmer, J.; Cai, C.; Wexler, J.;
Viegas, F.; et al. 2018. Interpretability beyond feature at-
Bhatt, U.; Weller, A.; and Moura, J. M. 2020. Evaluating tribution: Quantitative testing with concept activation vec-
and aggregating feature-based model explanations. arXiv tors (tcav). In International conference on machine learning,
preprint arXiv:2005.00631. 2668–2677. PMLR.
Biggio, B.; Corona, I.; Maiorca, D.; Nelson, B.; Šrndić, N.; Korte, B.; and Lovász, L. 1981. Mathematical structures
Laskov, P.; Giacinto, G.; and Roli, F. 2013. Evasion Attacks underlying greedy algorithms. In Gécseg, F., ed., Funda-
against Machine Learning at Test Time. Lecture Notes in mentals of Computation Theory, 205–209. Springer Berlin
Computer Science, 387–402. Heidelberg. ISBN 978-3-540-38765-7.
Boggust, A.; Suresh, H.; Strobelt, H.; Guttag, J. V.; and Lakkaraju, H.; Arsov, N.; and Bastani, O. 2020. Robust and
Satyanarayan, A. 2022. Beyond Faithfulness: A Framework Stable Black Box Explanations. In International Conference
to Characterize and Compare Saliency Methods. CoRR, on Machine Learning, 5628–5638.
abs/2206.02958. Lecue, F.; Guidotti, R.; Minervini, P.; and Giannotti, F.
Carvalho, D. V.; Pereira, E. M.; and Cardoso, J. S. 2019. 2021. 2021 Explainable AI Tutorial. https://xaitutorial2021.
Machine learning interpretability: A survey on methods and github.io/. Visited on 14-09-2021.
metrics. Electronics, 8(8): 832. Lipton, Z. C. 2018. The mythos of model interpretability.
Chalasani, P.; Chen, J.; Chowdhury, A. R.; Wu, X.; and Jha, Commun. ACM, 61(10): 36–43.
S. 2020. Concise Explanations of Neural Networks using Lundberg, S. M.; and Lee, S. 2017. A Unified Approach to
Adversarial Training. In Proceedings of the 37th Interna- Interpreting Model Predictions.
tional Conference on Machine Learning, ICML 2020, 13-18 Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; and
July 2020, Virtual Event, volume 119 of Proceedings of Ma- Vladu, A. 2018. Towards Deep Learning Models Resistant
chine Learning Research, 1383–1391. PMLR. to Adversarial Attacks.
Chattopadhyay, A.; Sarkar, A.; Howlader, P.; and Balasubra- Molnar, C. 2019. Interpretable Machine Learning. https:
manian, V. N. 2018. Grad-CAM++: Generalized Gradient- //christophm.github.io/interpretable-ml-book/.
Based Visual Explanations for Deep Convolutional Net-
Nourelahi, M.; Kotthoff, L.; Chen, P.; and Nguyen, A.
works. 839–847.
2022. How explainable are adversarially-robust CNNs?
Chen, J.; Wu, X.; Rastogi, V.; Liang, Y.; and Jha, S. 2019. arXiv:2205.13042.
Robust Attribution Regularization. Oh, C.; and Jeong, J. 2020. VODCA: Verification of Diag-
Fagin, R.; Kumar, R.; and Sivakumar, D. 2003. Comparing nosis Using CAM-Based Approach for Explainable Process
Top k Lists. SIAM Journal on Discrete Mathematics, 17(1): Monitoring. Sensors, 20(23): 6858.
134–160. Oviedo, F.; Ferres, J. L.; Buonassisi, T.; and Butler, K. T.
Fan, F.; Xiong, J.; Li, M.; and Wang, G. 2021. On 2022. Interpretable and Explainable Machine Learning for
Interpretability of Artificial Neural Networks: A Survey. Materials Science and Chemistry. Accounts of Materials Re-
arXiv:2001.02522 [cs, stat]. search, 3(6): 597–607.
Ozdag, M. 2018. Adversarial Attacks and Defenses Against Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan,
Deep Neural Networks: A Survey. Procedia Computer Sci- D.; Goodfellow, I. J.; and Fergus, R. 2014. Intriguing prop-
ence, 140: 152–161. erties of neural networks.
Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. ”Why Tang, S.; Ghorbani, A.; Yamashita, R.; Rehman, S.; Dunn-
Should I Trust You?”: Explaining the Predictions of Any mon, J. A.; Zou, J. Y.; and Rubin, D. L. 2021. Data Valuation
Classifier. for Medical Imaging Using Shapley Value: Application on A
Large-scale Chest X-ray Dataset. Scientific Reports(Nature
Samek, W.; Montavon, G.; Vedaldi, A.; Hansen, L. K.; and
Publisher Group).
Müller, K., eds. 2019. Explainable AI: Interpreting, Ex-
plaining and Visualizing Deep Learning, volume 11700 of Tomsett, R.; Harborne, D.; Chakraborty, S.; Gurram, P.; and
Lecture Notes in Computer Science. Springer. ISBN 978-3- Preece, A. D. 2020. Sanity Checks for Saliency Metrics.
030-28953-9. In The Thirty-Fourth AAAI Conference on Artificial Intelli-
gence, AAAI 2020, The Thirty-Second Innovative Applica-
Sarkar, A.; Sarkar, A.; and Balasubramanian, V. N. 2021. tions of Artificial Intelligence Conference, IAAI 2020, The
Enhanced Regularizers for Attributional Robustness. Tenth AAAI Symposium on Educational Advances in Artifi-
Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; cial Intelligence, EAAI 2020, New York, NY, USA, February
Parikh, D.; and Batra, D. 2017. Grad-CAM: Visual Explana- 7-12, 2020, 6021–6029. AAAI Press.
tions from Deep Networks via Gradient-Based Localization. Wang, F.; and Kong, A. W. 2022. Exploiting the Relation-
618–626. ship Between Kendall’s Rank Correlation and Cosine Simi-
Shrikumar, A.; Greenside, P.; and Kundaje, A. 2017. Learn- larity for Attribution Protection. CoRR, abs/2205.07279.
ing Important Features Through Propagating Activation Dif- Wang, Z.; Wang, H.; Ramkumar, S.; Mardziel, P.; Fredrik-
ferences. son, M.; and Datta, A. 2020. Smoothed Geometry for Robust
Shrikumar, A.; Greenside, P.; Shcherbina, A.; and Kundaje, Attribution.
A. 2016. Not Just a Black Box: Learning Important Features Yap, M.; Johnston, R. L.; Foley, H.; MacDonald, S.; Kon-
Through Propagating Activation Differences. 1605.01713. drashova, O.; Tran, K. A.; Nones, K.; Koufariotis, L. T.;
Silva, S. H.; and Najafirad, P. 2020. Opportunities and Chal- Bean, C.; Pearson, J. V.; Trzaskowski, M.; and Waddell, N.
lenges in Deep Learning Adversarial Robustness: A Survey. 2021. Verifying explainability of a deep learning tissue clas-
CoRR, abs/2007.00753. sifier trained on RNA-seq data. Scientific Reports(Nature
Publisher Group).
Simonyan, K.; Vedaldi, A.; and Zisserman, A. 2014. Deep
Inside Convolutional Networks: Visualising Image Classifi- Yeh, C.-K.; Hsieh, C.-Y.; Suggala, A.; Inouye, D. I.; and
cation Models and Saliency Maps. Ravikumar, P. K. 2019. On the (in) fidelity and sensitivity of
explanations. Advances in Neural Information Processing
Singh, M.; Kumari, N.; Mangla, P.; Sinha, A.; Balasubrama- Systems, 32.
nian, V. N.; and Krishnamurthy, B. 2020. Attributional Ro-
Zeiler, M. D.; and Fergus, R. 2014. Visualizing and Under-
bustness Training Using Input-Gradient Spatial Alignment.
standing Convolutional Networks. In Proceedings of The
Slack, D.; Hilgard, A.; Lakkaraju, H.; and Singh, S. 2021a. European Conference on Computer Vision (ECCV).
Counterfactual Explanations Can Be Manipulated. In Ad- Zeiler, M. D.; Krishnan, D.; Taylor, G. W.; and Fergus, R.
vances in Neural Information Processing Systems. 2010. Deconvolutional networks. In The Twenty-Third IEEE
Slack, D.; Hilgard, A.; Singh, S.; and Lakkaraju, H. 2021b. Conference on Computer Vision and Pattern Recognition,
Reliable Post hoc Explanations: Modeling Uncertainty in CVPR 2010, San Francisco, CA, USA, 13-18 June 2010,
Explainability. In Advances in Neural Information Process- 2528–2535. IEEE Computer Society.
ing Systems, 9391–9404. Zhang, Y.; Tiňo, P.; Leonardis, A.; and Tang, K.
Slack, D.; Hilgard, S.; Jia, E.; Singh, S.; and Lakkaraju, H. 2020. A Survey on Neural Network Interpretability.
2020. Fooling LIME and SHAP: Adversarial Attacks on arXiv:2012.14261 [cs].
Post hoc Explanation Methods. Zhou, B.; Khosla, A.; Lapedriza, À.; Oliva, A.; and Torralba,
Smilkov, D.; Thorat, N.; Kim, B.; Viégas, F. B.; and Wat- A. 2016. Learning Deep Features for Discriminative Local-
tenberg, M. 2017. SmoothGrad: removing noise by adding ization. In 2016 IEEE Conference on Computer Vision and
noise. 1706.03825. Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June
27-30, 2016, 2921–2929. IEEE Computer Society.
Springenberg, J. T.; Dosovitskiy, A.; Brox, T.; and Ried-
miller, M. A. 2015. Striving for Simplicity: The All Con-
volutional Net.
Sundararajan, M.; Taly, A.; and Yan, Q. 2017. Axiomatic
Attribution for Deep Networks. In Precup, D.; and Teh,
Y. W., eds., Proceedings of the 34th International Confer-
ence on Machine Learning, ICML 2017, Sydney, NSW, Aus-
tralia, 6-11 August 2017, volume 70 of Proceedings of Ma-
chine Learning Research, 3319–3328. PMLR.
A Supplementary : Rethinking Robustness
of Model Attributions
The Appendix contains proofs, additional experiments to
show that the trends hold across different datasets and other
ablation studies which could not be included in the main pa-
per due to space constraints.
B Attributional Robustness Metrics are

(a) Original image (b) IG(original image)
Weak
In this section, we empirically show that the existing met-
rics for attributional robustness are weak and inadequate as
they allow even a small random perturbation to appear like
a decent attributional attack.
B.1 Random Vectors are Attributional Attacks

under Existing Metrics
Random vectors of a small ℓ∞ norm are often used as base- (c) IG after random (d) IG after universal
lines of input perturbations (both in adversarial robustness
(Silva and Najafirad 2020) and attributional robustness lit- Figure 9: Sample image from MNIST on LeNet based model
erature (Ghorbani, Abid, and Zou 2019)), since it is known shows that the Integrated Gradients (IG) after a universal
that predictions of neural network models are known to be random perturbation are more dissimilar than IG after a sim-
resilient to random perturbations of inputs. Previous work ple, independent random perturbation for each input. All
by Ghorbani, Abid, and Zou (2019) has shown random per- perturbations have random ±1 coordinates, scaled down to
turbations to be a reasonable baseline to compare against have ℓ∞ norm ϵ = 0.3. (c) has a top-k intersection of 0.68,
their attributional attack. Extending it further, we show that while (d) has a top-k intersection of 0.62. With our locality-
a single input-agnostic random perturbation happens to be sensitive metric, (c) has 1-LENS@k of 0.99 and (d) has 1-
an effective universal attributional attack if we measure at- LENS@k of 1.0.
tributional robustness using a weak metric based on top-k
intersection. In other words, considering even a random per-
turbation happens to be a good attributional attack under tional attack than using an independent random perturbation
such metrics, we show that existing metrics for attributional for each input.
robustness such as top-k intersection are extremely fragile,
i.e., they would unfairly deem many attribution methods as Dataset Perturbation top-k intersection Spearman’s ρ Kendall’s τ
MNIST random 0.7500 0.5347 0.4337
fragile. universal random 0.5855 0.4831 0.4063
Integrated Gradients (IG) is a well-known attribution Fashion MNIST random 0.5385 0.6791 0.5152
universal random 0.5280 0.7154 0.5688
method based on well-defined axiomatic foundations (Sun- GTSRB random 0.8216 0.9433 0.8136
dararajan, Taly, and Yan 2017), which is commonly used in universal random 0.9293 0.9887 0.9243
Flower random 0.8202 0.9562 0.8340
attributional robustness literature (Chen et al. 2019; Sarkar, universal random 0.9344 0.9908 0.9321
Sarkar, and Balasubramanian 2021). We take a naturally
trained CNN model on MNIST and perturb the images using Table 3: Attributional robustness of IG on naturally trained
a random perturbation (an independent random perturbation models measured using average top-k intersection, Spear-
per input image) as well as a single, input-agnostic or uni- man’s ρ and Kendall’s τ between IG(original image) and
versal random perturbation for all images. Figure 9 shows IG(perturbed image). k = 100 for MNIST, Fashion MNIST,
a sample image from the MNIST dataset and the visual dif- GTSRB and k = 1000 for Flower.
ference between the IG of the original image, the IG after
adding a random perturbation, and the IG after adding a uni-
versal random perturbation. The IG after the universal ran-
dom attack (Figure 9d) is visually more dissimilar to the IG C Details of Experimental Setup
of the original image (Figure 9b) than the IG of a simple ran- The detailed description of the setup used in our experi-
dom perturbation (Figure 9c). (Note that top-k intersection ments.
between Figure 9b and 9c is only 0.62, although the two look Datasets: We use the standard benchmark train-test split
similar. As stated in the caption, a locality-sensitive metric of all the datasets used in this work, that is publicly available.
shows them to be closer in attribution however.) MNIST dataset consists of 70, 000 images of 28 × 28 size,
Similarly, Table 3 shows that under existing metrics to divided into 10 classes: 60, 000 used for training and 10, 000
quantify attributional robustness of IG on a naturally trained for testing. Fashion MNIST dataset consists of 70, 000 im-
CNN model, even a single, input-agnostic or universal ran- ages of 28 × 28 size, divided into 10 classes: 60, 000 used
dom perturbation can sometimes be a more effective attribu- for training and 10, 000 for testing. GTSRB dataset con-
sists of 51, 739 images of 32 × 32 size, divided into 43 and Zou (2019) depending on the dataset. We used Pytorch5 -
classes: 34, 699 used for training, 4, 410 for validation and Captum6 to obtain the various explanation maps with the
12, 630 for testing. Flower dataset consist of 1, 360 images random sign perturbation experiments.
of 128 × 128 size, divided into 17 classes: 1, 224 used for
training and 136 for testing. GTSRB and Flower datasets D Proofs from Section 3.1
were preprocessed exactly as given in (Chen et al. 2019)[Ap- We restate and prove Proposition 1 below.
pendix C] for consistency of results. ImageNet dataset con-
(w )
sists of images of 227 × 227 size, divided into 1000 classes. Proposition 4. For any w1 ≤ w2 , we have dk 2 (a1 , a2 ) ≤
50, 000 for validation were used to obtain samples for test- (w )
dk 1 (a1 , a2 ) ≤ |Sk △Tk | /k, where △ denotes the symmet-
ing. ric set difference, i.e., A△B = (A \ B) ∪ (B \ A).
Architectures: For MNIST, Fashion MNIST, GTSRB
and Flower datasets we use the exact architectures as Proof. The inequalities follows immediately using S ⊆
used by Chen et al. (2019), ImageNet dataset we use Nw1 (S) ⊆ Nw2 (S), for any S, and hence, |S \ Nw (T )| ≤
SqueezeNet(Iandola et al. 2016) as given by Ghorbani, |S \ T |, for any S, T and w.
Abid, and Zou (2019) and ResNet50(He et al. 2015).
Attribution robustness metrics: We use the same compar- We restate and prove Proposition 2 below.
ison metrics as used by Ghorbani, Abid, and Zou (2019) Proposition 5. d(a1 , a2 ) defined above is upper bounded by
and Chen et al. (2019) like top-k pixels intersection, Spear- u(a1 , a2 ) given by
man’s ρ and Kendall’s τ rank correlation to compare attri- ∞ ∞
bution maps of the original and perturbed images. The k X X |Sk △Tk |
u(a1 , a2 ) = αk βw ,
value for top-k attack along with settings like step size, num- k
k=1 w=0
ber of steps and number of times attack is to be applied is
as used by Chen et al. (2019) for the attack construction : and u(a1 , a2 ) defines a bounded metric on the space of at-
MNIST(200,0.01,100,3), Fashion MNIST(100,0.01,100,3), tribution vectors.
GTSRB(100,1.0,50,3), Flower(1000,1.0,100,3) and Ima-
geNet(1000,1.0,100,3) (Ghorbani, Abid, and Zou 2019). Proof. Proof follows from Proposition 1 and using the
Sample sizes for attribution robustness evaluations: IG fact that symmetric set difference satisfies triangle inequal-
based experiments For MNIST, Fashion MNIST and Flower ity.
with fixed top-k attack similar to Chen et al. (2019) the com- We restate and prove Proposition 3 below.
plete test set were used to obtain the results. For GTSRB
a random sample of size 1000 was used for all the experi- Proposition 6. For any inputs x, y and any w ≥ 0,
ments. Simple gradient based experiments For MNIST and ã(w) (x) − ã(w) (y) 2 ≤ ∥a(x) − a(y)∥2 .
Fashion MNIST a random sample of 2500/1000 from the 2
test set. For GTSRB, a random sample of size 1000 and Proof. ã(w) (x) − ã(w) (y) 2
the complete test set for the Flower dataset. For ImageNet, X (w) 2
(w)
we used the samples provide by Ghorbani, Abid, and Zou = ãij (x) − ãij (y)
(2019). We used around 500 random samples for the ran- 1≤i,j≤n
dom sign perturbation results using Pytorch/Captum.  2
Adversarial training: We use the standard setup as
used by (Chen et al. 2019). We perform PGD based
X 1  X 
= 4
 (apq (x) − apq (y))
adversarial training with the provided ϵ budget using (2w + 1)  
1≤i,j≤n (p,q)∈Nw (i,j),
the following settings (number of steps, step size) for 1≤p,q≤n
PGD : MNIST (40,0.01), Fashion MNIST(20,0.01), GT- X (2w + 1)2 X 2
SRB(7,2/255), Flower(7,2/255). For ImageNet, we used the ≤ (apq (x) − apq (y))
PGD based pre-trained ResNet50 model with ℓ∞ -norm of (2w + 1)4
1≤i,j≤n (p,q)∈Nw (i,j),
ϵ = 8/255 provided in the robustness package2 . 1≤p,q≤n
Training for Attributional Robustness: We use the IG- by Cauchy-Schwarz inequality
SUM-NORM objective function for all the datasets study
based on (Chen et al. 2019) based training. With the exact
setting as given in paper with code 3 .
Hardware Configuration: We used a server with 4 Nvidia 1 X X 2
= (apq (x) − apq (y))
GeForce GTX 1080i GPU and a server with 8 Nvidia Tesla (2w + 1)2
1≤i,j≤n (p,q)∈Nw (i,j),
V100 GPU to run the experiments in the paper. 1≤p,q≤n
Explanation methods: The Tensorflow 4 code for IG, SG, (2w + 1)2 X 2
DeepLIFT we used (Chen et al. 2019) and Ghorbani, Abid, ≤ (apq (x) − apq (y))
(2w + 1)2
1≤p,q≤n
2
https://github.com/MadryLab/robustness/
3 5
https://github.com/jfc43/robust-attribution-regularization https://github.com/pytorch
4 6
https://www.tensorflow.org https://github.com/pytorch/captum
1.0
0.975 0.95 0.8
0.9 0.950
Average top-k Intersection

0.90 0.7
0.925
0.8 1-LENS-prec@k 0.900 0.85
top-k 0.6
because each (p, q) appears in at most (2w+1)2 possibles

0.7 0.875 0.80
0.850 0.5
1-LENS-prec@k 0.75 1-LENS-prec@k 1-LENS-prec@k
0.6 0.825
top-k 0.70 top-k 0.4
top-k
0.800
Nw (i, j)’s 20 40 60
Eval no. of top-k
80 100 20 40 60
Eval no. of top-k
80 100 20 40 60
Eval no. of top-k
80 100 0 200 400 600
Eval no. of top-k
800 1000
(a) MNIST IG : (b) F-MNIST IG (c) GTSRB IG : (d) Flower IG :

2 top-k : top-k top-k top-k
= ∥a(x) − a(y)∥2 . 1.0
0.95 0.95
0.9 0.95

0.90 0.90
0.8 0.90
1-LENS-prec@k 0.85
0.7 0.85 top-k 0.85
0.80
0.6
0.80 0.80
1-LENS-prec@k 0.75 1-LENS-prec@k 1-LENS-prec@k
0.5
top-k 0.75 0.70 top-k 0.75
top-k
E Additional results for Section 3 20 40 60

Eval no. of top-k
80 100 20 40 60
Eval no. of top-k
80 100 20 40 60
Eval no. of top-k
80 100 0 200 400 600
Eval no. of top-k
800 1000
(e) MNIST IG : (f) F-MNIST IG (g) GTSRB IG : (h) Flower IG :

E.1 Additional results for Section 3.2 random : random random random
The effect of varying k. Figure 10 shows a large disparity
between top-k intersection and 1-LENS-prec@k even when Figure 11: Attributional robustness of IG on adversari-
k is large. Figure 10 shows that top-k intersection can be ally(PGD) trained models measured as average top-k in-
very low even when the IG of the original and the IG of the tersection and 1-LENS-prec@k between IG(original image)
perturbed images are locally very similar, as indicated by and IG(perturbed image). Perturbations are obtained by the
high 1-LENS-prec@k. Our observation holds for the per- top-k attack (Ghorbani, Abid, and Zou 2019) and ran-
turbations obtained by the top-k attack (Ghorbani, Abid, dom perturbation. The plots show how the above measures
and Zou 2019) as well as a random perturbation across all change with varying k across different datasets.
datasets in our experiments. Figure 11 and 12 show the re-
sults on PGD and IG-SUM-NORM trained networks.
1.00
0.35 0.95 0.85
0.9 0.9 0.95 0.95
0.9 0.90 0.80

0.8 0.8 0.30 0.90 0.90
0.85 0.75
0.8
0.7 0.7 0.25 0.85 0.80
1-LENS-prec@k 1-LENS-prec@k 1-LENS-prec@k 0.85 1-LENS-prec@k 0.70
0.7 0.6 0.80 top-k top-k 0.75
top-k top-k 0.6 0.20
0.80
0.65
0.5 0.75 0.70
0.6 0.5 0.15 0.60
0.4 0.75 0.65
1-LENS-prec@k 1-LENS-prec@k 0.70 1-LENS-prec@k 0.55 1-LENS-prec@k
0.5 0.4 0.10 0.60
0.3
top-k top-k 0.65 0.70 top-k 0.50
top-k
0.3 0.55
0.2 0.05 20 40 60 80 100 20 40 60 80 100 20 40 60 80 100 0 200 400 600 800 1000
20 40 60 80 100 20 40 60 80 100 20 40 60 80 100 0 200 400 600 800 1000 Eval no. of top-k Eval no. of top-k Eval no. of top-k Eval no. of top-k
Eval no. of top-k Eval no. of top-k Eval no. of top-k Eval no. of top-k
(a) MNIST: top- (b) F-MNIST: (c) GTSRB: top- (d) Flower: top- (a) MNIST IG : (b) F-MNIST IG (c) GTSRB IG : (d) Flower IG :
k top-t k k top-k : top-k top-k top-k
1.0 1.00
1.00 0.975
0.95
0.9 0.95 0.95
0.95 0.9 0.95 0.950
0.90

0.90 0.8 0.90 0.90 0.925

0.85 0.8
0.90 0.900
0.85
0.7 1-LENS-prec@k 0.85
0.80 0.7
0.85
0.875
0.80 top-k 0.80 0.85 0.80 0.850
0.6 0.75 0.6
0.75 0.825
0.70 1-LENS-prec@k 0.5 0.75
1-LENS-prec@k 0.70 1-LENS-prec@k 0.5 1-LENS-prec@k 0.80 1-LENS-prec@k 0.75 1-LENS-prec@k 0.800
1-LENS-prec@k
top-k 0.70 top-k 0.65 top-k top-k top-k 0.70
top-k top-k
0.65 0.4 0.75 0.775
20 40 60 80 100 20 40 60 80 100 20 40 60 80 100 0 200 400 600 800 1000 20 40 60 80 100 20 40 60 80 100 20 40 60 80 100 0 200 400 600 800 1000
Eval no. of top-k Eval no. of top-k Eval no. of top-k Eval no. of top-k Eval no. of top-k Eval no. of top-k Eval no. of top-k Eval no. of top-k
(e) MNIST: ran- (f) F-MNIST: (g) GTSRB: ran- (h) Flower: ran- (e) MNIST IG : (f) F-MNIST IG (g) GTSRB IG : (h) Flower IG :
dom random dom dom random : random top-k random
Figure 10: Attributional robustness of IG on naturally Figure 12: Attributional robustness of IG on IG-SUM-
trained models measured as average top-k intersection NORM trained models measured as average top-k intersec-
and 1-LENS-prec@k between IG(original image) and tion and 1-LENS-prec@k between IG(original image) and
IG(perturbed image). Perturbations are obtained by the top- IG(perturbed image). Perturbations are obtained by the top-
k attack (Ghorbani, Abid, and Zou 2019) and random per- k attack (Ghorbani, Abid, and Zou 2019) and random per-
turbation. The plots show how the above measures change turbation. The plots show how the above measures change
with varying k across different datasets. with varying k across different datasets.
Comparison of Spearman’s ρ and Kendall’s τ with 1-

LENS-Spearman and 1-LENS-Kendall. Figure 13 com-
pares Spearman’s ρ and Kendall’s τ with 1-LENS-Spearman Zou (2019) seem weaker under locality-senstitive robust-
and 1-LENS-Kendall measures for attributional robustness. ness measures only because the attack was specifically con-
We observe that 1-smoothing of attribution maps increases structed for a corresponding top-k intersection objective.
the corresponding Kendall’s τ and Spearman’s ρ measures Since the construction of the attack in Ghorbani, Abid, and
of attributional robustness, and this observation holds across Zou (2019) is modifiable for any similarity objective, we use
all datasets in our experiments. As a result, we believe that 1-LENS-prec@k to construct a new attributional attack for
1-LENS-Spearman and 1-LENS-Kendall result in better or 1-LENS-prec@k objective based on the k × k neighborhood
tighter attributional robustness measures than Spearman’s ρ of pixels. Surprisingly, we notice that it leads to a worse at-
and Kendall’s τ . Additional results in Figure 14. tributional attack, if we measure its effectiveness using the
top-k intersection; see Figure 15. In other words, attribu-
Modifying the attack of Ghorbani, Abid, and Zou (2019) tional attacks against locality-sensitive measures of attribu-
for 1-LENS-prec@k objective A natural question is tional robustness are non-trivial and may require fundamen-
whether the original top-k attack of Ghorbani, Abid, and tally different ideas.
1.0
Spearman's
1-LENS-Spearman
Kendall's
Average top-k Intersection 0.8 1-LENS-Kendall
0.6 1.0
top-k
0.8 1-LENS-prec@k
0.4
w-LENS@k
0.6
0.2
0.4
0.0
MNIST Fashion MNIST GTSRB Flower 0.2
Dataset
0.0
MNIST Fashion-MNIST Flower
Figure 13: Attributional robustness of IG on naturally
trained models measured as average Spearman’s ρ, 1- Figure 15: Average top-k intersection between IG(original
LENS-Spearman, Kendall τ and 1-LENS-Kendall between image) and IG(perturbed image) on naturally trained mod-
IG(original image) and IG(perturbed image). The perturba- els where the perturbation is obtained by incorporating 1-
tions are obtained by the top-t attack of Ghorbani, Abid, and LENS-prec@k objective in the Ghorbani, Abid, and Zou
Zou (2019). (2019) attack.
1.0 1.0
Spearman's
0.8 1-LENS-Spearman 0.8
Kendall's
0.6 1-LENS-Kendall 0.6
0.4 0.4 Spearman's

1-LENS-Spearman
0.2 0.2 Kendall's
1-LENS-Kendall
0.0 0.0
Natural PGD IG-SUM-NORM Natural PGD IG-SUM-NORM
Type of Training Type of Training
(a) MNIST (b) Fashion MNIST

1.0 1.0
1.0 1.0
0.8 0.8
0.8 0.8

0.6 0.6
0.6 0.6
0.4 Spearman's 0.4 Spearman's 0.4 0.4
1-LENS-Spearman 1-LENS-Spearman top-k top-k
0.2 Kendall's 0.2 Kendall's 0.2 1-LENS-recall@k 0.2 1-LENS-recall@k
1-LENS-Kendall 1-LENS-Kendall
0.0 0.0 1-LENS-prec@k 1-LENS-prec@k
Natural PGD IG-SUM-NORM Natural PGD IG-SUM-NORM 0.0 0.0
Type of Training Type of Training Natural PGD IG-SUM-NORM Natural PGD IG-SUM-NORM
(c) GTSRB (d) Flower (a) MNIST (b) Fashion MNIST

Figure 14: Average Kendall’s τ , Spearman’s ρ, 1-LENS- 1.0 1.0
Kendall and 1-LENS-Spearman used to measure the attri- 0.8 0.8

Average topK Intersection
butional robustness of IG on natrually trained, PGD-trained 0.6 0.6
and IG-SUM-NORM trained models. The perturbation used 0.4 0.4
is the top-k attack of Ghorbani, Abid, and Zou (2019). top-k top-k
0.2 1-LENS-recall@k 0.2 1-LENS-recall@k
Shown for (a) MNIST, (b) Fashion MNIST, (c) GTSRB and 1-LENS-prec@k 1-LENS-prec@k
0.0 0.0
(d) Flower datasets. Type of Training Type of Training
(c) GTSRB (d) Flower
E.2 Experiments with Integrated Gradients Figure 16: Average top-k intersection, 1-LENS-prec@k, 1-
Below we present additional experimental results for Inte- LENS-recall@k measured between IG(original image) and
grated Gradients (IG). IG(perturbed image) for models that are naturally trained,
PGD-trained and IG-SUM-NORM trained. The perturbation
used is the top-k attack of (Ghorbani, Abid, and Zou 2019).
Shown for (a) MNIST, (b) Fashion MNIST, (c) GTSRB and
(d) Flower datasets.
1.0 1.0
0.8 0.8

0.6 0.6
0.4 0.4
top-k top-k
1-LENS-prec@k 1-LENS-prec@k
0.0 0.0

1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
top-k top-k
1-LENS-prec@k 1-LENS-prec@k 1.0 1.0
0.0 0.0
Type of Training Type of Training 0.8 0.8

0.6 0.6
0.4 0.4
top-k old metric
Figure 17: Attributional robustness of IG on naturally, PGD 0.2 1-LENS-prec@k
2-LENS-prec@k 0.2 1-LENS-prec@k
2-LENS-prec@k
and IG-SUM-NORM trained models measured as top-k in- 0.0
Natural PGD IG-SUM-NORM
0.0
tersection, 1-LENS-prec@k and 1-LENS-recall@k between Type of Attack Type of Attack
the IG of the original images and the IG of their perturba- (a) IG : top-k (b) IG : random
tions obtained by the random sign attack (Ghorbani, Abid,
1.0
and Zou 2019) across different datasets.
0.8
0.6
0.4
top-k
1-LENS-prec@k
0.2 2-LENS-prec@k
3-LENS-prec@k
0.0
Type of Attack
1.0 1.0
(c) IG : center of mass
0.8 0.8
0.6 0.6 Figure 19: Attributional robustness of IG on naturally, PGD

0.4 0.4 and IG-SUM-NORM trained models measured as top-k in-
top-k top-k
0.2 1-LENS-recall@k 0.2 1-LENS-recall@k tersection and w-LENS-prec@k between the IG of the orig-
0.0
0.0
inal images and the IG of their perturbations. Perturbations
Type of Training Type of Training are obtained by the top-k attack and the mass center attack
(a) MNIST (b) Fashion MNIST (Ghorbani, Abid, and Zou 2019) as well as a random per-
turbation. The plots show the effect of varying w on Flower
1.0 1.0
dataset.
0.8 0.8
0.6 0.6
0.4 0.4
top-k top-k
0.0 0.0
Figure 18: Attributional robustness of IG on naturally, PGD

and IG-SUM-NORM trained models measured as top-k in-
tersection, 1-LENS-prec@k and 1-LENS-recall@k between
the IG of the original images and the IG of their perturba-
tions obtained by the mass center attack (Ghorbani, Abid,
and Zou 2019) across different datasets.
E.3 Experiments with Simple Gradients 1.0 1.0
We observe that our conclusions about Integrated Gradients 0.8 0.8

(IG) continue to hold qualitatively, even if we replace IG 0.6 0.6
with Simple Gradients as our attribution method. 0.4 0.4

top-k top-k
1.0 1.0 0.0 0.0
0.8 0.8

0.6 0.6
0.4 0.4 1.0
top-k top-k
0.2 1-LENS-recall@k 0.2 1-LENS-recall@k 0.8

0.0 0.0 0.6
0.4
(a) MNIST (b) Fashion MNIST top-k

0.2 1-LENS-recall@k
1-LENS-prec@k
1.0 1.0 0.0
top-k Natural PGD IG-SUM-NORM
1-LENS-recall@k Type of Training
1-LENS-prec@k
0.8 0.8
0.6 0.6
(c) Flower
0.4 0.4
top-k Figure 22: Attributional robustness of Simple Gradients on
0.2 0.2 1-LENS-recall@k
1-LENS-prec@k naturally, PGD and IG-SUM-NORM trained models mea-
0.0 0.0
Natural PGD
Type of Training
IG-SUM-NORM Natural PGD
Type of Training
IG-SUM-NORM sured as top-k intersection, 1-LENS-prec@k and 1-LENS-
recall@k between the Simple Gradient of the original im-
(c) GTSRB (d) Flower ages and the Simple Gradient of their perturbations obtained
by the mass center attack (Ghorbani, Abid, and Zou 2019)
Figure 20: Attributional robustness of Simple Gradients on across different datasets.
naturally, PGD and IG-SUM-NORM trained models mea-
sured as top-k intersection, 1-LENS-prec@k and 1-LENS-
ages and the Simple Gradient of their perturbations obtained 1.0 top-k
1-LENS-prec@k
1.0
2-LENS-prec@k
by the top-k attack (Ghorbani, Abid, and Zou 2019) across 0.8 3-LENS-prec@k 0.8

top-k
1-LENS-prec@k
different datasets. 0.6 0.6
2-LENS-prec@k
3-LENS-prec@k
0.4 0.4
0.2 0.2
1.0 1.0 0.0 0.0

0.8 0.8
0.6 0.6
(a) SG : top-k (b) SG : random
0.4 0.4 1.0 top-k
1-LENS-prec@k
top-k top-k 2-LENS-prec@k
0.2 0.2 1-LENS-recall@k 0.8 3-LENS-prec@k
1-LENS-recall@k
0.0 0.0 0.6
0.4
(a) MNIST (b) Fashion MNIST 0.2
1.0 1.0 0.0

Type of Training
0.8 0.8
0.6 0.6
(c) SG : center of mass
0.4 0.4
top-k top-k Figure 23: Attributional robustness of Simple Gradients on
0.2 1-LENS-recall@k 0.2 1-LENS-recall@k naturally, PGD and IG-SUM-NORM trained models mea-
0.0
Natural PGD
Type of Training
IG-SUM-NORM
0.0
Natural PGD
Type of Training
IG-SUM-NORM sured as top-k intersection and w-LENS-prec@k between
the IG of the original images and the IG of their perturba-
(c) GTSRB (d) Flower tions. Perturbations are obtained by the top-k attack and the
mass center attack (Ghorbani, Abid, and Zou 2019) as well
Figure 21: Attributional robustness of Simple Gradients on as a random perturbation. The plots show the effect of vary-
naturally, PGD and IG-SUM-NORM trained models mea- ing w on Flower dataset.
sured as top-k intersection, 1-LENS-prec@k and 1-LENS-
ages and the Simple Gradient of their perturbations obtained
by the random sign attack (Ghorbani, Abid, and Zou 2019)
across different datasets.
E.4 Experiments with Other Explanation
Methods
In this section we present all the results with other explana-
tion methods other than IG and SG shown in previous sec-
tions with naturally and PGD trained models using mainly
random sign perturbation.
Top-1000 Intersection
1.0 1.0 1.0 1.0
0.9 0.9 0.9 0.9
0.8 0.8 0.8 0.8
0.7 0.7 0.7 0.7
0.6 0.6 0.6 0.6
0.5 0.5 0.5 0.5
0.4 0.4 0.4 0.4
0.3 0.3 0.3 0.3
0.2 0.2 0.2 0.2
0.1 0.1 0.1 0.1
0.0 0.0 0.0 0.0
0 1 2 4 8 0 1 2 4 8 0 1 2 4 8 0 1 2 4 8
1.0 1.0 1.0 1.0

0.9 0.9 0.9 0.9
3-LENS-recall@k
0.8 0.8 0.8 0.8

0.7 0.7 0.7 0.7
0.6 0.6 0.6 0.6
0.5 0.5 0.5 0.5
0.4 0.4 0.4 0.4
0.3 0.3 0.3 0.3
0.2 0.2 0.2 0.2
0.1 0.1 0.1 0.1
0.0 0.0 0.0 0.0
0 1 2 4 8 0 1 2 4 8 0 1 2 4 8 0 1 2 4 8
1.0 1.0 1.0 1.0

3-LENS-recall@k-div
0.9 0.9 0.9 0.9

0.8 0.8 0.8 0.8
0.7 0.7 0.7 0.7
0.6 0.6 0.6 0.6
0.5 0.5 0.5 0.5
0.4 0.4 0.4 0.4
0.3 0.3 0.3 0.3
0.2 0.2 0.2 0.2
0.1 0.1 0.1 0.1
0.0 0.0 0.0 0.0
0 1 2 4 8 0 1 2 4 8 0 1 2 4 8 0 1 2 4 8
Figure 24: Similar to Figure 2, From top to bottom, we plot average top-k intersection (currently used metric, top), 3-LENS-recall@k
and 3-LENS-recall@k-div (proposed metrics, middle and bottom respectively) against attributional attack perturbations for four attribution
methods of a SqueezeNet model (as used by Ghorbani, Abid, and Zou (2019)) on Imagenet: (left) Simple Gradients (SG), (center left)
Integrated Gradients (IG), (center right) DeepLift, (right) Deep Taylor Decomposition. We use k = 1000 with an ℓ∞ -norm attack and three
attack variants proposed by Ghorbani, Abid, and Zou (2019). Evidently, the proposed metrics show more robustness under the same attacks.
1.0 1.0 1.0 1.0 1.0 1.0

top-k top-k top-k top-k top-k top-k
1-LENS-recall@k 1-LENS-recall@k 3-LENS-recall@k 1-LENS-recall@k 1-LENS-recall@k 3-LENS-recall@k
0.8
1-LENS-prec@k 0.8
1-LENS-prec@k 0.8
3-LENS-prec@k 0.8
1-LENS-prec@k 0.8
1-LENS-prec@k 0.8
3-LENS-prec@k
w-LENS@k
w-LENS@k
w-LENS@k
w-LENS@k
w-LENS@k
w-LENS@k
0.6 0.6 0.6 0.6 0.6 0.6
0.4 0.4 0.4 0.4 0.4 0.4
0.2 0.2 0.2 0.2 0.2 0.2
0.0 0.0 0.0 0.0 0.0 0.0

Natural/top-k Natural/top-k-div PGD/top-k PGD/top-k-div Natural/top-k Natural/top-k-div PGD/top-k PGD/top-k-div Natural/top-k Natural/top-k-div PGD/top-k PGD/top-k-div Natural/top-k Natural/top-k-div PGD/top-k PGD/top-k-div Natural/top-k Natural/top-k-div PGD/top-k PGD/top-k-div Natural/top-k Natural/top-k-div PGD/top-k PGD/top-k-div
(a) MNIST (b) Fashion MNIST (c) ImageNet (a) MNIST (b) Fashion MNIST (c) ImageNet
Figure 25: Attributional robustness of Simple Gradients(SG) Figure 26: Attributional robustness of Image × Gradients
on naturally and PGD trained models measured as top-k on naturally and PGD trained models measured as top-k
intersection, w-LENS-prec@k and w-LENS-recall@k be- intersection, w-LENS-prec@k and w-LENS-recall@k be-
tween the Simple Gradients(SG) of the original images and tween the Image × Gradients of the original images and of
of their perturbations obtained by the random sign perturba- their perturbations obtained by the random sign perturbation
tion across different datasets. across different datasets.
1.0 1.0 1.0
top-k top-k top-k
1-LENS-recall@k 1-LENS-recall@k 3-LENS-recall@k
0.8
1-LENS-prec@k 0.8
1-LENS-prec@k 0.8
3-LENS-prec@k
w-LENS@k
w-LENS@k
w-LENS@k
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2
0.0 0.0 0.0

Natural/top-k Natural/top-k-div PGD/top-k PGD/top-k-div Natural/top-k Natural/top-k-div PGD/top-k PGD/top-k-div Natural/top-k Natural/top-k-div PGD/top-k PGD/top-k-div
(a) MNIST (b) Fashion MNIST (c) ImageNet
Figure 27: Attributional robustness of LRP (Bach et al.

2015) on naturally and PGD trained models measured
as top-k intersection, w-LENS-prec@k and w-LENS-
1.0 1.0
recall@k between the LRP of the original images and of top-k top-k
their perturbations obtained by the random sign perturbation 1-LENS-prec@k 1-LENS-prec@k
0.8 0.8
across different datasets. 3-LENS-prec@k 3-LENS-prec@k
w-LENS-prec@k
w-LENS-prec@k
0.6 0.6
1.0 1.0 1.0

top-k top-k top-k
1-LENS-recall@k 1-LENS-recall@k 3-LENS-recall@k 0.4 0.4
0.8
1-LENS-prec@k 0.8
1-LENS-prec@k 0.8
3-LENS-prec@k
w-LENS@k
w-LENS@k
w-LENS@k
0.6 0.6 0.6

0.2 0.2
0.4 0.4 0.4
0.0 0.0
0.2 0.2 0.2
Natural/top-k Natural/top-k-div PGD/top-k PGD/top-k-div Natural/top-k Natural/top-k-div PGD/top-k PGD/top-k-div
0.0 0.0 0.0
(a) Simple Gradient (b) Image × Gradient
1.0 1.0
(a) MNIST (b) Fashion MNIST (c) ImageNet top-k top-k
0.8 0.8
Figure 28: Attributional robustness of DeepLIFT (Shriku- 3-LENS-prec@k 3-LENS-prec@k
w-LENS-prec@k
w-LENS-prec@k
mar, Greenside, and Kundaje 2017) on naturally and PGD 0.6 0.6
trained models measured as top-w intersection, w-LENS- 0.4 0.4

prec@k and w-LENS-recall@k between the DeepLIFT of
the original images and of their perturbations obtained by 0.2 0.2
the random sign perturbation across different datasets. 0.0 0.0

Natural/top-k Natural/top-k-div PGD/top-k PGD/top-k-div Natural/top-k Natural/top-k-div PGD/top-k PGD/top-k-div
1.0 1.0 1.0

top-k
1-LENS-recall@k
top-k
1-LENS-recall@k
top-k
3-LENS-recall@k
(c) LRP [Bach 2015] (d) DeepLIFT [Shrikumar 2017]
0.8
1-LENS-prec@k 0.8
1-LENS-prec@k 0.8
3-LENS-prec@k 1.0 1.0
top-k top-k
w-LENS@k
w-LENS@k
w-LENS@k
0.6 0.6 0.6

0.8 0.8
0.4 0.4 0.4 2-LENS-prec@k 2-LENS-prec@k
w-LENS-prec@k
w-LENS-prec@k
0.2 0.2 0.2
0.6 0.6
0.0 0.0 0.0
0.4 0.4
0.2 0.2
Figure 29: Attributional robustness of GradSHAP (Lund-

0.0 0.0
berg and Lee 2017) on naturally and PGD trained models Natural/top-k Natural/top-k-div PGD/top-k PGD/top-k-div Natural/top-k Natural/top-k-div PGD/top-k PGD/top-k-div
measured as top-k intersection, w-LENS-prec@k and w- (e) GradSHAP [Lundberg 2017] (f) IG [Sundararajan 2017]
LENS-recall@k between the GradSHAP of the original im-
ages and of their perturbations obtained by the random sign Figure 31: Attributional robustness of explanation methods
perturbation across different datasets. on naturally and PGD trained models measured as top-k
intersection and w-LENS-prec@k between the explanation
1.0
top-k
1-LENS-recall@k
1.0
top-k
1-LENS-recall@k
1.0
top-k
3-LENS-recall@k
map of the original images and their perturbations. Pertur-
0.8
1-LENS-prec@k 0.8
1-LENS-prec@k 0.8
3-LENS-prec@k
bations are obtained by the random sign perturbation. The
w-LENS@k
w-LENS@k
w-LENS@k
0.6 0.6 0.6
0.4 0.4 0.4

plots show the effect of varying w on ImageNet dataset with
0.2 0.2 0.2
naturally and PGD trained ResNet50 model.
0.0 0.0 0.0
Figure 30: Attributional robustness of Integrated Gradients

(Sundararajan, Taly, and Yan 2017) on naturally and PGD
trained models measured as top-k intersection, w-LENS-
prec@k and w-LENS-recall@k between the Integrated
Gradients(IG) of the original images and of their pertur-
bations obtained by the random sign perturbation across dif-
ferent datasets.
F The fragility of top-k intersection
Figure 32 highlights the top-100 pixels in the unperturbed
and perturbed maps of a sample image. Figure 33, 34 show
the top-1000 pixels and top-1000 diverse pixels along with
w-LENS-recall@k and w-LENS-prec@k on them. Figure
35 visualize the top-1000 diverse pixels obtained with differ-
ent window sizes eg. 3 and 5, respectively. Figure 36 elabo-
rates the use of LENS(locality) and Figure 37 shows LENS-
div(diversity) for the same example from ImageNet. Zoom
in required to observe the finer details.
(a) Original IG (b) top-100 pixels (c) Perturbed IG (d) top-100 pixels
(e) Original IG (f) top-100 pixels (g) Perturbed IG (h) top-100 pixels
Figure 32: Sample Integrated Gradients(IG) map using Flower dataset where top-100 is highlighted before and after image is
perturbed with top-k attack of Ghorbani, Abid, and Zou (2019).
(a) Original IG (b) top-1000 pixels (c) 1-LENS-recall@k (d) 2-LENS-recall@k
(e) Perturbed IG (f) top-1000 pixels (g) 1-LENS-prec@k (h) 2-LENS-prec@k
Figure 33: Example based on locality. Sample Integrated Gradients(IG) map using Flower dataset with top-k highlighted fol-
lowed by w-LENS@k maps (row 1) Original IG and (row 2) Perturbed IG with top-k attack (Ghorbani, Abid, and Zou 2019).
(a) Original IG (b) top-1000-div pixels (c) 1-LENS-recall@k-div (d) 2-LENS-recall@k-div
(e) Perturbed IG (f) top-1000-div pixels (g) 1-LENS-prec@k-div (h) 2-LENS-prec@k-div
Figure 34: Example based on diversity. Sample Integrated Gradients(IG) map using Flower dataset with top-k-div highlighted
followed by w-LENS@k-div maps (row 1) Original IG and (row 2) Perturbed IG with top-k attack (Ghorbani, Abid, and Zou
2019).
(a) Original IG (b) top-1000 diverse pixels with 3×3 window (c) top-1000 diverse pixels with 5×5 window
(d) Perturbed IG (e) top-1000 diverse pixels with 3×3 window (f) top-1000 diverse pixels with 5×5 window
Figure 35: Example based on diversity with different window sizes. Sample Integrated Gradients(IG) map using Flower dataset
with top-k-div highlighted with (column 2) 3×3 window (column 3) 5×5 window. (column 1) (top) map of unperturbed image
(bottom) map of perturbed image with top-k attack (Ghorbani, Abid, and Zou 2019).
(a) Original Image (b) top-1000 of Original IG (c) 3-LENS-recall@1000 (d) 3-LENS-recall@1000 0/1
(e) Perturbed Image (f) top-1000 of Perturbed IG (g) 3-LENS-prec@1000 (h) 3-LENS-prec@1000 0/1
Figure 36: Example based on locality(LENS) with sample Integrated Gradients(IG) map from ImageNet dataset (a) the original
image and (b) being the perturbed image. (c) and (d) show top-k highlighted and (e) and (f) are the corresponding maps with
LENS. Maps (g), (h) are the maps corresponding to (e), (f), respectively with non-zero value pixels shown as white. top-k:0.108,
3-LENS-recall@k:0.254, 3-LENS-prec@k:0.433, top-k-div:0.090, 3-LENS-recall@k-div:0.758, 3-LENS-prec@k-div:0.807
(a) Original IG (b) top-1000-div of Original IG (c) 3-LENS-recall@1000-div (d) 3-LENS-recall@1000-div 0/1
(e) Perturbed IG (f) top-1000-div of Perturbed IG (g) 3-LENS-prec@1000-div (h) 3-LENS-prec@1000-div 0/1
Figure 37: Example based on diversity with 7×7 window size with sample Integrated Gradients(IG) map from ImageNet
dataset (a) the original IG of unperturbed image and (b) perturbed IG obtained with top-k attack(Ghorbani, Abid, and Zou
2019). (c) and (d) show top-k-div highlighted and (e) and (f) are the corresponding maps with LENS. Maps (g), (h) are the
maps corresponding to (e), (f), respectively with non-zero value pixels shown as white. top-k:0.108, 3-LENS-recall@k:0.254,
3-LENS-prec@k:0.433, top-k-div:0.090, 3-LENS-recall@k-div:0.758, 3-LENS-prec@k-div:0.807
G Additional results for PGD-trained and
IG-SUM-NORM trained models
Figure 11 and Figure 12 shows the impact of k in top-k
for adversarially(PGD) trained and attributional(IG-SUM-
NORM) trained network, respectively. But an important
point to be noticed is that even with small number of fea-
tures LENS is able to cross 70-80% which supports the ob-
servation of sparsity and stability of attributions achieved
by adversarially(PGD) trained models by Chalasani et al.
(2020). Similarly, the experiments with different w value for
w-LENS-top-k in Figure 19 clearly incidates that due to the
stability properties at lower window sizes LENS is able to
cross the 80% intersection quickly. Supporting that our met-
ric nicely captures local stability very well.
While above we observed only the top-k version of
LENS. Figure 14 further extends the observation to LENS-
Spearman and LENS-Kendall who to show that with LENS
with a smoothing of 3 × 3 the maps from PGD-trained and
IG-SUM-NORM trained models have a higher top-k inter-
section above 70% in comparison to natural trained model
across all datasets used in our experiments, which further
strengthen the conclusions from previous papers that IG on
PGD-trained and IG-SUM-NORM trained models give bet-
ter attributions.
Appendix E and H provide results on PGD-trained
and IG-SUM-NORM trained models along with naturally
trained models for compact presentation of results.
H Additional results with top-k-div

Table 4 provides detailed LENS and diversity with LENS
results on MNIST, Fashion-MNIST, Table 5 on Flower with
natural, PGD and IG-SUM-NORM trained networks and Ta-
ble 6 on naturally trained network for ImageNet. All results
with top-k, center of mass attack proposed by Ghorbani,
Abid, and Zou (2019) and random sign perturbation using
Integrated Gradients(IG).
Table 7 show the LENS and diversity results with random
sign perturbation on MNIST, Fashion MNIST and Table 8
on ImageNet with different explanation methods like Sim-
ple Gradients(SG), Image×Gradient, DeepLIFT(Shrikumar
et al. 2016; Shrikumar, Greenside, and Kundaje 2017), LRP
(Bach et al. 2015), GradShap(Lundberg and Lee 2017)
and Integrated Gradients(IG) (Sundararajan, Taly, and Yan
2017).
Dataset Train Type Attack Type Attribution Method top-k intersection 1-LENS-recall@k 1-LENS-prec@k top-k-div 1-LENS-recall@k-div 1-LENS-prec@k-div
MNIST Nat top-k IG 0.4635 0.6209 0.9427 0.1680 0.6592 0.8126
Nat mass center IG 0.5779 0.7146 0.9572 0.1708 0.6183 0.7191
Nat random IG 0.7384 0.8790 0.9808 0.2046 0.6774 0.8140
MNIST PGD top-k IG 0.5378 0.6127 0.9690 0.2300 0.6567 0.8114
PGD mass center IG 0.6077 0.7133 0.9700 0.1907 0.5894 0.6801
PGD random IG 0.6867 0.8220 0.9835 0.1751 0.6726 0.7995
MNIST IG-SUM-NORM top-k IG 0.6406 0.7817 0.9736 0.2327 0.6701 0.7995
IG-SUM-NORM mass center IG 0.7075 0.8387 0.9695 0.2002 0.6293 0.7294
IG-SUM-NORM random IG 0.7746 0.9129 0.9846 0.1783 0.6788 0.8142
Fashion-MNIST Nat top-k IG 0.3841 0.9118 0.9361 0.2132 0.8404 0.8857
Nat random IG 0.5378 0.9377 0.9500 0.2716 0.8337 0.8825
Fashion-MNIST PGD top-k IG 0.7440 0.9571 0.9696 0.3770 0.8486 0.8608
PGD random IG 0.8816 0.9863 0.9892 0.3806 0.8451 0.8695
Fashion-MNIST IG-SUM-NORM top-k IG 0.6953 0.9608 0.9697 0.3671 0.8437 0.8655
Table 4: Table with top-k intersection and top-k-div results for MNIST and Fashion MNIST with LeNet based model using
Integrated Gradients(IG) (Sundararajan, Taly, and Yan 2017). The columns first contain locality results: top-k intersection, 1-
LENS-recall@k, 1-LENS-prec@k followed by diversity results: top-k-div, 1-LENS-recall@k-div, 1-LENS-prec@k-div. Mod-
els trained naturally, PGD and IG-SUM-NORM are used. Results include top-k, center of mass attack of Ghorbani, Abid, and
Zou (2019) as well as random sign perturbation.
Flower Nat top-k IG 0.1789 0.4091 0.5128 0.2355 0.9482 0.9560
Nat random IG 0.8206 0.9709 0.9747 0.4613 0.9741 0.9778
Flower PGD top-k IG 0.5941 0.8444 0.9165 0.4078 0.9733 0.9737
PGD random IG 0.9033 0.9929 0.9934 0.5948 0.9847 0.9886
Flower IG-SUM-NORM top-k IG 0.6179 0.8720 0.9299 0.3875 0.9747 0.9757
Table 5: Table with top-k intersection and top-k-div results for Flower with ResNet based model using Integrated Gradients(IG)
(Sundararajan, Taly, and Yan 2017). The columns first contain locality results: top-k intersection, 2-LENS-recall@k, 2-LENS-
prec@k followed by diversity results: top-k-div, 2-LENS-recall@k-div, 2-LENS-prec@k-div. Models trained naturally, PGD
and IG-SUM-NORM are used. Results include top-k, center of mass attack of Ghorbani, Abid, and Zou (2019) as well as
random sign perturbation.
ImageNet Nat top-k IG 0.0544 0.2822 0.3189 0.0684 0.7303 0.7458
Nat random IG 0.3133 0.8463 0.8460 0.2157 0.8713 0.8689
Table 6: Table with top-k intersection and top-k-div results for ImageNet with SqueezeNet model using Integrated Gradi-
ents(IG) (Sundararajan, Taly, and Yan 2017). The columns first contain locality results: top-k intersection, 3-LENS-recall@k, 3-
LENS-prec@k followed by diversity results: top-k-div, 3-LENS-recall@k-div, 3-LENS-prec@k-div. Models naturally trained
are used. Results include top-k, center of mass attack of Ghorbani, Abid, and Zou (2019) as well as random sign perturbation.
Dataset Train Type Attack Type Attribution method top-k 1-LENS-recall@k 1-LENS-prec@k top-k-div 1-LENS-recall@k-div 1-LENS-prec@k-div
MNIST Nat random Simple Gradient 0.6548 0.8872 0.9355 0.2431 0.8724 0.8942
Nat random Image × Gradient 0.6887 0.8590 0.9791 0.1640 0.6725 0.7735
Nat random LRP [Bach 2015] 0.6887 0.8590 0.9791 0.1640 0.6725 0.7733
Nat random DeepLIFT [Shrikumar 2017] 0.6896 0.8602 0.9792 0.1642 0.6725 0.7724
Nat random GradSHAP [Lundberg 2017] 0.6950 0.8629 0.9792 0.1646 0.6716 0.7707
Nat random IG [Sundararajan 2017] 0.6978 0.8636 0.9795 0.1654 0.6717 0.7705
MNIST PGD random Simple Gradient 0.4456 0.7544 0.9712 0.1798 0.8034 0.8719
PGD random Image × Gradient 0.5355 0.8102 0.9853 0.1620 0.6788 0.8037
PGD random LRP [Bach 2015] 0.6887 0.8590 0.9791 0.2786 0.8669 0.9557
PGD random DeepLIFT [Shrikumar 2017] 0.5387 0.8111 0.9855 0.1626 0.6795 0.8056
PGD random GradSHAP [Lundberg 2017] 0.6831 0.9164 0.9835 0.1611 0.6684 0.7619
PGD random IG [Sundararajan 2017] 0.6729 0.9359 0.9847 0.1671 0.6641 0.7573
Fashion-MNIST Nat random Simple Gradient 0.5216 0.8421 0.8614 0.2691 0.8718 0.8820
Fashion-MNIST PGD random Simple Gradient 0.6036 0.8374 0.9362 0.2974 0.8202 0.8696
Table 7: Table with top-k intersection and top-k-div results for MNIST and Fashion MNIST with LeNet based model trained
naturally and adversarially(PGD), using different explanation methods with random sign perturbation. The columns first con-
tain locality results: top-k intersection, 1-LENS-recall@k, 1-LENS-prec@k followed by diversity results: top-k-div, 3-LENS-
recall@k-div, 3-LENS-prec@k-div.
Dataset Train Type Attack Type Attribution method top-k 3-LENS-recall@k 3-LENS-prec@k top-k-div 3-LENS-recall@k-div 3-LENS-prec@k-div
ImageNet Nat random Simple Gradient 0.3825 0.7875 0.7949 0.3096 0.8290 0.8115
ImageNet PGD random Simple Gradient 0.1725 0.7245 0.7306 0.1410 0.8004 0.8004
Table 8: Table with top-k intersection and top-k-div results for ImageNet with naturally and adversarially(PGD) trained
ResNet50 model using different explanation methods with random sign perturbation. The columns first contain locality re-
sults: top-k intersection, 1-LENS-recall@k, 1-LENS-prec@k followed by diversity results: top-k-div, 3-LENS-recall@k-div,
3-LENS-prec@k-div.
I Details of Survey Conducted to Study despite the top-k intersection being less the 30% between the
Human’s Perception of Robustness maps, users who fall into the Agree with 3-LENS-prec@k
metric category is large indicating that current top-k based
Detailed description of the survey conducted to study hu- comparison is weak.
man’s interpretation of attribution maps.
Survey Format: Each question consisted of an unper-
turbed image from the Flower dataset and a pair of expla-
nation/attribution maps. The pair can be any combination of
original and attribution map obtained with random pertur-
bation or with Ghorbani, Abid, and Zou (2019) attack. We
used Integrated Gradients (IG) (Sundararajan, Taly, and Yan
2017) to obtain the attribution maps. Perturbed maps were
obtained using the Ghorbani, Abid, and Zou (2019) attack Figure 38: Sample question from survey using Lily Valley
and random noise with appropriate ϵ-budget. The questions image with original IG map and IG map with Ghorbani,
were presented at random. At no point in the survey we re- Abid, and Zou (2019) attack using ϵ = 8/255.
vealed the type of map or perturbation added to obtain the
maps. This ensured the user was not biased by this extra in- In another sample from the survey (Figure 38), surpris-
formation while answering the survey. A sample question ingly more the 50% who fall in the category Agree with
presented to the user is as given below : 3-LENS-prec@k metric preferred the perturbed map over
Here is an image of Tigerlily with two attribution maps. the original map. top-k and 3-LENS-prec@k values were
36% and 88%, respectively.
We did have few questions in the survey to study the ef-
fectiveness of random perturbation as an attack as observed
by Ghorbani, Abid, and Zou (2019)[Figure 3] with top-k
metric. We used ϵ = 8/255 for the random perturbation.
The results of the survey were very unanimous with users
responses overwhelmingly(above 90%) fell into the Agree
with 3-LENS-prec@k metric. This strongly indicates that
random perturbation considered as an attack under current
Do both these attribution maps explain the image well? metrics gives a false sense of attribution robustness. Refer to
(1) Yes, both are similar and explain the image well. the last entries in Table 9.
(2) Yes, both explain the image well, but are dissimilar. Attack Type Agree with 3-LENS-prec@k metric(%) Agree with top-k metric(%) top-1000 intersection 3-LENS-prec@1000
top-k 70.37 29.63 0.343 0.928
(3) Only A explains the image well, B is different. top-k
random
81.48
93.55
18.52
6.45
0.0805
0.357
0.521
0.7965
(4) Only B explains the image well, A is different.

(5) No, Both the maps do not explain the image well. Table 9: Survey results based on humans ability to relate
Interpretation of Options: Options (1), (2) the user is able the explanation map to the original image with or without
to relate both the maps to equally represent the image. Op- noise using the Flower dataset using Integrated Gradients
tion (3) the user finds map A to represent the image over (IG) (Sundararajan, Taly, and Yan 2017) as the explanation
map B. Option (4) the user finds map B relates to the image method.
more closer than map B. Option (5) the user can not relate
the maps to the image.
Table 1 in the main paper, we simplify the options by
forming 2 categories - (1) Agree with 3-LENS-prec@k
metric : combines Option (1), (2) and (4). These options
show that the top-k metric does not match human visual
perception when the user either is agnostic to noise in the
map or finds the perturbed map more relatable to the image.
(2) Agree with top-k metric : Only Option (3). It shows that
top-k metric is sufficient to measure the differences between
the attribution maps and closely captures human visual per-
ception.
Summary of Survey Results: In the above sample question,
map A is the original IG map of the image and map B the
IG map of Ghorbani, Abid, and Zou (2019) attacked image
with ϵ = 8/255. Interestingly the users who chose between
(1) to (4), more than 65% users chose option (1), (2) or (4)
(Agree with 3-LENS-prec@k metric), while remaining pre-
ferred option (3) (Agree with top-k metric). Showing that

Rethinking Robustness of Model Attributions: Sandesh Kamath, Sankalp Mittal, Amit Deshpande, Vineeth N Balasubramanian

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Rethinking Robustness of Model Attributions: Sandesh Kamath, Sankalp Mittal, Amit Deshpande, Vineeth N Balasubramanian

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Rethinking Robustness of Model Attributions: Sandesh Kamath, Sankalp Mittal, Amit Deshpande, Vineeth N Balasubramanian

Uploaded by

Copyright:

Available Formats

Rethinking Robustness of Model Attributions

Sandesh Kamath1 , Sankalp Mittal1 , Amit Deshpande2 , Vineeth N Balasubramanian1

Abstract numbers of attribution methods has led to a concerted fo-

2 Background and Related Work

and y. Theorem 3 in (Wang et al. 2020) upper bounds the

where Hx (f ) is the Hessian of f w.r.t. the input at x and top-k

coefficients such as Kendall’s τ and Spearman’s ρ, we com- 0.2 0.2

pute w-LENS-Kendall and w-LENS-Spearman as the same 0.0 0.0

PGD GradSHAP [Lundberg 2017] 0.1714 0.7270 0.8552

B Attributional Robustness Metrics are

B.1 Random Vectors are Attributional Attacks

Average top-k Intersection

Average top-k Intersection

Average top-k Intersection

Average top-k Intersection

because each (p, q) appears in at most (2w+1)2 possibles

(a) MNIST IG : (b) F-MNIST IG (c) GTSRB IG : (d) Flower IG :

Average top-k Intersection

Average top-k Intersection

Average top-k Intersection

Average top-k Intersection

E Additional results for Section 3 20 40 60

(e) MNIST IG : (f) F-MNIST IG (g) GTSRB IG : (h) Flower IG :

Average top-k Intersection

Average top-k Intersection

Average top-k Intersection

Average top-k Intersection

Average top-k Intersection

Average top-k Intersection

Average top-k Intersection

Average top-k Intersection

Average top-k Intersection

Average top-k Intersection

Average top-k Intersection

Average top-k Intersection

Average top-k Intersection

0.90 0.8 0.90 0.90 0.925

Comparison of Spearman’s ρ and Kendall’s τ with 1-

Average top-k Intersection

0.4 0.4 Spearman's

(a) MNIST (b) Fashion MNIST

Average top-k Intersection

Average top-k Intersection

(c) GTSRB (d) Flower (a) MNIST (b) Fashion MNIST

Kendall and 1-LENS-Spearman used to measure the attri- 0.8 0.8

Average topK Intersection

butional robustness of IG on natrually trained, PGD-trained 0.6 0.6

and IG-SUM-NORM trained models. The perturbation used 0.4 0.4

(c) GTSRB (d) Flower

Average top-k Intersection

(a) MNIST (b) Fashion MNIST

Average topK Intersection

Average top-k Intersection

Average top-k Intersection

Average top-k Intersection

0.6 0.6 Figure 19: Attributional robustness of IG on naturally, PGD

Average topK Intersection

(c) GTSRB (d) Flower