Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: nccmath

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2311.13594v2 [cs.LG] 18 Jan 2024

Labeling Neural Representations with Inverse Recognition

Kirill Bykov
UMI Lab
ATB Potsdam
Potsdam, Germany
kbykov@atb-potsdam.de
&Laura Kopf
UMI Lab
ATB Potsdam
Potsdam, Germany
lkopf@atb-potsdam.de
Shinichi Nakajima
Machine Learning Group
TU Berlin
Berlin, Germany
nakajima@tu-berlin.de
&Marius Kloft
Machine Learning Group
RPTU Kaiserslautern-Landau
Kaiserslautern, Germany
kloft@cs.uni-kl.de
&Marina M.-C. Höhne
UMI Lab
ATB Potsdam
University of Potsdam, Germany
mhoehne@atb-potsdam.de
Corresponding author.
Abstract

Deep Neural Networks (DNNs) demonstrate remarkable capabilities in learning complex hierarchical data representations, but the nature of these representations remains largely unknown. Existing global explainability methods, such as Network Dissection, face limitations such as reliance on segmentation masks, lack of statistical significance testing, and high computational demands. We propose Inverse Recognition (INVERT), a scalable approach for connecting learned representations with human-understandable concepts by leveraging their capacity to discriminate between these concepts. In contrast to prior work, INVERT is capable of handling diverse types of neurons, exhibits less computational complexity, and does not rely on the availability of segmentation masks. Moreover, INVERT provides an interpretable metric assessing the alignment between the representation and its corresponding explanation and delivering a measure of statistical significance. We demonstrate the applicability of INVERT in various scenarios, including the identification of representations affected by spurious correlations, and the interpretation of the hierarchical structure of decision-making within the models.

1 Introduction

Deep Neural Networks (DNNs) have demonstrated exceptional performance across a broad spectrum of domains due to their ability to learn complex, high-dimensional representations from vast volumes of data [1]. Nevertheless, despite these impressive accomplishments, our comprehension of the concepts encoded within these representations remains limited. The "black-box" nature of representations, combined with the known susceptibility of networks to learn spurious correlations [2, 3, 4], biases [5] and harmful stereotypes [6] poses significant risks for the application of DNN systems, particularly in safety-critical domains [7].

To tackle the problem of the inherent opacity of DNNs, the field of Explainable AI (XAI) has emerged [8, 9, 10]. The global explanation methods aim to explain the concepts and abstractions learned within the DNNs representations. This is often achieved by establishing associations between neurons and human-understandable concepts [11, 12, 13, 14], or by visualizing the stimuli responsible for provoking high neural activation levels [15, 16, 17, 18]. Such methods demonstrated themselves to be capable of detecting the malicious behavior and identifying the specific neurons responsible [19, 20].

In this work, we introduce the Inverse Recognition (INVERT) 111The code can be accessed via the following link: https://github.com/lapalap/invert. method for labeling neural representations within DNNs. Given a specific neuron, INVERT provides an explanation of the function of the neuron in the form of a composition of concepts, selected based on the ability of the neuron to detect data points within the compositional class. Unlike previous methods, the proposed approach does not rely on segmentation masks and only necessitates labeled data, is not constrained by the specific type of neurons, and demands fewer computational resources. Furthermore, INVERT offers a statistical significance test to confirm that the provided explanation is not merely a random occurrence. We evaluate the performance of the proposed approach across various datasets and models, and illustrate its practical use through multiple examples.

2 Related work

Post-hoc interpretability, a subfield within Explainable AI, focuses on explaining the decision-making strategies of Deep Neural Networks (DNNs) without interfering with the original training process [21, 22]. Within the realm of post-hoc methods, a fundamental categorization arises concerning the scope of explanations they provide. Local explanation methods aim to explain the decision-making process for individual data points, often presented in the form of attribution maps [23, 24, 25]. On the other hand, global explanation methods aim to explain the prediction strategy learned by the machine across the population and investigate the purpose of its individual components [26, 27].

Inspired by principles from neuroscience [28, 29, 30], global explainability directs attention towards the in-depth examination of individual model components and their functional purpose [31]. Often, global explainability is referred to as mechanistic interpretability, particularly in the context of Natural Language Processing (NLP) [32, 33, 34, 35]. Global approach to interpretability allows for the exploration of concepts learned by the model [36, 37, 38, 39] and explanation of circuits — computational subgraphs within the model that learn the transformation of various features [40, 41]. Various methods were proposed to interpret the learned features, including Activation-Maximisation (AM) methods [15]. These methods aim to explain what individual neurons or groups of neurons have learned by visualizing inputs that elicit strong activation responses. Such input signals can either be found in an existing dataset [16] or generated synthetically [42, 17, 18]. AM methods demonstrated their utility in detecting undesired concepts learned by the model [19, 43, 20]. However, these methods require substantial user input to identify the concepts embodied in the Activation-Maximization signals. Recent research has demonstrated that such explanations can be manipulated while maintaining the behavior of the original model [44, 45, 46].

Another group of global explainability methods aim to explain the abstraction learned by the neuron within the model, by associating it with the human-understandable concepts. The Network Dissection (NetDissect) method [11, 47] was developed to provide explanations by linking neurons to concepts, based on the overlap between the activation maps of neurons and concept segmentation masks, quantified using the Intersection over Union (IoU) metric. Addressing the limitation that neurons could only be explained with a single concept, the subsequent Compositional Explanations of Neurons (CompExp) method was introduced, enabling the labeling of neurons with compositional concepts [12]. Despite their utility, these methods generally have limitations, as they are primarily applicable to convolutional neurons and necessitate a dataset with segmentation masks, which significantly restricts their scalability (a more comprehensive discussion of these methods can be found in Appendix A.2). Other notable methods include CLIP-Dissect [13], MILAN [48], and FALCON [49]. However, these methods utilize an additional model to produce explanations, thereby introducing a new source of potential unexplainability stemming from the explainer model.

3 INVERT: Interpreting Neural Representations with Inverse Recognition

In the following, we introduce a method called Inverse Recognition (INVERT). This method aims to explain the abstractions learned by a neural representation by identifying what compositional concept representation is most effective at detecting in a binary classification scenario. Unlike the general objective of Supervised Learning (SL) [50], which is to learn representations that can detect given concepts, the central idea behind INVERT is to learn a compositional concept that explains a given representation the best.

Let 𝔻m,𝔻superscript𝑚\mathbb{D}\subset\mathbb{R}^{m},blackboard_D ⊂ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , where m𝑚m\in\mathbb{N}italic_m ∈ blackboard_N is the number of dimensions of data, be the input (data) space. We use the term neural representations to refer to a sub-function of a network that represents the computational graph from the input of the model to the scalar output (activation) of a specific neuron, or any combination of neurons, that results in a scalar function.

Definition 1 (Neural representation).

A neural representation f𝔽𝑓𝔽f\in\mathbb{F}italic_f ∈ blackboard_F is defined as a real-valued function f:𝔻,normal-:𝑓normal-→𝔻\displaystyle f:\mathbb{D}\rightarrow\mathbb{R},italic_f : blackboard_D → blackboard_R , which maps the data domain 𝔻𝔻\mathbb{D}blackboard_D to the real numbers \mathbb{R}blackboard_R. Here, 𝔽𝔽\mathbb{F}blackboard_F represents the space of real-valued functions on 𝔻𝔻\mathbb{D}blackboard_D.

Frequently, in DNNs, particular neurons, like convolutional neurons, produce multidimensional outputs. Depending on the specific needs of the application, these multidimensional functions can be interpreted either as a set of individual scalar representations or the neuron’s output can be aggregated to yield a single scalar output, e.g. with pooling operations, such as average- or max-pooling. Unless stated otherwise, we utilize average-pooling as the standard aggregation measure.

We define a concept as a mapping that represents the human process of attributing characteristics to data.

Definition 2 (Concepts).

A concept c𝑐c\in\mathbb{C}italic_c ∈ blackboard_C is defined as a binary function: c:𝔻{0,1}normal-:𝑐normal-⟶𝔻01c:\mathbb{D}\longrightarrow\{0,1\}italic_c : blackboard_D ⟶ { 0 , 1 }, which maps the data domain 𝔻𝔻\mathbb{D}blackboard_D to the set of binary numbers. A value of 1111 indicates the presence of the concept in the input, and 00 indicates its absence. Here, \mathbb{C}blackboard_C corresponds to the space of all concepts, that could be defined on 𝔻.𝔻\mathbb{D}.blackboard_D .

In practice, given the dataset 𝒟𝔻𝒟𝔻\mathcal{D}\subset\mathbb{D}caligraphic_D ⊂ blackboard_D, concepts are usually defined by labels, which reflect the judgments made by human experts. We define C={c1,,cd}𝐶subscript𝑐1subscript𝑐𝑑C=\{c_{1},...,c_{d}\}\subset\mathbb{C}italic_C = { italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT } ⊂ blackboard_C as a set of d𝑑d\in\mathbb{N}italic_d ∈ blackboard_N atomic concepts, that are induced by labels of the dataset (also referred to as primitive concepts or primitives). Within the context of this work, we permit concepts to be non-disjoint, signifying that each data point may have multiple concepts attributed to it. Additionally, we define a vector 𝒞=[c1,,cd]d.𝒞subscript𝑐1subscript𝑐𝑑superscript𝑑\mathcal{C}=\left[c_{1},\dots,c_{d}\right]\in\mathbb{C}^{d}.caligraphic_C = [ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ] ∈ blackboard_C start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT .

A key step for explaining the abstractions learned by neural representations relies on the choice of the similarity measure between the concept and the representation. INVERT evaluates the relationship between representation and concepts by employing the non-parametric Area Under the Receiver Operating Characteristic (AUC) metric, measuring the representation’s ability to distinguish between the presence and absence of a concept.

Definition 3 (AUC similarity).

Let f𝔽𝑓𝔽f\in\mathbb{F}italic_f ∈ blackboard_F be a neural representation, dataset 𝒟𝔻𝒟𝔻\mathcal{D}\subset\mathbb{D}caligraphic_D ⊂ blackboard_D and concept c𝑐c\in\mathbb{C}italic_c ∈ blackboard_C. We define a similarity measure d:𝔽×[0,1]normal-:𝑑normal-⟶𝔽01d:\mathbb{F}\times\mathbb{C}\longrightarrow[0,1]italic_d : blackboard_F × blackboard_C ⟶ [ 0 , 1 ] as

d(f,c)={x|x𝒟,c(x)=0}{y|y𝒟,c(y)=1}1[f(x)<f(y)]{x|x𝒟,c(x)=0}{y|y𝒟,c(y)=1},𝑑𝑓𝑐subscript𝑥formulae-sequence𝑥𝒟𝑐𝑥0subscript𝑦formulae-sequence𝑦𝒟𝑐𝑦11delimited-[]𝑓𝑥𝑓𝑦delimited-∣∣𝑥formulae-sequence𝑥𝒟𝑐𝑥0delimited-∣∣𝑦formulae-sequence𝑦𝒟𝑐𝑦1d(f,c)=\frac{\sum_{\Set{x}{x\in\mathcal{D},c(x)=0}}\sum_{\Set{y}{y\in\mathcal{% D},c(y)=1}}\textstyle{\textbf{1}}\left[f(x)<f(y)\right]}{\mid\Set{x}{x\in% \mathcal{D},c(x)=0}\mid\cdot\mid\Set{y}{y\in\mathcal{D},c(y)=1}\mid},italic_d ( italic_f , italic_c ) = divide start_ARG ∑ start_POSTSUBSCRIPT { start_ARG italic_x end_ARG | start_ARG italic_x ∈ caligraphic_D , italic_c ( italic_x ) = 0 end_ARG } end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT { start_ARG italic_y end_ARG | start_ARG italic_y ∈ caligraphic_D , italic_c ( italic_y ) = 1 end_ARG } end_POSTSUBSCRIPT 1 [ italic_f ( italic_x ) < italic_f ( italic_y ) ] end_ARG start_ARG ∣ { start_ARG italic_x end_ARG | start_ARG italic_x ∈ caligraphic_D , italic_c ( italic_x ) = 0 end_ARG } ∣ ⋅ ∣ { start_ARG italic_y end_ARG | start_ARG italic_y ∈ caligraphic_D , italic_c ( italic_y ) = 1 end_ARG } ∣ end_ARG , (1)

where 𝟏[f(x)<f(y)]𝟏delimited-[]𝑓𝑥𝑓𝑦\textstyle{\textbf{1}}\left[f(x)<f(y)\right]1 [ italic_f ( italic_x ) < italic_f ( italic_y ) ] is an indicator function that yields 1 if f(x)<f(y)𝑓𝑥𝑓𝑦f(x)<f(y)italic_f ( italic_x ) < italic_f ( italic_y ) and 0 otherwise.

AUC provides an interpretable measure to assess the ability of the representation to systematically output higher activations for the datapoints, where the concept is present. An AUC of 1111 denotes a perfect classifier, while an AUC of 0.50.50.50.5 suggests that the classifier’s performance is no better than random chance.

Given that various concepts have different numbers of data points associated with them, for concept c𝑐c\in\mathbb{C}italic_c ∈ blackboard_C we can compute concept fraction, corresponding to the ratio of data points that are positively labeled by the concept:

T(c)={x|x𝒟,c(x)=1}{x|x𝒟}.𝑇𝑐delimited-∣∣𝑥formulae-sequence𝑥𝒟𝑐𝑥1delimited-∣∣𝑥𝑥𝒟T(c)=\frac{\mid\Set{x}{x\in\mathcal{D},c(x)=1}\mid}{\mid\Set{x}{x\in\mathcal{D% }}\mid}.italic_T ( italic_c ) = divide start_ARG ∣ { start_ARG italic_x end_ARG | start_ARG italic_x ∈ caligraphic_D , italic_c ( italic_x ) = 1 end_ARG } ∣ end_ARG start_ARG ∣ { start_ARG italic_x end_ARG | start_ARG italic_x ∈ caligraphic_D end_ARG } ∣ end_ARG . (2)

3.1 Finding Optimal Compositional Explanations

Given a representation f𝔽𝑓𝔽f\in\mathbb{F}italic_f ∈ blackboard_F, the INVERT’s objective is to identify the concept, that maximizes the AUC similarity with the representation, or, in other words finding the concept that representation is detecting the best. Due to the ability of representations to detect shared features across various concepts explaining a representation with a single atomic concept from C𝐶Citalic_C may not provide a comprehensive explanation. To surmount this challenge, we adopt the existing compositional concepts approach [12], and we augment the set of atomic concepts C𝐶Citalic_C by introducing new generic concepts, as a logical combination of existing ones. These logical forms involve the composition of AND, OR, and NOT operators, and they are based on the atomic concepts from C.𝐶C.italic_C .

Definition 4 (Compositional concept).

Given a vector of atomic concepts 𝒞,𝒞\mathcal{C},caligraphic_C , a compositional concept φ𝜑\varphiitalic_φ is a higher-order interpretable function that maps 𝒞𝒞\mathcal{C}caligraphic_C to a new, compositional concept:

φ:d.:𝜑superscript𝑑\varphi:\mathbb{C}^{d}\longrightarrow\mathbb{C}.italic_φ : blackboard_C start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ⟶ blackboard_C . (3)

For example, let C={c1,c2}𝐶subscript𝑐1subscript𝑐2C=\{c_{1},c_{2}\}italic_C = { italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } be a set of atomic concepts with corresponding vector 𝒞.𝒞\mathcal{C}.caligraphic_C . Let c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT be a concept for “dog”, and c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT a concept for “llama”. Then φ(𝒞)=c1 OR c2= “dog” OR “llama”𝜑𝒞subscript𝑐1 OR subscript𝑐2 “dog” OR “llama”\varphi(\mathcal{C})=c_{1}\text{ OR }c_{2}=\text{ ``dog'' OR ``llama''}italic_φ ( caligraphic_C ) = italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT OR italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = “dog” OR “llama” is a compositional concept with the length L=2.𝐿2L=2.italic_L = 2 . The φ(𝒞)𝜑𝒞\varphi(\mathcal{C})italic_φ ( caligraphic_C ) is a concept in itself (i.e. φ(𝒞)𝜑𝒞\varphi(\mathcal{C})\in\mathbb{C}italic_φ ( caligraphic_C ) ∈ blackboard_C) and corresponds to a concept that is positive for all images of dogs or llamas in the dataset.

Evaluating the performance of all conceivable logical forms across all of the d𝑑ditalic_d concepts from C𝐶Citalic_C is generally computationally infeasible. Consequently, the set of potential compositional concepts ΦLsubscriptΦ𝐿\Phi_{L}roman_Φ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT is restricted to a form of predetermined length L𝐿L\in\mathbb{N}italic_L ∈ blackboard_N, where L𝐿Litalic_L is a parameter of the method. The objective of INVERT, in this context, can be reformulated as:

φ*=argmaxφΦLd(f,φ(𝒞)).superscript𝜑subscriptargmax𝜑subscriptΦ𝐿𝑑𝑓𝜑𝒞\varphi^{*}=\operatorname*{arg\,max}_{\varphi\in\Phi_{L}}d\left(f,\varphi(% \mathcal{C})\right).italic_φ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_φ ∈ roman_Φ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_d ( italic_f , italic_φ ( caligraphic_C ) ) . (4)

Refer to caption

Figure 1: Demonstration of the INVERT method (B=1,α=0.35%formulae-sequence𝐵1𝛼percent0.35B=1,\alpha=0.35\%italic_B = 1 , italic_α = 0.35 %) for the neuron f33subscript𝑓33f_{33}italic_f start_POSTSUBSCRIPT 33 end_POSTSUBSCRIPT from ResNet18, AvgPool layer (Neuron 33), using ImageNet 2012 validation dataset. The resulting explanations can be observed in the bottom part of the figure, where three steps of the iterative process are demonstrated from L=1𝐿1L=1italic_L = 1 to L=3𝐿3L=3italic_L = 3. It can be observed that INVERT explanations align with the neuron’s high-activating images, illustrated in the top right figure.

To determine the optimal compositional concept that maximizes AUC, we employ an approach similar to that used in [12], utilizing Beam-Search optimization. Parameters of the proposed method include predetermined length L𝐿L\in\mathbb{N}italic_L ∈ blackboard_N, the beam size B.𝐵B\in\mathbb{N}.italic_B ∈ blackboard_N . Additionally, during the search process explanations could be constrained to the condition T(φ(𝒞))[α,β]𝑇𝜑𝒞𝛼𝛽T(\varphi(\mathcal{C}))\in[\alpha,\beta]italic_T ( italic_φ ( caligraphic_C ) ) ∈ [ italic_α , italic_β ], where 0α<β0.50𝛼𝛽0.50\leq\alpha<\beta\leq 0.50 ≤ italic_α < italic_β ≤ 0.5. In Section 4.1, we further demonstrate that by imposing a such constraint on the concept fraction resulting explanations could be made more comprehensive. We refer to the standard approach when α=0,β=0.5formulae-sequence𝛼0𝛽0.5\alpha=0,\beta=0.5italic_α = 0 , italic_β = 0.5. In our experiments, unless otherwise specified, the parameter β𝛽\betaitalic_β is set to 0.50.50.50.5. Additional details and a description of the algorithm can be found in Appendix A.3.

Figure 1 illustrates the INVERT pipeline for explaining the neuron from ResNet18 Average Pooling layer [51]. For this, we employed the validation set of ImageNet2012 [52] as the dataset 𝒟Isubscript𝒟𝐼\mathcal{D}_{I}caligraphic_D start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT in the INVERT process. This subset contains 50,000 images from 1,000 distinct, non-overlapping classes, each represented by 50 images. Notably, since ImageNet classes are intrinsically linked to WordNet [53], we extracted an additional 473 hypernyms, or higher-level categories, and assigned labels for these overarching classes. In Figure 1 and subsequent figures, we use beige color to represent individual ImageNet classes and orange color to represent hypernyms. In the density plot graphs, the orange density illustrates the distribution of data point activations that belong to the explanation concept, while blue represents the distribution of activations of data points corresponding to the negation of the explanation.

3.2 Statistical significance

IoU-based explanations, such as those provided by the Network Dissection method [11], often report small positive IoU scores for the resulting explanations. This raises concerns about the potential randomness of the explanation. The AUC value is equivalent to the Wilcoxon-Mann-Whitney statistic [54] and can be interpreted as a measure based on pairwise comparisons between classifications of the two classes. Essentially, it estimates the probability that the classifier will rank a randomly chosen positive example higher than a negative example [55].

Given the concept c𝑐c\in\mathbb{C}italic_c ∈ blackboard_C, this connection to the Mann–Whitney U test allows us to test if the distribution of the representation’s activations on the data points where concept c𝑐citalic_c is positive significantly differs from the distribution of activations on points where the concept is negative. We can then report the corresponding p𝑝pitalic_p-value (against a double- or one-sided alternative), which helps avoid misinterpretations due to randomness, thereby improving the reliability of the explanation process, as shown in Figure 2. In all subsequent figures, the explanations provided by INVERT achieve statistical significance (against double-sided alternative) with a standard significance level (0.050.050.050.05).


Refer to caption

Figure 2: The figure illustrates the contrast between a poor explanation (on the left) and INVERT explanations with L=1𝐿1L=1italic_L = 1 and varying parameter α𝛼\alphaitalic_α, for neuron 592 in the ViT B 16 feature-extractor layer. The INVERT explanations were computed over the ImageNet 2012 validation set. The figure demonstrates that as the parameter α𝛼\alphaitalic_α increases, the concept fraction T𝑇Titalic_T also increases, indicating that more data points belong to the positive class. Furthermore, this figure showcases the proposed method’s ability to evaluate the statistical significance of the result. The poor explanation fails the statistical significance test (double-sided alternative) with a p-value of 0.35, while all explanations provided by INVERT exhibit a p<0.005𝑝0.005p<0.005italic_p < 0.005.

4 Analysis

In this section, we provide additional analysis of the proposed method, including the effect of constraining the concept fraction of explanations and comparison of the INVERT to the prior methods.

4.1 Simplicity-Precision tradeoff

The INVERT method is designed to identify the compositional concept that has the highest AUC similarity to a given representation. However, the standard approach neglects to account for the class imbalance between datapoints that belong and do not belong to a particular concept, often leading to precise but narrowly applicable explanations due to the small concept fraction. To mitigate this issue, we can modify the INVERT process to work exclusively with compositional concepts where the fraction equals or exceeds a specific threshold, represented as α𝛼\alphaitalic_α.

Refer to caption
Figure 3: Three different INVERT explanations, computed by adjusting the parameter α𝛼\alphaitalic_α for the Neuron 88 in ResNet18 AvgPool layer. Higher values of this parameter lead to broader explanations, albeit at the cost of precision, thus resulting in a lower AUC. The visualization of the WordNet taxonomy for the hypernyms is provided in the Appendix 3.

For this experiment, the INVERT method was utilized on the feature extractors of four different models trained on ImageNet. These models include ResNet18 [51], GoogleNet [56], EfficientNet B0 [57], and ViT B 16 [58]. In this experiment, we examined 50 randomly chosen neurons from the feature-extractor layer of each model. We utilized the ImageNet 2012 validation dataset 𝒟Isubscript𝒟𝐼\mathcal{D}_{I}caligraphic_D start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT, which was outlined in the previous section, to generate INVERT explanations with B=3𝐵3B=3italic_B = 3 varying the explanation length L𝐿Litalic_L between 1111 and 5555, and parameter α𝛼\alphaitalic_α, responsible for the constraining the concept fraction, α{0,0.002,0.005,0.01}.𝛼00.0020.0050.01\alpha\in\{0,0.002,0.005,0.01\}.italic_α ∈ { 0 , 0.002 , 0.005 , 0.01 } .

The experiment’s results are depicted in Figure 4. For all models, we can see an effect that we call the simplicity-precision tradeoff: the explanations with the highest AUC typically involve just one individual class with a low concept fraction, achieved in an unrestricted mode with parameter α𝛼\alphaitalic_α set to 0.00.0 . By constraining the concept fraction α𝛼\alphaitalic_α and increasing the explanation length L𝐿Litalic_L, we can improve AUC scores while still maintaining the desired concept fraction. Still, this indicates that more generalized, broader explanations come at the cost of a loss in precision in terms of the AUC measure. Figure 3 demonstrates how the change of parameter α𝛼\alphaitalic_α affects the resulting explanation.


Refer to caption

Figure 4: Impact of the parameter α𝛼\alphaitalic_α and formula length L𝐿Litalic_L on the resulting explanations. The first row of the figure shows the average AUC of optimal explanations for 50 randomly sampled neurons from the feature-extractor part of each one of the four ImageNet pre-trained models, conditioned by different values of parameter α𝛼\alphaitalic_α in different colors. These graphs indicate that neurons generally tend to achieve the highest AUC for one individual class with L=1𝐿1L=1italic_L = 1 and α=0𝛼0\alpha=0italic_α = 0. The second row presents the distribution of AUC scores alongside the distribution of concept fractions T𝑇Titalic_T for the INVERT explanations of length L=5𝐿5L=5italic_L = 5, for each model. Here, we can observe a clear trade-off between the precision of the explanation in terms of AUC measure and concept size T.𝑇T.italic_T .

4.2 Evaluating the Accuracy of Explanations

While it is generally challenging to obtain ground-truth explanations for the latent representations in Deep Neural Networks (DNNs), in Supervised Learning, the concepts of the output neurons are defined by the specific task. In the subsequent experiment, we compared the performance of INVERT and Network Dissection in accurately explaining neurons when the ground truth is known.

For this experiment, we employed 5 different models: 2 segmentation models and 3 classification models. For image segmentation, we employed MaskRCNN ResNet50 FPN model [59], pre-trained on MS COCO dataset [60] and evaluated on a subset of 24,237 images of MS COCO train2017, containing 80 distinct classes, and FCN ResNet50 model [61], pre-trained on MS COCO, and evaluated on a subset of MS COCO val2017, limited to the 20 categories found in the Pascal VOC dataset [62]. For classification models we employed ImageNet pre-trained ResNet18 [51] DenseNet161 [63], and GoogleNet [64], with 1,000 output neurons, each neuron corresponding to the individual class in the ImageNet dataset.

Refer to caption
Figure 5: Comparing the computational cost of INVERT with Compositional Explanations of Neurons method (CompExp) in hours with varying formula lengths.

The outputs from the segmentation models were converted into pixel-wise confidence scores. These scores were arranged in the format [NB,Nc,H,W],subscript𝑁𝐵subscript𝑁𝑐𝐻𝑊[N_{B},N_{c},H,W],[ italic_N start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_H , italic_W ] , where NBsubscript𝑁𝐵N_{B}italic_N start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT represents the number of images in a batch, and Ncsubscript𝑁𝑐N_{c}italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT signifies the number of classes. Each value indicates the likelihood of a specific pixel belonging to a particular class. To aggregate multidimensional activations, the INVERT method used a max-pool operation.

All the classification models that were used had 1,000 one-dimensional output neurons. The evaluation process for both explanation methods was carried out using a subset of 20,000 images from the ImageNet-2012 validation dataset. For the Network Dissection method, which necessitates segmentation masks, these masks were generated from the bounding boxes included in the dataset. Both Network Dissection and INVERT methods were implemented using standard parameters.

Table 1 presents the outcomes of the evaluation process. It is noteworthy that INVERT exhibits superior or equivalent performance to Network Dissection across all tasks. Importantly, INVERT can accurately identify concepts in image segmentation networks using only the labels of images, in comparison to the Network Dissection method that uses segmentation masks.

Table 1: A comparison of explanation accuracy between NetDissect and INVERT. The accuracy is computed by matching identified classes with the ground truth labels.
Model Dataset NetDissect INVERT
MaskRCNN ResNet50 FPN MS COCO 95.06% 98.77%
FCN ResNet50 MS COCO 95.24% 95.24%
ResNet18 ImageNet 19.2% 73.2%
GoogleNet ImageNet 19.7% 82.2%
DenseNet161 ImageNet 19.1% 86.9%

Computational cost comparison

Methods such as Network Dissection and Compositional Explanations (CompExp) of neurons have been observed to exhibit computational challenges mainly due to the operations on high-dimensional masks. While CompExp and INVERT share a beam-search optimization mechanism, the proposed approach allows for less computational resources since logical operations are performed on binary labels, instead of masks. Figure 5 showcases the running time of applying INVERT and Compositional Explanations for explaining 2048 neurons in layer 4 of the FCOS-ResNet50-FPN model [65] pre-trained on the MS COCO dataset [60] on a singe Tesla V100S-PCIE-32GB GPU. The time comparison of varying formula lengths demonstrates the advantage of INVERT being more effective computationally, which leads to reduced running time and computational costs.

5 Applications

In this section, we outline some specific uses of INVERT, including auditing models for spurious correlations, explaining circuits within the models, and manually creating circuits with desired characteristics.

5.1 Finding Spurious Correlations by Integrating New Concepts

Refer to caption
Figure 6: Difference of INVERT (L=1,α=0formulae-sequence𝐿1𝛼0L=1,\alpha=0italic_L = 1 , italic_α = 0) explanations of Neuron 154 in Average Pooling layer of ImageNet-trained ResNet18 model before (top) and after (bottom) integration of new concepts to the dataset.

Due to the widespread use of Deep Neural Networks across various domains, it is crucial to investigate whether these models display spurious correlations, backdoors, or base their decisions on undesired concepts. Using the known spurious dependency of ImageNet-trained models on watermarks written in Chinese [19, 66, 67] we illustrate that INVERT provides a straightforward method to test existing hypotheses regarding the model’s dependency on specific features and allows for identification of the particular neurons accountable for undesirable behavior.

To illustrate this, we augmented the ImageNet dataset 𝒟Isubscript𝒟𝐼\mathcal{D}_{I}caligraphic_D start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT, with an additional dataset, 𝒟Tsubscript𝒟𝑇\mathcal{D}_{T}caligraphic_D start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, comprising 100 images. This new dataset contains 50 images for each of two distinct concepts: Chinese textual watermarks and Latin textual watermarks (see Appendix A.4). We created examples of these classes by randomly selecting images from the ImageNet dataset and overlaying them with randomly generated textual watermarks. Figure 6 depicts the change in the explanation process conducted on the original dataset and its expanded version. Since the original dataset didn’t include the concept of watermarked images, the label “African chameleon” was attributed to the representation. However, after augmenting the dataset with two new classes, the explanation shifted to the “Chinese text” concept, with the AUC measure increasing to 0.99. This demonstrates the capability of INVERT to pinpoint sources of spurious behavior within the latent representations of the neural network.

5.2 Explaining Circuits

INVERT could be employed for explaining circuits – computational subgraphs within the model, demonstrating the information flow within the model [41]. The analysis of circuits enables us to understand complex global decision-making strategies by examining how features transform from one layer to another. Furthermore, this approach can be employed for glocal explanations [68] – local explanation of a particular data point can be deconstructed into local explanations for individual neurons in the preceding layers, explained by INVERT.

Refer to caption

Figure 7: The figure illustrates the “carton” circuit within the ResNet18 model. The left part of the figure showcases the three most significant neurons (in terms of the weight of linear connection) and their corresponding INVERT explanation linked to the class logit “carton”. The right part of the figure demonstrates how the local explanation from the class logit can be decomposed into individual explanations of individual neurons from the preceding layer.

To illustrate this, we computed INVERT explanations (L=3,α=0.002formulae-sequence𝐿3𝛼0.002L=3,\alpha=0.002italic_L = 3 , italic_α = 0.002) for all neurons in the average pooling layer of ResNet18. This was based on the augmented dataset from the preceding section. In ResNet18, the neurons in the Average Pooling layer have a linear connection to the output class logits. Figure 7 (left) illustrates the circuit of the three most significant neurons (based on the weight of linear connection) linked to the “carton” output logit. It could be observed that this class depends on Neuron 296, a “box” detector, and Neuron 154, which identifies the “Chinese text” concept. Furthermore, the right side of Figure 7 depicts the decomposition of local explanations: given an image of a carton box, we can dissect the GradCam [69] local explanation of a “carton” class-logit into the composition of local explanations from individual neurons. It is noticeable how Neuron 296 assigns relevance to the box, while Neuron 154 assigns relevance solely to the watermark present in the image. More illustrations of different circuits can be found in Appendix A.8.

5.3 Handcrafting Circuits

In this section, we demonstrate that it’s somewhat feasible to use the knowledge of what concepts are detected by neurons to combine them into manually designed circuits that can detect novel concepts. Just as compositional concepts are formed using logical operators, we employed fuzzy logic operators between neurons to construct meaningful handcrafted circuits with desired properties.

In contrast to conventional logic, fuzzy logic operators allow for the degree of membership to vary from 0 to 1 [70]. For this experiment, we employed the Gödel norm that demonstrated the best performance among other fuzzy logic operators (see Appendix A.5 for details). For the two functions f,g:𝔻[0,1],:𝑓𝑔𝔻01f,g:\mathbb{D}\longrightarrow[0,1],italic_f , italic_g : blackboard_D ⟶ [ 0 , 1 ] , the Gödel AND (T-norm) operator is defined as min(f,g)𝑓𝑔\min(f,g)roman_min ( italic_f , italic_g ) and the OR (T-conorm) is defined as max(f,g).𝑓𝑔\max(f,g).roman_max ( italic_f , italic_g ) . Negation is performed by the 1f1𝑓1-f1 - italic_f operation.

We utilized the ImageNet-trained ViT L 16 model [58], specifically 1024 representations from the feature-extractor layer. The output of each of these representations was mapped to the range [0,1]01[0,1][ 0 , 1 ] by first normalizing the output based on their respective mean and standard deviation across the ImageNet 2012 validation dataset, and then applying the Sigmoid transformation. In this experiment, for each of the 1473 ImageNet atomic concepts (which includes 1000 classes and 473 hypernyms), we identified a neuron from the feature-extractor layer that showed the highest AUC similarity. For instance, for the concept “boat”, Neuron 61 exhibited the highest AUC similarity (denoted as fboatsubscript𝑓boatf_{\text{boat}}italic_f start_POSTSUBSCRIPT boat end_POSTSUBSCRIPT), for the concept “house”, Neuron 899 showed the highest AUC similarity (denoted as fhousesubscript𝑓housef_{\text{house}}italic_f start_POSTSUBSCRIPT house end_POSTSUBSCRIPT), and for the concept “lakeside”, Neuron 575 showed the highest AUC similarity (denoted as flakesidesubscript𝑓lakesidef_{\text{lakeside}}italic_f start_POSTSUBSCRIPT lakeside end_POSTSUBSCRIPT).

Further, we manually constructed six different compositional formulas using concepts from ImageNet that were designed to resemble different concepts from the Places365 [71] dataset. For example, for the “boathouse” class from Places365, we assumed that images from this class would likely include “boat”, “house”, and water, represented by the concept “lakeside”. As such, we constructed a compositional formula “boat” AND “house” AND “lakeside” using concepts from the ImageNet dataset. Finally, using the neurons, that detect these concepts (e.g. fboat,fhouse,flakesidesubscript𝑓boatsubscript𝑓housesubscript𝑓lakesidef_{\text{boat}},f_{\text{house}},f_{\text{lakeside}}italic_f start_POSTSUBSCRIPT boat end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT house end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT lakeside end_POSTSUBSCRIPT) we manually constructed the circuits using Gödel fuzzy logic operators. That is, for “boathouse” example, final circuit was formed as g(x)=min(fboat(x),fhouse(x),flakeside(x))𝑔𝑥subscript𝑓boat𝑥subscript𝑓house𝑥subscript𝑓lakeside𝑥g(x)=\min(f_{\text{boat}}(x),f_{\text{house}}(x),f_{\text{lakeside}}(x))italic_g ( italic_x ) = roman_min ( italic_f start_POSTSUBSCRIPT boat end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT house end_POSTSUBSCRIPT ( italic_x ) , italic_f start_POSTSUBSCRIPT lakeside end_POSTSUBSCRIPT ( italic_x ) ) using the Gödel AND operator. The performance of the resulting circuits was evaluated on the Places365 dataset in terms of AUC similarity with the concept. In essence, by labeling representations using the ImageNet dataset and manually building a circuit guided by intuition, we evaluated how this newly created function can perform in detecting a class in the binary classification task on a different dataset.

Refer to caption

Figure 8: The figure presents four distinct handcrafted circuits, created from the latent representations from the ImageNet-trained ViT L 16 feature-extractor layer to detect classes from the Places365 dataset. For each neuron, or combination of neurons, we provide the Area Under the Receiver Operating Characteristic (AUROC) score for the Places365 concept in a binary classification task, distinguishing between the presence and absence of this concept.

Figure 8 illustrates the “boathouse” example and three other handcrafted circuits derived from ViT representations (the other two circuits can be found in Appendix 16). We found that after performing this manipulation, the AUC performance in detecting the Places365 class improved compared to the performance of each individual neuron. This example shows that by understanding the abstractions behind previously opaque latent representations, we can potentially construct meaningful circuits and utilize the symbolic properties of latent representations. In Appendix A.6, we further demonstrate that when labels of the target dataset overlap or are similar to the dataset used for explanation, fine-tuning of the model can be achieved by simply employing representations with explanations matching the target labels.

6 Disscussion and Conclusion

In our work, we introduced the Inverse Recognition (INVERT) method, a novel approach for interpreting latent representations in Deep Neural Networks. INVERT efficiently links neurons with compositional concepts using an interpretable similarity metric and offers a statistical significance test to gauge the confidence of the resulting explanation. We demonstrated the wide-ranging utility of our method, including its capability for model auditing to identify spurious correlations, explaining circuits within models, and revealing symbolic-like properties in connectionist representations.

While INVERT mitigates the need for image segmentation masks, it still relies on a labeled dataset for explanations. In future research, we plan to address this dependency. Additionally, we will explore different similarity measures between neurons and explanations, and investigate new ways to compose human-understandable concepts.

The widespread use of Deep Neural Networks across various fields underscores the importance of developing reliable and transparent intelligent systems. We believe that INVERT will contribute to advancements in Explainable AI, promoting more understandable AI systems.

Acknowledgements

This work was partly funded by the German Ministry for Education and Research (BMBF) through the project Explaining 4.0 (ref. 01IS200551). Shinichi Nakajima was supported by the German Ministry for Education and Research (BMBF) as BIFOLD - Berlin Institute for the Foundations of Learning and Data under the grant BIFOLD23B. Marius Kloft acknowledges support by the Carl-Zeiss Foundation, the DFG awards KL 2698/2-1, KL 2698/5-1, KL 2698/6-1, and KL 2698/7-1, and the BMBF awards 03|B0770E and 01|S21010C.

References

  • [1] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
  • [2] Sebastian Lapuschkin, Stephan Wäldchen, Alexander Binder, Grégoire Montavon, Wojciech Samek, and Klaus-Robert Müller. Unmasking clever hans predictors and assessing what machines really learn. Nature communications, 10:1096, 2019.
  • [3] Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020.
  • [4] Kirill Bykov, Laura Kopf, and Marina M.-C. Höhne. Finding spurious correlations with function-semantic contrast analysis. In Luca Longo, editor, Explainable Artificial Intelligence, pages 549–572, Cham, 2023. Springer Nature Switzerland.
  • [5] Laleh Seyyed-Kalantari, Haoran Zhang, Matthew B. A. McDermott, Irene Y. Chen, and Marzyeh Ghassemi. Underdiagnosis Bias of Artificial Intelligence Algorithms Applied to Chest Radiographs in Under-Served Patient Populations. Nature Medicine, 27(12):2176–2182, December 2021.
  • [6] Federico Bianchi, Pratyusha Kalluri, Esin Durmus, Faisal Ladhak, Myra Cheng, Debora Nozza, Tatsunori Hashimoto, Dan Jurafsky, James Zou, and Aylin Caliskan. Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale, November 2022.
  • [7] Oliver Willers, Sebastian Sudholt, Shervin Raafatnia, and Stephanie Abrecht. Safety concerns and mitigation approaches regarding the use of deep learning in safety-critical perception tasks. In Computer Safety, Reliability, and Security. SAFECOMP 2020 Workshops: DECSoS 2020, DepDevOps 2020, USDAI 2020, and WAISE 2020, Lisbon, Portugal, September 15, 2020, Proceedings 39, pages 336–350. Springer, 2020.
  • [8] Wojciech Samek, Grégoire Montavon, Andrea Vedaldi, Lars Kai Hansen, and Klaus-Robert Müller. Explainable AI: interpreting, explaining and visualizing deep learning, volume 11700. Springer Nature, 2019.
  • [9] Leilani H Gilpin, David Bau, Ben Z Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal. Explaining explanations: An overview of interpretability of machine learning. In 2018 IEEE 5th International Conference on data science and advanced analytics (DSAA), pages 80–89. IEEE, 2018.
  • [10] Feiyu Xu, Hans Uszkoreit, Yangzhou Du, Wei Fan, Dongyan Zhao, and Jun Zhu. Explainable ai: A brief survey on history, research areas, approaches and challenges. In Natural Language Processing and Chinese Computing: 8th CCF International Conference, NLPCC 2019, Dunhuang, China, October 9–14, 2019, Proceedings, Part II 8, pages 563–574. Springer, 2019.
  • [11] David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6541–6549, 2017.
  • [12] Jesse Mu and Jacob Andreas. Compositional explanations of neurons. Advances in Neural Information Processing Systems, 33:17153–17163, 2020.
  • [13] Tuomas Oikarinen and Tsui-Wei Weng. Clip-dissect: Automatic description of neuron representations in deep vision networks. arXiv preprint arXiv:2204.10965, 2022.
  • [14] Adriano Lucieri, Muhammad Naseer Bajwa, Stephan Alexander Braun, Muhammad Imran Malik, Andreas Dengel, and Sheraz Ahmed. Exaid: A multimodal explanation framework for computer-aided diagnosis of skin lesions. Computer Methods and Programs in Biomedicine, 215:106620, 2022.
  • [15] Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent. Visualizing higher-layer features of a deep network. University of Montreal, 1341(3):1, 2009.
  • [16] Judy Borowski, Roland Simon Zimmermann, Judith Schepers, Robert Geirhos, Thomas SA Wallis, Matthias Bethge, and Wieland Brendel. Natural images are more informative for interpreting cnn activations than state-of-the-art synthetic feature visualizations. In NeurIPS 2020 Workshop SVRHM, 2020.
  • [17] Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization. Distill, 2(11):e7, 2017.
  • [18] Thomas FEL, Thibaut Boissin, Victor Boutin, Agustin Martin Picard, Paul Novello, Julien Colin, Drew Linsley, Tom ROUSSEAU, Remi Cadene, Lore Goetschalckx, Laurent Gardes, and Thomas Serre. Unlocking feature visualization for deep network with MAgnitude constrained optimization. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • [19] Kirill Bykov, Mayukh Deb, Dennis Grinwald, Klaus Robert Muller, and Marina MC Höhne. DORA: Exploring outlier representations in deep neural networks. Transactions on Machine Learning Research, 2023.
  • [20] Gabriel Goh, Nick Cammarata, Chelsea Voss, Shan Carter, Michael Petrov, Ludwig Schubert, Alec Radford, and Chris Olah. Multimodal neurons in artificial neural networks. Distill, 6(3):e30, 2021.
  • [21] Sajid Ali, Tamer Abuhmed, Shaker El-Sappagh, Khan Muhammad, Jose M Alonso-Moral, Roberto Confalonieri, Riccardo Guidotti, Javier Del Ser, Natalia Díaz-Rodríguez, and Francisco Herrera. Explainable artificial intelligence (xai): What we know and what is left to attain trustworthy artificial intelligence. Information Fusion, 99:101805, 2023.
  • [22] Daniel Vale, Ali El-Sharif, and Muhammed Ali. Explainable artificial intelligence (xai) post-hoc explainability methods: Risks and limitations in non-discrimination law. AI and Ethics, 2(4):815–826, 2022.
  • [23] David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert Müller. How to explain individual classification decisions. Journal of Machine Learning Research, 11(Jun):1803–1831, 2010.
  • [24] Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.
  • [25] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017.
  • [26] Rabia Saleem, Bo Yuan, Fatih Kurugollu, Ashiq Anjum, and Lu Liu. Explaining deep neural networks: A survey on the global interpretation methods. Neurocomputing, 2022.
  • [27] Vijay Arya, Rachel KE Bellamy, Pin-Yu Chen, Amit Dhurandhar, Michael Hind, Samuel C Hoffman, Stephanie Houde, Q Vera Liao, Ronny Luss, Aleksandra Mojsilović, et al. One explanation does not fit all: A toolkit and taxonomy of ai explainability techniques. arXiv preprint arXiv:1909.03012, 2019.
  • [28] David H Hubel and Torsten N Wiesel. Receptive fields of single neurones in the cat’s striate cortex. The Journal of physiology, 148(3):574, 1959.
  • [29] Vernon B Mountcastle. Modality and topographic properties of single neurons of cat’s somatic sensory cortex. Journal of neurophysiology, 20(4):408–434, 1957.
  • [30] John O’Keefe and Jonathan Dostrovsky. The hippocampus as a spatial map: preliminary evidence from unit activity in the freely-moving rat. Brain research, 1971.
  • [31] Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 5(3):e00024–001, 2020.
  • [32] Kenneth Li, Aspen K Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Emergent world representations: Exploring a sequence model trained on a synthetic task. arXiv preprint arXiv:2210.13382, 2022.
  • [33] Neel Nanda, Lawrence Chan, Tom Liberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217, 2023.
  • [34] N Elhage, N Nanda, C Olsson, T Henighan, N Joseph, B Mann, A Askell, Y Bai, A Chen, T Conerly, et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021.
  • [35] Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models. URL https://openaipublic. blob. core. windows. net/neuron-explainer/paper/index. html.(Date accessed: 14.05. 2023), 2023.
  • [36] Johanna Vielhaben, Stefan Bluecher, and Nils Strodthoff. Multi-dimensional concept discovery (mcd): A unifying framework with completeness guarantees. Transactions on Machine Learning Research, 2023.
  • [37] Laura O’Mahony, Vincent Andrearczyk, Henning Müller, and Mara Graziani. Disentangling neuron representations with concept vectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3769–3774, 2023.
  • [38] Amirata Ghorbani, James Wexler, James Y Zou, and Been Kim. Towards automatic concept-based explanations. Advances in neural information processing systems, 32, 2019.
  • [39] Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning, pages 2668–2677. PMLR, 2018.
  • [40] Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593, 2022.
  • [41] Nick Cammarata, Shan Carter, Gabriel Goh, Chris Olah, Michael Petrov, Ludwig Schubert, Chelsea Voss, Ben Egan, and Swee Kiat Lim. Thread: Circuits. Distill, 2020. https://distill.pub/2020/circuits.
  • [42] Anh Nguyen, Alexey Dosovitskiy, Jason Yosinski, Thomas Brox, and Jeff Clune. Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. Advances in neural information processing systems, 29, 2016.
  • [43] Stephen Casper, Yuxiao Li, Jiawei Li, Tong Bu, Kevin Zhang, Kaivalya Hariharan, and Dylan Hadfield-Menell. Red teaming deep neural networks with feature synthesis tools. arXiv preprint arXiv:2302.10894, 2023.
  • [44] Dilyara Bareeva, Marina M. C. Höhne, Alexander Warnecke, Lukas Pirch, Klaus-Robert Müller, Konrad Rieck, and Kirill Bykov. Manipulating feature visualizations with gradient slingshots, 2024.
  • [45] Robert Geirhos, Roland S Zimmermann, Blair Bilodeau, Wieland Brendel, and Been Kim. Don’t trust your eyes: on the (un) reliability of feature visualizations. arXiv preprint arXiv:2306.04719, 2023.
  • [46] Jonathan Marty, Eugene Belilovsky, and Michael Eickenberg. Adversarial attacks on feature visualization methods. In NeurIPS ML Safety Workshop, 2022.
  • [47] David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, Joshua B Tenenbaum, William T Freeman, and Antonio Torralba. GAN dissection: Visualizing and understanding generative adversarial networks. arXiv preprint arXiv:1811.10597, 2018.
  • [48] Evan Hernandez, Sarah Schwettmann, David Bau, Teona Bagashvili, Antonio Torralba, and Jacob Andreas. Natural language descriptions of deep visual features. In International Conference on Learning Representations, 2022.
  • [49] Neha Kalibhat, Shweta Bhardwaj, C Bayan Bruss, Hamed Firooz, Maziar Sanjabi, and Soheil Feizi. Identifying interpretable subspaces in image representations. 2023.
  • [50] Pádraig Cunningham, Matthieu Cord, and Sarah Jane Delany. Supervised learning. In Machine learning techniques for multimedia, pages 21–49. Springer, 2008.
  • [51] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [52] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • [53] George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
  • [54] James A Hanley and Barbara J McNeil. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 143(1):29–36, 1982.
  • [55] Corinna Cortes and Mehryar Mohri. Confidence intervals for the area under the roc curve. Advances in neural information processing systems, 17, 2004.
  • [56] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
  • [57] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
  • [58] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • [59] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [60] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing.
  • [61] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [62] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, June 2010.
  • [63] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
  • [64] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [65] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
  • [66] Kirill Bykov, Klaus-Robert Müller, and Marina M-C Höhne. Mark my words: Dangers of watermarked images in imagenet. arXiv preprint arXiv:2303.05498, 2023.
  • [67] Zhiheng Li, Ivan Evtimov, Albert Gordo, Caner Hazirbas, Tal Hassner, Cristian Canton Ferrer, Chenliang Xu, and Mark Ibrahim. A whac-a-mole dilemma: Shortcuts come in multiples where mitigating one amplifies others. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20071–20082, 2023.
  • [68] Reduan Achtibat, Maximilian Dreyer, Ilona Eisenbraun, Sebastian Bosse, Thomas Wiegand, Wojciech Samek, and Sebastian Lapuschkin. From attribution maps to human-understandable explanations through concept relevance propagation. Nature Machine Intelligence, 5(9):1006–1019, 2023.
  • [69] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
  • [70] Emile van Krieken, Erman Acar, and Frank van Harmelen. Analyzing differentiable fuzzy logic operators. Artificial Intelligence, 302:103602, 2022.
  • [71] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
  • [72] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages 722–729. IEEE, 2008.
  • [73] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Large-scale celebfaces attributes (celeba) dataset. Retrieved August, 15(2018):11, 2018.
  • [74] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
  • [75] Fei-Fei Li, Marco Andreeto, Marc’Aurelio Ranzato, and Pietro Perona. Caltech 101, apr 2022.
  • [76] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.

Appendix: Labeling Neural Representations with Inverse Recognition

Appendix A Appendix

A.1 Broader Impact

Our proposed INVERT method contributes to enhancing the transparency and safety of Deep Neural Networks. By providing human understandable and interpretable explanations for neurons in black-box models, our approach offers valuable insights into their internal operations, improving understanding. Moreover, our method is able to identify potentially spurious representations. An important advantage of our method is its notable reduction in computational cost compared to previous approaches. This reduction not only improves efficiency but also minimizes the harmful environmental impact associated with excessive GPU usage.

It is important to note that we cannot make definitive claims regarding specific groups of people benefiting from or being disadvantaged by our method. The general applicability and potential implications of our approach should be explored further and with caution.

A.2 Prior work

Let’s consider a function, g:𝔻k×k,:𝑔𝔻superscript𝑘𝑘g:\mathbb{D}\rightarrow\mathbb{R}^{k\times k},italic_g : blackboard_D → blackboard_R start_POSTSUPERSCRIPT italic_k × italic_k end_POSTSUPERSCRIPT , that represents a convolutional neuron within a model that produces activation maps of dimensions k×k,𝑘𝑘k\times k,italic_k × italic_k , along with a concept c.𝑐c\in\mathbb{C}.italic_c ∈ blackboard_C . Both Network Dissection [11] and Compositional Explanations of Neurons [12] methods make use of the Intersection over Union (IoU) similarity metric to measure the degree of correlation between a function and a concept. A prerequisite for these methodologies are segmentation masks of concepts, meaning for every concept c,𝑐c\in\mathbb{C},italic_c ∈ blackboard_C , there exists a corresponding function Mc:𝔻{0,1}h×w,:subscript𝑀𝑐𝔻superscript01𝑤M_{c}:\mathbb{D}\rightarrow\{0,1\}^{h\times w},italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT : blackboard_D → { 0 , 1 } start_POSTSUPERSCRIPT italic_h × italic_w end_POSTSUPERSCRIPT , which generates a binary mask for the specific concept, of the same size as the original input.

To evaluate the similarity between function g𝑔gitalic_g and concept c,𝑐c,italic_c , the multi-dimensional outputs from g𝑔gitalic_g are subjected to thresholding based on neuron-specific percentiles (i.e., values above chosen percentiles are converted to 1 and the remaining to 0), and upscaled to match the dimensions of the original image. We can define the resulting function that produces binary masks of the same size as the input as G:𝔻{0,1}h×w.:𝐺𝔻superscript01𝑤G:\mathbb{D}\rightarrow\{0,1\}^{h\times w}.italic_G : blackboard_D → { 0 , 1 } start_POSTSUPERSCRIPT italic_h × italic_w end_POSTSUPERSCRIPT . The final similarity (IoU) score between g𝑔gitalic_g and c𝑐citalic_c can be computed as the Intersection over Union score between concept masks M𝑀Mitalic_M and function G::𝐺absentG:italic_G :

dIoU(g,c)=x𝒟𝟏(Mc(x)G(x))x𝒟𝟏(Mc(x)G(x)).subscript𝑑IoU𝑔𝑐subscript𝑥𝒟𝟏subscript𝑀𝑐𝑥𝐺𝑥subscript𝑥𝒟𝟏subscript𝑀𝑐𝑥𝐺𝑥d_{\text{IoU}}(g,c)=\frac{\sum_{x\in\mathcal{D}}\textstyle{\textbf{1}}\left(M_% {c}(x)\cap G(x)\right)}{\sum_{x\in\mathcal{D}}\textstyle{\textbf{1}}\left(M_{c% }(x)\cup G(x)\right)}.italic_d start_POSTSUBSCRIPT IoU end_POSTSUBSCRIPT ( italic_g , italic_c ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_D end_POSTSUBSCRIPT 1 ( italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_x ) ∩ italic_G ( italic_x ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_D end_POSTSUBSCRIPT 1 ( italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_x ) ∪ italic_G ( italic_x ) ) end_ARG . (5)

In section 4.2, the method of Compositional Explanations of neurons was applied using a 7x7 input map for each feature. Conversely, the INVERT approach uses a strategy that computes a scalar value by calculating the average of the input map.

A.3 INVERT algorithm

Given a neural representation f:𝔻,:𝑓𝔻f:\mathbb{D}\longrightarrow\mathbb{R},italic_f : blackboard_D ⟶ blackboard_R , a dataset 𝒟𝔻,𝒟𝔻\mathcal{D}\subset\mathbb{D},caligraphic_D ⊂ blackboard_D , a set of atomic concepts C,𝐶C\in\mathbb{C},italic_C ∈ blackboard_C , and a vector 𝒞d𝒞superscript𝑑\mathcal{C}\in\mathbb{C}^{d}caligraphic_C ∈ blackboard_C start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT the INVERT approach seeks to identify a compositional concept φ*superscript𝜑\varphi^{*}italic_φ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, which is formed as a logical operation on the concepts, to optimize AUC similarity d(f,φ*(𝒞).d(f,\varphi^{*}(\mathcal{C}).italic_d ( italic_f , italic_φ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( caligraphic_C ) . For this purpose, we utilized an optimization process similar to that of the CompExpl methodology [12], employing Beam search to find the optimal compositional concept.

This method requires the configuration of certain parameters, namely the predetermined formula length L𝐿L\in\mathbb{N}italic_L ∈ blackboard_N, the beam size B𝐵B\in\mathbb{N}italic_B ∈ blackboard_N, and additionally, the parameters α,β.𝛼𝛽\alpha,\beta.italic_α , italic_β . Beam search intends to iteratively combine concepts, starting with the atomic concepts (primitives) from C𝐶Citalic_C. At every iteration of the process, the top B𝐵Bitalic_B best-performing compositional concepts are selected, and all feasible formulas are computed with primitives (i.e. atomic concepts). Subsequently, only the top B𝐵Bitalic_B best-performing concepts are selected, and the process continues until the formula reaches the predetermined length.

In detail, firstly, we define a set of primitives Φ¯¯Φ\bar{\Phi}over¯ start_ARG roman_Φ end_ARG — a set of compositional concepts that correspond to the set of concepts C𝐶Citalic_C and their negation. The set Φ¯¯Φ\bar{\Phi}over¯ start_ARG roman_Φ end_ARG comprises 2k2𝑘2k2 italic_k compositional concepts, with each concept corresponding to either the base concept or its negation. Next, all 2k2𝑘2k2 italic_k concepts are evaluated in terms of AUC similarity with a given function, and the top B𝐵Bitalic_B best performing compositional concepts, that satisfy αT(φ(𝒞))β𝛼𝑇𝜑𝒞𝛽\alpha\leq T(\varphi(\mathcal{C}))\leq\betaitalic_α ≤ italic_T ( italic_φ ( caligraphic_C ) ) ≤ italic_β are selected, leading to the formation of the set Φ*superscriptΦ\Phi^{*}roman_Φ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT where |Φ*|=B,superscriptΦ𝐵|\Phi^{*}|=B,| roman_Φ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT | = italic_B , referred to as a Beam. These are the top B𝐵Bitalic_B best-performing compositional concepts with a length of 1,11,1 , satisfying the requisite condition on their positive fraction in the dataset. Subsequently, the following operations are iteratively performed until the predetermined formula length L𝐿Litalic_L is met:

  1. 1.

    Each of the B𝐵Bitalic_B compositional concepts in the beam Φ*superscriptΦ\Phi^{*}roman_Φ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is combined with all primitives (concepts from Φ¯¯Φ\bar{\Phi}over¯ start_ARG roman_Φ end_ARG) using either the AND or OR operation, thereby augmenting the formula length by 1, resulting in a total of 4Bk4𝐵𝑘4Bk4 italic_B italic_k new formulas.

  2. 2.

    All newly generated formulas are evaluated based on their similarity to the representation, and the beam Φ*superscriptΦ\Phi^{*}roman_Φ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is updated to include the top B𝐵Bitalic_B performing formulas, which satisfy the condition αT(φ(𝒞))β𝛼𝑇𝜑𝒞𝛽\alpha\leq T(\varphi(\mathcal{C}))\leq\betaitalic_α ≤ italic_T ( italic_φ ( caligraphic_C ) ) ≤ italic_β.

Upon reaching the predetermined formula length L𝐿Litalic_L, the Beam-Search procedure concludes by identifying the compositional concept φ*superscript𝜑\varphi^{*}italic_φ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT with the highest observed AUC.

A.4 Integrating Datasets from Different Sources

Let 𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT be two separate datasets. Each of these datasets is linked to its unique set of concepts, represented as C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT respectively. By merging these datasets, we can form a consolidated dataset, symbolized as 𝒟~=𝒟1𝒟2~𝒟subscript𝒟1subscript𝒟2\tilde{\mathcal{D}}=\mathcal{D}_{1}\cup\mathcal{D}_{2}over~ start_ARG caligraphic_D end_ARG = caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. This unified dataset will encompass a combined set of concepts, denoted as C~=C1C2~𝐶subscript𝐶1subscript𝐶2\tilde{C}=C_{1}\cup C_{2}over~ start_ARG italic_C end_ARG = italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∪ italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

The key requirement for this integration is the mutual definition: the concepts in C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT should be defined within the dataset 𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and conversely, the concepts in C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT should be defined within the dataset 𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. While this does necessitate supplementary labeling, it becomes straightforward when it is evident that the concepts from both datasets do not overlap semantically. For instance, flower concepts from the Oxford Flowers102 [72] and faces from the CelebA [73] can be effortlessly combined. This is accomplished by designating the output of concepts within the non-native dataset as negative.

A.5 Comparing Fuzzy Logic operators

Fuzzy logic operators [70] serve as essential instruments within the domain of fuzzy logic, a mathematical construct designed for modeling and handling data that is imprecise or vague. This contrasts with conventional logic where an element strictly either belongs to a set or not; fuzzy logic allows for the degree of membership to vary from 0 to 1, thereby allowing for partial membership.

In this experiment, our objective was to compare different fuzzy logic operators and examine their behavior concerning the proposed AUC metric. To fulfill this aim, we employed four distinct pre-trained deep learning image classification models: AlexNet [74], DenseNet161 [63], EfficientNet B4 [57], and ViT 16 L [58]. We focused on 1000 neural representations corresponding to the ImageNet classes in the output logit (pre-SoftMax) layer for each model, for which we recognized the ’ground-truth’ concept — the corresponding ImageNet class. For fuzzy logic operators’ testing, we mapped the output of each representation to the set [0,1]01[0,1][ 0 , 1 ] by normalizing each representation’s output using their corresponding mean and standard deviation across the ImageNet dataset and applied a Sigmoid transformation. We tested four different Fuzzy logic operators, specifically Gödel, Product, Łukasiewicz, and Yager with parameter p=2,𝑝2p=2,italic_p = 2 , as illustrated in Table 3.

For performance evaluation, we generated random compositional concepts of a given length and computed the AUC similarity between fuzzy logic norms applied to functions corresponding to these concepts. For instance, given the random compositional concept φ=ci OR cj,𝜑subscript𝑐𝑖 OR subscript𝑐𝑗\varphi=c_{i}\text{ {OR} }c_{j},italic_φ = italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT smallcaps_OR italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , we derive compositional representations as per each of the four examined methods (e.g., the Gödel operator produces a function hG=max(fi,fj)subscript𝐺subscript𝑓𝑖subscript𝑓𝑗h_{G}=\max(f_{i},f_{j})italic_h start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = roman_max ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )). These compositional representations are then evaluated in terms of AUC similarity with the compositional concept — d(hG,φ)𝑑subscript𝐺𝜑d(h_{G},\varphi)italic_d ( italic_h start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , italic_φ ).

We conducted the evaluation in two modes, that is, assessing the performance of the OR (T-conorm) operator and the performance of the AND (T-norm) operator. For each mode, we assembled 1000 random compositional concepts by sampling L𝐿Litalic_L random concepts without replacement and calculated the AUC between compositional concepts and corresponding function. Note that for the second mode, AND (T-norm), random compositional concepts were assembled using the AND NOT operation, given the mutual exclusivity of ImageNet labels.


Refer to caption

Figure 9: Average AUC similarity between random compositional OR concepts and corresponding compositional representations employing various Fuzzy logic operators (Higher is better) evaluated across four distinct models.

Refer to caption

Figure 10: Average AUC similarity between random compositional AND NOT concepts and corresponding compositional representations employing various Fuzzy logic operators (Higher is better) evaluated across four distinct models.

Figures 9 and 10 depict the mean AUC similarity between random compositional concepts of varying lengths and the corresponding compositional representations, which were assembled using four distinct fuzzy logic operators. From these figures, it becomes evident that Gödel fuzzy logic operators demonstrate the most significant robustness to the length of the formula, consistently attaining superior AUC in contrast to other operators. Consequently, we can infer that Gödel’s operator emerges as the optimal choice for implementing fuzzy logic operations on neural representations.

Table 2: List of different fuzzy operators
NOT(a)𝑎(a)( italic_a ) AND(a,b)𝑎𝑏(a,b)( italic_a , italic_b ) (T-norm) OR(a,b)𝑎𝑏(a,b)( italic_a , italic_b ) (T-conorm)
Gödel 1a1𝑎1-a1 - italic_a min(a,b)𝑎𝑏\min(a,b)roman_min ( italic_a , italic_b ) max(a,b)𝑎𝑏\max(a,b)roman_max ( italic_a , italic_b )
Product ab𝑎𝑏a\cdot bitalic_a ⋅ italic_b a+bab𝑎𝑏𝑎𝑏a+b-a\cdot bitalic_a + italic_b - italic_a ⋅ italic_b
Łukasiewicz max(a+b1,0)𝑎𝑏10\max(a+b-1,0)roman_max ( italic_a + italic_b - 1 , 0 ) min(a+b,1)𝑎𝑏1\min(a+b,1)roman_min ( italic_a + italic_b , 1 )
Yager, p=2𝑝2p=2italic_p = 2 max(1((1a)2+(1b)2)12,0)1superscriptsuperscript1𝑎2superscript1𝑏2120\max(1-((1-a)^{2}+(1-b)^{2})^{\frac{1}{2}},0)roman_max ( 1 - ( ( 1 - italic_a ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 - italic_b ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT , 0 ) min((a2+b2)12,1)superscriptsuperscript𝑎2superscript𝑏2121\min((a^{2}+b^{2})^{\frac{1}{2}},1)roman_min ( ( italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT , 1 )

A.6 Finetuning without training

In this section, we investigate whether it is feasible to perform model fine-tuning without having access to the target dataset, relying solely on the explanations of the latent representations and target class descriptions. In simple terms, the idea was to directly use the latent representation from ImageNet-trained models as a classificator for a class in another dataset that has a similar meaning to the explanation of the representation.

For this purpose, we utilized four different ImageNet deep learning image classification models, specifically AlexNet [74], DenseNet161 [63], EfficientNet B4 [57], and ViT 16 L [58], all of which were pre-trained on the ImageNet dataset. The feature-extractor layer that precedes the final output logit layer was used in all these models for our experiments. We computed the AUC similarity scores for all representations in each of the feature extractors in relation to ImageNet concepts.

The target dataset employed for this study was the Caltech101 dataset [75], which comprises of 101 image classification categories. Specifically, we utilized a subset of this dataset that includes 46 classes, each of which has an exact or very similar equivalent in ImageNet classes.

We aimed to create a model for classifying Caltech classes by selecting suitable latent representations from the feature extractor layers of ImageNet models, and directly linking them to Caltech class logits. For each model, we chose a representation with the highest AUC similarity to the ImageNet concept closest to the Caltech concepts. This resulted in a subset of 46 neurons per model, each neuron having the highest AUC for an ImageNet concept similar to a Caltech concept. These neurons were normalized by the ImageNet validation dataset’s mean and standard deviation, and a Sigmoid activation function was applied to constrain outputs between 0 and 1. Neuron selection was solely based on ImageNet explanations, with no Caltech101 data utilized.

We also hypothesized that individual signals from feature extractor layers could be further enhanced by executing a continuous AND operation with other neurons that share a high AUC towards the concept. Table 3 presents the results of this procedure in terms of the accuracy achieved on the target dataset. For this task, the random accuracy stands at approximately 2%, while the conventional fine-tuning approach—which freezes the feature extractor layer and trains a linear classification layer atop the feature extractors—achieves an accuracy of up to 98.29% (last row of the table). Remarkably, by simply linking the representation with the highest AUC towards the ImageNet concept from the latent layer to the CalTech101 output class logit using our approach (L=1𝐿1L=1italic_L = 1), we were able to attain a substantial non-random accuracy, peaking at 69.50% in the case of DenseNet161. Furthermore, by selecting top L𝐿Litalic_L neurons that have the highest AUC towards ImageNet concepts and employing Gödel AND operator between representations, we observed that this typically improved the results, with the only exception being the AlexNet model where this strategy slightly reduced the accuracy.

Table 3: A comparison of the accuracy achieved by the proposed finetuning method, which includes finetuning with a single representation (L=1), and multiple representations (L=2,5,10) combined with a fuzzy AND operator, against traditional and random finetuning baselines.
AlexNet DenseNet161 EfficientNet B4 ViT L 16
Random 2.21% 2.21% 2.21% 2.21%
L =1 43.91% 62.50% 39.96% 47.12%
L = 2 42.95% 70.51% 69.23% 60.79%
L = 5 40.17% 75.00% 80.88% 78.31%
L = 10 30.88% 69.12% 86.11% 79.49%
Finetuned 91.67% 97.76% 94.76% 98.29%

A.7 Comparison between IoU and AUC metrics

In our supplementary experiments comparing different models, we further investigate the correlation between AUC and IoU. Table 4 demonstrates the performance of our method INVERT (AUC) in comparison to NetDissect and Compexp (IoU) performed on different models and layers including varying formula lengths (N). Our analysis employs ResNet18 and DenseNet161 PyTorch models trained on the Places365 dataset [71], accessible through the Compositional Explanations of Neurons implementation222https://github.com/jayelm/compexp/tree/master. Following their approach we apply the methods on the ADE20k subset of the Broden dataset on formula lengths of 1 to 3. The IoU and AUC scores are summarized as the average and standard deviation across all neurons in each selected model layer. From these results, we can observe, that optimal explanations from AUC (INVERT) and IoU (NetDissect, CompExpl) based methods do not necessarily maximize each other objective functions.

The results in Table 5 reveal a correlation between IoU and AUC scores in non-zero IoU cases across multiple models and layers. The metrics differ in their applications and are not as strongly aligned. The correlation scores represent the average and standard deviation of the Pearson and Spearman correlation statistics. For each neuron and each available concept, correlations were calculated between the IoU and AUC scores. The “Normal” scenario corresponds to the standard case, whereas the “Log” case refers to when a logarithmic transformation was applied to the IoU values, with an additional epsilon value of 1e-4. Correlations were computed exclusively for concepts that showed non-zero IoU scores. We can observe, that for non-zero IoU scores there exists a small positive correlation between IoU and AUC scores.

To further comprehend the correlation of these metrics we investigate the case where AUC and IoU perform differently. In Figure 11, we present a case where explanations yielding 0 IoU scores are better aligned with the explanation goal. We provide evidence of IoU-based explanations resulting in low neuron activation, while INVERT achieves notable activation even when IoU scores are 0.

Table 4: Comparison of IoU and AUC performed on different models and layers including varying formula lengths (N). All models are trained on the Places365 dataset and the explanations were constructed based on the ADE20k subset of the Broden dataset. The table presents the average and standard deviation scores IoU and AUC scores across all neurons in the selected model layer.
N = 1
INVERT NetDissect
Model - Layer IoU AUC IoU AUC
ResNet18 - Layer 4 0.0062±plus-or-minus\pm±0.0123 0.8959±plus-or-minus\pm±0.0691 0.0581±plus-or-minus\pm±0.0318 0.8367±plus-or-minus\pm±0.1155
ResNet18 - Layer 3 0.0007±plus-or-minus\pm±0.0022 0.8834±plus-or-minus\pm±0.0780 0.0121±plus-or-minus\pm±0.0012 0.5549±plus-or-minus\pm±0.1660
DenseNet161 - Features 0.0016±plus-or-minus\pm±0.0066 0.8928±plus-or-minus\pm±0.0733 0.0364±plus-or-minus\pm±0.0279 0.7448±plus-or-minus\pm±0.1547
DenseNet161 - Dense Block 4 0.0007±plus-or-minus\pm±0.0017 0.9014±plus-or-minus\pm±0.0655 0.0150±plus-or-minus\pm±0.0034 0.6877±plus-or-minus\pm±0.1582
N = 2
INVERT CompExp
Model - Layer IoU AUC IoU AUC
ResNet18 - Layer 4 0.0021±plus-or-minus\pm±0.0062 0.9972±plus-or-minus\pm±0.0037 0.0756±plus-or-minus\pm±0.0369 0.8310±plus-or-minus\pm±0.1016
ResNet18 - Layer 3 0.0023±plus-or-minus\pm±0.0042 0.9955±plus-or-minus\pm±0.0077 0.0185±plus-or-minus\pm±0.0014 0.5726±plus-or-minus\pm±0.1332
DenseNet161 - Features 0.0042±plus-or-minus\pm±0.0124 0.9958±plus-or-minus\pm±0.0056 0.0455±plus-or-minus\pm±0.0313 0.7248±plus-or-minus\pm±0.1424
DenseNet161 - Dense Block 4 0.0029±plus-or-minus\pm±0.0059 0.9961±plus-or-minus\pm±0.0058 0.0222±plus-or-minus\pm±0.0040 0.6930±plus-or-minus\pm±0.1127
N = 3
INVERT CompExp
Model - Layer IoU AUC IoU AUC
ResNet18 - Layer 4 0.0026±plus-or-minus\pm±0.0079 0.9977±plus-or-minus\pm±0.0030 0.0849±plus-or-minus\pm±0.0391 0.8184±plus-or-minus\pm±0.0995
ResNet18 - Layer 3 0.0021±plus-or-minus\pm±0.0038 0.9966±plus-or-minus\pm±0.0057 0.0235±plus-or-minus\pm±0.0016 0.5714±plus-or-minus\pm±0.1084
DenseNet161 - Features 0.0035±plus-or-minus\pm±0.0104 0.9967±plus-or-minus\pm±0.0046 0.0497±plus-or-minus\pm±0.0330 0.7132±plus-or-minus\pm±0.1356
DenseNet161 - Dense Block 4 0.0026±plus-or-minus\pm±0.0054 0.9969±plus-or-minus\pm±0.0048 0.0361±plus-or-minus\pm±0.0231 0.6846±plus-or-minus\pm±0.1045
Table 5: Correlation between IoU and AUC based on the score for each class per neuron. The models were pre-trained using the Places365 dataset and their performance was assessed on the ADE20k subset of the Broden dataset. The table presents the average and standard deviation of the Pearson and Spearman correlation statistics.
Normal Log(+eps)
Model - Layer Pearson Spearman Pearson Spearman
ResNet18 - Layer 4 0.3429±plus-or-minus\pm±0.0682 0.3623±plus-or-minus\pm±0.0945 0.4116±plus-or-minus\pm±0.0885 0.3623±plus-or-minus\pm±0.0945
ResNet18 - Layer 3 0.2377±plus-or-minus\pm±0.0911 0.2738±plus-or-minus\pm±0.1121 0.3009±plus-or-minus\pm±0.1180 0.2738±plus-or-minus\pm±0.1121
DenseNet161 - Features 0.2681±plus-or-minus\pm±0.0869 0.2787±plus-or-minus\pm±0.1041 0.3156±plus-or-minus\pm±0.1050 0.2787±plus-or-minus\pm±0.1041
DenseNet161 - Dense Block 4 0.2143±plus-or-minus\pm±0.1039 0.2691±plus-or-minus\pm±0.1626 0.2878±plus-or-minus\pm±0.1573 0.2691±plus-or-minus\pm±0.1626

Refer to caption

Figure 11: Comparison between INVERT and NetDissect. The figure displays three distributions of activations: one for all datapoints in green, one for datapoints corresponding to the IoU-based explanation in orange, and one for the AUC-based explanation in blue. These distributions pertain to the average activation across activation maps of Neuron 205 in ResNet18, layer 3, pre-trained on the Places365 dataset. The activations were collected across the ADE20k subset of the Broden dataset. The class labeled as “car” resulted from IoU optimization, while the class labeled as “ocean” resulted from AUC optimization. Notably, even though the “ocean” class has an IoU score of 0, it comprises some of the most activating images for the neuron, as evidenced by the top 9 most activated images.

Refer to caption

Figure 12: In (a) we compare the distribution of AUC and IoU across all concepts of the ADE20k atomic concepts from the Broden dataset for Neuron 269 from layer 4 of ResNet18 trained on the Places365 dataset, where (b) shows the top 4 activating images of the ADE20k dataset. (c) shows the distribution of maximized IoU, maximized AUC, and random IoU scores for layer 4 of ResNet18 trained on the Places365 dataset with a formula length of 1.

Figure 12 (a) shows a qualitative example of AUC and IoU scores across all concepts of Neuron 269 from layer 4 of ResNet18 trained on the Places365 dataset [71]. Each data point corresponds to one concept among the 1105 ADE20k atomic concepts sourced from the Broden dataset [76, 11]. This example illustrates the dependence between AUC and IoU, high IoU scores are correlated with high AUC scores. In Figure 12 (b) we showcase the top 4 most activating images for Neuron 269 from the ADE20k dataset to align them with the highest scoring concepts in Figure 12 (a). Comparing the set of images with the concepts exhibiting the highest AUC scores (e.g., “throne room”, “apse-indoor”, “fur”), we observe a strong visual alignment. However, when examining the concepts with high IoU scores (e.g., “nursery”, “cradle”, “attic”), we find a relatively low degree of visual similarity. Those results demonstrate the limitations of the IoU measure for evaluating explanations.

Furthermore, we conducted a quantitative evaluation shown in Figure 12 (c), specifically focusing on layer 4 of the ResNet18 trained on the Places365 dataset. We compared the distribution of IoU scores of explanations obtained by maximizing IoU and AUC respectively. Additionally, we examined the mean values of these distributions, which included random IoU scores as baseline reference. Our findings reveal that maximizing IoU leads to a relatively sparse distribution of IoU scores while maximizing AUC results in a more densely concentrated accumulation of predominantly low IoU scores. As anticipated, the performance of random IoU scores was notably poor. We can observe that maximizing AUC also indirectly maximizes IoU.

Table 6 serves as a sanity check implementing metric comparison for best explanation and random explanation evaluation. The FasterRCNN ResNet50FPN model was pretrained on the MS COCO dataset for object detection, while the UPerNet BEiT-B model was pretrained on ADE20k for semantic segmentation. The former model’s evaluation was conducted on a subset of MS COCO containing 20,000 images, while the latter was assessed on the ADE20k subset of the Broden dataset. The output layers of both models were utilized to access the “ground truth” label for each neuron. In the table, the “True” column represents the IoU/AUC scores of the explanation that align with the ground-truth neuron label. On the other hand, the “Random” column corresponds to the scores of randomly chosen explanation-concept pairs that differ from the “ground truth”.

Table 6: Metric comparison for true label and random explanation evaluation. FasterRCNN ResNet50 FPN (pre-trained on MS COCO for object detection). UPerNet BEiT-B (pre-trained on ADE20k for semantic segmentation).
FasterRCNN ResNet50 FPN UPerNet BEiT-B
Metric True Random True Random
IoU 0.8355±plus-or-minus\pm±0.0466 0.0077±plus-or-minus\pm±0.0046 0.8553±plus-or-minus\pm±0.0913 0.0007±plus-or-minus\pm±0.0007
AUC 0.9556±plus-or-minus\pm±0.0371 0.5005±plus-or-minus\pm±0.0253 0.8738±plus-or-minus\pm±0.0929 0.5001±plus-or-minus\pm±0.0164

A.8 Figures


Refer to caption

Figure 13: The figure displays the WordNet taxonomy, which was used to gather the hierarchical structure of the labels for the Figure 3.

Refer to caption

Figure 14: The figure illustrates the “safe” circuit within the ResNet18 model. The top part of the figure showcases the three most significant neurons (in terms of the weight of linear connection) and their corresponding INVERT explanation linked to the class logit “safe”. The bottom part of the figure demonstrates how the local explanation from the class logit can be decomposed into individual explanations of individual neurons from the preceding layer. This allows for a more detailed understanding of how each neuron contributes to the final classification.

Refer to caption

Figure 15: The figure illustrates the “monitor” circuit within the ResNet18 model. The top part of the figure showcases the four most significant neurons (in terms of the weight of linear connection) and their corresponding INVERT explanation linked to the class logit “monitor”. The bottom part of the figure demonstrates how the local explanation from the class logit can be decomposed into individual explanations of individual neurons from the preceding layer.

Refer to caption

Figure 16: The figure presents two distinct handcrafted circuits. For each neuron, or combination of neurons, we report the Area Under the Receiver Operating Characteristic (AUROC) score. This score represents the AUC classification performance towards classifying specific concepts from the Places365 dataset.