Making deep neural networks right for the right scientific reasons by interacting with their explanations

Patrick Schramowski

{}^{*}

, Wolfgang Stammer

{}^{*}

, Stefano Teso, Anna Brugger,
Franziska Herbert, Xiaoting Shao, Hans-Georg Luigs,
Anne-Katrin Mahlein & Kristian Kersting

Abstract

Deep neural networks have shown excellent performances in many real-world applications. Unfortunately, they may show “Clever Hans”-like behavior—making use of confounding factors within datasets—to achieve high performance. In this work, we introduce the novel learning setting of “explanatory interactive learning” (XIL) and illustrate its benefits on a plant phenotyping research task. XIL adds the scientist into the training loop such that she interactively revises the original model via providing feedback on its explanations. Our experimental results demonstrate that XIL can help avoiding Clever Hans moments in machine learning and encourages (or discourages, if appropriate) trust into the underlying model.

Imagine a plant phenotyping team attempting to characterize crop resistance to plant pathogens. The plant physiologist records a large amount of hyperspectral imaging data. Impressed by the results of deep learning in other scientific areas, she wants to establish similar results for phenotyping. Consequently, she asks a machine learning expert to apply deep learning to analyze the data. Luckily, the resulting predictive accuracy is very high. The plant physiologist, however, remains skeptical. The results are “too good, to be true”. Checking the decision process of the deep model using explainable artificial intelligence (AI), the machine learning expert is flabbergasted to find that the learned deep model uses clues within the data that do not relate to the biological problem at hand, so-called confounding factors. The physiologist loses trust in AI and turns away from it, proclaiming it to be useless.^†^†Preprint. Work in progress.
${}^{*}$ Equal contribution.

This example encapsulates an important issue of current explainable AI [1, 2]. Indeed, the seminal paper of Lapuschkin et al. [3] helps in “unmasking Clever Hans predictors and assessing what machines really learn”. However, rather than proclaiming, as the plant physiologist might, that the machines have learned the right predictions for the wrong reasons and can therefore not be trusted, we here showcase that interactions between the learning system and the human user can correct the model, towards making the right predictions for the right reasons [4]. This may also increase the trust in machine learning models. Actually, trust lies at the foundation of major theories of interpersonal relationships in psychology [5, 6] and we argue that interaction and understandability are central to trust in learning machines. Surprisingly, the link between interacting, explaining and building trust has been largely ignored by the machine learning literature. Existing approaches focus on passive learning only and do not consider the interaction between the user and the learner [7, 8, 9], whereas, interactive learning frameworks such as active [10] and coactive learning [11] do not consider the issue of trust. In active learning, for instance, the model presents unlabeled instances to a user, and in exchange obtains their label. This is completely opaque—the user is oblivious to the model’s beliefs and reasons for predictions and to how they change in time, and cannot see the consequences of her instructions. In coactive learning, the user sees and corrects the system’s prediction, if necessary, but the predictions are not explained to her. So, why should users trust models learned interactively?

Furthermore, although an increasing amount of research investigates methods for explaining machine learning models, even here the notion of interaction has been largely ignored. Reconsider the study by Lapuschkin et al. [3]. They showed that one can find “Clever Hans”-like behavior in popular computer vision models basing their decisions on confounding factors. Based on these findings, the authors recommended a word of caution towards the interest in such models, but they did not offer a solution for correcting their behavior. Particularly in real-world applications, where monitoring for every possible confounding factor or acquiring a new dataset due to existing confounders is time and resource consuming, it is inevitable to move beyond revealing the (wrong) reasons by making a step towards correcting the reasons underlying a models decisions.

Refer to caption — Figure 1: Explanatory Interactive Learning (XIL)—Human users revise learning machines towards trustworthy decision strategies. The cartoons sketch the main unerlying idea for each case (row). (a-left) Data samples, expert-classifications (checks and Xs with colors indicating the class) and explanations (overlaid with an edge filtered original image for better interpretability) that an expert expects of an ML model. Yellow corresponds to relevant regions, blue to irrelevant regions for a classification. Not even an expert can be certain about potential samples from a early disease stage and what a valid explanation should be. (a-middle) Illustration of hyperspectral data consisting of spatial and spectral dimensions. The planes on the top and left sides of the cube correspond to slices taken from the center of the cube but placed on the edges for visualization. (a-right) The characteristic reflectance of healthy tissue vs. disease spots. The vertical red, green and blue lines depict the three wavelengths of the RGB dataset. (b,c) Classifications of a deep neural network (b) and its explanations (c). The learned model clearly uses confounding factors, identified as the embedding agar solution, to explain its decision. (d) The human user provides feedback on the reasons. In turn, the machine gets new information and can continue learning. The human-revised deep network yields classifications matching a biologically plausible strategies. (All shown RGB images correspond to real RGB images, while the edge overlays resulted from pseudo-RGB images generated from the original hyperspectral dataset, cf. Methods RGB/HS classification.)

Doing so is exactly the main technical contribution of the present study. We introduce the novel learning setting of “explanatory interactive learning” (XIL) and illustrate its benefits in an important scientific endeavor, namely, plant phenotyping. Starting from a learning system that does not deliver biologically plausible explanations for a relevant, real-world task in plant phenotyping, we add the scientist into the training loop, who interactively revises the original model by interacting via it’s explanations so that it produces trustworthy decisions without a major drop in performance. Specifically, XIL takes the form illustrated in Fig. 1. In each step, the learner explains its interactive query to the domain expert, and she responds by correcting the explanations, if necessary, to provide feedback. This allows the user not only to check whether the model is right or wrong on the chosen instance but also if the answer is right (or wrong) for the wrong reasons, e.g., when there are ambiguities in the data such as confounders [4]. By witnessing the evolution of the explanations, similar to a teacher supervising the progress of a student, the human user can see whether the model eventually “gets it”. The user may even correct the explanation presented to guide the learner. This correction step is crucial for more directly affecting the learner’s beliefs and is integral to modulating trust [6, 12].

Specifically, we make the following contributions: (i) Introduction of XIL with counterexamples (CE) to revise “Clever Hans” behavior in a model-agnostic fashion. (ii) Adaption of the “right for the right reasons” (RRR) loss to latent layers of deep neural networks. (iii) Showcasing XIL on the computer vision benchmark datasets PASCAL VOC 2007 [13] and MSCOCO 2014 [14]. (iv) Evaluation of XIL on a highly relevant dataset for plant phenotyping, demonstrating its potential as an enabler of scientific discovery. (v) Gathering of the plant phenotyping dataset and the creation of a version with confounders. (vi) A user study on trust development within XIL [15].

A preliminary version of this manuscript has been published as a conference paper [16]. The present paper significantly extends the conference version by (ii-v) Moreover, the ad-hoc XIL user study presented in [16] was completely re-designed, newly conducted, and now includes a thorough statistical analysis (vi). To encourage further research, we provide the created plant phenotyping dataset.

We proceed as follows. We start by formally introducing Explanatory Interactive Machine Learning (XIL) and instantiate it in the caipi method [16] as well as the rrr method [4]. After introducing XIL, we discuss quantitative results on test datasets, before providing details on how domain experts can revise learning machines and in turn enable the machines to correct their abilities to solve the scientific real-world task of plant disease prediction. Finally, we demonstrate the importance of explaining decisions for building trustful machines via a user study. Our contributions thus address a main part of building trustworthy AI methods by providing an end-to-end, interactive method to evaluate and revise machine learning models. This provides an important add-on to Rudin’s [17] message “Stop explaining black-box machine learning models for high stakes decisions and use interpretable models instead”, namely “Start interacting with explanations of machine learning models to avoid ‘Clever Hans’-like behavior.”

Explanatory Interactive Machine Learning (XIL)

In XIL, a learner can interactively query the user (or some other information source) to obtain the desired outputs of the data points. The interaction takes the following form. At each step, the learner considers a data point (labeled or unlabeled), predicts a label, and provides explanations of its prediction. The user responds by correcting the learner if necessary, providing a slightly improved—but not necessarily optimal—feedback to the learner.

Let us now instantiate this schema to explanatory active learning—combining active learning with local explainers (cf. Methods). Indeed, other interactive learning can be made explanatory too, including coactive learning [11], active imitation learning [18], and mixed-initiative interactive learning [19], but this is beyond the scope of this paper.

Explanatory Active Learning.

In Explanatory Active Learning, we require black-box access to an active learner and an explainer. We assume that the active learner provides a procedure $\textsc{{\color[rgb]{.5,0,.5}SelectQuery}}(f,\mathcal{U})$ for selecting an informative instance $x\in\mathcal{U}$ based on the current model $f$ , and a procedure $\textsc{{\color[rgb]{.5,0,.5}Fit}}(\mathcal{L})$ for fitting a new model (or update the current model) on the examples in $\mathcal{L}$ . The explainer is assumed to provide a procedure $\textsc{{\color[rgb]{0,0,1}Explain}}(f,x,\hat{y})$ for explaining a particular prediction $\hat{y}=f(x)$ . The framework is intended to work for any reasonable learner and explainer.

When using lime for computing an interpretable model locally around the queries to visualize explanations for current predictions, this results in caipi as summarized in Alg. 1.

f\leftarrow\textsc{{\color[rgb]{.5,0,.5}Fit}}(\mathcal{L})

2:repeat

x\leftarrow\textsc{{\color[rgb]{.5,0,.5}SelectQuery}}(f,\mathcal{U})

\hat{y}\leftarrow f(x)

\hat{z}\leftarrow\textsc{{\color[rgb]{0,0,1}Explain}}(f,x,\hat{y})

6: Present

x

\hat{y}

, and

\hat{z}

to the user

7: Obtain

\bar{y}

and explanation correction

\mathcal{C}

8: if caipi:

\{(\bar{x}_{i},\bar{y})\}_{i=1}^{c}\leftarrow\textsc{{\color[rgb]{1,.5,0}% ToCounterExamples}}(\mathcal{C})

10:

\mathcal{L}\leftarrow\mathcal{L}\cup\{(x,\bar{y})\}\cup\{(\bar{x}_{i},\bar{y})% \}_{i=1}^{c}

11: else if rrr:

12:

\{(x,\bar{y},A)\}\leftarrow\textsc{{\color[rgb]{1,.5,0}ToBinaryCorrectionMask}% }(\mathcal{C})

13:

\mathcal{L}\leftarrow\mathcal{L}\cup\{(x,\bar{y},A)\}

14:

\mathcal{U}\leftarrow\mathcal{U}\setminus\{x\}

15:

f\leftarrow\textsc{{\color[rgb]{.5,0,.5}Fit}}(\mathcal{L})

16:until budget

T

is exhausted or

f

is good enough

17:return

f

Algorithm 1 caipi takes as input a set of labeled examples

\mathcal{L}

, a set of unlabeled instances

\mathcal{U}

, and iteration budget

T

At each iteration $t=1,\ldots,T$ an instance $x\in\mathcal{U}$ is chosen using the query selection strategy implemented by the SelectQuery procedure. Then its label $\hat{y}$ is predicted using the current model $f$ , and Explain is used to produce an explanation $\hat{z}$ of the prediction. The triple $(x,\hat{y},\hat{z})$ is presented to the user as a (visual) artifact. The user checks the prediction and the explanation for correctness and provides the required feedback. Upon receiving the feedback, the system updates $\mathcal{U}$ and $\mathcal{L}$ accordingly and re-fits the model. The loop terminates when the iteration budget $T$ is reached or the model is good enough.

During interactions between the system and the user, three cases can occur: (1) Right for the right reasons: The prediction and the explanation are both correct. No feedback is requested. (2) Wrong for the wrong reasons: The prediction is wrong. As in active learning, we ask the user to provide the correct label. While the explanation may provide some signal as to why the prediction was wrong, we currently do not require the user to act on it—this is an interesting avenue for future work—but treat the explanation to be simply wrong. (3) Right for the wrong reasons: The prediction is correct but the explanation is wrong—the main target of XIL.

Model-agnostic XIL using counterexamples (CE).

The “right for the wrong reasons” case is novel in active learning, and we propose explanation corrections to deal with it. They can assume different meanings depending on whether the focus is on component relevance, polarity, or relative importance (ranking), among others. In our experiments we ask the annotator to indicate the components that have been wrongly identified by the explanation as relevant, that is, {linenomath*}

\mathcal{C}=\{j:|w_{j}|>0\land\text{the user believes the $j$th component to % be irrelevant}\}\ .

Given the correction $\mathcal{C}$ , we are faced with the problem of explaining it back to the learner. We propose a simple strategy to achieve this. This strategy is embodied by ToCounterExamples. It converts $\mathcal{C}$ to a set of counterexamples that teach the learner not to depend on the irrelevant components. In particular, for every $j\in\mathcal{C}$ we generate $c$ examples $(\bar{x}_{1},\bar{y}_{1}),\ldots,(\bar{x}_{c},\bar{y}_{c})$ , where $c$ is an application-specific constant. Here, the labels $\bar{y}_{i}$ are identical to the prediction $\hat{y}$ . The instances $\bar{x}_{i}$ , $i=1,\ldots,c$ are also identical to the query $x$ , except that the $j$ th component (i.e. $\psi_{j}(x)$ ) has been either randomized, changed to an alternative value, or substituted with the value of the $j$ th component appearing in other training examples of the same class. This counterexample strategy (ce) produces $c\cdot|\mathcal{C}|$ counterexamples, which are added to $\mathcal{L}$ , as summarized in Alg. 1. Importantly, this method is model-agnostic and can be used also when applying a non-differentiable model.

XIL using gradients.

If the model is differentiable, the learner can also be regularized to be right for the right reasons using the “Right for the Right Reasons” loss (rrr) introduced by Ross et al. [4]. Here one adds a penalty to gradients that lie outside of a binary mask that indicates which features of the input are relevant. We modified the original loss function to: {linenomath*}

\displaystyle L(\theta,\ X,\ y,\ A)=

\displaystyle\underbrace{\sum_{n=1}^{N}\sum_{k=1}^{K}-c_{k}y_{nk}\log({\hat{y}% }_{nk})}_{\text{Right answers}}\quad+\quad\underbrace{\lambda_{1}\sum_{n=1}^{N% }\sum_{d=1}^{D}\left(A_{nd}\frac{\delta}{\delta h_{nd}}\sum_{k=1}^{K}c_{k}\log% ({\hat{y}}_{nk})\right)^{2}}_{\text{Right reasons}}+\underbrace{\lambda_{2}% \sum_{i}\theta^{2}_{i}}_{\text{Weight regularization}}\ ,

(1)

where $\theta$ describes the parameters of the network, $X$ the input, $y$ the ground truth and $A$ the binary mask used in the regularization term that discourages the input gradient from being large in regions marked by $A$ . Instead of regularizing the gradients with respect to $X$ , as originally described in [4], we regularize the gradients of the final convolutional layer $h$ , corresponding to Gradient weighted Class Activation Maps (grad-Cam) ([20], cf. Methods). Further $c$ is a rescaling weight given to each class of the unbalanced dataset and $\hat{y}$ corresponds to the network prediction. The objective function is split into three terms. The first and the last are the familiar cross-entropy and weight ( $\theta$ ) regularization terms. The second term is the new regularization term. The $\lambda$ values are used to weight the different regularizations. Ross et al. [4] state that the regularization parameter $\lambda_{1}$ should be set such that the “right answers” and “right reasons” terms have similar orders of magnitude.

rrr can easily be incorporated into XIL (see again Alg. 1), and, as demonstrated by Selvaraju et al.’s hint approach [21], Eq. 1 can be extended if the user is confident about how a (visual) explanation should look like. See Methods for an extended discussion of related work.

Demonstrating XIL on Computer Vision datasets.

We begin by considering simulated users—as it is common for active learning—to evaluate the contribution of explanation feedback. Indeed, counterexample strategies (e.g. caipi) can trivially accommodate more advanced models than the one employed here. We simulate a human annotator that provides correct labels. Explanation corrections are also assumed to be correct and complete (i.e. they identify all false-positive components), for simplicity.

(a) Fashion-MNIST (Toy) Dataset
	no	Counterexamples			rrr
	corr.	$c=1$	$c=3$	$c=5$	IG
Train	97%	93%	92%	92%	89%
Test	48%	82%	85%	85%	85%

(b) Scientific Dataset
	no.	rrr
	corr.	grad-Cam
RGB	89%	87%*
HS	99%	95%

(c) HS Scientific Dataset
non-confounded test set
per-channel	no	rrr
average	corr.	grad-Cam
\pbox15cmnon-tissue	81%	87%
full image	50%	82%

Table 1: Explanatory feedback can boost trust and performance. Highest performances are bold. (a) Accuracy on the fashion MNIST dataset of an MLP without corrections (no corr.), with our (ce) using varying

c

(middle), and rrr with input gradient (IG) constraints [4]. (b) The average model balanced accuracy of applying rrr with grad-Cam over five cross-validation runs. With “*” we denote situations where decisions made based on the background could not be fully removed. (c) The average model balanced accuracy over five cross-validation runs on a non-confounded test set of the hyperspectral (HS) scientific data. The confounding background features were set to either the per-channel average of the non-tissue regions or the full image of the training samples. The accuracies are reported for HS-CNN.

Specifically, we applied our data augmentation strategy to a decoy variant of fashion-MNIST [16], based on [22] (cf. Methods). The average test set accuracy of a multilayer-perceptron (with the same hyperparameters as in [4]) is reported in Tab. 1(a) for three correction strategies: no corrections, our ce, and the input-gradient constraints (rrr). The models’ explanations for ce are computed with lime. Additionally, for every training image, we added $c=1,3,5$ counterexamples where the decoy pixels are randomized. When no corrections are given, the accuracy on the test set is $48\%$ : the confounders completely fool the network, cf. Tab. 1(b). Providing even a single counterexample increases the accuracy to $82\%$ , i.e., the effect of confounders drops drastically. With more counterexamples, the accuracy of ce is similar to that of rrr. Both methods pose valid improvements, thus showing that explanatory interactive learning (XIL) is an effective mean for correcting “Clever Hans” moments in machine learning and may even improve predictive performance and beliefs.

Furthermore, we conducted experiments on the PASCAL VOC 2007 [13] dataset. We focused on a five-class subset (cf. Methods) and revised the model using XIL with the rrr loss. Fig. 2 presents some example images and their explanations with and without user feedback, i.e. default (test accuracy.: $78\%$ ) and XIL trained (test accuracy: $73\%$ ). One can see that the classifier has learned the confounding factor for horse images (the source tag on the bottom left corner) without user feedback. After retraining the classifier using user feedback on the location of the source tag, we can see that the model no longer focuses on the confounder, demonstrating the benefit and effectiveness of XIL also in this setting. Similar benefits can be observed on MSCOCO using hint-like extensions. They may help to more quickly align human and gradient-based network explanations, as shown in the supplement.

Deep plant phenotyping: High predictive performance.

Next, we showcase the extent, importance, and usability of XIL. To this end, we performed classification and revised corrections of the learned models on a real-world, scientific dataset. This dataset corresponds to RGB and hyperspectral (HS) (cf. Methods) images of leaf tissue from inoculated (Cercospora beticola) and healthy sugar beet plants. Notably, there is a strong variability in the extent of disease severity over all samples, with some samples clearly showing the characteristic of Cercospora Leaf Spot (CLS) (two rightmost samples in Fig. 1) while others do not (second to the left sample in Fig. 1) and for the human eye appear indistinguishable—at least in RGB—from healthy leaves (top sample in Fig. 1). Roughly $50\%$ of inoculated tissue samples showed visible CLS.

We performed classification using convolutional neural networks (CNNs) on the RGB and HS datasets (cf. Methods). The task was to classify the leaf samples into the one of the two classes: healthy or diseased. The corresponding average balanced accuracies determined over 5 cross-validation runs are shown in the left column (no corr.) of Tab. 1(b). They show high accuracies of 88% on the RGB dataset and nearly perfect performance of 99% on the HS dataset. It seems the HS data contains more relevant information to the classification task.

Be careful! The deep network might be right for the wrong reasons.

The nearly perfect predictive performance is rather suspicious since plant phenotyping is a rather difficult task. Therefore, we wanted to know the reasons for the predictions and visualized the explanations of the networks using grad-Cams. Specifically, we applied a spectral clustering and t-SNE [23] analysis, similar to [3], on the resulting explanations. Fig. 4(a) shows the strategies of the CNN trained on the HS data for data points belonging to the test set only. Fig. S.1(a) shows the strategies of the CNN trained with RGB data. One can identify that the HS-CNN has altogether two prediction strategies, one for each predicted class label (cf. supplement for more details). In the case of control samples, the HS-CNN focuses on large areas of the tissue, however, for inoculated samples, even if CLS are visible, the network focuses on the nutritional solution (agar) to classify these as inoculated. Moreover, when analyzing the reflectance of the agar across different stages of disease development, we could indeed identify differences between control and inoculated nutrition solution. This can be seen in the left panel of Fig. 3. Given the much smaller data dimensionality of the RGB images compared to the HS data, it seems likely that the RGB-CNN would have more difficulties focusing only on the agar as a classification feature, thus explaining the different classification strategies between HS and RGB-CNNs as well as the reduced classification performance of the RGB-CNN, compared to the HS-CNN.

In any case, both CNNs showed high to very high performances by largely using confounding factors within the dataset. The trained neural networks used strategies, which a biologist would consider as cheating rather than valid problem-solving behavior. The accuracies may not correspond to the true performance when measured in an environment outside of the lab setting, possibly even leading to dangerous consequences if left untackled.

Revising the model to be right for the right reasons.

It is too simple to say that we can not trust these models and even question if machines are truly “intelligent”. We now show that with the human in the loop revising the machine, as in the XIL setting, the models can recover from the observed “Clever Hans”-like strategies towards trustful ones.

To this end, we let a plant biologist revise the machine by constraining the machine’s explanations to match her domain knowledge. Since the used models are differentiable, we focused on using rrr rather than using the CE strategy, though both would be valid here within the XIL framework. Specifically, we simulated the interaction between the domain experts and the ML models. After training a model without any interactions, plant physiologists analyzed the provided predictions and corresponding explanations. She decided that it is always a wrong reason to focus on the background and consequently her annotations corresponded to binary masks of the whole tissue (cf. Methods).

As before, we analyzed the decision strategies of the rrr trained model using t-SNE and spectral clustering. The results are summarized in Fig. 4(b) for the HS-CNN and Fig. S.1(b) for the RGB-CNN. As one can see, after training the HS-CNN with rrr, the model focuses on image regions lying only on the tissue, regardless of the underlying class. The strategies of control samples correspond to nearly full activation of the whole tissue, whereas for inoculated samples the identified relevant image regions are often very specific spots. Particularly, the model now focuses on the CLS, which were previously essentially ignored. Fig. 1(d) shows in more detail several examples of the observed strategies used by the corrected HS-CNN in comparison to the observed “Clever Hans” strategies of the unrevised machine. Although the model’s performance slightly decreased, cf. Tab. 1(b), it is still able to classify samples without visible symptoms. Even exploring different hyper-parameters for rrr, we were not able to force the RGB-CNN to fully ignore the background, as illustrated in Fig. S.1(b). As shown in Fig. 3(a), although the HS-CNN has much more information at hand to focus on the confounding factors in the first place. However, after revision with rrr, it is easier for the HS-CNN to make accurate predictions based on the reflectance of the tissue in comparison to the RGB-CNN (Fig. 3(b)). Particularly, the HS-CNN mainly uses a spectral area for prediction, which is beyond the RGB area. This explains the difficulty of correcting the RGB-CNN.

We now focus on evaluating the default and revised models on a non-confounded test dataset to investigate the generalization improvement of training with XIL. Due to a missing non-confounded test set for the scientific dataset, we performed the simple trick of replacing the confounding features of all test samples with other values (cf. Methods). The results are summarized in Tab. 1(c), reporting the average test accuracy over five cross-validations. One can see that indeed the accuracy of the revised model is higher than that of the default model for both modifications. These results further indicate the generalization improvements due to XIL. Further experiments applying prior knowledge can be found in the supplement.

Trust development during XIL.

After demonstrating that explanations and especially XIL are necessary to reveal and correct so-called “Clever Hans” behavior of ML models we finally investigate how explanations influence the trust of users in the learning process. To this end, we designed a questionnaire about a machine that learns a simple concept by querying labels (but not explanation corrections) to an annotator. The online questionnaire was administered to 106 participants of varying ages and backgrounds.

Specifically, we designed a toy binary classification problem of ( $3\times 3$ ) black-and-white images, inspired by the color dataset used in [4]. The subjects were told that an image is positive if the two top corners are white and negative otherwise. They were shown 30 images together with the classification of an AI model and a knowledgeable annotator. The learning of the model was simulated by increasing the model’s classification accuracy from $50\%$ over $70\%$ to $100\%$ after every 10 images.

Each participant was randomly assigned to perform one of three experimental conditions with varying feedback from the model. In test condition 1 (TC1), the participant received feedback for each image in the form of the model’s prediction and the label provided by a knowledgeable annotator. No explanations were shown. Test conditions 2 and 3 (TC2, TC3) were identical to TC1, meaning that at every stage the same example, prediction and feedback label were shown, but now explanations were also provided. The explanations highlighted the two most relevant pixels in form of red dots. In TC2, the explanations converged to the correct rule—they highlight the two top corners—from the \nth6 image onwards. In TC3 the explanations converged to an incorrect rule—an image was classified as positive if the two top right pixels were white—from the \nth12 image onward. To assess the participant’s trust in the model’s skills we used the Trust in Automation Questionnaire (TiA) [24]. After each learning process stage, the subjects were asked to rate (Q1) “I trust that the AI has learned the correct rule for classifying such images.”. Lastly, having seen all images, subjects were asked to answer the full TiA.

Fig. 5 summarizes the results, where (a) shows the total TiA score over TC1-TC3 and (b-d) the Q1 results for each test condition over the different stages of the learning process. They confirm previous findings: without explanations, people trust highly accurate machines, but the trust drops when wrong behavior is witnessed [6]. Users expect machines and their explanations to be correct. Indeed, explanations may increase the trust in earlier iterations at lower predictive performances, if they are correct. But people do not forgive wrong explanations if the predictions are correct. Thus, users really care about the “right for the wrong reasons” case.

Taking all our empirical results together, people care about “Clever Hans”-like moments in machine learning, XIL can eliminate them, and XIL may even improve the model’s predictive performance.

Conclusion

In recent years, AI methods, especially machine learning with various directions and algorithms [25, 26], have become more and more successful in a wide range of areas like computer vision, natural language processing, and robotics, among others. Consider e.g. AlphaZero surpassing human-level performance in playing chess and Go. During its self-play training process, AlphaZero discovered a remarkable level of Go knowledge. This included not only fundamental elements of human Go knowledge, but also non-standard strategies beyond the scope of traditional human Go knowledge [27], exemplifying the potential of these methods to discover strategies previously unknown even to experts of the domain. However, studies from various applications such as [28, 29, 30, 3] have revealed that learning machines can also result in “Clever Hans”-like moments, i.e., human-undesired strategies where the machine exploits artifacts in the dataset.

To “un-Hans” machines, we introduced the novel learning setting of “explanatory interactive learning” (XIL) and illustrated its benefits. XIL adds the scientist into the training loop. She interactively revises the original model via providing feedback on its explanations, used to automatically augment the training with counterexamples or to modify the model using rrr. Our experimental results demonstrate that users care strongly about “Clever Hans”-like moments in machine learning and XIL can indeed help avoiding them.

There are several possible avenues for future work to overcome the current limitations of XIL. Acquiring annotations, especially of explanations, can be time consuming. The number of interactions required in order to reach an acceptable state is an open issue [16]. Hence, one should work on optimal query strategies for XIL that aim at minimizing the interaction efforts. Adapting regret bounds from co-active learning [11] might be an interesting alternative. Moreover, the data at hand may not always allow XIL to fully alleviate wrong reasons without decreasing the network’s predictive performance. One should develop ways for keeping the drop as small as possible. Furthermore, XIL relies on two assumptions, namely, (a) faithful explanations can be computed, and (b) the user feedback is faithful, too. Assumption (a) is still subject to very active research, particularly for deep learning methods [31] (see the supplement). One should improve the quality and robustness of XAI methods and also explore XIL for interpretable models [32]. If the user is rather confident about the right reasons, learning to explain methods such as hint provide an interesting avenue for future work. Our initial results, see the supplement, are encouraging. However, even scientific experts do not always know the reasons for predictions. Therefore, one should strive to better understand the effects of wrong feedback and even adversarial attacks [33] on XIL. Additionally, one should turn other interactive learning settings such as coactive [11], active imitation [18], mixed-initiative interactive [19] and guided probabilistic learning [34] into explanatory one. Lastly, because it is not yet clear what makes explanations good for humans [35], one should extend explanatory interactions towards using alternative explanations, multiple modalities and counterfactuals [36, 37]. In any case, interacting with explanations of machine learning models is an enabler for scientific discoveries for humans and machines in cooperation.

Methods

Active learning.

The active learning paradigm targets scenarios where obtaining supervision has a non-negligible cost. Here we cover the basics of pool-based active learning, and refer the reader to two excellent surveys [38, 39] for more details. Let $\mathcal{X}$ be the space of instances and $\mathcal{Y}$ be the set of labels (e.g. $\mathcal{Y}=\{\pm 1\}$ ). Initially, the learner has access to a small set of labeled examples $\mathcal{L}\subseteq\mathcal{X}\times\mathcal{Y}$ and a large pool of unlabeled instances $\mathcal{U}\subseteq\mathcal{X}$ . The learner is allowed to query the label of unlabeled instances (by paying a certain cost) to a user functioning as an annotator, often a human expert. Once acquired, the labeled examples are added to $\mathcal{L}$ and used to update the model. The overall goal is to maximize the model quality while keeping the number of queries or the total cost at a minimum. To this end, the query instances are chosen to be as informative as possible, typically by maximizing some informativeness criterion, such as the expected model improvement [40] or practical approximations thereof. By carefully selecting the instances to be labeled, active learning can enjoy much better sample complexity than passive learning [41, 42]. Prototypical active learners include max-margin [43] and Bayesian approaches [44]; recently, deep variants have been proposed [45]. However, active (showing query data points) and even coactive learning (showing additionally the prediction of the query data point) do not establish trust: informative selection strategies just pick instances where the model is uncertain and likely wrong. There is a trade-off between query informativeness and user “satisfaction”, as noticed and explored in [46]. To properly modulate trust into the model, we argue it is essential to present explanations, e.g., visual ones as shown in Fig. 6.

Local explainers.

There are two main strategies for explaining machine learning models. Global approaches aim to explain the model by converting it as a whole to a more interpretable format [7],[47]. Local explainers instead focus on the arguably more approachable task of explaining individual predictions [9]. While explainable interactive learning can accommodate any local explainer, in our implementations we used either lime [8] or grad-Cam [20], both described next.

The idea of lime (Local Interpretable Model-agnostic Explanations) is simple: even though a classifier may rely on many uninterpretable features, its decision surface around any given instance can be locally approximated by a simple, interpretable local model. In lime, the local model is defined in terms of simple features encoding the presence or absence of basic components, such as words in a document or objects in a picture. While not all problems admit explanations in terms of elementary components, many of them do [8]; in this case, lime assumes these to be provided in advance. An explanation can be readily extracted from such a model by reading off the contributions of the various components to the target prediction and translating them into an interpretable visual artifact. For instance, in document classification one may highlight the words that support (or contradict) the predicted class.

grad-Cams are a generalization of Class Activation Maps, introduced by [48] and take advantage of the facts that, firstly, deeper layers of a CNN capture higher-level visual constructs and, secondly, that convolutional features retain spatial information. As such, the last convolutional layer represents a trade-off between high visual representation and spatial information. Specifically, a grad-Cam is computed by forward passing an image through the network, applying a backpropagation of a one-hot encoding vector that specifies the class label of interest up to the last convolutional layer. The resulting gradients of each channel are global average pooled, multiplied with the corresponding feature maps, summed and finally passed through a RELU activation function. In this way, the final feature maps of the convolutional feature extractor are weighted by the importance of these features. The resulting two-dimensional heatmap can finally be interpolated to the original input size for visualization. In case a 3D convolutional network is used to classify hyperspectral data the resulting heatmap is three dimensional also showing activations along the spectral dimension of the data, cf. Fig. 6.

Explanatory Interactive Learning with counterexamples.

Why is this data augmentation a sensible idea? To see this, consider the case of linear max-margin classifiers. Let $f(x)=\langle\boldsymbol{w},\boldsymbol{\phi}(x)\rangle+b$ be a linear classifier over two features, $\phi_{1}$ and $\phi_{2}$ , of which only the first is relevant. Fig. 7 shows that $f(x)$ (red line) uses $\phi_{2}$ to correctly classify a positive example $x_{i}$ . In order to obtain a better model (e.g. the green line), the simplest solution would be to enforce an orthogonality constraint $\langle\boldsymbol{w},(0,1)^{\top}\rangle=0$ during learning. Counterexamples follow the same principle. In the separable case, the counterexamples $\{\bar{x}_{i\ell}\}_{\ell=1}^{c}$ amount to additional max-margin constraints [49] of the form $y_{i}\langle\boldsymbol{w},\boldsymbol{\phi}(\bar{x}_{i\ell})\rangle\geq 1$ . The only ones that influence the model are those on the margin, for which strict equality holds. For all pairs of such counterexamples $\ell,\ell^{\prime}$ it holds that $\langle\boldsymbol{w},\boldsymbol{\phi}(\bar{x}_{i\ell})\rangle=\langle% \boldsymbol{w},\boldsymbol{\phi}(\bar{x}_{i\ell^{\prime}})\rangle$ , or equivalently $\langle\boldsymbol{w},\boldsymbol{\delta}_{i\ell}-\boldsymbol{\delta}_{i\ell^{% \prime}}\rangle=0$ , where $\boldsymbol{\delta}_{i\ell}=\boldsymbol{\phi}(\bar{x}_{i\ell})-\boldsymbol{% \phi}(x_{i})$ . In other words, the counterexamples encourage orthogonality between $\boldsymbol{w}$ and the correction vectors $\boldsymbol{\delta}_{i\ell}-\boldsymbol{\delta}_{i\ell^{\prime}}$ , thus approximating the orthogonality constraint above.

Most importantly, this data augmentation procedure is model-agnostic, although alternatives indeed exist: (manually) adding a discovered data artifact to samples of other classes [50], contrastive examples [51], feature ranking [52] for SVMs and constraints on the input gradients for differentiable models [4].

We note that due to sampling, lime may output different explanations for the same prediction. To reduce the variance of the experiments with ce of Tab. 1, we ran it $10$ times and retained the $k$ components identified most often as relevant by lime.

fashion-MNIST dataset.

The fashion-MNIST dataset, a fashion product recognition dataset, includes 70,000 images over 10 classes. All images were corrupted by introducing confounders, that is, $4\times 4$ patches of pixels in randomly chosen corners whose shade is a function of the label in the training set and random in the test set (see [4] for details).

PASCAL VOC 2007 dataset.

We used a subset of the PASCAL VOC 2007 dataset in our experiment. This subset includes resp. 1470 train and 782 test images over 5 classes (horse, cat, bird, bus, dog). Only samples from the horse class contain confounding features, i.e. watermark text. We rescale all the images to 224*224*3 to use the VGG-16 network [53] as classifier, and we used the ImageNet-pre-trained weights as initial weights, as well as the ADAM optimizer [54]. We trained a default model without user feedback and a model with user feedback for 2k epochs. The explanation method was instantiated with input gradients (IG).

Sample collection.

To demonstrate the significance of XIL, we demonstrate XIL for deep plant phenotyping and plant disease detection, a growing and relevant field of research [55, 56, 57, 58, 59, 60]. To this end, we recorded a scientific, real-world dataset—a plant phenotyping dataset consisting of RGB and hyperspectral images (HS) of healthy and diseased sugar beet leaves. Then, we applied convolutional neural networks to classify the plants’ leaves into the categories control (healthy) and inoculated (diseased) and investigated the underlying reasons for the network’s predictions. As a model disease, Cercospora leaf spot (CLS) was used. This is caused by Cercospora beticola and is the most destructive leaf disease of sugar beet with worldwide economic importance.

The dataset used in this study corresponds to HS and RGB images of leaf discs of sugar beet cv. Isabella (KWS, Einbeck, Germany) inoculated with Cercospora beticola. Sugar beet seeds were pre-grown in small pots and piqued after the primary leaves were fully developed. The seedlings were then transferred into plastic pots (diameter of 17 cm) on commercial substrate (Topfsubstrat 1.5, Balster Erdenwerk, GmbH, Sinntal-Altengronau, Germany) under greenhouse conditions and watered as necessary. After reaching growth stage 16 according to BBCH scale [61] the plants were inoculated with C. beticola conidia which were collected from infested sugar beet leaves after incubation in a moist chamber for 48 hours. A spore suspension of $5\times 10^{5}$ was sprayed onto leaves before the plants were transferred into plastic bags to achieve 100% RH for 48 hours. For image acquisition leaf discs were stamped out with a cork borer of 2 cm diameter and placed on 10g/l pyhtoagar (Duchefa Biochemie B.V, Haarlem, Netherlands), containing 0.34 mM benzimidazole, 10 g sucrose and 3 mg kinetin. To observe different symptom classes sugar beet leaves of 9, 14 and 19 days after inoculation (dai) were used since the first symptoms appeared 9 dai. As a control group, 18 leaf discs of untreated sugar beet plants were measured as well and five technical replications with 6 discs each were used for each symptom group.

Each sample, both control and inoculated, was measured daily over five consecutive days such that a sample from 9 dai reappears four further times in the dataset as 10 to 13 dai. A few samples were discarded due to technical issues. The percentage of healthy leaves to unhealthy leaves was approximately $26\%$ to $74\%$ , respectively. For image acquisition leaf discs on agar were placed on a linear stage at a distance of 53 cm to a Hyperspec VNIR E-series imaging sensor (Headwall Photonics, Bolton, MA, USA) in the range of 380 nm to 1010 nm. The VNIR sensor has a spectral resolution of 2-3 nm and a pixel pitch of 6.5 $\mu$ m. The sensor was surrounded by eight lamps (Ushio Halogen Lamp J12V-150WA/80 (Marunouchi, Chiyoda-ku, Tokyo, Japan)) and the distance between lamps and leaves was 60 cm with a vertical orientation of 45°. Exposure times of 44 ms were used for the VNIR sensor.

The dataset consists of 2410 samples with 504 samples labeled as control and 1906 labeled as inoculated. Control samples were not re-used as inoculated samples. The collected hyperspectral raw data size was around 4TB. After preprocessing the data by cutting out the leaf discs into hyperspectral cubes the data size is 140 GB. Since there is a lot of redundancy in the wavelength resolution, we further sub-sampled the depth of the data cubes resulting in a final data size of 32GB.

Data preparation.

As mentioned above, each sample was imaged over five consecutive days such that each sample, though slightly differing from day to day, is represented up to 5 times within the full dataset. In this way, a sample from 9 dai would occur for 4 further days (10-13 dai). To prevent the models from memorizing the structure of the individual leaf samples and correlating this to the corresponding labels, a precaution was taken to exclusively contain all days of one sample either in the training or validation dataset.

Removing confounders for the scientific dataset.

It is essential to maintain the underlying assumption that the training and test data are drawn from the same distribution. If this is not the case, changes in accuracy might be due to artifacts of different data, rather than deficits of the model [62]. We applied two variations to the test samples of the HS dataset to remove the confounders: we set the background (everything but the plant tissue) (1) to the per-channel average of the non-tissue regions or (2) the per-channel average of the full images of the training data. We then evaluated the default and rrr revised CNNs on this modified test dataset. We focused here only on the HS data and model, due to the limitations of the RGB model’s performance.

RGB/HS classification.

The RGB images used for training the classifiers were generated from the hyperspectral data, by slicing the data at the corresponding RGB channels that were provided by the camera system (cf. Fig. 1 (A-Right)). Before training the RGB classifiers, the data was standard scaled following $z=(x-u)/s$ , where $u$ is the mean and $s$ the standard deviation of the training samples.

To train a classifier on the RGB images of sugar beet leaves we used a VGG-16 [53] network pre-trained on ImageNet [63] to finetune the network parameters using the RGB plant images. For training a batch size of 32, a learning rate of 1e-4 and a step learning rate scheduler set to reduce the learning rate at epochs 5 and 15 by a factor of $0.1$ were used. Furthermore, the ADAM optimizer was used with L2 regularization 1e-5. Five separate cross-validation folds were trained until convergence, using a data split of $0.75$ for training and $0.25$ for testing. Convergence was reached after 30 epochs.

To classify the HS data we trained a convolutional neural network (CNN) architecture with batch normalization using 3D convolution filters, rather than standard 2D filters, learning features not only along the image dimensions but also over the spectral dimensions. The used network is build up with four residual blocks, each containing one to three convolutional layers. The last two layers are fully connected layers with a final softmax activation function. The other layers use ReLU activations. During training the networks we used dropout to prevent overfitting. The network’s parameters are trained with a stochastic gradient descent optimizer with momentum using a batch size of 10 HS images, a learning rate of 1e-4 and an L2 regularization of 1e-5.

Five separate cross-validation folds were trained until convergence, using a data split of $0.75$ for training and $0.25$ for testing. Convergence was reached after 100 epochs.

Analyzing classification strategies of the model.

Based on the results of [31], in which the authors performed sanity checks over a variety of saliency methods, we chose to investigate our model’s explanations using Gradient-weighted Class Activation Mapping (grad-Cam)[20].

To analyze the resulting strategies produced by the layer-wise relevance propagation method (LRP), the authors of [3] revert to using spectral clustering on the resulting heatmaps in a pipeline they termed ’SpRAy’. This clustering served to receive an overview of the extent of the model’s decision strategies. We apply SpRAy in a similar way, however, rather than using the raw grad-Cam heatmaps, we perform a discrete Fourier transformation on these beforehand to better differentiate different strategies which we had previously identified from single samples. In detail, the pipeline is as follows

•

Perform a discrete Fourier transform on downsized grad-Cam heatmaps.
•

Using the Euclidean distance for the RGB data and the Cityblock distance for the HS data compute a k-nearest neighbor graph of the Fourier transformed heatmaps, represented as an adjacency matrix, $C$ .
•

Compute the affinity matrix as suggested in [64] as $A=max(C,C^{T})$ .
•

Perform an eigengap analysis [64] to estimate the number of clusters, k, within the dataset.
•

Perform spectral clustering on the affinity matrix, given k from the previous step
•

Perform a t-SNE analysis [23] on the similarity matrix, estimated from the affinity matrix as in [3] as $S=\frac{1}{A+\epsilon}$ , whereby $\epsilon\in[0,1]$ , here we used $\epsilon=0.05$ .

Applying XIL to CNNs for scientific dataset.

We produced the matrix $A$ (Eq. 1) corresponding to full tissue masks for each sample. Specifically, for each sample, we created a binary mask having values of zero within the tissue and values of one everywhere else, i.e. the background. In this way during training the gradients everywhere but on the tissue are to be minimized.

The network models were retrained from the same initial values as in the default training mode (using only the cross-entropy loss), however, now using rrr. To choose the optimal $\lambda_{1}$ value, the resulting explanations were visually assessed. The five cross-validation folds of HS-CNN were thus trained until convergence between 200 and 280 epochs using a $\lambda_{1}=20$ value, with all other hyperparameters as in the default training mode. For training the RGB-CNN with rrr the learning rate was reduced to a constant learning rate of 5e-05. Although applying a range of $\lambda_{1}$ values from $0.1$ to $1000$ , using the RGB-CNN, no satisfactory convergence state could be reached in which the regularized model showed acceptable explanations for each cross-validation run. The accuracies in Tab. 1 and the strategies presented in Fig. 4(b) and Fig. S.1(b) correspond to grad-Cams of training the five cross-validation folds with $\lambda_{1}=1$ for up to 200 epochs.

Extended related work.

Using XIL with ce or rrr, users either introduce counterexamples into the dataset and thus teach the learner not to depend on the irrelevant components or directly penalize the learner as soon as it uses irrelevant components, respectively. One important advantage of XIL is that the user does not have to be certain about the right reasons and instead can explore the learned reasons of the machine, in contrast to other procedures such as preprocessing the training set.

Recently, Selvaraju et al. [21] presented a framework (hint) similar to rrr but instead of penalizing the wrong reasons it advises the network to use a specific visual area (right reasons). As ce and rrr, the hint method could be embedded within the introduced XIL framework in case the users are certain about the right reasons. However, in many scientific applications such as the presented plant phenotyping dataset users are uncertain about what a valid explanation should be. In this case, removing wrong reasons might be preferable to applying right reasons.

The possibility of bi-directional exchange between user and model due to interaction [65] also distinguishes XIL from approaches for feature selection such as feature masking and approaches that embed prior knowledge into the training process, e.g. [66]. Lastly, interactions also allow that the user can provide incomplete explanations, in other words: only if it is actually required, the user can revise incorrect aspects of a model’s explanation.

Finally, we present the XIL framework here for visual tasks and visual explanations only. With our definition of XIL, it is also applicable to other data domains like natural language processing, see e.g. [4, 16]. However, we experienced that explanations, i.e. right and wrong reasons, are more difficult to define for this modality. In future work one should generally address the best ways to present explanations, even in multi-modal scenarios.

Details on participant recruitment and study procedure.

The presented study is part of an extensive thesis work [15]. It was conducted as an online survey, the link of which was distributed via the social network Facebook and the forum of the student body of the department of computer science at TU Darmstadt. Due to the distribution on these channels a wide range of people with different ages and different backgrounds was generated. Each participant completed only one of the three test conditions with 33 participants in TC1, 36 participants in TC2 and 37 participants in TC3, totaling 106 participants overall.

The wording of the original TiA was modified by replacing “system” with “artificial intelligence (AI)”.The response format to each question was a 5-point rating scale from strongly disagree to strongly agree.

Statistical analysis of the user study.

Samples with missing values were removed from the analysis and for all tests a significance level with alpha being 5% was used.

For all tests with the same sample/samples, the alpha level was corrected via the Bonferroni-Holm method. The corrected alpha level will be stated for every analysis. For testing the hypotheses one multi-factorial analysis of variances (MANOVA) and several one-factorial ANOVAs were conducted. The ANOVA, as well as the MANOVA, requires normal distribution of data, independence of data as well as homogeneity of the variances. To test the latter a Levene-Test was conducted before every ANOVA and the MANOVA. Normal distribution was presumed due to the sample sizes and as the samples were drawn randomly the independence of data was also presumed. A significant result of an ANOVA / MANOVA means that at least two of the groups differ significantly with respect to the dependent variable, but it is not stated which groups differ. Therefore, if the carried out analyses of variances were significant, post hoc tests were carried out to investigate which groups differed exactly. Post hoc tests were selected in this study as the hypotheses did not point out which groups should differ, which is why every possible comparison had to be considered. For post hoc testing, the Tukey-HSD-Test and the Pairwise-Test were performed.

The TiA score of subjects being familiar with AI over the whole sample (all test conditions combined) was higher ( $M=2.82$ , $SD=.64$ ) than the TiA score of subjects being unfamiliar with AI ( $M=2.51$ , $SD=.59$ ). As the conducted Levene-Test ( $F(5,99)=1.8$ , $p=.12$ , $\alpha=.05$ ) was not significant, the homogeneity of variance assumption held. Therefore, the MANOVA was conducted with a significant result for the independent variable test condition ( $F(2,99)=10.10$ , $p<.001$ , $\alpha=.025$ ). The MANOVA was significant for the independent variable familiarity with AI ( $F(1,99)=7.12$ , $p=.009$ , $\alpha=.025$ ). It was not significant for the interaction of the two independent variables ( $F(2,99)=.28$ , $p=.75$ , $\alpha=.025$ ).

For Fig. 5(a) in order to determine which test conditions differed significantly in their TiA score, a pairwise test was conducted as a post hoc test. The pairwise test showed significant differences between TC1 and TC3 ( $p=.0016$ , $\alpha=.05$ ) as well as between TC2 and TC3 ( $p=.0003$ , $\alpha=.05$ ).

For Fig. 5(b) the conducted Levene-Test was not significant ( $F(2,96)=.59$ , $p=.56$ , $\alpha=.05$ ). Therefore, an ANOVA was conducted afterwards and showed a significant result ( $F(2,96)=33.83$ , $p<.001$ , $\alpha=.0125$ ). Trust in the correct rule learning by the AI was significantly different between the blocks. The conducted Tukey-HSD test found a significant difference in trust into the correct rule learning only between stage 1 and 3 ( $p<.001$ , $\alpha=.05$ ) and between stage 2 and 3 ( $p<.001$ , $\alpha=.05$ ).

For Fig. 5(c) the Levene-Test was not significant ( $F(2,104)=.28$ , $p=.75$ , $\alpha=.05$ ). The ANOVA was significant ( $F(2,104)=23.19$ , $p<.001$ , $\alpha=.0167$ ). Therefore, a Tukey-HDS test was performed to investigate which blocks differed significantly. The test found only stage 1 and 3 ( $p<.001$ , $\alpha=.05$ ) and stage 2 and 3 ( $p<.001$ , $\alpha=.05$ ) to differ significantly with respect to trust in correct rule learning by the AI.

For Fig. 5(d) the conducted Levene-Test was not significant ( $F(2,105)=1.32$ , $p=.27$ , $\alpha=.05$ ). The afterwards conducted ANOVA was also not significant ( $F(2,105)=1.62$ , $p=.20$ , $\alpha=.05$ ). Therefore, there was no significant difference in trust into correct rule learning by the AI in TC3 and no post hoc test was performed.

Data availability

The ML benchmark Fashion-MNIST is available at https://github.com/zalandoresearch/fashion-mnist. The PASCAL VOC2007 dataset is available at http://host.robots.ox.ac.uk/pascal/VOC/voc2007/. The RGB and hyperspectral data that support the findings of this study are available at https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/2278.4 and in the code repository https://codeocean.com/capsule/4559958/tree. The user study is available at https://github.com/ml-research/xil/tree/master/Trust_Study.

Code availability

The code and a fully runnable capsule to reproduce the figures and results of this article, including pre-trained models, can be found at https://codeocean.com/capsule/4559958/tree.

Statement of ethical compliance

The authors confirm to have complied with all relevant ethical regulations, according to the Ethics Commission of the TU Darmstadt (https://www.intern.tu-darmstadt.de/gremien/ethikkommisson/auftrag/auftrag.en.jsp). An informed consent was obtained for each participant prior to commencing the user study.

Acknowledgments

ST an KK thank Antonio Vergari, Andrea Passerini, Samuel Kolb, Jessa Bekker, Xiaoting Shao, and Paolo Morettin for very useful feedback on the conference version of this article. Furthermore, the authors are thankful to Frank Jäkel for support and supervision on the user study, to Cigdem Turan for providing the figure sketches, and to Ulrike Steiner and Stefan Paulus for very useful feedback. PS, AKM, AB and KK acknowledge the support by BMEL funds of the German Federal Ministry of Food and Agriculture (BMEL) based on a decision of the Parliament of the Federal Republic of Germany via the Federal Office for Agriculture and Food (BLE) under the innovation support program, project “DePhenSe” (FKZ 2818204715). WS an KK were also supported by BMEL/BLE funds under the innovation support program, project “AuDiSens” (FKZ 28151NA187). ST acknowledges the supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme, grant agreement No. [694980] “SYNTH: Synthesising Inductive Data Models”. XS and KK also acknowledges the support by the German Science Foundation project “CAML” (KE1686/3-1) as part of the SPP 1999 (RATIO). AKM was partially funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy - EXC 2070 – 390732324

Conflict of interest statement

The authors declare the following competing interests: HS is employed by LemnaTec GmbH.

Author information

Affiliations

Technical University of Darmstadt, Computer Science Department, Artificial Intelligence and Machine Learning Lab, Darmstadt, Germany
Patrick Schramowski, Wolfgang Stammer, Franziska Herbert, Xiaoting Shao

Technical University of Darmstadt, Computer Science Department and Centre for Cognitive Science, Darmstadt, Germany
Kristian Kersting

University of Trento, Department of Information Engineering and Computer Science, Trento, Italy
Stefano Teso

University of Bonn, Institute of Crop Science and Resource Conservation (INRES) – Plant Diseases and Plant Protection, Bonn, Germany
Anna Brugger

Institute of Sugar Beet Research, Goettingen, Germany
Anne-Katrin Mahlein

LemnaTec GmbH, Aachen, Germany
Hans-Georg Luigs

Author Contributions

PS and WS contributed equally to the work. PS, WS, ST, KK designed the study. ST, KK designed and published (AAAI /ACM Conference on Artificial Intelligence, Ethics, and Society 2019) the preliminary version of this manuscript. PS, WS, XS, ST, and KK developed extensions of the basic XIL methods. PS, WS, AB, AKM, and KK interpreted the data and drafted the manuscript. AB and PS designed the phenotyping dataset. AB and HGL carried out the phenotyping dataset measuring. PS, WS, AB did the biological analysis. FH performed and analyzed the user study. AKM and KK directed the research and gave initial input. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Patrick Schramowski and Wolfgang Stammer.

References

[1] Guidotti, R. et al. A survey of methods for explaining black box models. ACM computing surveys (CSUR) 51, 1–42 (2018).
[2] Gilpin, L. H. et al. Explaining explanations: An overview of interpretability of machine learning. In 2018 IEEE International Conference on data science and advanced analytics (DSAA), 80–89 (2018).
[3] Lapuschkin, S. et al. Unmasking clever hans predictors and assessing what machines really learn. Nature communications 10, 1096 (2019).
[4] Ross, A. S., Hughes, M. C. & Doshi-Velez, F. Right for the right reasons: Training differentiable models by constraining their explanations. In Proceedings of International Joint Conference on Artificial Intelligence, 2662–2670 (2017).
[5] Simpson, J. A. Psychological foundations of trust. Current directions in psychological science 16, 264–268 (2007).
[6] Hoffman, R. R., Johnson, M., Bradshaw, J. M. & Underbrink, A. Trust in automation. IEEE Intelligent Systems 28, 84–88 (2013).
[7] Buciluǎ, C., Caruana, R. & Niculescu-Mizil, A. Model Compression. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD, 535–541 (2006).
[8] Ribeiro, M. T., Singh, S. & Guestrin, C. Why should I trust you?: Explaining the predictions of any classifier. In Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining, 1135–1144 (ACM, 2016).
[9] Lundberg, S. & Lee, S. An unexpected unity among methods for interpreting model predictions. CoRR abs/1611.07478 (2016). URL http://arxiv.org/abs/1611.07478.
[10] Settles, B. Closing the loop: Fast, interactive semi-supervised annotation with queries on features and instances. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1467–1478 (Association for Computational Linguistics, 2011).
[11] Shivaswamy, P. & Joachims, T. Coactive learning. Journal of Artificial Intelligence Research 53, 1–40 (2015).
[12] Kulesza, T. et al. Principles of explanatory debugging to personalize interactive machine learning. In Proceedings of International Conference on Intelligent User Interfaces, 126–137 (2015).
[13] Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J. & Zisserman, A. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html.
[14] Lin, T. et al. Microsoft COCO: common objects in context. In Proceedings of European Conference on Computer Vision, 740–755 (2014).
[15] Herbert, F. P., Kersting, K. & Jäkel, F. Why Should I Trust in AI? Master’s thesis, Technical University Darmstadt (2019).
[16] Teso, S. & Kersting, K. Explanatory interactive machine learning. In Proceedings of AAAI/ACM Conference on AI, Ethics, and Society (AAAI, 2019).
[17] Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1, 206–215 (2019).
[18] Judah, K. et al. Active imitation learning via reduction to iid active learning. In AAAI Fall Symposium Series (2012).
[19] Cakmak, M. et al. Mixed-initiative active learning. ICML 2011 Workshop on Combining Learning Strategies to Reduce Label Cost (2011).
[20] Selvaraju, R. R. et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, 618–626 (2017).
[21] Selvaraju, R. R. et al. Taking a hint: Leveraging explanations to make vision and language models more grounded. In Proceedings of the IEEE International Conference on Computer Vision, 2591–2600 (2019).
[22] Xiao, H., Rasul, K. & Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017).
[23] Maaten, L. v. d. & Hinton, G. Visualizing data using t-SNE. Journal of machine learning research 9, 2579–2605 (2008).
[24] Körber, M. Theoretical considerations and development of a questionnaire to measure trust in automation. In Congress of the International Ergonomics Association, 13–30 (Springer, 2018).
[25] Jordan, M. I. & Mitchell, T. M. Machine learning: Trends, perspectives, and prospects. Science 349, 255–260 (2015).
[26] Ghahramani, Z. Probabilistic machine learning and artificial intelligence. Nature 521, 452–459 (2015).
[27] Silver, D. et al. Mastering the game of go without human knowledge. Nature 550, 354–359 (2017).
[28] Zech, J. R. et al. Confounding variables can degrade generalization performance of radiological deep learning models. CoRR abs/1807.00431 (2018). URL http://arxiv.org/abs/1807.00431.
[29] Badgeley, M. A. et al. Deep learning predicts hip fracture using confounding patient and healthcare variables. npj Digital Medicine 2, 31 (2019).
[30] Chaibub Neto, E. et al. A permutation approach to assess confounding in machine learning applications for digital health. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 54–64 (ACM, 2019).
[31] Adebayo, J. et al. Sanity checks for saliency maps. In Proceedings of Advances in Neural Information Processing Systems, 9505–9515 (2018).
[32] Chen, C. et al. This looks like that: Deep learning for interpretable image recognition. In Proceedings of Advances in Neural Information Processing Systems, 8928–8939 (2019).
[33] Dombrowski, A. et al. Explanations can be manipulated and geometry is to blame. In Wallach, H. M. et al. (eds.) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, 13567–13578 (2019).
[34] Odom, P. & Natarajan, S. Human-guided learning for probabilistic logic models. Frontiers in Robotics and AI 5, 56 (2018).
[35] Narayanan, M. et al. How do humans understand explanations from machine learning systems? an evaluation of the human-interpretability of explanation. CoRR abs/1802.00682 (2018). URL http://arxiv.org/abs/1802.00682.
[36] Kanehira, A. & Harada, T. Learning to explain with complemental examples. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8603–8611 (2019).
[37] Huk Park, D. et al. Multimodal explanations: Justifying decisions and pointing to the evidence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 8779–8788 (2018).
[38] Settles, B. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 6, 1–114 (2012).
[39] Hanneke, S. et al. Theory of disagreement-based active learning. Foundations and Trends® in Machine Learning 7, 131–309 (2014).
[40] Roy, N. et al. Toward optimal active learning through monte carlo estimation of error reduction. ICML 441–448 (2001).
[41] Castro, R. M. et al. Upper and lower error bounds for active learning. In Proceedings of Conference on Communication, Control and Computing, 2.1, 1 (2006).
[42] Balcan, M.-F. et al. The true sample complexity of active learning. Machine learning 80, 111–139 (2010).
[43] Tong, S. & Koller, D. Support vector machine active learning with applications to text classification. Journal of machine learning research 2, 45–66 (2001).
[44] Krause, A. et al. Nonmyopic active learning of gaussian processes: an exploration-exploitation approach. In Proceedings of International Conference on Machine learning, 449–456 (ACM, 2007).
[45] Gal, Y. et al. Deep bayesian active learning with image data. In Proceedings of International Conference on Machine learning, 1183–1192 (2017).
[46] Schnabel, T. et al. Short-term satisfaction and long-term coverage: Understanding how users tolerate algorithmic exploration. In Proceedings of ACM International Conference on Web Search and Data Mining, 513–521 (ACM, 2018).
[47] Bastani, O., Kim, C. & Bastani, H. Interpreting blackbox models via model extraction. CoRR abs/1705.08504 (2017). URL http://arxiv.org/abs/1705.08504.
[48] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. & Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2921–2929 (2016).
[49] Cortes, C. et al. Support-vector networks. Machine learning 20, 273–297 (1995).
[50] Anders, C. J. et al. Analyzing imagenet with spectral relevance analysis: Towards imagenet un-hans’ed. arXiv preprint arXiv:1912.11425 (2019).
[51] Zaidan, O. et al. Using “annotator rationales” to improve machine learning for text categorization. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 260–267 (2007).
[52] Small, K. et al. The constrained weight space svm: learning with ranked features. In Proceedings of International Conference on Machine learning, 865–872 (Omnipress, 2011).
[53] Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of International Conference on Learning Representations (2015).
[54] Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. In Proceedings of International Conference on Learning Representations (2015).
[55] Lau, E. High-throughput phenotyping of rice growth traits. Nature Reviews Genetics 15, 778–778 (2014).
[56] de Souza, N. High-throughput phenotyping. Nature Methods 36–36 (2009).
[57] Tardieu, F., Cabrera-Bosquet, L., Pridmore, T. & Bennett, M. Plant Phenomics, From Sensors to Knowledge. Current Biology 27, R770–R783 (2017).
[58] Pound, M. P. et al. Deep machine learning provides state-of-the-art performance in image-based plant phenotyping. Gigascience 6, gix083 (2017).
[59] Mochida, K. et al. Computer vision-based phenotyping for improvement of plant productivity: a machine learning perspective. GigaScience 8, giy153 (2018).
[60] Mahlein, A.-K. et al. Quantitative and qualitative phenotyping of disease resistance of crops by hyperspectral sensors: seamless interlocking of phytopathology, sensors, and machine learning is needed! Current opinion in Plant Biology 50, 156–162 (2019).
[61] Meier, U. et al. Phenological growth stages of sugar beet (Beta vulgaris l. ssp.) codification and description according to the general bbch scale (with figures). Nachrichtenblatt des Deutschen Pflanzenschutzdienstes 45, 37–41 (1993).
[62] Hooker, S., Erhan, D., Kindermans, P. & Kim, B. A benchmark for interpretability methods in deep neural networks. In Proceedings of Advances in Neural Information Processing Systems, 9734–9745 (2019). URL http://papers.nips.cc/paper/9167-a-benchmark-for-interpretability-methods-in-deep-neural-networks.
[63] Deng, J. et al. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2009).
[64] Von Luxburg, U. A tutorial on spectral clustering. Statistics and computing 17, 395–416 (2007).
[65] Abdel-Karim, B. M., Pfeuffer, N., Rohde, G. & Hinz, O. How and what can humans learn from being in the loop? KI-Künstliche Intelligenz 1–9 (2020).
[66] Erion, G. G., Janizek, J. D., Sturmfels, P., Lundberg, S. & Lee, S. Learning explainable models using attribution priors. CoRR abs/1906.10670 (2019). URL http://arxiv.org/abs/1906.10670.