Post-Hoc Explanations Fail to Achieve their Purpose in Adversarial Contexts

Sebastian Bordt, Department of Computer Science, University of Tübingen, Germany, sebastian.bordt@uni-tuebingen.de

Michèle Finck, Law Faculty, University of Tübingen, Germany, michele.finck@uni-tuebingen.de

Eric Raidl, Ethics and Philosophy Lab, University of Tübingen, Germany, eric.raidl@uni-tuebingen.de

Ulrike von Luxburg, Department of Computer Science, University of Tübingen, Germany, ulrike.luxburg@uni-tuebingen.de

DOI: https://doi.org/10.1145/3531146.3533153
FAccT '22: 2022 ACM Conference on Fairness, Accountability, and Transparency, Seoul, Republic of Korea, June 2022

Existing and planned legislation stipulates various obligations to provide information about machine learning algorithms and their functioning, often interpreted as obligations to “explain”. Many researchers suggest using post-hoc explanation algorithms for this purpose. In this paper, we combine legal, philosophical and technical arguments to show that post-hoc explanation algorithms are unsuitable to achieve the law's objectives. Indeed, most situations where explanations are requested are adversarial, meaning that the explanation provider and receiver have opposing interests and incentives, so that the provider might manipulate the explanation for her own ends. We show that this fundamental conflict cannot be resolved because of the high degree of ambiguity of post-hoc explanations in realistic application scenarios. As a consequence, post-hoc explanation algorithms are unsuitable to achieve the transparency objectives inherent to the legal norms. Instead, there is a need to more explicitly discuss the objectives underlying “explainability” obligations as these can often be better achieved through other mechanisms. There is an urgent need for a more open and honest discussion regarding the potential and limitations of post-hoc explanations in adversarial contexts, in particular in light of the current negotiations of the European Union's draft Artificial Intelligence Act.

Keywords: Explainability, Transparency, Regulation, Artificial Intelligence Act, GDPR, Counterfactual Explanations, SHAP, LIME

ACM Reference Format:
Sebastian Bordt, Michèle Finck, Eric Raidl, and Ulrike von Luxburg. 2022. Post-Hoc Explanations Fail to Achieve their Purpose in Adversarial Contexts. In 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT '22), June 21–24, 2022, Seoul, Republic of Korea. ACM, New York, NY, USA 15 Pages. https://doi.org/10.1145/3531146.3533153

1 INTRODUCTION

Explainability is one of the concepts dominating debates about the ethics and regulation of machine learning algorithms. Intuitively, requests for explainability are reactions to the prevalent unease about machine learning algorithms, including concerns regarding discrimination, biases, manipulation, and data protection. The fact that machine learning systems are often “black boxes” is considered a major hurdle towards their implementation, supervision and control, and explainability is often praised as a remedy against such risks. Existing legislation such as the EU General Data Protection Regulation (’GDPR’) has sometimes been interpreted as containing a “right for explanation”. The draft Artificial Intelligence Act, a piece of proposed EU legislation, alludes to explainability but does, in its current form, not make clear whether and when exactly explainability is legally required. On the technical side, explainability has evolved into its own field of research [33]. The current machine learning literature knows two different approaches towards explainability. One approach is to build machine learning models that are constrained to be “inherently interpretable” [42]. The other approach is to use any machine learning model, even a “back-box”, and then employ any of an increasing number of approaches in order to “explain” the behavior of the black-box after the decision has been made (“post-hoc”). Because there exists no general way to summarize the entire behavior of a black-box model, these explanations are usually local, meaning that they only describe the behavior of the function for a single prediction or decision. The natural advantage of local post-hoc explanation methods, such as feature highlighting methods [30, 41] and counterfactual explanations [60], is that they place no constraints on model complexity and do not require model disclosure [7]. This has led a number of researchers to suggest that these methods might be able to comply with existing legal requirements [7, 60].

In this paper, we put forward an important distinction that has not yet been extensively discussed in the literature on explainable AI: whether the explanation's context is adversarial or cooperative. By “cooperative contexts” we broadly summarize situations where all involved parties have aligned interests. This includes model development and debugging, scientific discovery, and, to a degree, areas such as medical diagnosis. In a cooperative context, the explanation provider and the explanation receiver share the same interests: to identify the most suitable and insightful explanation algorithm for the given problem. In “adversarial contexts”, in contrast, parties have opposing interests. This is the case for example when a bank denies a customer a loan and the customer wants to contest the decision because it was discriminatory. Since the explanation provider anticipates that one might use the provided explanations to challenge the functioning of the system, the explanation provider does not have any incentive to provide “true” insights into the functioning of the system; but rather to render the internal functioning of the machine learning system incontestable. Indeed, it has been pointed out repeatedly that post-hoc explanation algorithms can be manipulated or cheated upon [5, 47, 48]. Many machine learning papers on explanation algorithms implicitly consider collaborative contexts where explanations are used to improve machine learning algorithms and can help developers to understand the biases of complex systems, or where they are used in an explorative spirit towards new scientific discoveries [63]. In contrast, the legal discussion focuses predominantly on adversarial scenarios. Here explainability is portrayed as a mechanism to add more transparency, fairness and accountability to AI, and post-hoc explanations are often seen as a technical tool to achieve these goals.

Combining insights from computer science, philosophy and law, we offer a critical multidisciplinary perspective on the usage of post-hoc explanations to achieve transparency and accountability obligations in adversarial contexts. We highlight the blurry legal landscape around explainability as well as the philosophical and technical limitations of post-hoc explanations. In Section 2 we introduce different scenarios – cooperative and adversarial – under which an external examiner might audit a black-box and its generated explanations. We focus on adversarial scenarios – where the explanation provider has opposing interests to the explanation receiver – and local post-hoc explanations – where the explanation explains a single decision for one particular person. In Section 3 we argue that existing and planned legislation,specifically the GDPR and the EU Artificial Intelligence Act, can either be read as portraying explainability as one possible mechanism to achieve more transparency or as presenting it as a free-standing objective. We also highlight the current lack of legal certainty as to how existing legal norms around explainability ought to be interpreted and implemented. These issues have been the source of confusion and uncertainty. This is why we propose to capture the role of explainability by a discussion of its motivations: Explanations are thought to build trust, and also enable actions, such as debugging, contesting, recourse. In Section 4 we show from a philosophical and technical perspective that the goals associated with explainability are unlikely to be achieved by post-hoc explanations. The reason is that the truth assumptions under which explanations are expected to fulfill their legal goal are lacking in the adversarial context. To the contrary, due to the inherent geometric ambiguity of local post-hoc explanations, the explanation provider has a multitude of options to influence explanations in a subtle, undetectable way and to pick those that suit her goals. In Section 5 we show that testing explanations is also problematic. While at best we can test for internal consistency of the explanation with the decision, in more typical cases the explanations become redundant and we would better rely on testing decisions and predictions directly. In Section 6 we conclude and argue that there needs to be a deeper and more honest debate about what the underlying objectives of explainability obligations are. We also argue that one needs to be honest about the fact that using a black-box entails considerable discretion: Neither post-hoc explanation methods, nor regulation can completely compel the deployer of a black-box to align his interests with the public good. As such, if one is absolutely unwilling to award any discretion to the deployer of the black-box, the only solution is to forbid its deployment and favor inherently interpretable or otherwise constrained machine learning methods. The question under which circumstances the deployment of a black-box might still be admissible depends on our ability to examine and audit the black-box. How exactly this might be done is still an area for future research. We hope that our paper contributes to an open discussion regarding the (lack of) potential of post-hoc explanations in the context of the on-going negotiation of the Artificial Intelligence Act.

2 EXPLANATIONS IN COOPERATIVE AND ADVERSARIAL CONTEXTS

In this work we broadly distinguish between “cooperative” and “adversarial” explanation contexts. In a cooperative context, all parties involved in the process of building the system, providing explanations and using the system share the same goal: to create a system as good and supportive as possible. Prototypical examples are model debugging and scientific research. But also a company building a medical decision support system, say for skin cancer detection, will closely collaborate with the doctors who use it [53]. The company's goal would be to provide explanations that are as helpful as possible. The situation is very different in adversarial contexts, where parties do not share the same goal, such as in the oft-repeated example of a denial of a loan application. Here, the applicant and bank have opposing interests and incentives. Accordingly, should the bank be mandated to provide the applicant with an explanation, this explanation will be shaped by the bank's incentives and existing power asymmetries. For reasons that we outline below, the distinction between cooperative and adversarial contexts is crucial. In particular, we argue that local post-hoc explanations, which have a variety of use-cases in the cooperative scenario, are pointless or even harmful in adversarial contexts.

2.1 Parties involved in the adversarial explanation process

We consider adversarial explanation contexts where an AI decision system is used to make decisions about individuals. Prominent examples are university admissions, job and loan applications, or bail and sentencing decisions. Under existing and planned legislation, such as the EU Artificial Intelligence Act, the creator of the system ought to provide information about how the system comes to its decisions (see Section 3 below for a detailed discussion of the legal background). The creator of the system is the entity that has built the machine learning system and uses it to support decision making.¹ The creator could be a private company (such as a bank) or a public entity (such as a university). The decision subject is the person about whom the automated system makes a decision: the person who applies for a loan, or the person who applies to for university admission. After the decision has been communicated, the explanation recipient asks for an explanation, which is communicated by the explanation provider. The explanation recipient could be the decision subject herself, or an external examiner who is supposed to investigate the decisions or explanations on behalf of the decision subject or to defend her interests. The explanation provider is typically the creator of the system.²

2.2 Machine learning problem: Supervised learning, tabular data, point-wise post-hoc explanations

In our technical discussion, we assume that the inputs $x \in \mathbb {R}^d$ of a decision algorithm are given in tabular form. Each dimension of the input encodes a different property of a person, for example age, income, etc. Typically, the number of dimensions d is large: persons are described by dozens or hundreds of features. A machine learning algorithm is used to learn a decision function $f: \mathbb {R}^d \rightarrow \mathbb {R}$. The resulting decision y = f(x) for input x could be a binary decision (“receives the loan” or “does not receive the loan”) or a numeric risk score on which such a decision is based, as in the often discussed COMPAS algorithm to predict recidivism risk. We focus on supervised machine learning, where f is learned based on training data consisting of pairs (x₁, y₁),..., (x_n, y_n) with x_i the training points and y_i the training labels. An explanation algorithm E is an algorithm operating on a decision function with the purpose of explaining it. We focus on local post-hoc explanation algorithms: The explanation algorithm E gets queried with a data point x and the corresponding decision y, and produces an explanation E(x, y). Internally, the algorithm has access to the decision function f, and in some cases also to the training data. The explanation E(x, y) is supposed to explain why the decision function f came to decide y for x. The explanation can be in linguistic form. For example, “The low income of Mr. Smith was relevant for the refusal of the loan” or “Mr. Smith would have received the loan had his income been 10.000 Euros higher”.

2.3 Explanation algorithms that fall into this framework

In this paper we consider local post-hoc explanation algorithms such as LIME, SHAP, and DiCE [30, 34, 41]. The explanations generated by these algorithms do not provide a global or holistic view of the decision function f but merely try to explain individual decisions y = f(x). The often-cited advantage of these algorithms is that they work, at least in principle, for any decision function [7, 41]. Different algorithms take different approaches as to what constitutes an explanation: LIME and SHAP provide feature attributions that aim to quantify the influence of the different input-features for the particular decision. Feature attributions correspond to the linguistic form “The low income of Mr. Smith was relevant for the refusal of the loan”. Another approach is to provide counterfactual explanations [60]. These explanations are based on searching for a sufficiently close or the closest alternative point x′ to the actual input point x that yields a decision y′ = f(x′) that differs from the original decision y = f(x). Comparing the two we can arrive at factors that are relevant to the decision [24]. The resulting counterfactual explanations have the linguistic form “Mr. Smith would have received the loan had his income been 10.000 Euros higher”.

3 LEGAL FRAMEWORK: EXPLAINABILITY IN EU LAW

This paper argues that post-hoc explanation algorithms are unsuitable in adversarial contexts. Before we elaborate this from a philosophical and technical perspective (Section 4), it is important to understand the related legal framework. We focus on European Union law as the EU has often been a first-mover regarding the regulation of data and its analysis, and over time its legislation will likely inspire other jurisdictions (for a broader view, see [21]). Our analysis focuses on the draft Artificial Intelligence Act (AIA), a piece of proposed legislation that would be the first to specifically target AI. This pioneering approach would be a global blueprint for the regulation of AI. In its current form it creates different legal obligations for different AI applications on the basis of the perceived risks. The AIA would apply to general AI systems (Section 3.1). We also consider the General Data Protection Regulation (GDPR), which applies to the processing of personal data (Section 3.2). It will be seen that whereas EU law contains various obligations to provide information about a machine learning algorithm and its functioning, it remains unclear how these legal norms should be implemented from a technical perspective and whether explainability should be understood as a free-standing legal obligation or whether it should rather be seen as one of various mechanisms to achieve algorithmic transparency (Section 3.3). To better understand the latter we also review their underlying rationales and objectives from a philosophical and legal perspective (Section 3.4).

3.1 The draft Artificial Intelligence Act (AIA)

The current draft of the AIA defines AI systems as “software (...) that can, for a given set of human-defined objectives, generate outputs such as content, predictions, recommendations, or decisions influencing the environments they interact with”. Generally, the AIA regulates AI on the basis of its perceived risk by introducing four different categories of AI. Most relevant to our discussion are the two categories of systems that are high-risk, as opposed to systems that are not high-risk (the remaining two categories are practices that are subject to qualified prohibitions, and a residual category of AI systems that includes law enforcement software, emotion recognition system, biometric categorisation systems and deep fakes) [54]. The stronger the risk, the heavier regulatory obligations apply, also regarding transparency and interpretability.

There are two categories of high-risk AI systems. First, AI systems that relate to products that are already subject to supranational harmonisation, namely AI systems intended to be used as a safety component of a product, which are themselves products covered by Union harmonising legislation or which are required to undergo third-party conformity assessments. Second, a list of systems that are currently considered to carry a high-risk such as, for instance, biometric identification systems, systems for the management and operation of critical infrastructure, those used in education and employment, some law enforcement systems as well as others (see further Art 3(1) of the draft AIA). Article 13 governs explainability for high-risk AI systems, which have to be “designed and developed in such a way to ensure that their operation is sufficiently transparent to enable users to interpret the system's output and use it appropriately”. Furthermore, users (the entity deploying the AI) need to have access to instructions for use in an appropriate digital format that contains information about the characteristics, capabilities and limitations of performance, including information about the level of accuracy, robustness and cybersecurity, risks to health, safety or fundamental rights, specifications for the input data, expected lifetime of the AI system and necessary maintenance measures. Finally, human oversight must be ensured. These measures are designed to minimize risks to health, safety or fundamental rights. Human oversight shall either be (i) identified and built into the system by the provider before it is placed on the market or put into service, or (ii) identified by the provider before the system is placed on the market or put into service but only implemented by the user. In its current version, the AIA would thus require that high-risk AI systems are sufficiently transparent to enable the interpretation of the system's output. Is this an explainability obligation? Recital 47 sheds some light on how to interpret these notions. It specifies that high-risk AI systems should be transparent to a “certain degree” to “address the opacity that may make certain AI systems incomprehensible to or too complex for natural persons”. To this end, users “should be able to interpret the system output and use it appropriately” through the provision of “relevant documentation and instructions of use”. This does not read like an obligation to make systems explainable in the sense that the way in which data has been processed must be entirely traceable. Rather, the AIA would require that an “interpretation” of the output must be facilitated through sufficient transparency. Importantly, this does not necessarily seem to imply that an absolute truth must be identified post-hoc (see Sections 4.1 and 4.2 below) but rather the overall functioning of the system and how it comes to an output. The draft AIA leaves open the question of what transparency and interpretability imply from a technical perspective. This certainly includes the elements listed in its Article 16 such as technical documentation, keeping logs or quality management systems. Article 13 leaves open whether there are additional requirements and what, exactly, interpretability requires from a technical perspective. If input data ought to be entirely traceable, “black-box” systems cannot be used in high-risk applications. This highlights that it is important to think about the objectives of transparency and explainability. If these can be achieved through alternative means, excluding black-box systems such as deep neural networks from high-risk scenarios (such as healthcare as devices falling under the Medical Devices Regulation qualify as high-risk) might unduly hinder innovation in important domains.

Article 52 AIA creates some general transparency obligations for AI systems that are not high-risk. These are general disclosure obligations such as to (i) inform users that they are interacting with an AI system unless this is obvious from context, (ii) users of an emotion recognition system or biometric categorization system shall inform natural persons exposed thereto, (iii) deep fakes must be disclosed as such. Some exceptions apply where the AI is used in the context of law enforcement. These are thus obligations of transparency that require disclosure that AI is used, as opposed to how it is used.

To summarize, the draft AIA would thus not, in its current form, create a general explainability obligation for machine learning systems. Such an obligation clearly is not foreseen in relation to AI systems that are not qualified as high-risk. Arguably, there is also no explainability obligation in relation to high-risk AI systems. Rather, what is required is transparency of the system's functioning and output generation. This transparency must make these elements interpretable but not necessarily amount to the provision of an explanation as it is commonly understood in computer science.

3.2 The General Data Protection Regulation (GDPR)

The GDPR creates some general transparency requirements that form part of the data controller's (the entity that determines the purposes and means of processing) general informational obligations vis-à-vis the data subject (the natural person that personal data relates to). In addition, it also contains a specific regime for “solely automated data processing”. In contrast to the draft AIA, which creates vague obligations resting on the user, the GDPR creates specific rights for the individual subjected to such decisions.

Article 13 requires that data controllers provide specific information to data subjects where personal data is collected from them at the time of collection such as whether “automated decision-making” is used, and, if so, provide “meaningful information about the logic involved ³, as well as the significance and the envisaged consequences of such processing”. Article 14(1)(h) creates the same obligation in cases where data is not directly collected from the data subject. Pursuant to Recital 62 this information does not have to be provided where it is redundant, or where compliance proves impossible or involves a disproportionate effort. The same wording can also be found in Article 15, which deals with the data subject's right to access data. Whereas Articles 13 and 14 relate to the pre-processing stage, data subjects can exercise their rights under Article 15 at any time, including after processing has taken place. This raises the question of whether – despite the identical wording of these provisions – Article 15 may substantively require something different when referring to the “logic” of the automated decision-making process.

There is no general right to an explanation under the GDPR. Some explainability requirements may, however, arise in respect of machine learning algorithms that produce legal effect or similarly significantly affect a data subject. Article 22 creates a qualified prohibition of “solely automated data processing”, including profiling. This implies that such techniques can only be used in some circumstances, namely (i) where necessary to enter into or perform a contract between the data subject and controller, (ii) where it is authorized by law or where the data subject has provided explicit consent. In these circumstances automated processing can take place, but the data subject has the right to human intervention and to express her point of view and to contest the decision. Recital 71 mentions an additional element, namely that the data subject has the right “to obtain an explanation” after human review of the decision “and to challenge this decision”.⁴ Recitals, however, do not have the same legally binding force as the text of the GDPR itself.

Over the past years there has been a vivid academic debate around whether the reference to “an explanation” in Recital 71 amounts to a “right to an explanation” that data subjects can exercise via-à-vis controllers [59] [32] [45] [14]. The Article 29 Working Party's guidance suggests that Article 22, read in conjunction with Recital 71, should be understood to require that controllers (i) tell data subjects that they are engaging in automated decision making, (ii) deliver meaningful information about the logic, and (iii) explain the processing's significance and envisaged consequences. The information provided should include details about the categories of data; why data is seen as pertinent; how profiles are built; why the profile is relevant for the decision-making process and how it is used to reach a decision about the data subject. The last three criteria appear to apply to profiling only [36]. Information with respect to the “logic” means “simple ways to tell the data subject about the rationale behind, or the criteria relied on in reaching the decision”. What is required is “not necessarily a complex explanation of the algorithms used or disclosure of the full algorithm”. Nonetheless, the information transmitted to the data subject should be sufficiently comprehensive to “understand the reasons for the decision”. Thus, an explanation of algorithms or disclosure of the full algorithm isn't “necessarily” required and that the controller ought to find “simple ways to tell the data subject about the rationale behind, or the criteria relied on in reaching the decision”. Unfortunately, this guidance leaves a lot of room for doubt regarding what exactly is required of controllers. In any event the GDPR does not create a general right to an explanation but applies only to automated decision-making that legally affect the data subject or have similarly significant effects on them.

3.3 Explainability as a sub-component of transparency

While there is a persistent myth that EU law requires that all decisions based on AI are “explainable” our analysis has painted a more nuanced picture. First, there is no overarching explainability norm that would apply to any usage of AI. To what degree secondary law requires explanations has not been authoritatively settled. Ultimately, the Court of Justice of the European Union will need to settle this question in respect of the GDPR. Concerning the draft AIA, however, legislators should clarify in the final text whether explainability is a free-standing legal obligation in respect of high-risk AI systems or whether it should rather be understood as a sub-component of transparency. As shown above, it is indeed possible to read references to explainability as elements of the broader transparency obligation. Article 13 AIA is explicitly about transparency, but the reference that this transparency must allow users to “interpret the system's output” has been understood as an explainability obligation by some. Further iterations should clarify the link between transparency and explainability to enhance legal certainty. An analysis of the history behind the AIA confirms the lack of precision of the AIA itself. The EU High Level Expert Group on AI's report on the one hand portrayed explainability as a component of transparency. On the other hand, it repeatedly referred to another concept, “explicability”, which was introduced as an ethical principle and as the “procedural dimension” of fairness. In contrast, the AIA White Paper made no reference to explainability other than to mention that symbolic reasoning could help make deep neural networks more explainable. This part of the AIA legislative history underlines the lack of consensus about what exactly explainability is. Similarly, the GDPR could also be read as referring to explainability as a sub-component of transparency. Articles 12-15 derive from the core data protection principle of transparency in Article 5(1)(a) and likewise, one reading of Article 22 in conjunction with relevant recitals could also be understood as a more general transparency rather than explainability obligation.

This, of course, raises the question of what transparency means and what it should enable. There is broad consensus that the GDPR requires that decisions reached through automated decision making be justifiable. Indeed, Hildebrandt has highlighted that data protection requires “the justification of such decision-making rather than an explanation in the sense of its heuristics” (p. 113 in [18]). Kamimski and Urban deem that justification should enable “understanding, revealing and making challengeable the normative grounds of a decision” (p. 1980 in [21]). Wachter, Mittelstadt and Russell have argued that explainability is ultimately designed to help the data subject understand, contest and alter decisions and that this could also be achieved by counterfactual explanations [60]. If explainability is merely one means of achieving transparency, there needs to be a more thorough discussion as to what other, alternative, means of achieving transparency there are, particularly in situations where explainability strictu sensu proves impossible. Considering the lack of consensus as to how the legislative texts of the AIA and the GDPR ought to be interpreted and applied in practice, it is helpful to consider their underlying objectives.

3.4 Rationale and objectives of explainability norms in an adversarial setting

The vague formulation of explainability rights, coupled with uncertainty regarding their function makes it legitimate to ask whether explanations serve any meaningful purpose. Indeed, as Edwards and Veale [14] have argued, “the search for a legally enforceable right to an explanation may be at best distracting and at worst nurture a new kind of transparency fallacy”. This is essentially a warning that if explainability obligations just become a box-ticking exercise, they might give a misleading appearance of compliance rather than to be of any real value to the decision subject. In addition, explainability rights in the GDPR inevitably also suffer from the general shortcomings of the low enforcement of the GDPR.

In order to better understand the above-examined norms we propose to consider their underlying objectives. Before discussing legislative history let us recapitulate what philosophers have identified as main objectives for algorithmic explanations.⁵ One major motivation for explainability of AI systems is the hope that this may foster trust in these systems [10, 26, 35, 57]. This has been called the “Explainability-Trust” hypothesis [22].⁶ The hypothesis is controversial, and it is not exactly clear how explanations would induce trust. The underlying rationale seems to rest on an analogy with human interactions. Consider decisions made by human experts. When the decision doesn't satisfy us, we are drawn to ask for an additional explanation. Given such an explanation, we may check whether it conforms to our expectations about good decision making. If so, this may be a ground for further trusting the decision maker. This is not a one-shot process, but an ever evolving interaction on a long term time-scale. We tend to trust a person that proved repeatedly to predict correctly, make good decisions, or provide well informed explanations. The trust raising potential of an explanation however requires that we can submit explanations and decisions to tests, possibly by delegating it to other experts. The trust raising potential of a single explanation thus presupposes that the explanation provider stays in the information-exchange on the long run: only then does she have an incentive to provide a correct explanation, since an incorrect one would lead to a loss of trust in the long run but not in a one-shot exchange. If an algorithm rather than a human expert makes a decision, we might have similar expectations. We would like to engage in a similar information exchange with an algorithm as we engage with humans. The demand for an explanation is then a demand for a piece of communicative interaction. The hope that this builds trust stems from the intuition that the interaction with the algorithm is similar to the interaction among humans, as depicted above. This assumption may however fail either because the algorithmic explanations cannot be submitted to sensible tests or because the exchange is one-shot and not long run. In the first case, explanations loose their trust raising potential. In the second, the explanation provider may not have the incentive to tell the truth. A second implicit motivation for explainability stems from the idea that information provided by explanations can be used to perform actions, and may in fact be needed for such actions. In the adversarial setting, a data subject might want to use an explanation to contest a decision [7, 60], by claiming, or arguing that the decision is not right, not good, or not fair. The data subject might also want to use the explanation for recourse, in order to do better next time [4, 55, 60] (see also [2, 29]).⁷ But such explanations are only of value when true or correct. A false explanation will not help in doing better next time, and may even be devised such as to render a decision incontestable.

The two motivations from philosophy — building trust and enabling recipients to act — can also be found as objectives in the legal texts. The EU High Level Expert Group on AI described explainability as one tool to achieve trust in AI systems [35].⁸ The AIA provides that explainability norms are designed to allow users to fully understand the capacities and limitations of high-risk systems, leading again to trust. Partly related to trust, one can understand explainability as a tool for risk management, in line with the AIA's overall risk-based approach. Indeed, for high-risk AI systems, transparency must be ensured by monitoring the system's operation, detect signs of anomalies, dysfunctions and unexpected performance in order to counteract automation bias or to potentially intervene in the system (the idea of a “stop button”). The European Commission White Paper also emphasized the risk-based approach and stressed that due to the potential scale of AI systems [11]: a hidden bias or an incorrect assumption of an AI system, say deciding on tens of thousands of university admission decisions, will have a large systemic effect. This differentiates large-scale AI systems from human decision-making systems. In philosophy explanations are considered as a tool towards future actions. Similarly, the legal discussion also portrays explainability as an enabling right. The High-Level Expert Group on AI has drawn attention to the fact that to be able to contest decisions, they must be traceable. Also outside the AIA and the GDPR, explainability serves a related purpose. In consumer protection law, explainability is linked to the unequal power dynamics between the business and the consumer. In the public administration, it has been argued that being subjected to an intransparent black-box decision would undermine human dignity and is also to be avoided, unlike in the private sector, individuals cannot vote with their feet and go elsewhere.

Overall, the motivations for explanations seem to presuppose that such explanations are true or correct. Only then does a single explanation raise trust, and only then can an explanation be used to perform the intended actions, such as contesting or recourse. We will, however, see in the next section that this truth-presupposition for explanations fails in adversarial scenarios of algorithmic post-hoc explanations.

4 THE PROBLEMS WITH POST-HOC EXPLANATIONS IN ADVERSARIAL CONTEXTS

Figure 1: Different explanation algorithms lead to different explanations. Depicted are the feature attribution explanations of four different explanation algorithms: Exact SHAP for trees [31], LIME [41], DiCE [24], and Interventional SHAP [20]. All four explanation algorithms attempt to explain the prediction for the *same* individual with the *same* decision function (a gradient boosted tree) on the *same* dataset (Adult-Income). The idea of feature attribution explanations is to determine how much each dimension of the input contributed towards the decision. The figures depict these attributions by drawing a bar for each of the 12 input dimensions. The larger the bar, the higher is the influence of the corresponding feature. Some methods distinguish between positive and negative attributions. In the depicted example, the first bar in Panel (a) is relatively large, which indicates that the SHAP algorithm determined that the value of the first feature contributed strongly to the prediction. The DiCE algorithm in Panel (c), in contrast, determined that the value of feature 9 contributed most strongly to the prediction. More figures showing results for other data points can be found in the supplement.

We now discuss the problems with post-hoc explanations in adversarial scenarios. What can we expect from an algorithmic explanation in these contexts? We roughly know what to expect from human explanations. For example, witnesses giving evidence in court are expected to tell the truth. Can we expect something similar of an algorithmic explanation? If the algorithm decided, for example, to reject a loan application, can we expect to discover the true reason why it decided to do so? The answer is that we cannot, for two reasons. First, the algorithm's view of the world is coarse-grained and incomplete, and this significantly restricts the vocabulary available for potential explanations (Section 4.1). Second, even within the limited picture of the world that the algorithm has access to (the “algorithm's own world”) uniquely preferred or “ground truth” explanations do not exist (Section 4.2). This directly ties with the computer science perspective of why post-hoc explanations should not be used in adversarial contexts: the task of providing post-hoc explanations is underdetermined. The objective of the adversary explanation provider is to deploy a classifier that has high accuracy and generate post-hoc explanations that cannot be contested by the data subject or an examiner. We argue that due to the high degree of ambiguity inherent to algorithmic explanations, the adversary has sufficient degrees of freedom to devise incontestable explanations – even without explicitly optimizing against a particular explanation method [46, 47]. We identify four key quantities that allow the adversary to influence the resulting explanations: the choice of an explanation algorithm and its particular parameters (Sections 4.3 and 4.4); the exact shape of the high-dimensional decision boundary (Section 4.5); and, when applicable, the choice of the reference dataset (Section 4.6). This section contains a number of figures and simulation results. Additional figures can be found in the supplement. The code to replicate the results in this paper is available at https://github.com/tml-tuebingen/facct-post-hoc.

4.1 The algorithm's view of the world is coarse-grained and incomplete - this limits potential explanations

Learning and explanation algorithms only have access to a coarse-grained description of the real world. Their vocabulary is restricted to certain features, and possible relations between them. The “experience” of such algorithms given by the finite training data is formulated in the restricted vocabulary and provides only a small window to the world. Overall, the algorithm's representation of the real world is coarse-grained and incomplete.⁹ The learning algorithm just sees features and training labels. The explanation algorithm, additionally, sees the learning algorithm's association between input and output. This is what we call “the world of the explanation algorithm”, and this is all what it can exploit. As a consequence, all the explanation algorithm could talk about are geometric properties in the world of the algorithm: distances of points to the decision surface, proximity between points, their true or predicted labels, the gradient of the decision function at a point, the necessary change of a feature to change the decision, etc. Although a true explanation for a decision might exist in the real world, it might not be represented in the data or other aspects of the algorithm's world, which could thus not provide any such explanation. This is even the case in a cooperative setting. Consider the example of a medical diagnosis of a disease for which a true (say, causal) explanation exists in the real world. If the learning algorithm was trained on feature-based data such as age, blood pressure, etc, the explanation algorithm could suggest that age was the cause. However, in reality the cause for the disease may not be age, but rather a smoking habit that was not represented in the data. So even if a true explanation exists (say, a cause) this may neither be identifiable nor expressible by the explanation algorithm.

4.2 Even within the algorithm's own world, a unique preferred reason does not exist

Even within the limited world that the explanation algorithm has access to, a “true internal reason” why the learned decision function comes to a certain decision does generally not exist. This is particularly the case for complicated black-box functions. Even machine learning experts digging into the learning algorithm or properties of the function could not reveal a unique true reason. All we can do is to provide vague approximations of how the algorithm arrives at its decision, by summarizing which features contributed how much to the decision (the approach of LIME and SHAP), or whether a change in some features would alter the decision (the approach of counterfactual explanations). For example, in the case of a loan rejection, we might want to know whether it was rather our low income or our postal code which determined the decision, and whether we could change something about the decision, if in the future we had a higher income or moved to another area. However, these explanation attempts are all subject to choices. A mathematically unique way to determine how much each feature of a complicated black-box function contributed to the decision does not exist. Consequently, all feature attribution methods rely on particular assumptions and mechanisms in order to construct explanations: LIME, for example, looks at the gradient of the decision function at the point to be explained [15, 41]. SHAP compares the point with other datapoints from a reference population [16, 30]. Yet another approach would be to re-train the classifier on subsets of features or to use counterfactual feature importance, where one looks at the distance to the decision surface in various directions. All these mechanisms and choices seem plausible but, as we will see in the next section, they all deliver different explanations.

4.3 Different explanation algorithms lead to different explanations

Different explanation algorithms lead to different explanations [25]. This is true even if the algorithms have access to exactly the same information (the geometry of the data, the learned decision function, etc). In an adversarial context, this is problematic because it means that the creator of the system can modify the explanations by choosing a particular explanation algorithm. In practice, different explanation algorithms lead to different explanations even on the most simple machine learning problems. In high dimensions, that is in real-world problems, the difference between the explanations obtained from two different explanation algorithms can be so significant that the explanations are entirely different. This is illustrated in Figure 1. The figure depicts the feature attribution explanations that four different explanation algorithms determined for the same individual. From the difference between the four panels in Figure 1 it is quite clear that different explanation algorithms can lead to markedly different explanations, even if they all attempt to explain the same decision for the same individual.¹⁰ Details on the machine learning problem, dataset and explanation algorithms can be found in the supplement.

Figure 2: For any given datapoint, different explanation algorithms might lead to very similar or completely different explanations. In many cases, however, there are both similarities and dissimilarities. The Figure depicts the SHAP and LIME feature attributions for a datapoint in the folktables ACSIncome prediction task [13]: Are these attributions similar or different? More figures showing results for other data points can be found in the supplement.

That different explanation algorithms lead to different explanations is also true for counterfactual explanation methods [34, 60]. Indeed, there is a variety of ways in which the optimization problem can be set up, which in turn leads to different explanations. However, already a single counterfactual explanation method can lead to a large number of counterfactual explanations. In a cooperative context, being able to generate many different counterfactual explanations for the same individual can be beneficial [34]. In an adversarial context this is problematic because there is no principled way to choose among different counterfactual explanations, and the adversary is again awarded considerable discretion to determine explanations. In realistic, high-dimensional applications, the number of potential counterfactual explanations can quickly become very large. Let us illustrate this point on the German Credit Dataset. The German Credit Dataset is a 20-dimensional dataset with features on credit history and personal characteristic. The task is to predict credit risk in binary form. How many different counterfactual explanations exist for a single individual? With a common black-box decision function, more than 100 different counterfactual explanations exist for each individual.

At its core, the fundamental difficulty of explainable machine learning is then the same as in other fields of unsupervised learning: the lack of a ground truth explanation impedes the development of an algorithmic framework to automatically evaluate explanations. Every explanation algorithm needs to make assumptions about which properties of the decision function it seeks to highlight. As a result, it is possible to develop sanity checks for explanation algorithms and exclude unreasonable approaches [3, 9], but not to discern whether any of two post-hoc explanations is “more correct”, which would be equivalent to discussing whether any of two different clusterings is “more correct” [58].

4.4 The explanation provider can choose between a large number of possible explanation algorithms and parametrizations

Figure 3: Explanations depend on the exact shape of the classifiers high-dimensional decision boundary. Panel (a) and (b): On the diabetes dataset, linear regression and a random forest agree for 94% of their predictions. Shown are the SHAP explanations on a data point where the prediction of both methods agree. As we can see, the explanations differ. Panel (c) and (d): the dependence on the decision boundary is subtle. It can even be hard to tell from the explanations whether the classifier had been trained trained at all. On the Wisconsin Breast Cancer dataset, the SHAP explanations of a classifier trained to achieve an accuracy of 96% are hard to distinguish from those of the same classifier trained on random labels. More figures showing results for other data points can be found in the supplement.

Even for a single explanation algorithm, there can be many different parameter choices that all lead to different explanations. LIME explanations, for example, depend on the bandwidth and the number of perturbations [15, 27, 46]. The uniqueness properties of Shapley values non-withstanding, there is a multiplicity of ways in which Shapley values can be operationalized to generate explanations [51]. Counterfactual explanation algorithms depend on the underlying metric chosen to represent closeness (e.g. Euclidean distance vs. L1-norm)¹¹ as well as additional hyperparameters to weight-off between closeness and prediction, and, at least in principle, any number of additional penalty terms [34]. In certain cases, it might be possible to come up with good default parameter choices. For example, recent work has demonstrated how to choose the bandwidth parameter of LIME in a principled way or quantify uncertainty in the resulting explanations [27, 46, 64]. It is also possible to exclude explanation algorithms and parametrizations that are completely unreasonable, for example because they are not sensitive to the decision function [3, 9]. This nevertheless leaves an ever-increasing number of plausible explanation algorithms and corresponding parametrizations. Quite generally, different explanation algorithms vary among many different dimensions, and there is an ever increasing number of suggestions as to how black-box functions might be explained. This can be seen, for example, in the recent work of Covert et al. [12], who summarize 25 existing methods in a unified framework. As already discussed above, there are no fundamental reasons that impede us from using any particular method.¹²

4.5 Explanations depend on the exact shape of the high-dimensional decision boundary

Figure 4: A simple toy example of how the choice of the explanation's reference dataset can influence the resulting explanations. The dataset in Panel (a) consists of two different population groups. The blue and orange color depicts the binary label that the classifier is supposed to predict at each data point (to get an intuition, you might think of the groups as “male” and “female”, and the label as “is awarded the credit” or “is not awarded the credit”). Panels (b) and (c) depict the interventional SHAP feature attributions [20] for the *same* data point in Group 1. In Panel (b), the explanation's reference dataset consists of the observations of Group 1 only. In Panel (c), the reference dataset is the entire dataset. The example shows that changing the reference dataset can almost completely change the feature attribution from one feature to another.

Even if we fix a particular explanation method and its parameters, the generated explanations still depend on the exact shape of the learned decision boundary. In high dimensions, there are often many different black-box functions that solve a particular classification problem to a desired accuracy, that is they represent the data sufficiently well. However, these functions often lead to different explanations. To a certain extent, we may say that the exact shape of the learned decision boundary is arbitrary, but since the explanations depend on it, these turn out to be arbitrary as well. One of the reasons for the sensitivity of the explanation to the function's shape is that many explanation methods evaluate the function f at datapoints that are outside the data distribution or at points that are unlike most points from the data distribution. In the adversarial scenario, this is problematic because the adversary can freely modify the values of the function f outside the data distribution without changing the classification behavior. Recent work has demonstrated that this property can be used to explicitly manipulate and attack explanation methods [47, 48]. But even without explicit attacks, there are many different choices, in particular hyperparameter and architecture choices, that influence the shape of the decision boundary, and thus the resulting explanations. For an external examiner, this presents a challenging problem: while certain explicit attacks on explanation methods could in principle be detected through code review (see also Section 5.2), it is far less clear how one would argue about choosing one classifier over another, or any particular choice of hyperparameters. This problem is illustrated in Figure 3. Here, we solved the same machine learning problem both with linear regression and a random forest. The two methods have comparable performance on the test set, where 94% of their predictions agree. Nevertheless, the explanations obtained for the two different decision functions can be quite different – even for points that receive the same prediction.

Turning to counterfactual explanations, it is well-known that these depend on the exact shape of the decision boundary. Let us give an example, again using the German Credit Dataset. Consider two different decision functions, a gradient boosted tree and logistic regression. If we generate a number of diverse counterfactual explanations [34] for a typical individual with respect to one decision function, are these also counterfactual explanations with respect to the other decision function (at least as long as both functions arrive at the same decision)? In this simple experiment less than 50% of counterfactual explanations that work for the gradient boosted tree also work for logistic regression. As discussed above, the fact that the explanations depend on the exact shape of the decision boundary is problematic because it allows the creator of the system to influence the resulting explanations. The particular choice of the decision function can even determine whether certain types of counterfactual explanations exist at all. Let us give an example on the Wisconsin Breast Cancer Dataset. To demonstrate the dependence on the decision boundary, we consider again two different decision functions, linear regression and a random forest. For linear regression, there exist a large number of counterfactual explanations that modify only a single variable. For the random forest, it is impossible to find any such counterfactual explanations. This is despite the fact that both classifiers exhibit similarly low test error.

4.6 It is unclear how to choose the reference dataset that many explanations depend on

In recent years, there has been an increased focus on the composition of datasets, for example on the representation of different sociodemographic groups in machine learning datasets [6, 37]. In many real-world problems such as credit lending, the criteria for choosing an appropriate dataset are not clear. In both cooperative and adversarial contexts, the creator of the system has to make numerous choices, many of which can have significant effects on both the shape of the learned decision boundary and the generated explanations. For example, Anders et al. [5] have shown that gradient-based explanations can be manipulated by adding additional variables to the dataset. In this section, we highlight the additional role that the dataset can have on algorithmic explanations, even when keeping the learned decision boundary constant. Indeed, while some explanation algorithms such as LIME only rely on the learned decision boundary, other methods such as SHAP and some counterfactual explanation methods make additional use of the data in order to generate explanations. The relevant dataset could be the training data, but it could also be a different dataset. We refer to it as the reference dataset. While the usage of such a dataset to generate explanations can be seen as a remedy to the vagaries of high dimensions, or as a possibility to generate counterfactual explanations that look like they come from the data, this approach is problematic as long as the adversary determines the composition of the dataset. The reason is that whether certain datapoints are included in the dataset or not can determine whether an explanation algorithm provides one or another explanation. Figure 4 illustrates this with a simple example: By deciding between two different reference datasets, one can effectively decide whether one ore another feature was relevant to the decision.

4.7 Bottom line: Post-hoc explanations are highly problematic in an adversarial context

It is extremely important to understand that an explanation algorithm is based on many human choices that are shaped by human objectives and preferences. While many choices are plausible, there is no objective reason to prefer one algorithm over the other, or one explanation over the other. Apart from the explanation algorithm and its particular parameters, explanations are influenced by human choices such as the selection of the classifier and the composition of the dataset. In adversarial contexts it implies that the adversary can choose, among many different plausible explanations, one that suits their incentives. This complicated situation makes it particularly difficult for external observers, including judges and regulatory bodies, to determine whether an explanation is acceptable. Explanation algorithms appear to provide objective explanations, yet as explained above this is not the case (compare Section 4.2).

5 ONCE AN EXAMINER IS ALLOWED TO ASSESS THE PROVIDED POST-HOC EXPLANATIONS, SHE'D BETTER INVESTIGATE THE DECISION FUNCTION DIRECTLY

So far we have discussed explainability obligations in European Union law and their motivation (Section 3), and pointed out theoretical (Sections 4.1-4.2) and practical (Sections 4.3-4.6) shortcomings of post-hoc explanations. In this section, we add yet another component to our argument. In an adversarial setting, it is not only the AI decision system itself but also the corresponding explanation algorithm which might need to be examined by a third party. Even if the examiner only attempts to assess the most basic consistency properties of the provided explanations, that is to check whether the explanations relate to the AI decision system at all, this necessarily requires that the examiner is able to query the AI system. But then, the explanations become entirely redundant: Rather than relying on explanations to enable risk management, provide trust or bias and discrimination detection (compare Section 3.4), the examiner could directly query the AI system for problematic decision behavior. Because the creator of the system and the examiner have competing interests, it is important to distinguish degrees of transparent interaction between the two. Naturally, the examiner would like to have access to as much information as possible, whereas the adversary creator wants to disclose as little information as possible. We distinguish between a minimal and a fully transparent scenario of information disclosure (Sections 5.1-5.2).

5.1 Minimalist scenario where decision function and explanation algorithm can be queried

To determine whether the adversary's explanations actually correspond to the used decision function f instead of being arbitrary justifications not related to the decision process, the examiner needs to be able to query the decision function and the generated explanations.¹³ This includes a fair amount of related knowledge, such as which variables are input to the algorithm, but excludes explicit access to the decision function, explanation algorithm, source code and training dataset. A related but slightly more limited version of this scenario arises when individuals jointly collect the decisions and explanations from the creator of the system. In this minimalist scenario, the examiner can validate the internal consistency of the provided explanations. Researchers have proposed a number of criteria that the examiner can test for such as faithfulness to the model, robustness to local perturbations, as well as necessity and sufficiency notions for individual feature attributions [3, 24, 56]. The examiner might also want to perform tests as to whether the provided explanations have been manipulated [48]. More importantly however, even just with the ability to query the decision function, the examiner can ignore the explanations and directly investigate the decision function for problematic properties. For example, the examiner could conduct a systematic evaluation of, say, fairness metrics such as equal opportunity and demographic parity, based on an independent reference dataset of her choice (see [6] for these and other notions of fairness and discrimination). Indeed, because the adversary designing the explanation algorithm has no interest in choosing explanations that highlight any discriminatory behavior of the decision algorithm, the examiner is well-advised to simply ignore the explanations and test the decision algorithm directly. Although such tests might be similar to certain explanation algorithms, what is important is that the examiner (as opposed to the creator) designs and implements them. Note that we are not saying that the minimalist scenario actually allows the examiner to assess all legally relevant properties of the decision function. What exactly can be assessed with querying access is a question that still requires more research. Our point is that once we have querying access, the explanations are useless.

5.2 Fully transparent scenario where algorithms’ source code and training data are disclosed

At the opposite end of the minimalist scenario is the fully transparent scenario where the examiner is allowed to investigate the decision function, source code and training data. An examiner could then scrutinize whether the explanation algorithms have been implemented according to the state of the art with sensible parameter choices. This directly rules out the possibility for the creator of the system to manipulate explanations. Are post-hoc explanations useful in the transparent scenario, perhaps because the examiner now has the tools to verify whether the adversary has chosen the “correct” explanations? As we have already discussed above, the problem is that there is no notion of “correct” explanation (Sections 4.2 and 4.3). Thus, except for notions of internal consistency [3, 24], there is, in general, nothing the examiner can say about the explanations. Another issue, already observed in Sections 4.5 and 4.6, are hyperparameter choices and decisions regarding the composition of the dataset. For these decisions, it is highly non-trivial to come up with uniquely reasonable defaults: If the adversary has found a particular neural network architecture with hyperparameters that generalize well on the adversary's own dataset, how exactly could the examiner argue that this is inappropriate? Nevertheless, all of these choices can influence the resulting explanations, even if we fix a particular explanation algorithm. Of course, the examiner could scrutinize the source code, re-train the system with different parameters, perform tests on the data, and generate alternative explanations. Some have argued that this might be sufficient in order to assess a variety of legal requirements [23]. While we think that more research is needed on what can be realistically achieved in the fully transparent scenario, it is quite clear that the examiner can, at least in principle, perform a variety of powerful tests (whether this is achievable in practice, based on the limited resources of an examiner, is yet a different story). At any rate, just as in the minimalist scenario, the examiner is well-advised to examine and test the system on her own, and to ignore the explanations provided by the adversary creator.

6 DISCUSSION

Explainability is often praised as a tool to mitigate some of the risks of black-box AI systems. Our paper demonstrates that in adversarial contexts, post-hoc explanations are of very limited use. From a technical and philosophical point of view these explanations can never reveal the “unique, true reason” why an algorithm came to a certain decision. In complicated black-box models, such a true reason simply does not exist. We moreover demonstrated that post-hoc explanations of standard decision algorithms on simple datasets possess a high degree of ambiguity that cannot be resolved in principle. For these reasons, post-hoc explanations of black-box systems are, to a certain degree, incontestable. In the best case, post-hoc explanation algorithms can point out some of the factors that contributed to a decision — these algorithms are therefore useful for model debugging, scientific discovery and practical applications where all parties share a common goal. In adversarial contexts, in contrast, we demonstrated that local post-hoc explanations are either trivial or harmful. In the worst case, the explanations may induce us into falsely believing that a “justified”, or “objective” decision has been made even when this is not the case.

It was also seen that it remains unclear how expectations of explainability in the GDPR or the AIA ought to be interpreted. The GDPR does not give rise to a general explainability obligation, and the draft AI Act currently would only require some degree of explainability in relation to high-risk applications of AI. We call on legislators to formulate related provisions with more specificity in order to create legal certainty in this respect. If the final version of the AIA requires a strong version of explainability for high-risk AI systems, black-boxes simply cannot be used: they cannot be explained directly, and the only indirect means of explaining them — local post-hoc explanations — are unsuitable. In this case, one would have to resort to the use of simple, inherently interpretable machine learning models rather than black-box models (compare [42]) although this may impede innovations. We would expect that these algorithms and their explanations are more robust and less susceptible to manipulation, such that large parts of our criticism would not apply to inherently interpretable models. However, future research needs to clarify whether this is the case, because we are not aware of any research that investigates inherently interpretable machine learning in an adversarial setting. If, on the other hand, explainability in the final version of the AIA is to be understood as one of several means to achieve more transparency in machine learning, other methods than post-hoc explanations might be more suitable to achieve the desired goals of transparency. For example, as far as testing for biases and discrimination is concerned, it is unlikely that the creator of the system will choose to generate explanations that can be used to uncover hidden biases. But there is a much more direct route to assess discrimination than implicitly through explanations. Indeed, external examiners could directly test the system for discriminatory properties [23]. As such, the external examination of black-boxes may be a more suitable means of enabling more accountable AI systems.

The current draft of the AIA already requires documentation regarding the functioning of AI systems. However, one has to be aware of the versatile manipulation possibilities that lie in the development process of AI systems itself, through choice of training data, features, algorithms, parameters, and so on. Even in the fully transparent scenario where the entire development pipeline including the source code is open [23], a considerable leeway for manipulations remains. In order to address these, an external examiner would need access to considerable manpower and resources. Even when training data and source code can in principle be examined, algorithms re-applied or even retrained, actually doing so for a system that has been developed by a large team might be very difficult if not impossible. More research is needed to understand exactly which legal objectives can be satisfied by such extended documentation of AI systems, or whether the documentation would again just serve as a means to provide an appearance of objectivity without any real value.

Overall, we believe that the question of testing and certifying machine learning systems in an adversarial scenario is a research direction that is still heavily under-explored. There is no single way to achieve all the desired transparency and control goals for such AI systems. Even complete transparency, open code, open data might not lead to all the desired goals. For this reason, it is important to investigate in more detail what objective can be achieved by which means, and which goals might not be possible to achieve at all. Only then can we engage in a meaningful debate about responsible use of AI systems in social contexts.

Finally, we recall that our criticism of explainability, in particular local post-hoc explanations, concerns adversarial scenarios. In cooperative scenarios, many interesting discoveries might be made with the help of explainable machine learning.

7 FUNDING DISCLOSURE

This work has been partially supported by the German Research Foundation through the Cluster of Excellence “Machine Learning – New Perspectives for Science” (EXC 2064/1 number 390727645), the Baden-Württemberg Foundation (program “Verantwortliche Künstliche Intelligenz”), the BMBF Tübingen AI Center (FKZ: 01IS18039A), the International Max Planck Research School for Intelligent Systems (IMPRS-IS) and the Carl Zeiss Foundation. The authors declare no additional sources of funding and no financial interests.

REFERENCES

P. Achinstein. 1983. The Nature of Explanation. Oxford University Press, New York.
A. Adadi and M. Berrada. 2018. Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI). IEEE Access 6(2018), 52138–52160.
J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim. 2018. Sanity checks for saliency maps. In Neural Information Processing Systems (NeurIPS).
A.Karimi, G. Barthe, B. Schölkopf, and I. Valera. 2021. A survey of algorithmic recourse: definitions, formulations, solutions, and prospects. arxiv:2010.04050
C. Anders, P. Pasliev, A. K. Dombrowski, K. R. Müller, and P. Kessel. 2020. Fairwashing explanations with off-manifold detergent. In International Conference on Machine Learning (ICML).
S. Barocas, M. Hardt, and A. Narayanan. 2019. Fairness and Machine Learning. fairmlbook.org. http://www.fairmlbook.org.
S. Barocas, A. Selbst, and M. Raghavan. 2020. The hidden assumptions behind counterfactual explanations and principal reasons. In ACM Conference on Fairness, Accountability, and Transparency.
R. B. Braithwaite. 1953. Scientific Explanation: A Study of the Function of Theory, Probability and Law in Science. Cambridge University Press, Cambridge.
O. Camburu, E. Giunchiglia, J. Foerster, T. Lukasiewicz, and P. Blunsom. 2019. Can I trust the explainer? Verifying post-hoc explanatory methods. arXiv:1910.02065 (2019).
L. Chazette, W. Brunotte, and T. Speith. 2021. Exploring explainability: A definition, a model, and a knowledge catalogue. In IEEE 29th International Requirements Engineering Conference (RE).
European Commission. 2020. White Paper on Artificial Intelligence-A European approach to excellence and trust. Com (2020) 65 Final (2020).
I. Covert, S. Lundberg, and S.I. Lee. 2021. Explaining by removing: A unified framework for model explanation. Journal of Machine Learning Research (JMLR) 22, 209 (2021), 1–90.
F. Ding, M. Hardt, J. Miller, and L. Schmidt. 2021. Retiring Adult: New Datasets for Fair Machine Learning. In Neural Information Processing Systems (NeurIPS).
L. Edwards and M. Veale. 2017. Slave to the algorithm: Why a right to an explanation is probably not the remedy you are looking for. Duke Law and Technology Review 16 (2017).
D. Garreau and U. von Luxburg. 2020. Explaining the Explainer: A First Theoretical Analysis of LIME. In Conference on Artificial Intelligence and Statistics (AISTATS).
S. Ghalebikesabi, L. Ter-Minassian, K. DiazOrdaz, and C. C. Holmes. 2021. On locality of local explanation models. In Advances in Neural Information Processing Systems (NeurIPS).
C. Hempel. 1965. Aspects of Scientific Explanation and Other Essays in the Philosophy of Science. Free Press, New York.
M. Hildebrandt. 2019. Privacy as protection of the incomputable self: From agnostic to agonistic machine learning. Theoretical Inquiries in Law 20, 1 (2019), 83–121.
A. Z. Jacobs and H. Wallach. 2021. Measurement and fairness. In ACM conference on Fairness, Accountability, and Transparency.
D. Janzing, L. Minorics, and P. Blöbaum. 2020. Feature relevance quantification in explainable AI: A causal problem. In International Conference on Artificial Intelligence and Statistics (AISTATS).
M. Kaminski and J. Urban. 2021. The Right to Contest AI. Columbia Law Review (2021).
L. Kästner, M. Langer, V. Lazar, A. Schomäcker, T. Speith, and S. Sterz. 2021. On the Relation of Trust and Explainability: Why to Engineer for Trustworthiness. In IEEE 29th International Requirements Engineering Conference Workshops (REW).
J. Kleinberg, J. Ludwig, S. Mullainathan, and C. Sunstein. 2018. Discrimination in the Age of Algorithms. Journal of Legal Analysis 10 (2018), 113–174.
R. Kommiya Mothilal, D. Mahajan, C. Tan, and A. Sharma. 2021. Towards unifying feature attribution and counterfactual explanations: Different means to the same end. In AAAI/ACM Conference on AI, Ethics, and Society.
S. Krishna, T. Han, A. Gu, J. Pombra, S. Jabbari, S. Wu, and H. Lakkaraju. 2022. The Disagreement Problem in Explainable Machine Learning: A Practitioner's Perspective. arXiv preprint arXiv:2202.01602(2022).
M. Langer, D. Oster, T. Speith, H. Hermanns, L. Kästner, E. Schmidt, A. Sesing, and K. Baum. 2021. What do we want from Explainable Artificial Intelligence (XAI)? – A stakeholder perspective on XAI and a conceptual model guiding interdisciplinary XAI research. Artificial Intelligence 296 (2021).
E. Lee, D. Braines, Mi. Stiffler, A. Hudler, and D. Harborne. 2019. Developing the sensitivity of LIME for better machine learning explanation. In Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications.
D. Lewis. 1973. Counterfactuals. Blackwell.
Q. V. Liao and K. R. Varshney. 2021. Human-Centered Explainable AI (XAI): From Algorithms to User Experiences. arXiv preprint arXiv:2110.10790(2021).
S. Lundberg and S. Lee. 2017. A unified approach to interpreting model predictions. In Neural Information Processing Systems (NeurIPS).
S. M. Lundberg, G. Erion, H. Chen, A. DeGrave, J. M. Prutkin, B. Nair, R. Katz, J. Himmelfarb, N. Bansal, and S. I. Lee. 2020. From local explanations to global understanding with explainable AI for trees. Nature machine intelligence 2, 1 (2020), 56–67.
G. Malgieri and G. Comandé. 2017. Why a Right to Legibility of Automated Decision-Making Exists in the General Data Protection Regulation. International Data Privacy Law 7, 4 (11 2017), 243–265.
C. Molnar. 2020. Interpretable machine learning. Lulu.com.
R. Mothilal, A. Sharma, and C. Tan. 2020. Explaining machine learning classifiers through diverse counterfactual explanations. In ACM Conference on Fairness, Accountability, and Transparency.
High-Level Expert Group on AI. 2019. Ethics Guidelines for Trustworthy AI.
Working Party. 2016. Guidelines on Automated individual decision-making and Profiling for the purposes of RegulationGuidelines on Automated individual decision-making and Profiling for the purposes of Regulation 2016/679.
A. Paullada, I. Raji, E. Bender, E.and Denton, and A. Hanna. 2021. Data and its (dis) contents: A survey of dataset development and use in machine learning research. Patterns 2, 11 (2021).
J. Pearl. 2000. Causality: Models, Reasoning and Inference. Cambridge University Press, Cambridge.
K. Popper. 1959. The Logic of Scientific Discovery. Hutchinson, London.
A. Reutlinger and J. Saatsi. 2018. Explanation Beyond Causation; Philosophical Perspectives on Non-Causal Explanations. Oxford University Press, Oxford.
M. T. Ribeiro, S. Singh, and C. Guestrin. 2016. Why should i trust you? Explaining the predictions of any classifier. In 22nd ACM SIGKDD international conference on knowledge discovery and data mining.
C. Rudin. 2019. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1, 5 (2019), 206–215.
W. Salmon. 1971. Statistical Explanation and Statistical Relevance. University of Pittsburgh Press, Pittsburgh, PA.
W. Salmon. 1989. Four Decades of Scientific Explanation. In Scientific Explanation, Kitcher and Salmon (Eds.). Minnesota Studies in the Philosophy of Science, Vol. 13. University of Minnesota Press, 3–219.
A. Selbst and J. Powles. 2018. Meaningful Information and the Right to Explanation. In ACM Conference on Fairness, Accountability, and Transparency.
D. Slack, A. Hilgard, S. Singh, and H. Lakkaraju. 2021. Reliable post hoc explanations: Modeling uncertainty in explainability. In Neural Information Processing Systems (NeurIPS).
D. Slack, S. Hilgard, E. Jia, S. Singh, and H. Lakkaraju. 2020. Fooling lime and shap: Adversarial attacks on post hoc explanation methods. In AAAI/ACM Conference on AI, Ethics, and Society.
D. Slack, S. Hilgard, H. Lakkaraju, and S. Singh. 2021. Counterfactual Explanations Can Be Manipulated. arXiv:2106.02666 (2021).
P. Spirtes, C. Glymour, and R. Scheines. 1993. Causation, Prediction, and Search. Springer, Berlin.
W. Spohn. 1980. Stochastic independence, causal independence, and shieldability. Journal of Philosophical Logic 9 (1980), 73–99.
M. Sundararajan and A. Najmi. 2020. The many Shapley values for model explanation. In International Conference on Machine Learning (ICML).
R. Tomsett, D. Braines, D. Harborne, A. Preece, and S. Chakraborty. 2018. Interpretable to Whom? A Role-based Model for Analyzing Interpretable Machine Learning Systems. In ICML Workshop on Human Interpretability in Machine Learning.
P. Tschandl, C. Rinner, Z. Apalla, G. Argenziano, N. Codella, A. Halpern, M. Janda, A. Lallas, C. Longo, J. Malvehy, J. Paoli, S. Puig, C. Rosendahl, H. Soyer, I. Zalaudek, and H. Kittler. 2020. Human–computer collaboration for skin cancer recognition. Nature Medicine 26, 8 (2020), 1229–1234.
M. Veale and F. Zuiderveen Borgesius. 2021. Demystifying the Draft EU Artificial Intelligence Act—Analysing the good, the bad, and the unclear elements of the proposed approach. Computer Law Review International 22, 4 (2021), 97–112.
S. Venkatasubramanian and M. Alfano. 2020. The Philosophical Basis of Algorithmic Recourse. In ACM Conference on Fairness, Accountability, and Transparency.
G. Vilone and L. Longo. 2021. Notions of explainability and evaluation approaches for explainable artificial intelligence. Information Fusion 76(2021), 89–106.
W. J. von Eschenbach. 2021. Transparency and the Black Box Problem: Why We Do Not Trust AI. Philos. Technol. 34(2021), 1607–1622.
U. von Luxburg, R. Williamson, and I. Guyon. 2012. Clustering: Science or Art?JMLR Workshop and Conference Proceedings (Workshop on Unsupervised Learning and Transfer Learning)(2012), 65 – 79.
S. Wachter, B. Mittelstadt, and L. Floridi. 2017. Why a Right to Explanation of Automated Decision-Making Does Not Exist in the General Data Protection Regulation. International Data Privacy Law 7, 2 (06 2017), 76–99.
S. Wachter, B. Mittelstadt, and C. Russell. 2017. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harv. JL & Tech. 31(2017), 841.
J. Woodward. 2003. Making Things Happen: A Theory of Causal Explanation. Oxford University Press.
J. Woodward and L. Ross. 2003. Scientific Explanation. The Stanford Encyclopedia of Philosophy (Summer Edition 2021) (2003). https://plato.stanford.edu/archives/sum2021/entries/scientific-explanation/
C. Zednik and H. Boelsen. forthcoming. Scientific Exploration and Explainable Artificial Intelligence. Minds and Machines(forthcoming).
Y. Zhang, K. Song, Y. Sun, S. Tan, and M. Udell. 2019. Why Should You Trust My Explanation? Understanding Uncertainty in LIME Explanations. arXiv preprint arXiv:1904.12991(2019).

FOOTNOTE

¹The creator is mainly the developer. But since the developer develops the system for a user, their interests typically align. Hence we do not distinguish developer and user, and use the term “creator” instead.

²Similar distinctions were introduced by [52].

³The exact interpretation of “logic” in the GDPR is not settled but likely does not refer to understandings of this term in philosophy or computer science.

⁴Children should not be subject to automated decision-making.

⁵Questions regarding “Explanations” have been discussed since the beginning of philosophy, with a strong revival in the philosophy of science of the last century, treating scientific explanations [1, 8, 17, 39, 43], causal explanations [28, 38, 49, 50, 61], and non-causal explanations [40]. We refer the interested reader to [44, 62] and restrict our discussion to the context of machine learning.

⁶For further references, see §2 therein.

⁷Other actions belong more properly to the collaborative setting, such as debugging, improving, correcting, learning, understanding and testing.

⁸With the consequence that explainability would also play a role in stimulating the adoption of AI and the competitiveness of the internal market.

⁹Similar issues were discussed in [7, 19].

¹⁰The reader who is acquainted with the internal mechanics of the depicted explanation method might feel that a direct comparison between the different methods is unwarranted, because different methods measure different aspects of the underlying decision function [9]. Note, however, that this is exactly the point that we want to make by explicitly contrasting the different attributions.

¹¹This originates in the philosophical account: counterfactuals depend on the way one measures proximity between facts and alternative counter-facts [28].

¹²The distinction between two “different” explanation algorithms and different parameter choices for the “same” explanation algorithm is of course a matter of perspective: We might consider the question of distributional versus intervential Shapley values as a question of how to use “the” SHAP method [20], but we might as well perceive it as a discussion as to which of two different methods to use.

¹³This means that for any possible datapoint (or individual) x, the examiner is allowed to ask the adversary: “For this hypothetical datapoint x, what would be the decision y = f(x), and what would be the corresponding explanation E(x, y)? The adversary would then privately compute both quantities and make them available to the examiner, but not tell the examiner how the computation was performed.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

FAccT '22, June 21–24, 2022, Seoul, Republic of Korea