survey

Open access

Redefining Counterfactual Explanations for Reinforcement Learning: Overview, Challenges and Opportunities

Authors:

Jasmina Gajcin,

Ivana DusparicAuthors Info & Claims

ACM Computing Surveys, Volume 56, Issue 9

Article No.: 219, Pages 1 - 33

https://doi.org/10.1145/3648472

Published: 24 April 2024 Publication History

PDF eReader

Abstract

While AI algorithms have shown remarkable success in various fields, their lack of transparency hinders their application to real-life tasks. Although explanations targeted at non-experts are necessary for user trust and human-AI collaboration, the majority of explanation methods for AI are focused on developers and expert users. Counterfactual explanations are local explanations that offer users advice on what can be changed in the input for the output of the black-box model to change. Counterfactuals are user-friendly and provide actionable advice for achieving the desired output from the AI system. While extensively researched in supervised learning, there are few methods applying them to reinforcement learning (RL). In this work, we explore the reasons for the underrepresentation of a powerful explanation method in RL. We start by reviewing the current work in counterfactual explanations in supervised learning. Additionally, we explore the differences between counterfactual explanations in supervised learning and RL and identify the main challenges that prevent the adoption of methods from supervised in reinforcement learning. Finally, we redefine counterfactuals for RL and propose research directions for implementing counterfactuals in RL.

1 Introduction

Artificial intelligence (AI) solutions have become pervasive in various fields in the last decades, thanks in part to the adoption of deep learning algorithms. In particular, deep learning has shown remarkable success in supervised learning tasks, where the goal is to learn patterns in a labeled training data set and use them to accurately predict labels on unseen data [71]. Deep learning algorithms rely on neural networks, which allow for efficient processing of large amounts of unstructured data. However, they also rely on a large number of parameters, making their decision-making process difficult to understand. These models are often referred to as black-box due to the lack of transparency in their inner workings.

Reinforcement learning (RL) [85] is a sub-field of AI that focuses on developing intelligent agents for sequential decision-making tasks. RL employs a trial-and-error learning approach in which an agent learns a task from scratch through interactions with its environment. An agent can observe the environment, perform actions that alter its state and receive rewards from the environment, which guide it towards an optimal behavior. The goal of RL is to obtain an optimal policy \(\pi\), which maps the agent’s states to optimal actions. This bears some similarity to supervised learning approaches, where the goal is to classify an instance into the correct class according to the input features. However, while supervised learning algorithms rely on labeled training instances to learn patterns in the data, RL agents approach the task without prior knowledge and learn it through interactions with the environment. Deep reinforcement learning (DRL) algorithms [5], which employ a neural network to represent an agent’s policy, are currently the most popular approach for learning RL policies [5]. DRL algorithms have shown remarkable success in navigating sequential decision-making problems in games, autonomous driving, healthcare, and robotics [17, 46, 48, 64]. Although they can process large amounts of unstructured, high-dimensional data, they also struggle to explain an agent’s decisions, due to their reliance on neural networks.

Depending on the task and the user, AI systems require explainability for a variety of reasons (Figure 1). From the perspective of the developer, explainability is necessary to verify the system’s behavior before deployment. Understanding how the input features influence the decision of the AI system is necessary to avoid deployment of models that rely on spurious correlations [8, 27] and to ensure robustness to adversarial attacks [86]. From the perspective of fairness, understanding the decision-making process of an AI system is necessary to prevent automated discrimination. Lack of transparency of AI models can cause them to inadvertently adopt historical biases ingrained in training data and use them in their decision logic [39]. To prevent discrimination, users of autonomous decision-making systems are now legally entitled to an explanation under the regulation within the GDPR in EU [31]. From the perspective of expert and non-expert users of the system, explainability is necessary to ensure trust. For experts that use AI systems as an aid in their everyday tasks, trust is a crucial component necessary for successful collaboration. For example, a medical doctor using an AI system for diagnostics needs to understand it to trust its decisions and use them for this high-risk task [49]. Similarly, for non-expert users, trust is needed to encourage interaction with the system. If an AI system is used to make potentially life-altering decisions for the user, they need to understand how the system operates to maintain their confidence and trust in the system.

Fig. 1.

The field of explainable AI (XAI) explores methods for interpreting decisions of black-box systems in various fields such as machine learning, reinforcement learning, and explainable planning [2, 16, 26, 32, 50, 60, 74, 76, 78, 83, 90, 91, 97]. In recent years, the focus of XAI has mostly been on explaining decisions of supervised learning models [10]. Specifically, the majority of XAI methods have focused on explaining the decisions of neural networks, due to the emergence of deep learning as the state-of-the-art approach to many supervised learning tasks [1, 13, 99]. In contrast, explainable RL (XRL) is a fairly novel field that has not yet received an equal amount of attention. Most often, existing XRL methods focus on explaining DRL algorithms, which rely on neural networks to represent the agent’s policy, due to their prevalence and success [95]. However, as RL algorithms are becoming more prominent and are being considered for use in real-life tasks, there is a need for understanding their decisions [24, 73]. For example, RL algorithms are being developed for different tasks in healthcare, such as dynamic treatment design [53, 59, 101]. Without rigorous verification and understanding of such systems, the medical experts will be reluctant to collaborate and rely on them [75]. Similarly, RL algorithms have been explored for enabling autonomous driving [4]. To understand and prevent mistakes such as the 2018 Uber accident [47] where a self-driving car failed to stop before a pedestrian, the underlying decision-making systems have to be scrutable. Specific to the RL framework, explainability is also necessary to correct and prevent “reward hacking” – a phenomenon where an RL agent learns to trick a potentially misspecified reward function, such as a vacuum cleaner ejecting collected dust to increase its cleaning time [3, 68].

In this work, we explore counterfactual explanations in supervised and reinforcement learning. Counterfactual explanations answer the question: “Given that the black-box model made decision y for input x, how can x be changed for the model to output alternative decision \(y^{\prime }\)?” [93]. Counterfactual explanations offer actionable advice to users of black-box systems by generating counterfactual instances – instances as similar as possible to the original instance being explained but producing a desired outcome. If the user is not satisfied with the decision of a black-box system, a counterfactual explanation offers them a recipe for altering their input features, to obtain a different output. For example, if a user is denied a loan by an AI system, they might be interested to know how they can change their application so that it gets accepted in the future. Counterfactual explanations are targeted at non-expert users, as they often deal in high-level terms, and offer actionable advice to the user. They are also selective, aiming to change as few features as possible to achieve the desired output. As explanations that can suggest potentially life-altering actions to the users, counterfactuals carry great responsibility. A useful counterfactual explanation can help a user achieve a desired outcome, and increase their trust and confidence in the system. However, an ill-defined counterfactual that proposes unrealistic changes to the input features or does not deliver the desired outcome can waste the user’s time and effort and erode their trust in the system. For this reason, careful selection of counterfactual explanations is essential for maintaining user trust and encouraging their collaboration with the system.

Although they have been explored in supervised learning [14, 19, 40, 58, 72, 96], counterfactual explanations are rarely applied to RL tasks [67]. In supervised learning, methods for generating counterfactual explanations often follow a similar pattern. Firstly, a loss function is defined, taking into account different properties of counterfactual instances, such as the prediction of the desired class or similarity to the original instance. The loss function is then optimized over the training data to find the most suitable counterfactual instance. While the exact design of the loss function and the optimization algorithm vary between approaches, the high-level approach often takes the same form.

In this work, we challenge the definition of counterfactuals inherited from supervised learning for explaining RL agents. We examine the similarities and differences between the supervised and RL from the perspective of counterfactual explanations and argue that the same definition of counterfactual explanations cannot be directly translated from supervised to RL. Even though the two learning paradigms share similarities, in this work we demonstrate that the sequential nature of RL tasks, as well as the agent’s goals, plans, and motivations, make these two approaches substantially different from the perspective of counterfactual explanations. We start by reviewing the existing state-of-the-art methods for generating counterfactual explanations in supervised learning. Furthermore, we identify the main differences between supervised and reinforcement learning from the perspective of counterfactual explanations and redefine them for RL use. Finally, we identify research questions that need to be answered before counterfactual explanation methods can be applied to RL and propose potential solutions.

Previous surveys of XRL recognize counterfactual explanations as an important method but do not offer an in-depth review of methods for generating this type of explanation [36, 73, 98]. Previous surveys of counterfactual explanations, however, focus only on methods for explaining supervised learning models and offer a theoretical background and review of state-of-the-art approaches [81, 82, 92]. Similarly, Guidotti [33] reviews counterfactuals for supervised learning and offers a demonstration and comparison of different approaches. On the other hand, in this work, we focus specifically on counterfactual explanations from the perspective of RL. Additionally, while previous work has explored differences between supervised and RL for causal explanations [20], we utilize this to redefine counterfactual explanations for RL use, as well as explore challenges of applying supervised learning methods for generating counterfactual explanations directly in RL.

The rest of the work is organized as follows. Section 2 provides a taxonomy and a short overview of methods for explaining the behavior of RL agents. In Section 3 we identify the key similarities and differences between supervised and reinforcement learning from the perspective of explainability. Properties of counterfactual explanations are explored in Section 4. Furthermore, Section 5 offers a review of the state-of-the-art methods for generating counterfactual explanations in supervised and reinforcement learning. Finally, Section 6 focuses on redefining counterfactual explanations for RL and identifying challenges and open questions in this field.

2 Explainable Reinforcement Learning (XRL)

Recent years have seen a rise in the development of explainable methods for RL tasks [73]. In this section, we provide an overview of RL framework, XRL taxonomy and offer a condensed review of current state-of-the-art XRL methods.

2.1 Reinforcement Learning (RL)

Reinforcement learning (RL) [85] is a trial-and-error learning paradigm for navigating sequential decision-making tasks. RL problems are usually presented in the form of Markov Decision Process (MDP) \(M = (S, A, P, R, \gamma)\), where S is a set of states an agent can observe and A a set of actions it can perform. Transition function \(P: S \times A \rightarrow S\) defines how actions change the current state and produce a new state. Actions can elicit rewards \(r \in R\), whose purpose is to guide the agent toward the desired behavior. Parameter \(\gamma\) is the discount factor. The agent’s performance, starting from time step t, can be calculated as the return:

\begin{equation} G_t = \sum _{i=t+1}^{T} \gamma ^{i - t - 1} \cdot r_i \end{equation}

(1)

where \(\gamma\) balances between the importance of short and long-term rewards.

The goal of the RL agent is to learn a policy \(\pi : S \rightarrow A\), which maps the agent’s states to optimal actions. One of the most notable approaches for learning the optimal policy is deep reinforcement learning (DRL). DRL is a field at the intersection of deep learning and RL. DRL algorithms use neural networks to represent an agent’s policy. This allows easier processing of unstructured observations, such as images, and overcomes scalability limitations of tabular methods such as Q-learning. DRL algorithms have been successfully applied to various domains, such as games [64, 94], autonomous driving [46] and robotics [48, 70]. Despite their success, DRL algorithms cannot be safely deployed to real-life tasks before their behavior is fully verified and understood. However, since these methods rely on neural networks to learn the optimal policy, the agent’s behavior is difficult to explain and the reasoning behind its decisions is not transparent.

Additionally, there are a number of extensions to the RL framework. In multi-goal RL, for example, an agent operates in an environment with more than one goal and must navigate them successfully to finish the task. A simple taxi environment in which an agent needs to pick up and drop off passengers is an example of a multi-goal problem, with two subtasks. Similarly, multi-objective RL applies to tasks where the agent needs to balance between different, often conflicting desiderata. In autonomous driving, for example, an agent needs to consider different objectives such as speed, safety, and user comfort.

2.2 XRL Taxonomy

According to a previous survey of XRL by Puiutta and Veith [73] the taxonomy of XRL methods largely overlaps with the taxonomy of XAI approaches, and classifies methods based on the reliance of the black-box model, time of interpretation, scope and target audience.

Depending on their reliance on the black-box model, XRL methods can be model-agnostic or model-specific. Model-specific explanations are developed for a particular type of black-box algorithm, while model-agnostic approaches can explain any method.

An XRL model can be either intrinsically or post-hoc explainable. An intrinsically explainable model has an interpretable structure, such as linear models or decision trees. Just by observing the structure of such a model, we can understand the relationships between the individual features and the outcome. On the other hand, post-hoc explainability assumes the model is a black box and explanations are generated after the model has been trained.

According to their scope, explanations can be local or global. Local methods focus on interpreting a single decision in a specific instance, while global approaches explain the entire behavior of a model. For example, saliency maps are a local approach that explains an action of an RL agent by highlighting parts of the state that contributed to that action. On the other hand, global surrogates use interpretable models, such as decision trees to distill a black-box policy into a more understandable format.

Additionally, we can also make a distinction between explainability methods based on their target audience. The majority of current XRL approaches focus on explaining agent behavior to developers or experts. This means that they often offer low-level, technical explanations which may be incomprehensible to a non-expert user. User-friendly explanations, on the other hand, must deal in high-level terms. They usually explain an event by comparing it to a similar event with a different outcome [63], or offering actionable advice on how to change the input in order to obtain a desired output.

2.3 XRL Methods

In this section, we provide a brief overview of current methods in XRL. Specifically, we review some notable approaches for global and local explanation. In the category of global approaches, we review global surrogates as a notable model-specific approach and summaries as the model-agnostic approach. In local explanations, we review saliency maps, which can be both model-specific and model-agnostic, and contrastive explanations which are often model-agnostic. While most of XRL methods fall within these four categories, there are some other approaches like hierarchical interpretable RL [9] or visualization approaches [64, 102] which we omit from this survey. An overview of the most notable XRL methods is given in Table 1.

Table 1.

Method	Subcategory	Scope	Model-reliance	Explanation time	Audience	Publications
Global surrogates	Decision trees	Global	Model-specific	Post-hoc	Developers/Experts	Coppens et al. [16], Liu et al. [57]
Global surrogates	Custom model	Global	Model-specific	Post-hoc	Developers/Experts	Verma [90], Verma et al. [91]
Saliency maps	Gradient-based	Local	Model-specific	Post-hoc	Developers/Experts	Wang et al. [97]
Saliency maps	Perturbation-based	Local	Model-agnostic	Post-hoc	Developers/Experts	Greydanus et al. [32], Puri et al. [74]
Summaries		Global	Model-agnostic	Post-hoc	All	Amir and Amir [2], Sequeira and Gervasio [78]
Contrastive Expl.	Contrasting outcomes	Local	Model-agnostic	Post-hoc	All	Madumal et al. [61], van der Waa et al. [89]
Contrastive Expl.	Contrasting objectives	Local/Global	Model-agnostic	Post-hoc	All	Gajcin et al. [28], Juozapaitis et al. [41], Sukkerd et al. [84]
Hierarchical RL		Global	Model-specific	Interpretable	Developers/Experts	Beyret et al. [9]
Visualization	t-SNE	Global	Model-agnostic	Post-hoc	Developers/Experts	Mnih et al. [64], Zrihem et al. [102]
Other	Abstract graphs	Global	Model-specific	Interpretable	Developers/Experts	Topin and Veloso [87]
	Symbolic policies	Global	Model-specific	Interpretable	Developers/Experts	Landajuela et al. [51]
	Causal confusion	Global	Model-agnostic	Post-hoc	Developers	Gajcin and Dusparic [27]

Table 1. Categorization of Methods for Explainable RL Based on the XRL Taxonomy (Section 2.2)

Methods are classified based on their scope, reliance on the black-box model, time of explanation and intended audience.

2.3.1 Global Surrogates.

One of the main approaches to global explanations in RL is representing policies using an intrinsically interpretable model. This often consists of training a black-box DRL model and transforming it into an interpretable AI structure, while retaining performance. Most often, an RL policy is distilled into a tree structure, although several works introduce novel interpretable frameworks for representing RL policy.

Coppens et al. [16] employ a Soft Decision Tree (SDT) to predict actions of the DRL policy. SDT is a binary tree with predetermined depth, where each node i consists of a perceptron with weight \(w_i\) and bias parameter \(b_i\). SDT is trained to predict the optimal action from state features using a data set of state-action pairs \((s, \pi (s))\) extracted from DRL policy’s \(\pi\) interactions with the environment. Explanation for an individual decision is obtained by traversing the trained SDT and observing which spatial features were considered on the decision path. Even though SDT attempts to mimic the DRL policy, its performance depends on the tree depth and fails to replicate that of the black-box model. Increasing the depth of the tree leads to a larger expected reward in Mario AI benchmark task [42], but in turn, decreases interpretability since decision paths become longer and more difficult to understand. Similarly, Liu et al. [57] propose a regression tree with linear models in leaves, named Linear Model U-Tree (LMUT). In contrast to Coppens et al. [16], LMUT is not used to predict the agent’s next action, but rather represents the black-box policy implicitly, by approximating its Q function. Authors additionally propose an online learning algorithm that combines active interactions with the environment and learning of the Q-function directly from the black-box model. Instead of using state-action pairs as training data, a sequential data set of interactions \((s_t, a_t, r_t, s_{t+1}, Q(s_t, a_t))\) is gathered from interactions with the environment.

In contrast to tree-based approaches, Verma [90], Verma et al. [91] do not use a known structure, but introduce an interpretable RL-specific programming language. Each policy is represented as a program in a high-level, domain-specific language. Verma et al. [91] and Verma [90] also propose NDPS (Neurally Directed Program Search), a method for searching the program space. NDPS uses a trained black-box policy as an oracle, and attempts to find an interpretable program that approximates this policy best. NDPS consists of iterative updates of the current estimate e of the oracle policy \(e_N\). In each iteration, a neighborhood of policies structurally similar to the current estimate e is tested, and the one most similar to oracle is promoted to the current estimate. The iterative process converges after no change to the policy has been made for several iterations. The interpretable policy is evaluated in a car-racing environment, and achieves comparable performance to the black-box DRL policy.

Global surrogate models provide an alternative to black-box solutions, but their capabilities are usually limited by the trade-off between interpretability and performance. Moreover, they require additional time and resources for training a separate interpretable policy alongside the original, black-box one.

2.3.2 Summaries.

One of the most notable model-agnostic global methods is summary generation. Summaries are usually directed towards non-expert users of the system and aim to showcase specific parts of the agent’s behavior to give the user an idea about the agent’s capabilities and limitations. Summaries can also be beneficial to experts and developers as they can emphasize situations where an agent exhibits suboptimal behavior.

Amir and Amir [2] proposed HIGHLIGHTS, a method for selecting the most important states to present to the user. A state is considered to be important if making a wrong decision in the state leads to a significant decrease in expected reward. Formally, the importance of a state s is calculated as the difference between the largest and the smallest expected reward, attainable from s depending on the chosen action:

\begin{equation} \mathcal {I}(s) = \max _a Q(s, a) - \min _a Q(s, a) \end{equation}

(2)

A limited number of transitions relating to states with the highest importance score are presented to the user in the form of a summary. As the authors too acknowledge, this approach comes with significant drawbacks, such as sensitivity to the number of actions. Sequeira and Gervasio [78] addressed some limitations and expanded the HIGHLIGHTS algorithm by proposing four heuristics that determine whether a state is interesting enough to be included in the summary:

—

Frequency analysis detects situations that are frequently or infrequently visited by the agent. Frequency can be calculated for states: \(n(s)\), state-action pairs: \(n(s, a)\) or transitions: \(n(s, a, s^{\prime })\).

—

Execution certainty refers to the agent’s confidence in its decision in a specific state. Situations of high confidence could indicate the agent’s capabilities, while those in which the agent is indecisive could showcase its limitations.

—

Transition value is a measure of favorableness of the situation. Favorable states are those whose state value \(V(s)\) is higher than for states in its immediate surroundings, while unfavorable states see the local minima of \(V(s)\).

—

Sequence analysis uncovers most likely paths from interesting states (extracted according to the previous three conditions) to local maxima states in terms of the state value function.

Summary methods provide a way of condensing the agent’s behavior and locating situations where the agent acts in an unexpected or otherwise interesting way. At the moment, however, these methods rely on heuristics to determine whether a state is important enough to be included in the summary. Whether the state will be included in the summary only depends on its basic features, such as visitation frequency and action certainty, or its expected reward. Other situations of interest, such as states in which the agent’s goals change, or where an unexpected outside event impacts its plan are not explored. However, in a dynamic environment, understanding how the agent adjusts its goals and plans might be necessary to fully grasp the agent’s capabilities and limitations. Current methods also require access to the agent’s state and Q value functions, which can be inexact as they are the agent’s approximations of expected reward, as well as difficult to compare between agents. Additionally, while useful for visual tasks, summaries are not an appropriate technique for explaining tasks with other types of input, such as personal assistants [2].

2.3.3 Saliency Maps.

Saliency maps have originally been used to interpret decisions of neural networks for supervised learning tasks [80], but have also been applied as an out-of-the-box solution to the problem of explainable RL. This approach tries to explain why an action was chosen in a state by highlighting specific parts of the image that the agent focused on while making the decision. There are two approaches to generating saliency maps: perturbation-based and gradient-based. Perturbation-based methods distress parts of the image and compare the model’s decision before and after alternation, to assess the importance of a feature. This can mean either removing, blurring or in another way altering the part of the image. Intuitively, if the agent’s decision changes significantly after altering a feature, then the feature is important for the prediction. Gradient-based methods, however, rely on calculating the Jacobian of the neural network with respect to the input in order to see how much specific parts of the image impacted the decision. Gradient-based methods require access to the model’s internal parameters and structure, while perturbation methods use the model as an “oracle” and only require access to the model’s output.

Greydanus et al. [32] used perturbation-based saliency maps to explain agents’ actions in Atari games (Figure 2). To determine their contribution to a specific action, Gaussian blur was added to the image features. The approach was used to expose how agents learn interesting strategies, such as “tunneling” in a Breakout environment. Tunneling refers to a situation where an agent aims the ball through the brick wall to create a tunnel and then propels the ball to the top of the screen allowing it to bounce between the ceiling and the brick wall, collecting dense rewards. Saliency maps uncovered that while creating the tunnel the agent attends to the top of a brick wall, where its future rewards lie. Additionally, Greydanus et al. [32] apply saliency maps to uncover overfitting, in a situation where an agent learns to rely on a feature that is correlated with the optimal action but does not cause it. The overfitting scenario is created by adding hint pixels, representing the most likely expert action, to the raw Atari state. Although the overfitting agent shows good performance, saliency maps uncover that it bases its decisions on the added hint feature. Finally, Greydanus et al. [32] also showcased the potential benefits of saliency maps for debugging. After observing the suboptimal behavior of the agent in the Ms Packman environment, the authors apply saliency maps to uncover that the agent was not attending to the ghosts, resulting in the failure to avoid them. Greydanus et al. [32] recognize that capturing this kind of faulty behavior can inspire developers to reexamine the reward function that the agent is following. Similarly, Puri et al. [74] proposed SARFA (Specific and Relevant Feature Attribution), a method for generating perturbation-based saliency maps, and applied it to explain the moves of a chess agent. In this work, perturbation is performed by removing the piece from the chessboard. The action is further explained by comparing the difference in the expected reward of taking the action in the original and perturbed state. SARFA addresses two limitations of previous perturbation-based methods – specificity and relevance, and uses their harmonic mean to calculate feature saliency. Specificity means that the saliency of a feature should only focus on the action being explained. Perturbation of a feature should only affect the saliency of that feature if it significantly affects the action being explained, while not impacting other actions. Relevance addresses the need for saliency of a feature to only include those perturbations whose effects are relevant to the action being explained. In other words, perturbations should have a significant effect on the action in question, without impacting other actions, for the feature to be considered important.

Fig. 2.

Gradient-based saliency maps have not yet found broad applications in the field of RL. Few works utilize saliency maps to demonstrate and evaluate the behavior of newly proposed methods. Wang et al. [97] introduce Dueling Networks which separately estimate the state-value function and the advantage of each action, by implementing two different streams in the DNN structure. Dueling networks can distinguish between states in which making a good decision is crucial and those where there is no significant difference between actions. Gradient-based saliency maps are used to demonstrate and validate the behavior of dueling networks in a driving environment, showing that in critical situations advantage stream attends to nearby cars, while if the road is clear it does not attend to any features, as its decision is not important.

Saliency maps are a useful method for the visualization of important areas in the state space. However, they explain the action only as a function of input features. As such, this approach might fit well into the single-step decision-making process of supervised learning algorithms, where a decision can be easily interpreted in terms of individual feature contributions. In RL, however, saliency maps focus only on the current state to explain a decision, while ignoring other possible causes of behavior such as the agent’s plan, current goal, or outside events.

2.3.4 Contrastive Explanations.

Contrastive explanations explain why an agent chose a specific behavior over another. Evidence from social sciences suggests that humans prefer explanations that contrast different alternative events or options [63]. According to Lipton [55], contrasting explanations can help narrow down the list of causes of an event. Naturally, humans prefer explanations that are selective and present a few most important causes, instead of being overwhelmed by a substantial list of all potential causes of an event [65]. All of this makes contrastive explanations user-friendly.

The majority of contrastive explanations in RL focus on comparing the outcomes of actions a and \(a^{\prime }\), to uncover why the agent chose a. For example, van der Waa et al. [89] generate contrastive explanations by comparing the expected outcomes of two actions in a specific state. Firstly, the agent’s policy is made more understandable by mapping states and outcomes to a hand-crafted set of interpretable concepts. Outputs describe the agent’s position in understandable terms – in the grid-world environment, for example, the agent can be at the goal, in a trap, or near a forest. Using hand-crafted states and outputs, the approach can explain why the agent chose one or more consecutive actions of the fact policy \(\pi _t\) and did not follow the alternative foil \(\pi _f\). Both policies are unrolled in the environment, starting in state s for a set number of steps n and their visited paths \(Path(s, \pi _t)\) and \(Path(s, \pi _f)\) are recorded. Contrastive explanation can then be generated by comparing the two paths. Authors propose taking a set difference \(Path(s_t, \pi _t) \setminus Path(s_t, \pi _f)\) to obtain outcomes unique to \(\pi _t\), or calculating a symmetrical difference \(Path(s_t, \pi _t) \Delta Path(s_t, \pi _f)\).

Similarly, Madumal et al. [61] answer the question “Why did the agent take action a instead of b in state s?” by comparing the outcomes of two potential actions. Instead of unrolling the alternative policies in the environment, however, authors use a causal model of the environment named action-influence model. In the action-influence model, nodes represent state features, and arcs indicate causal relationships. Each arc is also assigned a structural equation that defines the nature and strength of the causal effect. Additionally, each arc is assigned an action, representing the causal effect after the action is performed. It also allows for a special type of node called the reward node, to represent the effect of actions and features on the reward. The authors use a hand-crafted causal model and learn structural equations from transition data using linear regression to represent the action-influence model. To generate contrasting explanations, causal chains of actions a and b are compared. To reduce explanation complexity, the authors propose that only minimal chains are used to represent the causal effects of an action. For a complete chain, the minimal chain consists of only the first and last arc including their source and destination nodes, ignoring the intermediate results. The final contrastive explanation is obtained by comparing differences between minimal causal chains of actions a and b.

In multi-objective RL, where an agent needs to balance between different, often conflicting goals, contrastive explanations focus on trade-offs between objectives for different actions. Sukkerd et al. [84] propose a method for generating contrasting explanations that show how the trade-off between different objectives contributed to the choice of the action. Explanation is presented in terms of quality attributes (QA), which represent specific objectives (e.g., execution time, energy consumption, etc.). To provide contrasting explanations, the authors propose comparing the policy to a subset of Pareto-optimal policies. The resulting contrastive explanations show the trade-offs in QAs that guided the choice of action:

I could improve on \(QA_i\) following policy \(\pi _i\). However, this would worsen the \(QA_j\) by an amount \(qa_j\). I chose the action because improvement in \(QA_i\) is not worth the deterioration of \(QA_j\)

Similarly, Juozapaitis et al. [41] convert the single-objective problem into a multi-objective one, by decomposing the reward into semantically meaningful components. Furthermore, they explain the difference between two actions in one state in terms of reward components. Formally, authors split the reward into a reward vector \(\vec{R}: S\times A \rightarrow \mathbb {R}^{|C|}\) where C represents a set of different reward types. Each component \(R_c(s, a)\) of the reward vector corresponds to the specific reward type, and the overall reward can be obtained as the sum of individual rewards: \(R(s, a) = \sum _{c \in C} R_c(s, a)\). Similarly, the Q-value can be also split into individual components \(Q_c^\pi\), where each component corresponds to one reward type. The original Q-value can then be calculated as a sum of individual reward-specific values: \(Q^\pi = \sum _{c \in C} Q_c^\pi\). Authors then propose a novel learning algorithm drQ, inspired by Q-learning, which supports individual rewards and updates separately to each \(Q_c^\pi\). The difference between two actions \(a_1\) and \(a_2\) in state s is then explained by analyzing the trade-off between the reward components. In order not to overwhelm the user, authors propose only presenting them with minimal contrastive explanations, which contain only critical positive and negative reasons for a decision.

On the global level, Gajcin et al. [28] propose a post-hoc summarization method for contrasting the behavior of two agents. The approach focuses on the difference in preferences between two policies trained on different reward functions. The approach aims to compare and summarize the differences between two policies, by focusing only on the differences in their preferences and ignoring potential differences in their abilities. To explore differences between policies, authors first unroll the policies in the environment and record disagreement states – states in which policies disagree on the best action choice. Both policies are also separately unrolled from each disagreement state for a set number of steps and recorded as disagreement trajectory pairs. To filter out disagreements that stem from the difference in preferences from those that stem from different abilities the authors propose three conditions. The disagreement between two policies is preference-based if:

(1)

Both policies are highly confident in their decision in the disagreement state.

(2)

Both policies have similar evaluations of the disagreement state \(s_d\). To estimate this similarity we evaluate the expression.

(3)

After unrolling policies in the environment for k steps from the disagreement state, both policies have similar evaluations of their outcomes.

Intuitively, the conditions describe disagreement where both policies are sure of their decision, have similar evaluations of their current state, and see the same potential in the environment, but differ in their preferred path to achieving that potential. Preference-based disagreement data is then used to generate contrasting explanations about the policies. For example, after comparing safety-based and speed-based driving policies \(\pi _A\) and \(\pi _B\), the approach generates the following explanation:

Policy \(\pi _A\) prefers states with x smaller, y smaller, h smaller, v smaller, u larger compared to policy \(\pi _B\). where state features x and y denote car’s location, h is the heading, v velocity and u steering wheel angle.

Contrastive explanations are a user-friendly type of explanations that can summarize the most important causes of a decision. However, current methods mostly focus on comparing state-based outcomes of decisions. There is room for extending contrastive methods to include goals, objectives, or unexpected events to explain a preference for one action over another. These explanations could be especially useful in multi-goal, multi-objective, or stochastic RL environments.

3 Explainability in Supervised Vs. Reinforcement Learning: Similarities and Differences

Supervised and reinforcement learning are often used to solve different types of tasks. For example, supervised learning is used for learning patterns in large amounts of labeled data, while RL focuses on sequential tasks, and learns them from scratch through interactions with the environment. However, from the perspective of explainability, the two approaches share notable similarities. Most importantly, the lack of transparency in supervised and reinforcement learning stems from the same source – the underlying black-box model. In supervised learning, a black-box model is used to map input states to their labels. Similarly, in RL tasks, the model maps the agent’s state to an optimal action. Most explainability methods focus on deep learning and DRL scenarios, where the underlying black-box model is a neural network [95]. The high-level approach of the two paradigms is in fact identical – a black-box model takes in often a multi-dimensional input consisting of features and produces a single output. For this reason, a model-agnostic approach trying to explain the model’s prediction in a supervised learning task can easily be applied to an RL task for explaining a choice of action in a specific state. For example, saliency maps have been successfully repurposed for RL [38], despite being developed with supervised learning in mind [80]. Other local explanation methods such as LIME [76] and SHAP [60] have also been used to interpret the effect of individual features on the action in the state [23, 100].

However, despite their similarities, supervised learning and RL frameworks differ in a few notable points and thus often require different explanation techniques. In Figure 3 we present some illustrations of the main differences between supervised and RL from the perspective of explainability:

Fig. 3.

(1)

State independence: one of the main assumptions of supervised learning algorithms is the independence of instances in the training data. RL, however, focuses on sequential tasks where states are temporally connected. There is an inherent causal structure between visited states and chosen actions during the agent’s execution. A certain state is only visited because previous states and actions led to it [20]. The causal links can be important components to explain the outcomes of the RL agent’s behavior. Figure 3(a) illustrates using previously visited states to explain the current decision. The agent justifies its decision of choosing the action right by remembering that it had previously picked up the blue key and should navigate towards the blue goal.

(2)

Sequential execution: supervised learning approaches are limited to one-step prediction tasks, while an RL framework specializes in sequential problems. While methods for explaining supervised learning models only need to rely on model parameters and input features to explain a prediction, the causes of the RL agent’s decisions can be in its past or the future [20]. For this reason, explanations of decisions in RL cannot be contained to the current time step but may need to include temporally distant causes. Figure 3(b) shows how temporally distant states can motivate an agent’s decision.

(3)

Explanation components: in supervised learning, a model’s decision is explained as a function of input features and model parameters. In RL, however, an agent’s decision-making process can be more complex and can include plans, goals or unexpected events. To fully understand the behavior of RL agents, explanations often cannot be limited to only state features, but should also include other possible causes of decisions. For example, in Figure 3(c) an agent can explain its decision by comparing two different objectives it balances between. Even though an agent is choosing the fastest way to the goal, stepping on the ice can lead to a large penalty. To understand why an agent would choose a potentially dangerous decision, it is necessary to know that it prefers speed over safety.

(4)

Training data set: while a necessary component in supervised learning approaches, it does not have a natural equivalent in an RL framework. For this reason, explainability methods for supervised learning which rely on training data sets (e.g., LIME, counterfactuals) might have to be adjusted to be applicable to RL tasks.

Supervised learning explainability approaches are limited to one-step prediction tasks, and as such have limited applications in RL. While they can be used to explain a decision based on the current state features, an RL agent’s decisions are often influenced by a wider range of causes and require richer, temporally extended explanations. For that reason, there is a need for RL-specific explainability techniques that account for the particularities of the framework.

4 Counterfactual Explanations

At the moment, a substantial amount of research in XAI is focused on developing explanations for experts familiar with AI systems. However, from the perspective of non-expert users, such explanations might be too detailed and difficult to understand. End-users are more likely to be interested in more abstract explanations of the system [21]. For example, consider the 2018 incident when an Uber self-driving vehicle accidentally killed a pedestrian after failing to avoid them while crossing the road [47]. In this situation, developers and experts might be interested in uncovering which parameters or input features contributed to the fatal failure, in order to be able to repair it. On the other hand, a non-expert user is more likely to be interested in high-level questions, such as “Did the car not recognize the person in front of it?” or “Did the car believe it had the right of way?” [21]. Although the research in developing low-level, expert-oriented explanations has gained momentum, user-friendly explanation methods are less explored. However, developing user-friendly explanations is necessary in order to build trust and encourage user interactions with the black-box systems.

In this work, we explore one of the most notable examples of user-friendly explanations – counterfactual explanations. Counterfactuals are local explanations that attempt to discover the causes of an event by answering the question: “Given that input x has been classified as y, what is the smallest change to x that would cause it to be classified as an alternative class \(y^{\prime }\)?”. For example, counterfactuals can be applied to answer questions such as “Since my application was rejected, what are some examples of successful loan applications similar to mine?” or “How can I change my application for it to be accepted in the future?”. The explanation is usually given in the form of a counterfactual instance – an instance \(x^{\prime }\) as similar as possible to the original x, but producing an alternative outcome \(y^{\prime }\). Formally, according to Wachter et al. [96], counterfactual explanations can be defined as statements:

Score y was returned because variables V had values (v1, v2, . . .) associated with them. If V instead had values (v1\(^{\prime }\), v2\(^{\prime }\), . . .), and all other variables had remained constant, score y’ would have been returned.

The statement consists of two main factors: variables that the user can alter, and the score. In supervised learning, variables represent input features that a black-box model uses to make a prediction, for example, pixels in image-based tasks. The prediction of the black-box model represents the score. The alternative outcome \(y^{\prime }\) can either be specified or omitted, in which case the counterfactual can be an instance producing any outcome other than y. Current methods either directly search for a counterfactual instance \(x^{\prime }\) similar to x, or a small perturbation \(\delta\) which can be added to x to achieve a desired outcome, in which case \(x^{\prime } = x + \delta\). A summary of the terminology used throughout the work is given in Table 2.

Table 2.

Term	Symbol	Meaning
Black-box model	f	Model being explained
Original instance	x	Instance being explained
Original output	y	Output produced for x by f; \(y = f(x)\)
Counterfactual output	\(y^{\prime }\)	Desired output; \(y^{\prime } \ne y\)
Counterfactual instance	\(x^{\prime }\)	Instance similar to x, but producing counterfactual output \(y^{\prime }\)
Counterfactual perturbation	\(\delta\)	Perturbation that needs to be added to x to obtain \(x^{\prime }\); \(x^{\prime } = x + \delta\)

Table 2. Some Important Terms in Counterfactual Explanations Research and the Notation used in this Work

The following properties make counterfactuals user-friendly explanations:

(1)

Actionable: counterfactual explanations give the user actionable advice, by identifying which parts of the input should be changed in order for the model’s output to change. Users can use this advice to alter their input and obtain the desired outcome.

(2)

Contrastive: counterfactual explanations compare real and imagined worlds. Contrastive explanations have been shown to be the preferred way for humans to understand the causes of events [63].

(3)

Selective: counterfactual explanations are often optimized to change as few features as possible, making it easier for users to implement the changes. This also corresponds to the human preference for shorter rather than longer explanations [65].

(4)

Causal: counterfactuals correspond to the third tier of Pearl’s Ladder of Causation [69] (Table 3). The first rung of the hierarchy corresponds to associative reasoning and answers the question “If I observe X what is the probability of Y occurring?”. This rung also aligns with statistical reasoning, as the probability in question can be estimated from data. The second rung requires interventions to answer the question “How would Y change if I changed X?”. Finally, the third rung addresses the question “Was it X that caused Y?”. Answering this question requires imagining alternative words where X did not happen and estimating whether Y would occur in such circumstances.

(5)

Inherent to human reasoning: Humans rely on generating counterfactuals in their everyday lives. By imagining alternative worlds, humans learn about cause-effect relationships between events. For example, by considering the counterfactual claim “Had the car driven slower, it would have avoided the accident”, the relationship between the car speed and the accident can be deduced. Additionally, humans use counterfactuals to assign blame or fault for a negative event [11].

Table 3.

Rung	Activity	Question	Example
Association	Observing	What is the probability of Y if I observed X?	What does a symptom tell us about the disease?
Intervention	Intervening	What will Y be if I change X?	If I take aspirin, will my headache be cured?
Counterfactual	Imagining	Was it X that caused Y?	Was it aspirin that cured my headache?

Table 3. Pearl’s Ladder of Causation [69]

As user-friendly explanations, counterfactuals are important for ensuring the trust and collaboration of non-expert users of the system. Since they offer the user actionable advice and can be possibly used to instruct users how to change their real-life circumstances, counterfactual explanations can severely influence user’s trust. While providing the user with useful counterfactuals can help the user better navigate the AI system, offering the user a counterfactual that proposes unrealistic changes or does not lead to desired outcomes can be costly and frustrating for the user, and further erode their trust in the system. For this reason, it is necessary to provide a way to ensure usefulness and evaluate the quality of counterfactual explanations.

4.1 Properties of Counterfactual Explanations

Due to their potential influence on user trust, there is a need to develop metrics for evaluating the usefulness of counterfactual explanations. Additionally, counterfactual explanations can suffer from the Rashomon effect – it is possible to generate a large number of suitable counterfactuals of a specific instance [65]. In this case, further analysis and comparison is necessary to select those most useful. For this reason, multiple criteria for assessing the quality of the generated counterfactual explanation have been proposed [92]:

(1)

Validity: the counterfactual must be assigned by the model to a different class to that of the original instance. If a specific counterfactual class \(y^{\prime }\) is provided, then validity is satisfied if the counterfactual is classified as \(y^{\prime }\).

(2)

Proximity: counterfactual \(x^{\prime }\) should be as similar as possible to the original instance x.

(3)

Actionability: the counterfactual explanation should provide users with meaningful insights into how they can change their features, in order to achieve the desired outcome. This means that suggesting that a user should change their sensitive features (e.g., race) or features that are immutable (e.g., age, country of origin) should be avoided, as it is not helpful and may be offensive.

(4)

Sparsity: ideally, the counterfactual should require only a few features to be changed in the original instance. This corresponds to the notion of a selective explanation, which humans find easier to understand [65].

(5)

Data manifold closeness: the counterfactual instances rely on modifying existing data points, which can lead to generation of out-of-distribution samples. To be feasible for users, counterfactuals need to be realistic.

(6)

Causality: while modifying the original sample, counterfactuals should abide by the causal relationship between features. For illustration, consider an example by Mahajan et al. [62] of a counterfactual which suggests that the user should increase their education to Masters, without adjusting their age. Even though such a counterfactual instance would be actionable and fall within the data manifold (as there probably are people of the same age as the user, but with a Master’s education), the proposed changes would not be feasible for the current user.

(7)

Recourse: ideally, the user should be offered a set of discrete actions that can transform the original instance into a counterfactual. This property is closely related to actionability and causality – offered sequence of actions should not change the immutable features, and should consider causal relationships between the variables, to understand how changing one feature can influence the others. The field of recourse started as separate from counterfactual explanations, but since explanation methods have become more elaborate and user-centered, this line has been blurred [92].

5 Counterfactual Explanations for Supervised Learning

Current approaches for generating counterfactuals for RL are inspired by methods from supervised learning. For that reason, in this section, we review some of the most notable methods for generating counterfactuals in supervised learning, classify them based on their approach and evaluate them based on the counterfactual properties.

5.1 Counterfactual Search

In one of the first approaches for generating counterfactuals, Wachter et al. [96] proposed minimizing a two-part loss function over instances \(x^{\prime }\) in the data set:

\begin{equation} L(x, x^{\prime }, y, y^{\prime }) = \lambda (f(x^{\prime }) - y^{\prime })^2 + d(x, x^{\prime }) \end{equation}

(3)

where f is the black-box model, x is the original instance, y is the label of the original instance and \(y^{\prime }\) is the desired outcome. The first term ensures that the counterfactual prediction is close to the desired class, while the second one minimizes the distance between the original and counterfactual instance. Parameter \(\lambda\) determines the balance between two objective quantities. Gradient descent is used to minimize the loss function over the training data set. This approach focuses only on two criteria, validity, and proximity, and can result in counterfactuals that are unrealistic, or require a large number of features to be changed.

To address the issues of sparsity and data manifold closeness, Dandl et al. [19] expand the loss function with two components – sparsity and data manifold closeness. Sparsity measures the number of features that had to be changed to generate the counterfactual, while data manifold closeness minimizes the distance between the counterfactual and the nearest observed instance. Multi-objective optimization is used, to generate counterfactuals that enable different trade-offs between objectives. Specifically, the Nondominated Sorting Genetic Algorithm (NSGA-II) [22] was applied, which selects suitable instances through repeated random recombination, mutation, and ranking of the obtained solutions. The approach generates a Pareto set of counterfactuals, representing different trade-offs between the objectives.

Looveren and Klaise [58] ensure that generated counterfactuals fall within the data manifold by defining prototypes of each class and directing the search towards one of these instances. To that end, the authors define a prototype loss, which measures the distance in the encoding space between the counterfactual and the representative of the target class. The prototype loss pushes the search in the direction of the nearest prototype of a suitable class. To optimize the objective function, fast iterative shrinkage-thresholding algorithm (FISTA) [7] is applied. To ensure actionability, the approach also allows users to define immutable features.

Extending Wachter et al. [96], DICE [66] generates diverse counterfactuals, to offer the user a number of possible ways to change their input. DICE introduces a diversity objective and optimizes a loss function which additionally includes proximity and validity objectives using gradient descent. To ensure actionability, users can provide constraints on immutable features. Additionally, adjustments are made to generated counterfactuals to ensure their sparsity, and each continuous feature is restored back to its value in x in a greedy approach until \(f(c)\) changes. DICE does not ensure that generated counterfactuals fall within the data manifold, and can result in unrealistic explanations.

Laugel et al. [52] propose counterfactual search using growing spheres (Figure 4). The approach minimizes the validity, proximity, and sparsity objectives. Growing spheres algorithm utilizes a two-step approach and optimizes the objectives sequentially, instead of simultaneously. The algorithm does not require access to the training data but rather relies on generating instances. To optimize the proximity objective, the algorithm iteratively generates instances in the vicinity of x and examines them, until the validity constraint is met. After obtaining the closest counterfactual \(x^{\prime }\), its feature values are iteratively reset to their values in the original instance x, starting with the features with the smallest difference in values between x and \(x^{\prime }\), ensuring sparsity. The growing spheres approach relies on generating a new instance, which can result in unrealistic counterfactuals. The approach also does not address the issue of actionability.

Fig. 4.

Guidotti et al. [34] propose LORE, a local approach for generating counterfactuals. LORE builds a local interpretable model in the neighborhood of the instance being explained. The neighborhood is generated using a genetic algorithm. Then, factual and counterfactual rules are extracted from the interpretable model. By comparing the factual and counterfactual rules, we can deduce which features need to change and by how much to obtain the desired outcome.

To ensure the counterfactual instance falls with the data manifold, REVISE [40], optimizes the loss function only over likely instances. A generative model is trained to sample instances from the data distribution. Variational autoencoder (VAE) is used to represent the underlying data distribution.

Similarly, FACE [72] ensures that the counterfactual is likely using a graph-based approach (Figure 5). Each instance in the data set is presented as a node in a directed graph, and the presence of an edge between two nodes indicates that there is a connection between them. Edges are weighted according to the density of instances between the two nodes. The shortest path algorithm is then used to find an instance closest to x with a different prediction. As the graph is weighted according to data density, FACE ensures there is a dense path between the original and counterfactual instance.

Fig. 5.

Mahajan et al. [62] propose a causal proximity measure, that can be included in any loss-based counterfactual method and ensures counterfactuals follow causal constraints between features. The approach assumes access to the structural causal model (SCM) of the features. A SCM is a directed graph, where each node corresponds to a feature, and arcs indicate causal relationships. Each arc is assigned a structural equation, which describes how the source node influences the sink. Causal proximity does not depend on the original instance x and instead ensures that features in the counterfactual instance follow causal constraints.

ReLACE [14] reimagines the problem of finding counterfactuals as a search for a sequence of actions that lead from x to \(x^{\prime }\). Authors suggest that this problem can be solved by learning a reinforcement learning (RL) policy which learns an optimal sequence of actions between x and \(x^{\prime }\) In each time step, the instance x is modified using an available set of actions, and the reward is assigned to the agent corresponding to how well the current state satisfies counterfactual properties of proximity and validity. The optimization is terminated when a state which satisfies the prediction constraint is obtained. To learn a policy that generates counterfactual instances a P-DQN agent is used. RELACE is model-agnostic and can quickly generate counterfactuals for each instance simultaneously. However, this approach does not address the problem of data manifold closeness of generated instances. Additionally, since each action forcibly changes feature values, causality constraints that exist between features may be violated in counterfactual instances.

Similarly, Samoilescu et al. [77] propose using DRL algorithms to efficiently generate batches of counterfactuals in one forward pass. In contrast to RELACE [14], the authors search for the counterfactual perturbation in the latent space instead of the high-dimensional state space. An encoder is trained to transform instances into their latent representation. To learn an RL policy that converts instances to their counterfactual counterparts, the DDPG [54] algorithm is used. The actor-network serves as a counterfactual generator, meaning that instead of outputting an action to alter the instance, the actor outputs the counterfactual itself. The proposed algorithm is model-agnostic, and can efficiently generate batches of counterfactuals. However, the authors do not explore whether the generated counterfactuals conform to the causal rules of the environment. Additionally, the issue of recourse is not considered, as the actor-network outputs counterfactuals instead of actions that could transform the original instance.

5.2 Algorithmic Recourse

The field of algorithmic recourse is similar to that of counterfactual explanations, and the two are often researched and presented side by side [43, 44]. With algorithmic recourse, the user is not only offered a favorable situation in the form of a counterfactual instance but also a set of actions that can be used to transform the original instance into a counterfactual. In this way, the user receives an actionable recommendation, which can help them change their input features and receive a different outcome from the system [43].

In one of the first works on recourse, Ustun et al. [88] explores the problem of finding a sequence of actions that flips the decision of a linear classifier f. The authors assume a simple binary classifier with weights \(w \in \mathbb {R}^d\), where \(f(x) = sgn (\)\(\lt\)\(x, w\)\(\gt\)\()\), and undesirable outcome \(y = -1\). The authors then search for a least expensive vector of actions \(a* = [0, a_1, \dots , a_d]\) that such that \(f(x + a) = +1\), using integer programming.

Sharma et al. [79] propose CERTIFAI, an approach that uses a genetic algorithm to alter the neighboring instance of x until a counterfactual \(x^{\prime }\) is reached. Although it does not offer a sequence of actions, CERTIFAI generates a sequence of intermediate instances between x and \(x^{\prime }\). CERTIFAI starts by generating a set I of neighboring instances of x, which it then iteratively alters through selection, mutation, and crossover until an instance that satisfies the validity constraint is found. Since CERTIFAI forcibly changes feature values, it may generate unlikely counterfactuals that do not conform to causal constraints between the features.

Karimi et al. [44] observe that, to generate recourse, causal relationships between features have to be taken into account. The authors propose MINT, a method for finding a minimal set of causal interventions to change the original instance x into a counterfactual \(x^{^{\prime }}\). To find \(x^{^{\prime }}\), the authors use the Abduction-Action-Prediction [69] procedure, used to generate a counterfactual instance in causal theory. The obtained counterfactual \(x^{\prime }\) obeys the causal constraints in the environment and can be reached from the original instance x using causal interventions. However, MINT requires access to a full causal model of the task, which is often not available.

Karimi et al. [45] extend the work of Karimi et al. [44] and prove that recourse cannot be guaranteed if the full causal model is not known. Karimi et al. [45] propose a probabilistic approach that can generate recourse with high probability, given only the causal graph, without structural equations. To learn an approximate SCM using the training data, a Bayesian approach is proposed, and probabilistic regression was used to learn approximate structural equations.

5.3 Evaluation

Table 4 summarizes how different methods reviewed in this section address counterfactual properties. All methods focus on providing a valid counterfactual and ensuring their proximity to the original instance. However, there is no unified way to evaluate proximity, and different methods chose different metrics to calculate the similarity between the instances, making the methods difficult to compare. Additionally, many methods address the issue of actionability by allowing the user to mark features that should not be changed. Around half of the methods also implement a way of ensuring sparsity, usually by minimizing the number of feature changes. Data manifold closeness has been interpreted in various ways in the current work. While some works simply observe the distance between the counterfactual and the nearest instance [19], other works measure sample density along the path between the instances [72] or represent the space of realistic instances by encoding the training data set [40, 77]. This indicates that there is no common understanding of what data manifold closeness is, and that the notion of a realistic instance remains open to interpretation.

Table 4.

Method	Proximity	Validity	Actionability	Sparsity	Data manifold closeness	Causality	Recourse
Wachter et al. [96]	✓	✓	✗	✓	✗	✗	✗
Dandl et al. [19]	✓	✓	✓	✓	✓	✗	✗
Looveren and Klaise [58]	✓	✓	✓	✓	✓	✗	✗
DICE [66]	✓	✓	✓	✓	✗	✓	✗
Laugel et al. [52]	✓	✓	✗	✓	✗	✗	✗
REVISE [40]	✓	✓	✓	✗	✓	✓	✓
FACE [72]	✓	✓	✓	✗	✓	✗	✗
LORE [34]	✓	✓	✓	✗	✓	✗	✗
ReLACE [14]	✓	✓	✓	✓	✗	✗	✓
Samoilescu et al. [77]	✓	✓	✓	✓	✓	✗	✗
Mahajan et al. [62]	✓	✓	✓	✗	✓	✓	✗
Ustun et al. [88]	✓	✓	✓	✗	✗	✗	✓
Karimi et al. [44]	✓	✓	✓	✗	✗	✓	✓
CERTIFAI [79]	✓	✓	✓	✗	✗	✗	✓

Table 4. Evaluation of State-of-the-Art Counterfactual Methods in Supervised Learning Based on Whether they Address the Desired Properties

The least considered properties by far are causality and recourse. Causality is, in a way, a prerequisite for useful recourse generation, because a sequence of actions provided by recourse needs to be causally correct to be useful to the user. Ensuring causality is the most challenging task of generating counterfactuals, as it requires causal information about the environment. However, the causal graph of the environment is rarely available, and often needs to be hand-crafted, requiring human time and effort.

6 Counterfactual Explanations in Reinforcement Learning

In the previous section, we provided an overview of the state-of-the-art methods for generating counterfactual explanations in supervised learning tasks. In RL, however, counterfactual explanations have been seldom applied. In this section, we explore how counterfactual explanations differ between supervised and reinforcement learning tasks, and identify the main challenges that prevent the adoption of methods from supervised to reinforcement learning. Additionally, we redefine counterfactual explanations and counterfactual properties for RL. Finally, we identify the main research directions for implementing counterfactual explanations in RL.

6.1 Existing Methods for Counterfactual Explanations in RL

Although the potential of counterfactual explanations for explaining RL agents has been recognized [20], there are currently few methods enabling counterfactual generation in RL. Olson et al. [67] proposed the first method for generating counterfactuals for RL agents. In their work, counterfactuals are generated using a generative deep-learning architecture. The authors explain the decisions of RL agents in Atari games and search for counterfactuals in the latent space of the policy network. The approach assumes that policy is represented by a deep neural network and starts by dividing the network into two parts. The first part A contains all but the last hidden layer of the network and serves for mapping input state s into latent representation z: \(A(s) = z\). The second part \(\pi\) consists only of the last fully connected layer followed by a softmax and is used to provide action probabilities based on the latent state. The policy network can be imagined as a composition of the network parts \(\pi (A(s))\). To generate counterfactual states, authors propose a deep generative model consisting of an encoder (E), discriminator (D), generator (G) and Wasserstein encoder (\(E_w, D_w\)). Encoder E and generator G work together as an encoder-decoder pair, with the encoder mapping the input state into a low-dimensional latent representation, and the generator performing reconstruction from encoding to the original state. Discriminator D learns to predict the probability distribution over actions \(\pi (z)\) given a latent state representation \(z = E(s)\). Counterfactuals are searched for in the latent space, which may have holes, resulting in unrealistic instances. For this reason, the authors include a Wasserstein encoder \((E_w, D_w)\) to map latent state z to an even lower-dimensional representation \(z_w\) to obtain a more compact format and ensure more likely counterfactuals. The generative architecture is trained on a set of the agent’s transitions in the environment \(\mathcal {X} = \lbrace (s_1, a_1),\dots ,(s_n, a_n)\rbrace\). Finally, to generate counterfactual state \(s^{\prime }\) of state s, authors first locate the latent representation \(z_w^*\) of \(s^{\prime }\) by solving:

\begin{equation} \begin{split} \textrm {minimize } \quad \quad |&|E_w(A(s)) - z_w^*||_2^2 \\ \textrm {subject to }\quad \quad &\arg \max _{a \in A} \pi (D_w(z_w^*, a)) = a^{\prime } \end{split} \end{equation}

(4)

where \(a^{\prime }\) is the counterfactual action. The objective can be simplified to:

\begin{equation} z_w^* = \arg \min _{z_w} ||z_w - E_w(A(s))||_2^2 + \log (1 - \pi (D_w(z_w), a^{\prime })) \end{equation}

(5)

The process for finding the counterfactual instance then consists of encoding the original state into Wasserstein encoding \(z_w = E_w(A(s))\), optimizing Equation (5) to find \(z_w^*\). The latent counterfactual instance \(z_w^*\) can then be decoded to obtain \(\pi (D_w(z_w^*))\), which is finally passed to the generator to obtain the counterfactual instance \(s^{\prime }\). This approach can generate realistic image-based counterfactual instances that can be used to explain the behavior of Atari agents. However, this approach is model-specific and requires access to the RL model’s internal parameters.

In contrast, Huber et al. [37] propose GANterfactual-RL, a model-agnostic generative approach based on the StarGAN [15] architecture for generating realistic counterfactual explanations. Huber et al. [38] approach the counterfactual search problem as the domain translation task where each domain is associated with states in which the agent chooses a specific action. A suitable counterfactual can be found by translating the original state to a similar state from the target domain. In this way, obtaining a counterfactual involves only changing the features that are relevant to the agent’s decision, while maintaining others. The approach consists of training two neural networks – discriminator D and generator G. The generator G is trained to translate the input state x into y, conditioned on the target domain c. The role of the discriminator is to distinguish between real states and fake ones produced by G.

The current approaches for generating counterfactuals in RL do not deviate from similar approaches in supervised learning and do not account for the additional complexities of the RL framework presented in Section 3. For example, the counterfactuals generated by Huber et al. [38], Olson et al. [67] rely only on state features to explain a decision and do not include different explanation components, such as goals or objectives that can influence the agent’s behavior. Similarly, the approaches do not address the issue of temporality and can generate counterfactuals that are close together in terms of features but might be far away in execution.

While current methods in supervised and RL generate counterfactuals suitable for one-step prediction tasks, they do not account for the explanatory requirements of RL tasks. In the following sections, we redefine counterfactual explanations to fit within the RL framework and explore their desired properties. We also identify the main research directions that need to be addressed before counterfactuals can be successfully used to explain the decisions of RL agents.

6.2 Counterfactual Explanations Redefined

In RL, current counterfactual explanations focus on interpreting a single decision in a specific state [37, 67]. Variables are simply state features, and the score is an agent’s choice of action in the state. This corresponds to the definition of counterfactuals used in supervised learning, where input features are replaced by state features and a model’s prediction of a class label with an action prediction. However, unlike supervised learning models, RL agents deal with sequential tasks where states are temporally connected. This means that an agent’s action in a single state cannot be fully explained by taking into account only the current state features. Additionally, an agent’s behavior cannot be fully explained by interpreting single actions, as individual actions are often part of a larger plan or policy. To be able to generate counterfactual explanations that can fully encompass the complexities of RL tasks, we need to redefine the notions of variable and score from the RL perspective. Examples of possible counterfactual explanations in RL after redefining the terms variable and score are presented in Table 5.

Table 5.

	Action	Plan	Policy
Features	Had state features taken different values, action \(a^{\prime }\) would be chosen instead of a	Had state features taken different values, path \(p^{\prime }\) would be chosen instead of p	Had state features taken different values, policy \(\pi ^{\prime }\) would be chosen instead of \(\pi\)
Goal	Had the agent followed a different goal in state s, action \(a^{\prime }\) would be chosen instead of a	Had agent followed a different goal in state s, plan \(p^{\prime }\) would be chosen instead of p	Had agent followed a different goal in state s, policy \(\pi ^{\prime }\) would be chosen instead of \(\pi\)
Objective	Had the agent preferred a different objective in state s, action \(a^{\prime }\) would be chosen instead of a	Had agent preferred a different objective in state s, plan \(p^{\prime }\) would be chosen instead of p	Had agent preferred a different objective in state s, policy \(\pi ^{\prime }\) would be chosen instead of \(\pi\)
Event	Had the agent encountered a different event in state s, action \(a^{\prime }\) would be chosen instead of a	Had agent encountered a different event in state s, plan \(p^{\prime }\) would be chosen instead of p	Had agent encountered a different event in state s, policy \(\pi ^{\prime }\) would be chosen instead of \(\pi\)
Expectation	Had the agent faced different expectation in state s, action \(a^{\prime }\) would be chosen instead of a	Had agent faced different expectation in state s, plan \(p^{\prime }\) would be chosen instead of p	Had agent faced different expectation in state s, policy \(\pi ^{\prime }\) would be chosen instead of \(\pi\)

Table 5. Different Types of Counterfactual Explanations in RL Depending on the Definition of Variable and Score

6.2.1 Variables.

To redefine variables from the perspective of RL tasks, we explore different possible causes of decisions for RL agents. According to the causal attribution model presented by Dazeley et al. [20], an agent’s decisions can be directly or indirectly affected by five factors – perception, goals, disposition, events and expectations. While Dazeley et al. [20] also provides a causal structure connecting different factors, in this work we focus only on their individual effect on the agent’s decision.

Current counterfactual explanations use an agent’s perception as a variable and only consider state features as the possible causes of an outcome. Counterfactual questions can then be phrased as “Given that the agent chose action a in state with state features \({f_1, \ldots , f_k}\) how can the features be changed for the agent to choose an alternative action a\(^{\prime }\)?” This type of explanation is most suitable for single-goal, single-objective, deterministic tasks, where each action depends only on the current state the agent is perceiving. In such tasks, it can be assumed that the user knows the ultimate goal of the agent’s behavior and the agent does not need to balance between different objectives. In such environments, an agent’s actions have deterministic consequences and unexpected events cannot influence the agent’s behavior. This corresponds to simple environments, such as deterministic games, where the agent’s goal is to win by optimizing a single objective function. In more complex, real-world environments, however, relying only on the current state to make a decision is unlikely, and additional factors might need to be taken into account.

An agent’s current goal can also influence an outcome. Counterfactual explanations can then be used to address the question “Given that the agent chose action a in a state s while following goal G, for what alternative goal G\(^{\prime }\) would the agent have chosen action a\(^{\prime }\)?”. This question is especially useful in multi-goal or hierarchical RL tasks, where the agent has a set of goals that need to be fulfilled and can switch between them. In such environments, knowing which goal an agent is pursuing in a state is necessary to fully understand its behavior. For example, consider an autonomous taxi vehicle. The task of the taxi agent can be split into two subgoals – picking up passengers and delivering them to their destinations. A passenger picked up in such a taxi might be confused if the agent takes a longer route to their destination. The passenger might lose trust in the agent and even suspect the agent is lost or is trying to trick the user. However, if an agent is allowed to explain its behavior by employing a counterfactual statement: “I am following a goal of picking up a different passenger. Had I followed a goal of delivering the current passenger I would have taken the shorter route to the destination.”, the user can be reassured that the agent is indeed following the best course of action.

Similarly, in multi-objective environments, an agent’s preference for a specific objective can guide its behavior. This corresponds to the effect of the agent’s disposition on its decisions as described in Dazeley et al. [20]. The counterfactual question can then be posed as: “Given that the agent chose action a in state s while preferring objective O, for what alternative objective preference O\(^{\prime }\) would the agent have chosen action a\(^{\prime }\)?”. An agent’s internal preference for one objective over another is likely to influence its actions, and this information is necessary to fully understand the agent’s behavior. Many real-life tasks fit this description. For example, the behavior of an autonomous driving agent is at each time step guided by different objectives such as speed, safety, fuel consumption, user comfort and so on. Different agents can have different preferences in regard to these objectives. If an agent does not slow down in a critical situation, and as a consequence ends up colliding with the vehicle in front of it, the counterfactual explanation could identify that, had the agent preferred safety over speed in the critical state, it would have chosen to slow down and the accident would have been avoided. This type of explanation can also be useful for developers when designing multi-objective reward functions for complex tasks. Deciding on the correct weights for different objectives is a notoriously challenging task and requires substantial time and effort [56]. Understanding how different preferences of objectives influence an agent’s decision, especially in critical states, can be useful for adjusting the appropriate weights.

Unexpected outside events can alter an agent’s behavior and as such should also be featured in counterfactual explanations. The counterfactual question can then be defined as “Given that the agent chose action a in state under an unexpected event E, for which event E\(^{\prime }\) would the agent have chosen action a\(^{\prime }\)?”. For example, consider an autonomous driving agent that had to change its trajectory due to an unexpected blockage on the road. The user might be confused as to why the agent decided on the longer path, and counterfactual explanations could restore their trust in the system by asserting that “Had the shorter path not been blocked, I would have taken it.” . This corresponds to the field of safe RL, which argues that, to be safely deployed, agents must be able to avoid danger and correctly respond to unexpected events [29].

Finally, expectations can also influence an agent’s behavior. Expectation refers to a set of social and cultural constraints that are imposed on the agent and can affect its decisions [20]. These can be anything from social conventions and laws to ethical rules. Counterfactual explanations could focus on how changing expectations affects an agent’s behavior by posing the question “Had expectations for the the agent been different, would it still make the same decisions?”. Including expectations in the explanations can help frame an agent’s behavior in the broader social and cultural context. Users who are not familiar with the expectations the agent is implementing may find it strange if the agent diverges from its primary goals due to an expectation. Explanations that identify social expectations as the cause of unexpected behavior can help users understand and regain trust in the model.

While decisions of a supervised learning model depend only on the input features, the RL agent’s behavior is often motivated by a wider range of causes, such as goals, objectives, and expectations. For that reason, the definition of a variable needs to be extended to include all factors influencing an agent’s decisions.

6.2.2 Score.

In supervised learning, the only reasonable outcome of the model’s behavior for a specific input instance is its prediction. In RL tasks, however, there are multiple possible scores that could describe an agent’s behavior in a specific state. Current methods for counterfactual explanations in RL use the action chosen in the state as the score. However, each agent’s decision often does not exist on its own, but as a part of a larger plan or a policy. Relying only on a current action might not be enough to capture the desired change in behavior, especially in complex, continuous environments. In game environments, such as chess or StarCraft, the user might be more interested to know why an agent chose one specific plan or sequence of steps instead of another. Similarly, in an autonomous driving task with continuous control, the user is unlikely to ask the agent why it changed the steering angle by a small amount but might be interested to know why it chose a specific route over another. Interpreting a single action in a complex or continuous environment might be too low-level for non-expert users, and better suited for experts or developers. Non-expert users, however, tend to be more comfortable with high-level explanations, corresponding to plans or policies. Counterfactual explanations in RL have to be extended to include more complex behavior that is inherent to RL agents to be user-friendly.

While supervised learning models focus on one-step prediction tasks, in RL agents learn sequential behavior which consists of plans and policies. For this reason, counterfactual explanations need to account for the broader scope of RL behavior.

6.3 Counterfactual Properties Redefined

Counterfactual properties are metrics for deciding on the most suitable counterfactual to be presented to the user. It is of high importance that counterfactual properties describe necessary desiderata for choosing useful counterfactuals. Providing a user with unattainable or ineffective counterfactuals can cause frustration and deteriorate user trust in the system. So far, counterfactual properties have been defined with supervised learning tasks in mind, where only input features are changed in order to obtain a desired prediction. Since counterfactual explanations in RL are not limited only to state features and one-step decisions, counterfactual properties need to be redefined for use in RL tasks:

(1)

Validity is measured as the extent to which the desired score is achieved in the counterfactual instance. In RL, the score can refer to the agent’s current action, plan, or policy. Depending on the chosen definition of the score, validity describes whether the agent in the counterfactual instance chose the desired action or followed the desired plan or policy.

(2)

Proximity is estimated as a difference between the original and counterfactual instances. In supervised learning, similarity is calculated as a function of input features. In RL however, proximity can include not only state features but other explanation factors, such as goals or objectives, which also need to be included when calculating proximity. Additionally, while in supervised learning data samples are independent of each other, states in RL tasks are temporally connected. Offering the user a counterfactual instance that is similar to the original in terms of state features, but temporally distant might be useless. For this reason, temporality also must be considered when calculating the proximity of two states in RL.

(3)

Actionability specifies the need for allowing certain features (e.g., race, country of origin) to be immutable. In supervised learning, developers are required to define which features should not be changed. In RL however, certain immutable features can also be contained within the environment itself. For example, although goal coordinates exist in the feature space, changing them might be impossible under the rules of the environment. Suggesting to the user that they should obtain a counterfactual where the goal is moved would not be actionable. For this reason, the definition of actionability in RL needs to be expanded to limit change of not only developer-defined features but also those that are rendered immutable by the rules of the environment.

(4)

Sparsity refers to the fact that as few as possible features should be changed in order to obtain the counterfactual instance. In current work, it is often assumed that there is no difference in the cost of changing individual features. In RL, however, counterfactual instances can be obtained not only by changing state features but also goals or objectives. Changing a single state’s features or changing the agent’s goal may pose different levels of difficulty for the user, and a sparsity metric should reflect this potential cost inequality in RL tasks.

(5)

Data manifold closeness refers to the requirement that counterfactual instances should be realistic, in order for users to be able to achieve them. The main challenge for estimating data manifold closeness is in defining what is meant by the notion of a realistic instance. In supervised learning, the training data set is used to represent the space of realistic instances. The data manifold closeness of an instance can then be estimated with respect to the defined realistic sample space. In RL, however, the training data set is not readily available. For this reason, a more precise definition of realistic instances in RL and accompanying metrics for evaluating data manifold closeness are needed in RL.

(6)

Causality refers to the notion that counterfactual instances should obey causal constraints between features. In supervised learning, generating causally correct counterfactual instances is only possible if the causal relationships between input features are known. In RL, however, causal rules are ingrained in the environment. Relying on actions provided by the environment can ensure that features are not changed independently but in accordance with the causal rules of the environment.

(7)

Recourse is a sequence of actions the user should perform to transform the original instance into a counterfactual. In supervised learning, the user is in theory allowed to perform whichever actions they desire, provided that they conform to the causality and actionability constraints. In certain RL tasks, however, the user’s actions might be limited to those available in the environment. In game environments, for example, the user can only change the game state by performing actions that obey the rules of the game. For illustration, consider the game StarCraft, a two-player strategy game where the goal is to build an army and defeat the opponent. If the user is interested to know why the RL agent did not choose an attack action in a specific state, the system can answer by providing recourse: “In order for attack to be the best decision, you first need to perform action build_army.” To follow the provided advice, the user must rely only on actions available in the environment.

Counterfactual properties guide the search for the counterfactual instance, and as such directly influence which explanations will be presented to the user. In RL, counterfactual properties need to be extended from their supervised learning counterparts, to encapsulate the complexities of RL tasks. The introduction of different types of variables and scores to counterfactual explanations in RL affects all counterfactual properties. Additionally, the temporal nature of RL needs to be considered when evaluating proximity, one of the properties that all previous works optimize. On the other hand, while supervised learning lacks causal information about input features, this knowledge is often ingrained within an RL environment, and can potentially be used to generate causally-correct counterfactuals and recourse.

6.4 Challenges and Open Questions

So far, we have shown how counterfactual explanations differ between supervised and reinforcement learning and have redefined them for RL tasks. However, the question remains, how can suitable counterfactuals be generated for RL tasks? As there is a large amount of research on counterfactual explanations in supervised learning, and due to similarities between the two learning paradigms, in this section we explore whether this substantial research can be applied to RL. We identify the main challenges and open research directions on this path that should be tackled before counterfactual explanations can be generated for RL tasks.

6.4.1 Counterfactual Search Space.

In supervised learning, the search for a counterfactual instance is often performed over the training data set. The training data set is an essential component of every supervised learning task. Additionally, the training data set is used to represent the space of realistic instances, to measure the data manifold closeness of counterfactual instances. Intuitively, if an instance is consistent with the data distribution defined by the training data set, it is considered realistic. In RL, however, the agent learns the task from scratch and the training data set is not available. For that reason, the search space corresponding to realistic instances needs to be redefined for RL tasks.

The most straightforward way to approximate training data set in RL is to obtain a data set of states by unrolling the agent’s policy. Existing loss-based approaches can then be optimized over such a data set to find the most useful counterfactual instance. Data manifold closeness of counterfactual instances could then be estimated in relation to this data set and realistic instances would correspond to those states that the agent is likely to visit by following its policy. This approach makes sense for counterfactual instances which can be obtained by only changing the state features. The counterfactual instance should then be reachable by the same policy the agent is employing. However, if searching for it includes changing the agent’s goals or dispositions, the counterfactual instance is not likely to be found within the agent’s current policy. For example, consider a multi-goal agent with two subgoals A and B, explaining why decision \(a^{\prime }\) was not taken instead of a in state s. A counterfactual instance might explain the decision by claiming that, had the agent been following subgoal B instead of A, action \(a^{\prime }\) would have been executed instead of action a. However, this counterfactual instance is unlikely to occur under the agent’s policy \(\pi\) which is already following goal A in state s. For this reason, further research to understand how can the search space for counterfactual instances be defined is needed.

6.4.2 Categorical Variables.

Most of the current state-of-the-art methods for generating counterfactual explanations in supervised learning do not support categorical input features. This problem naturally extends to RL as well, where state features can often be categorical. However, in RL this problem is additionally exacerbated with the use of alternative explanation components such as goals, dispositions, or events. These components often represent high-level concepts and as such are often discrete. Understanding how discrete variables can be integrated with existing counterfactual methods is necessary for their employment in RL tasks.

6.4.3 Temporality.

One of the main purposes of counterfactual explanations is to provide a user with a way to change the input features in the most suitable way to change the output of a black-box model. Intuitively, this means that the counterfactual needs to be reachable from the original instance. In supervised learning, many counterfactual properties such as proximity, sparsity, and data manifold closeness are used to guide the search toward easily obtainable counterfactuals. In supervised learning, the notion of reachability is not contained within the task – all samples are considered independent, and whether one sample can be obtained from another often depends on human-defined interpretation of terms such as immutability, causality, or sparsity. In most supervised learning works, proximity is used as a proxy for reachability, and states with similar state features are considered easily reachable. In RL, however, due to its sequential nature, states are temporally connected. Reachability in RL tasks is ingrained in the environment and depends on how temporally distant two states are in terms of execution.

This means that two states with almost identical state feature values might be distant from each other in terms of execution. For illustration, consider an example of a chess game presented in Figure 5. The generated counterfactual for a chess position can satisfy validity, sparsity, and data manifold closeness constraints and be very similar to the original instance in terms of state features. However, if it is not reachable from the original instance using the game rules, recourse cannot be provided based on such counterfactual. In other words, the counterfactual is not useful to the user if they cannot reach it by following the game rules. For this reason, temporal similarity needs to be considered along with the similarity of features when calculating proximity. Further research is necessary to define metrics and integrate temporal similarity in the search for counterfactuals in RL.

6.4.4 Stochasticity.

Most methods for providing recourse in supervised learning assume that model decisions and feature changes are deterministic and that there exists one true sequence of actions for obtaining the counterfactual instance. In many RL tasks, however, the environment can exhibit stochastic behavior and influence the agent’s behavior. Events outside of the agent’s control can change the environment, and the agent’s actions can have stochastic consequences. This is also a characteristic of real-life problems, where decision models can change through time, and performing an action might not have a deterministic outcome. For example, the threshold for a loan can change over time, and even if the user follows the recourse offered a year ago, their application might still get rejected. This inconsistency between advice and results can be frustrating for the user and undermine their trust in the system. For that reason, counterfactual explanations in RL should account for the stochastic nature of the environment when offering recourse to users. For example, counterfactual search could incorporate optimization of certainty constraint, ensuring that the user is presented not only with the fastest but also the most secure path towards their desired outcome.

6.4.5 Alternative Explanation Components.

Current methods for counterfactual explanations in RL rely only on state features as causes of a decision. However, in RL, various other components such as goals, objectives, or outside events can contribute to the agent’s behavior and should be captured in counterfactual explanations. Especially in the fields of multi-goal and multi-objective RL, diverse counterfactual explanations that capture the wide range of possible causes are necessary in order to fully comprehend the agent’s decisions. Further research in the integration of different causes into counterfactual explanations is needed.

Developing metrics for evaluating counterfactual properties in RL is another important research direction. So far, counterfactual properties have been defined and evaluated with supervised learning in mind. However, research is required into how redefined counterfactual properties for RL tasks can be evaluated. For example, to calculate proximity for RL states, temporal similarity might need to be considered. However, it is not clear how to efficiently estimate how easily a state is reachable from another state in RL. A brute-force approach would include a nearest-path search over the graph of all possible transitions in the environment. However, in high-dimensional environments, this approach would be prohibitively expensive, and distances between states would need to be estimated. Similarly, research is needed in estimating data manifold closeness for counterfactual instances in RL. While a training data set can be used in supervised learning to represent the space of realistic instances, in RL tasks there is no natural equivalent to the training data set. For that reason, an alternative metric for evaluating data manifold closeness is necessary. For purposes of evaluating sparsity and proximity, developing metrics that integrate state features with RL-specific explanations components like goals and objectives, is another research direction necessary for implementing counterfactual explanations in RL. Existing research in reward function similarity [30] could be the starting point for the development of metrics for comparing the goals and objectives of RL agents. Additionally, the cost associated with changing individual goals and objectives when generating recourse might depend on the user. This opens the research to the possibility of a human-in-the-loop approach [6] where user feedback could be integrated into the counterfactual generation process.

6.4.6 Counterfactual and Prefactual Statements.

Previous research in psychology distinguishes between prefactual and counterfactual explanations, two different types of statements that can be used to describe causes of an event [12, 18, 25]. For example, for a user wondering why their loan application has been rejected, a counterfactual explanation could state: Had you had a higher income, your loan would have been approved, while the prefactual explanation would advise: If you obtain a higher income in the future, your next loan application will be approved [18]. Previous research in psychology has found that the two explanation types include different psychological processes [12, 25].

In supervised learning, where models deal with one-step predictions, and there is no notion of time, counterfactual and prefactual statements are used interchangeably. In contrast, RL is designed for sequential tasks, where the distinction between prefactual and counterfactual statements makes more sense. For example, while counterfactual explanations could help locate mistakes in the agent’s previous path, prefactual explanations could help guide them from an undesirable state to a more favorable one. In this sense, counterfactual explanations could be related to the field of credit assignment [35]. Prefactual explanations, on the other hand, could be used in the field of safe RL to help agents avoid dangerous states and find a way to safety [29].

While the difference between prefactual and counterfactual statements in supervised learning is purely semantic, in RL different technical approaches would need to be developed to obtain the two types of explanations. Current methods for counterfactual generation in RL [37, 67] do not distinguish between prefactual and counterfactual statements and further research is needed to utilize these two perspectives of counterfactual explanations.

6.4.7 Evaluation.

Evaluating explanations is challenging as their perceived quality is subjective and depends on the user. Counterfactual explanations are usually evaluated based on their counterfactual properties. Few works build on this and evaluate the real-life applicability of counterfactuals in a user study [67]. In supervised learning, counterfactual explanations are evaluated against data sets such as German credit [19, 66], Breast Cancer [14, 58] or MNIST [52, 58, 77, 79]. For RL tasks, however, there is no established benchmark environment for the evaluation of counterfactual explanations. Current work in RL evaluates counterfactuals in Atari games [67]. Establishing benchmark environments is necessary for evaluating and comparing counterfactual methods in RL. Additionally, in RL benchmark environments need to be extended to include multi-goal, multi-objective, and stochastic tasks.

6.5 Conclusion

While successfully applied to a variety of tasks, RL algorithms suffer from a lack of transparency due to their reliance on neural networks. User-friendly explanations are necessary to ensure trust and encourage collaboration of non-expert users with the black-box system. In this work, we explored counterfactuals – a user-friendly, actionable explanation for interpreting black-box systems. Counterfactuals represent a powerful explanation method that can help users understand and better collaborate with the black-box system. However, counterfactuals that offer unrealistic changes or do not deliver a desired output can further damage user trust and hinder their engagement with the system. For this reason, only well-defined and useful counterfactual explanations must be presented to the user.

Although they are researched in supervised learning, counterfactual explanations are still underrepresented in RL tasks. In this work, we explored how counterfactual explanations can be redefined for RL tasks. Firstly, we offered an overview of current state-of-the-art methods for generating counterfactual explanations in supervised learning. Furthermore, we identified the main differences between supervised and RL from the perspective of counterfactual explanations and provided a definition more suited for RL tasks. Specifically, we recognized that definitions of score and variable are not straightforward in RL, and can encompass different concepts. Finally, we proposed the main research directions that are necessary for the successful implementation of counterfactual explanations in RL. Specifically, we identified temporality and stochasticity as important RL-specific concepts that affect counterfactual generation. Additionally, we brought attention to universal issues with counterfactual generation both in supervised and RL, such as the handling of categorical features and evaluation.

Acknowledgement

We thank James McCarthy for helpful feedback on the early versions of the manuscript.

References

[1]

Oludare Isaac Abiodun, Aman Jantan, Abiodun Esther Omolara, Kemi Victoria Dada, Nachaat AbdElatif Mohamed, and Humaira Arshad. 2018. State-of-the-art in artificial neural network applications: A survey. Heliyon 4, 11 (2018), e00938.

Abstract

1 Introduction

2 Explainable Reinforcement Learning (XRL)

2.1 Reinforcement Learning (RL)

2.2 XRL Taxonomy

2.3 XRL Methods

2.3.1 Global Surrogates.

2.3.2 Summaries.

2.3.3 Saliency Maps.

2.3.4 Contrastive Explanations.

3 Explainability in Supervised Vs. Reinforcement Learning: Similarities and Differences

4 Counterfactual Explanations

4.1 Properties of Counterfactual Explanations

5 Counterfactual Explanations for Supervised Learning

5.1 Counterfactual Search

5.2 Algorithmic Recourse

5.3 Evaluation

6 Counterfactual Explanations in Reinforcement Learning

6.1 Existing Methods for Counterfactual Explanations in RL

6.2 Counterfactual Explanations Redefined

6.2.1 Variables.

6.2.2 Score.

6.3 Counterfactual Properties Redefined

6.4 Challenges and Open Questions

6.4.1 Counterfactual Search Space.

6.4.2 Categorical Variables.

6.4.3 Temporality.

6.4.4 Stochasticity.

6.4.5 Alternative Explanation Components.

6.4.6 Counterfactual and Prefactual Statements.

6.4.7 Evaluation.

6.5 Conclusion

Acknowledgement

References

Cited By

Index Terms

Recommendations

Counterfactual Explanations and Algorithmic Recourses for Machine Learning: A Review

Counterfactual Explanations for Reinforcement Learning Agents

RACCER: Towards Reachable and Certain Counterfactual Explanations for Reinforcement Learning

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations