1 Introduction
Artificial intelligence (AI) solutions have become pervasive in various fields in the last decades, thanks in part to the adoption of deep learning algorithms. In particular, deep learning has shown remarkable success in supervised learning tasks, where the goal is to learn patterns in a labeled training data set and use them to accurately predict labels on unseen data [
71]. Deep learning algorithms rely on neural networks, which allow for efficient processing of large amounts of unstructured data. However, they also rely on a large number of parameters, making their decision-making process difficult to understand. These models are often referred to as black-box due to the lack of transparency in their inner workings.
Reinforcement learning (RL) [
85] is a sub-field of AI that focuses on developing intelligent agents for sequential decision-making tasks. RL employs a trial-and-error learning approach in which an agent learns a task from scratch through interactions with its environment. An agent can observe the environment, perform actions that alter its state and receive rewards from the environment, which guide it towards an optimal behavior. The goal of RL is to obtain an optimal policy
\(\pi\), which maps the agent’s states to optimal actions. This bears some similarity to supervised learning approaches, where the goal is to classify an instance into the correct class according to the input features. However, while supervised learning algorithms rely on labeled training instances to learn patterns in the data, RL agents approach the task without prior knowledge and learn it through interactions with the environment.
Deep reinforcement learning (DRL) algorithms [
5], which employ a neural network to represent an agent’s policy, are currently the most popular approach for learning RL policies [
5]. DRL algorithms have shown remarkable success in navigating sequential decision-making problems in games, autonomous driving, healthcare, and robotics [
17,
46,
48,
64]. Although they can process large amounts of unstructured, high-dimensional data, they also struggle to explain an agent’s decisions, due to their reliance on neural networks.
Depending on the task and the user, AI systems require explainability for a variety of reasons (Figure
1). From the perspective of the developer, explainability is necessary to verify the system’s behavior before deployment. Understanding how the input features influence the decision of the AI system is necessary to avoid deployment of models that rely on spurious correlations [
8,
27] and to ensure robustness to adversarial attacks [
86]. From the perspective of fairness, understanding the decision-making process of an AI system is necessary to prevent automated discrimination. Lack of transparency of AI models can cause them to inadvertently adopt historical biases ingrained in training data and use them in their decision logic [
39]. To prevent discrimination, users of autonomous decision-making systems are now legally entitled to an explanation under the regulation within the GDPR in EU [
31]. From the perspective of expert and non-expert users of the system, explainability is necessary to ensure trust. For experts that use AI systems as an aid in their everyday tasks, trust is a crucial component necessary for successful collaboration. For example, a medical doctor using an AI system for diagnostics needs to understand it to trust its decisions and use them for this high-risk task [
49]. Similarly, for non-expert users, trust is needed to encourage interaction with the system. If an AI system is used to make potentially life-altering decisions for the user, they need to understand how the system operates to maintain their confidence and trust in the system.
The field of
explainable AI (XAI) explores methods for interpreting decisions of black-box systems in various fields such as machine learning, reinforcement learning, and explainable planning [
2,
16,
26,
32,
50,
60,
74,
76,
78,
83,
90,
91,
97]. In recent years, the focus of XAI has mostly been on explaining decisions of supervised learning models [
10]. Specifically, the majority of XAI methods have focused on explaining the decisions of neural networks, due to the emergence of deep learning as the state-of-the-art approach to many supervised learning tasks [
1,
13,
99]. In contrast,
explainable RL (XRL) is a fairly novel field that has not yet received an equal amount of attention. Most often, existing XRL methods focus on explaining DRL algorithms, which rely on neural networks to represent the agent’s policy, due to their prevalence and success [
95]. However, as RL algorithms are becoming more prominent and are being considered for use in real-life tasks, there is a need for understanding their decisions [
24,
73]. For example, RL algorithms are being developed for different tasks in healthcare, such as dynamic treatment design [
53,
59,
101]. Without rigorous verification and understanding of such systems, the medical experts will be reluctant to collaborate and rely on them [
75]. Similarly, RL algorithms have been explored for enabling autonomous driving [
4]. To understand and prevent mistakes such as the 2018 Uber accident [
47] where a self-driving car failed to stop before a pedestrian, the underlying decision-making systems have to be scrutable. Specific to the RL framework, explainability is also necessary to correct and prevent “reward hacking” – a phenomenon where an RL agent learns to trick a potentially misspecified reward function, such as a vacuum cleaner ejecting collected dust to increase its cleaning time [
3,
68].
In this work, we explore counterfactual explanations in supervised and reinforcement learning. Counterfactual explanations answer the question:
“Given that the black-box model made decision y for input x, how can x be changed for the model to output alternative decision \(y^{\prime }\)?” [
93]. Counterfactual explanations offer actionable advice to users of black-box systems by generating
counterfactual instances – instances as similar as possible to the original instance being explained but producing a desired outcome. If the user is not satisfied with the decision of a black-box system, a counterfactual explanation offers them a recipe for altering their input features, to obtain a different output. For example, if a user is denied a loan by an AI system, they might be interested to know how they can change their application so that it gets accepted in the future. Counterfactual explanations are targeted at non-expert users, as they often deal in high-level terms, and offer actionable advice to the user. They are also selective, aiming to change as few features as possible to achieve the desired output. As explanations that can suggest potentially life-altering actions to the users, counterfactuals carry great responsibility. A useful counterfactual explanation can help a user achieve a desired outcome, and increase their trust and confidence in the system. However, an ill-defined counterfactual that proposes unrealistic changes to the input features or does not deliver the desired outcome can waste the user’s time and effort and erode their trust in the system. For this reason, careful selection of counterfactual explanations is essential for maintaining user trust and encouraging their collaboration with the system.
Although they have been explored in supervised learning [
14,
19,
40,
58,
72,
96], counterfactual explanations are rarely applied to RL tasks [
67]. In supervised learning, methods for generating counterfactual explanations often follow a similar pattern. Firstly, a loss function is defined, taking into account different properties of counterfactual instances, such as the prediction of the desired class or similarity to the original instance. The loss function is then optimized over the training data to find the most suitable counterfactual instance. While the exact design of the loss function and the optimization algorithm vary between approaches, the high-level approach often takes the same form.
In this work, we challenge the definition of counterfactuals inherited from supervised learning for explaining RL agents. We examine the similarities and differences between the supervised and RL from the perspective of counterfactual explanations and argue that the same definition of counterfactual explanations cannot be directly translated from supervised to RL. Even though the two learning paradigms share similarities, in this work we demonstrate that the sequential nature of RL tasks, as well as the agent’s goals, plans, and motivations, make these two approaches substantially different from the perspective of counterfactual explanations. We start by reviewing the existing state-of-the-art methods for generating counterfactual explanations in supervised learning. Furthermore, we identify the main differences between supervised and reinforcement learning from the perspective of counterfactual explanations and redefine them for RL use. Finally, we identify research questions that need to be answered before counterfactual explanation methods can be applied to RL and propose potential solutions.
Previous surveys of XRL recognize counterfactual explanations as an important method but do not offer an in-depth review of methods for generating this type of explanation [
36,
73,
98]. Previous surveys of counterfactual explanations, however, focus only on methods for explaining supervised learning models and offer a theoretical background and review of state-of-the-art approaches [
81,
82,
92]. Similarly, Guidotti [
33] reviews counterfactuals for supervised learning and offers a demonstration and comparison of different approaches. On the other hand, in this work, we focus specifically on counterfactual explanations from the perspective of RL. Additionally, while previous work has explored differences between supervised and RL for causal explanations [
20], we utilize this to redefine counterfactual explanations for RL use, as well as explore challenges of applying supervised learning methods for generating counterfactual explanations directly in RL.
The rest of the work is organized as follows. Section
2 provides a taxonomy and a short overview of methods for explaining the behavior of RL agents. In Section
3 we identify the key similarities and differences between supervised and reinforcement learning from the perspective of explainability. Properties of counterfactual explanations are explored in Section
4. Furthermore, Section
5 offers a review of the state-of-the-art methods for generating counterfactual explanations in supervised and reinforcement learning. Finally, Section
6 focuses on redefining counterfactual explanations for RL and identifying challenges and open questions in this field.
3 Explainability in Supervised Vs. Reinforcement Learning: Similarities and Differences
Supervised and reinforcement learning are often used to solve different types of tasks. For example, supervised learning is used for learning patterns in large amounts of labeled data, while RL focuses on sequential tasks, and learns them from scratch through interactions with the environment. However, from the perspective of explainability, the two approaches share notable similarities. Most importantly, the lack of transparency in supervised and reinforcement learning stems from the same source – the underlying black-box model. In supervised learning, a black-box model is used to map input states to their labels. Similarly, in RL tasks, the model maps the agent’s state to an optimal action. Most explainability methods focus on deep learning and DRL scenarios, where the underlying black-box model is a neural network [
95]. The high-level approach of the two paradigms is in fact identical – a black-box model takes in often a multi-dimensional input consisting of features and produces a single output. For this reason, a model-agnostic approach trying to explain the model’s prediction in a supervised learning task can easily be applied to an RL task for explaining a choice of action in a specific state. For example, saliency maps have been successfully repurposed for RL [
38], despite being developed with supervised learning in mind [
80]. Other local explanation methods such as LIME [
76] and SHAP [
60] have also been used to interpret the effect of individual features on the action in the state [
23,
100].
However, despite their similarities, supervised learning and RL frameworks differ in a few notable points and thus often require different explanation techniques. In Figure
3 we present some illustrations of the main differences between supervised and RL from the perspective of explainability:
(1)
State independence: one of the main assumptions of supervised learning algorithms is the independence of instances in the training data. RL, however, focuses on sequential tasks where states are temporally connected. There is an inherent causal structure between visited states and chosen actions during the agent’s execution. A certain state is only visited because previous states and actions led to it [
20]. The causal links can be important components to explain the outcomes of the RL agent’s behavior. Figure
3(a) illustrates using previously visited states to explain the current decision. The agent justifies its decision of choosing the action
right by remembering that it had previously picked up the blue key and should navigate towards the blue goal.
(2)
Sequential execution: supervised learning approaches are limited to one-step prediction tasks, while an RL framework specializes in sequential problems. While methods for explaining supervised learning models only need to rely on model parameters and input features to explain a prediction, the causes of the RL agent’s decisions can be in its past or the future [
20]. For this reason, explanations of decisions in RL cannot be contained to the current time step but may need to include temporally distant causes. Figure
3(b) shows how temporally distant states can motivate an agent’s decision.
(3)
Explanation components: in supervised learning, a model’s decision is explained as a function of input features and model parameters. In RL, however, an agent’s decision-making process can be more complex and can include plans, goals or unexpected events. To fully understand the behavior of RL agents, explanations often cannot be limited to only state features, but should also include other possible causes of decisions. For example, in Figure
3(c) an agent can explain its decision by comparing two different objectives it balances between. Even though an agent is choosing the fastest way to the goal, stepping on the ice can lead to a large penalty. To understand why an agent would choose a potentially dangerous decision, it is necessary to know that it prefers speed over safety.
(4)
Training data set: while a necessary component in supervised learning approaches, it does not have a natural equivalent in an RL framework. For this reason, explainability methods for supervised learning which rely on training data sets (e.g., LIME, counterfactuals) might have to be adjusted to be applicable to RL tasks.
Supervised learning explainability approaches are limited to one-step prediction tasks, and as such have limited applications in RL. While they can be used to explain a decision based on the current state features, an RL agent’s decisions are often influenced by a wider range of causes and require richer, temporally extended explanations. For that reason, there is a need for RL-specific explainability techniques that account for the particularities of the framework.
4 Counterfactual Explanations
At the moment, a substantial amount of research in XAI is focused on developing explanations for experts familiar with AI systems. However, from the perspective of non-expert users, such explanations might be too detailed and difficult to understand. End-users are more likely to be interested in more abstract explanations of the system [
21]. For example, consider the 2018 incident when an Uber self-driving vehicle accidentally killed a pedestrian after failing to avoid them while crossing the road [
47]. In this situation, developers and experts might be interested in uncovering which parameters or input features contributed to the fatal failure, in order to be able to repair it. On the other hand, a non-expert user is more likely to be interested in high-level questions, such as
“Did the car not recognize the person in front of it?” or
“Did the car believe it had the right of way?” [
21]. Although the research in developing low-level, expert-oriented explanations has gained momentum, user-friendly explanation methods are less explored. However, developing user-friendly explanations is necessary in order to build trust and encourage user interactions with the black-box systems.
In this work, we explore one of the most notable examples of user-friendly explanations –
counterfactual explanations. Counterfactuals are local explanations that attempt to discover the causes of an event by answering the question:
“Given that input x has been classified as y, what is the smallest change to x that would cause it to be classified as an alternative class \(y^{\prime }\)?”. For example, counterfactuals can be applied to answer questions such as
“Since my application was rejected, what are some examples of successful loan applications similar to mine?” or
“How can I change my application for it to be accepted in the future?”. The explanation is usually given in the form of a
counterfactual instance – an instance
\(x^{\prime }\) as similar as possible to the original
x, but producing an alternative outcome
\(y^{\prime }\). Formally, according to Wachter et al. [
96], counterfactual explanations can be defined as statements:
Score y was returned because variables V had values (v1, v2, . . .) associated with them. If V instead had values (v1\(^{\prime }\), v2\(^{\prime }\), . . .), and all other variables had remained constant, score y’ would have been returned.
The statement consists of two main factors:
variables that the user can alter, and the
score. In supervised learning, variables represent input features that a black-box model uses to make a prediction, for example, pixels in image-based tasks. The prediction of the black-box model represents the score. The alternative outcome
\(y^{\prime }\) can either be specified or omitted, in which case the counterfactual can be an instance producing any outcome other than
y. Current methods either directly search for a counterfactual instance
\(x^{\prime }\) similar to
x, or a small perturbation
\(\delta\) which can be added to
x to achieve a desired outcome, in which case
\(x^{\prime } = x + \delta\). A summary of the terminology used throughout the work is given in Table
2.
The following properties make counterfactuals user-friendly explanations:
(1)
Actionable: counterfactual explanations give the user actionable advice, by identifying which parts of the input should be changed in order for the model’s output to change. Users can use this advice to alter their input and obtain the desired outcome.
(2)
Contrastive: counterfactual explanations compare real and imagined worlds. Contrastive explanations have been shown to be the preferred way for humans to understand the causes of events [
63].
(3)
Selective: counterfactual explanations are often optimized to change as few features as possible, making it easier for users to implement the changes. This also corresponds to the human preference for shorter rather than longer explanations [
65].
(4)
Causal: counterfactuals correspond to the third tier of Pearl’s Ladder of Causation [
69] (Table
3). The first rung of the hierarchy corresponds to associative reasoning and answers the question “If I observe X what is the probability of Y occurring?”. This rung also aligns with statistical reasoning, as the probability in question can be estimated from data. The second rung requires interventions to answer the question “How would Y change if I changed X?”. Finally, the third rung addresses the question “Was it X that caused Y?”. Answering this question requires imagining alternative words where X did not happen and estimating whether Y would occur in such circumstances.
(5)
Inherent to human reasoning: Humans rely on generating counterfactuals in their everyday lives. By imagining alternative worlds, humans learn about cause-effect relationships between events. For example, by considering the counterfactual claim
“Had the car driven slower, it would have avoided the accident”, the relationship between the car speed and the accident can be deduced. Additionally, humans use counterfactuals to assign blame or fault for a negative event [
11].
As user-friendly explanations, counterfactuals are important for ensuring the trust and collaboration of non-expert users of the system. Since they offer the user actionable advice and can be possibly used to instruct users how to change their real-life circumstances, counterfactual explanations can severely influence user’s trust. While providing the user with useful counterfactuals can help the user better navigate the AI system, offering the user a counterfactual that proposes unrealistic changes or does not lead to desired outcomes can be costly and frustrating for the user, and further erode their trust in the system. For this reason, it is necessary to provide a way to ensure usefulness and evaluate the quality of counterfactual explanations.
4.1 Properties of Counterfactual Explanations
Due to their potential influence on user trust, there is a need to develop metrics for evaluating the usefulness of counterfactual explanations. Additionally, counterfactual explanations can suffer from the Rashomon effect – it is possible to generate a large number of suitable counterfactuals of a specific instance [
65]. In this case, further analysis and comparison is necessary to select those most useful. For this reason, multiple criteria for assessing the quality of the generated counterfactual explanation have been proposed [
92]:
(1)
Validity: the counterfactual must be assigned by the model to a different class to that of the original instance. If a specific counterfactual class \(y^{\prime }\) is provided, then validity is satisfied if the counterfactual is classified as \(y^{\prime }\).
(2)
Proximity: counterfactual \(x^{\prime }\) should be as similar as possible to the original instance x.
(3)
Actionability: the counterfactual explanation should provide users with meaningful insights into how they can change their features, in order to achieve the desired outcome. This means that suggesting that a user should change their sensitive features (e.g., race) or features that are immutable (e.g., age, country of origin) should be avoided, as it is not helpful and may be offensive.
(4)
Sparsity: ideally, the counterfactual should require only a few features to be changed in the original instance. This corresponds to the notion of a selective explanation, which humans find easier to understand [
65].
(5)
Data manifold closeness: the counterfactual instances rely on modifying existing data points, which can lead to generation of out-of-distribution samples. To be feasible for users, counterfactuals need to be realistic.
(6)
Causality: while modifying the original sample, counterfactuals should abide by the causal relationship between features. For illustration, consider an example by Mahajan et al. [
62] of a counterfactual which suggests that the user should increase their education to Masters, without adjusting their age. Even though such a counterfactual instance would be actionable and fall within the data manifold (as there probably are people of the same age as the user, but with a Master’s education), the proposed changes would not be feasible for the current user.
(7)
Recourse: ideally, the user should be offered a set of discrete actions that can transform the original instance into a counterfactual. This property is closely related to actionability and causality – offered sequence of actions should not change the immutable features, and should consider causal relationships between the variables, to understand how changing one feature can influence the others. The field of recourse started as separate from counterfactual explanations, but since explanation methods have become more elaborate and user-centered, this line has been blurred [
92].
6 Counterfactual Explanations in Reinforcement Learning
In the previous section, we provided an overview of the state-of-the-art methods for generating counterfactual explanations in supervised learning tasks. In RL, however, counterfactual explanations have been seldom applied. In this section, we explore how counterfactual explanations differ between supervised and reinforcement learning tasks, and identify the main challenges that prevent the adoption of methods from supervised to reinforcement learning. Additionally, we redefine counterfactual explanations and counterfactual properties for RL. Finally, we identify the main research directions for implementing counterfactual explanations in RL.
6.1 Existing Methods for Counterfactual Explanations in RL
Although the potential of counterfactual explanations for explaining RL agents has been recognized [
20], there are currently few methods enabling counterfactual generation in RL. Olson et al. [
67] proposed the first method for generating counterfactuals for RL agents. In their work, counterfactuals are generated using a generative deep-learning architecture. The authors explain the decisions of RL agents in Atari games and search for counterfactuals in the latent space of the policy network. The approach assumes that policy is represented by a deep neural network and starts by dividing the network into two parts. The first part
A contains all but the last hidden layer of the network and serves for mapping input state
s into latent representation
z:
\(A(s) = z\). The second part
\(\pi\) consists only of the last fully connected layer followed by a softmax and is used to provide action probabilities based on the latent state. The policy network can be imagined as a composition of the network parts
\(\pi (A(s))\). To generate counterfactual states, authors propose a deep generative model consisting of an encoder (
E), discriminator (
D), generator (
G) and Wasserstein encoder (
\(E_w, D_w\)). Encoder
E and generator
G work together as an encoder-decoder pair, with the encoder mapping the input state into a low-dimensional latent representation, and the generator performing reconstruction from encoding to the original state. Discriminator
D learns to predict the probability distribution over actions
\(\pi (z)\) given a latent state representation
\(z = E(s)\). Counterfactuals are searched for in the latent space, which may have holes, resulting in unrealistic instances. For this reason, the authors include a Wasserstein encoder
\((E_w, D_w)\) to map latent state
z to an even lower-dimensional representation
\(z_w\) to obtain a more compact format and ensure more likely counterfactuals. The generative architecture is trained on a set of the agent’s transitions in the environment
\(\mathcal {X} = \lbrace (s_1, a_1),\dots ,(s_n, a_n)\rbrace\). Finally, to generate counterfactual state
\(s^{\prime }\) of state
s, authors first locate the latent representation
\(z_w^*\) of
\(s^{\prime }\) by solving:
where
\(a^{\prime }\) is the counterfactual action. The objective can be simplified to:
The process for finding the counterfactual instance then consists of encoding the original state into Wasserstein encoding
\(z_w = E_w(A(s))\), optimizing Equation (
5) to find
\(z_w^*\). The latent counterfactual instance
\(z_w^*\) can then be decoded to obtain
\(\pi (D_w(z_w^*))\), which is finally passed to the generator to obtain the counterfactual instance
\(s^{\prime }\). This approach can generate realistic image-based counterfactual instances that can be used to explain the behavior of Atari agents. However, this approach is model-specific and requires access to the RL model’s internal parameters.
In contrast, Huber et al. [
37] propose GANterfactual-RL, a model-agnostic generative approach based on the StarGAN [
15] architecture for generating realistic counterfactual explanations. Huber et al. [
38] approach the counterfactual search problem as the domain translation task where each domain is associated with states in which the agent chooses a specific action. A suitable counterfactual can be found by translating the original state to a similar state from the target domain. In this way, obtaining a counterfactual involves only changing the features that are relevant to the agent’s decision, while maintaining others. The approach consists of training two neural networks – discriminator
D and generator
G. The generator
G is trained to translate the input state
x into
y, conditioned on the target domain
c. The role of the discriminator is to distinguish between real states and fake ones produced by
G.
The current approaches for generating counterfactuals in RL do not deviate from similar approaches in supervised learning and do not account for the additional complexities of the RL framework presented in Section
3. For example, the counterfactuals generated by Huber et al. [
38], Olson et al. [
67] rely only on state features to explain a decision and do not include different explanation components, such as goals or objectives that can influence the agent’s behavior. Similarly, the approaches do not address the issue of temporality and can generate counterfactuals that are close together in terms of features but might be far away in execution.
While current methods in supervised and RL generate counterfactuals suitable for one-step prediction tasks, they do not account for the explanatory requirements of RL tasks. In the following sections, we redefine counterfactual explanations to fit within the RL framework and explore their desired properties. We also identify the main research directions that need to be addressed before counterfactuals can be successfully used to explain the decisions of RL agents.
6.2 Counterfactual Explanations Redefined
In RL, current counterfactual explanations focus on interpreting a single decision in a specific state [
37,
67]. Variables are simply state features, and the score is an agent’s choice of action in the state. This corresponds to the definition of counterfactuals used in supervised learning, where input features are replaced by state features and a model’s prediction of a class label with an action prediction. However, unlike supervised learning models, RL agents deal with sequential tasks where states are temporally connected. This means that an agent’s action in a single state cannot be fully explained by taking into account only the current state features. Additionally, an agent’s behavior cannot be fully explained by interpreting single actions, as individual actions are often part of a larger plan or policy. To be able to generate counterfactual explanations that can fully encompass the complexities of RL tasks, we need to redefine the notions of variable and score from the RL perspective. Examples of possible counterfactual explanations in RL after redefining the terms variable and score are presented in Table
5.
6.2.1 Variables.
To redefine variables from the perspective of RL tasks, we explore different possible causes of decisions for RL agents. According to the causal attribution model presented by Dazeley et al. [
20], an agent’s decisions can be directly or indirectly affected by five factors –
perception, goals, disposition, events and expectations. While Dazeley et al. [
20] also provides a causal structure connecting different factors, in this work we focus only on their individual effect on the agent’s decision.
Current counterfactual explanations use an agent’s perception as a variable and only consider state features as the possible causes of an outcome. Counterfactual questions can then be phrased as “Given that the agent chose action a in state with state features \({f_1, \ldots , f_k}\) how can the features be changed for the agent to choose an alternative action a\(^{\prime }\)?” This type of explanation is most suitable for single-goal, single-objective, deterministic tasks, where each action depends only on the current state the agent is perceiving. In such tasks, it can be assumed that the user knows the ultimate goal of the agent’s behavior and the agent does not need to balance between different objectives. In such environments, an agent’s actions have deterministic consequences and unexpected events cannot influence the agent’s behavior. This corresponds to simple environments, such as deterministic games, where the agent’s goal is to win by optimizing a single objective function. In more complex, real-world environments, however, relying only on the current state to make a decision is unlikely, and additional factors might need to be taken into account.
An agent’s current goal can also influence an outcome. Counterfactual explanations can then be used to address the question “Given that the agent chose action a in a state s while following goal G, for what alternative goal G\(^{\prime }\) would the agent have chosen action a\(^{\prime }\)?”. This question is especially useful in multi-goal or hierarchical RL tasks, where the agent has a set of goals that need to be fulfilled and can switch between them. In such environments, knowing which goal an agent is pursuing in a state is necessary to fully understand its behavior. For example, consider an autonomous taxi vehicle. The task of the taxi agent can be split into two subgoals – picking up passengers and delivering them to their destinations. A passenger picked up in such a taxi might be confused if the agent takes a longer route to their destination. The passenger might lose trust in the agent and even suspect the agent is lost or is trying to trick the user. However, if an agent is allowed to explain its behavior by employing a counterfactual statement: “I am following a goal of picking up a different passenger. Had I followed a goal of delivering the current passenger I would have taken the shorter route to the destination.”, the user can be reassured that the agent is indeed following the best course of action.
Similarly, in multi-objective environments, an agent’s preference for a specific objective can guide its behavior. This corresponds to the effect of the agent’s disposition on its decisions as described in Dazeley et al. [
20]. The counterfactual question can then be posed as:
“Given that the agent chose action a in state s while preferring objective O, for what alternative objective preference O\(^{\prime }\) would the agent have chosen action a\(^{\prime }\)?”. An agent’s internal preference for one objective over another is likely to influence its actions, and this information is necessary to fully understand the agent’s behavior. Many real-life tasks fit this description. For example, the behavior of an autonomous driving agent is at each time step guided by different objectives such as speed, safety, fuel consumption, user comfort and so on. Different agents can have different preferences in regard to these objectives. If an agent does not slow down in a critical situation, and as a consequence ends up colliding with the vehicle in front of it, the counterfactual explanation could identify that, had the agent preferred safety over speed in the critical state, it would have chosen to slow down and the accident would have been avoided. This type of explanation can also be useful for developers when designing multi-objective reward functions for complex tasks. Deciding on the correct weights for different objectives is a notoriously challenging task and requires substantial time and effort [
56]. Understanding how different preferences of objectives influence an agent’s decision, especially in critical states, can be useful for adjusting the appropriate weights.
Unexpected outside events can alter an agent’s behavior and as such should also be featured in counterfactual explanations. The counterfactual question can then be defined as
“Given that the agent chose action a in state under an unexpected event E, for which event E\(^{\prime }\) would the agent have chosen action a\(^{\prime }\)?”. For example, consider an autonomous driving agent that had to change its trajectory due to an unexpected blockage on the road. The user might be confused as to why the agent decided on the longer path, and counterfactual explanations could restore their trust in the system by asserting that
“Had the shorter path not been blocked, I would have taken it.” . This corresponds to the field of safe RL, which argues that, to be safely deployed, agents must be able to avoid danger and correctly respond to unexpected events [
29].
Finally, expectations can also influence an agent’s behavior. Expectation refers to a set of social and cultural constraints that are imposed on the agent and can affect its decisions [
20]. These can be anything from social conventions and laws to ethical rules. Counterfactual explanations could focus on how changing expectations affects an agent’s behavior by posing the question
“Had expectations for the the agent been different, would it still make the same decisions?”. Including expectations in the explanations can help frame an agent’s behavior in the broader social and cultural context. Users who are not familiar with the expectations the agent is implementing may find it strange if the agent diverges from its primary goals due to an expectation. Explanations that identify social expectations as the cause of unexpected behavior can help users understand and regain trust in the model.
While decisions of a supervised learning model depend only on the input features, the RL agent’s behavior is often motivated by a wider range of causes, such as goals, objectives, and expectations. For that reason, the definition of a variable needs to be extended to include all factors influencing an agent’s decisions.
6.2.2 Score.
In supervised learning, the only reasonable outcome of the model’s behavior for a specific input instance is its prediction. In RL tasks, however, there are multiple possible scores that could describe an agent’s behavior in a specific state. Current methods for counterfactual explanations in RL use the action chosen in the state as the score. However, each agent’s decision often does not exist on its own, but as a part of a larger plan or a policy. Relying only on a current action might not be enough to capture the desired change in behavior, especially in complex, continuous environments. In game environments, such as chess or StarCraft, the user might be more interested to know why an agent chose one specific plan or sequence of steps instead of another. Similarly, in an autonomous driving task with continuous control, the user is unlikely to ask the agent why it changed the steering angle by a small amount but might be interested to know why it chose a specific route over another. Interpreting a single action in a complex or continuous environment might be too low-level for non-expert users, and better suited for experts or developers. Non-expert users, however, tend to be more comfortable with high-level explanations, corresponding to plans or policies. Counterfactual explanations in RL have to be extended to include more complex behavior that is inherent to RL agents to be user-friendly.
While supervised learning models focus on one-step prediction tasks, in RL agents learn sequential behavior which consists of plans and policies. For this reason, counterfactual explanations need to account for the broader scope of RL behavior.
6.3 Counterfactual Properties Redefined
Counterfactual properties are metrics for deciding on the most suitable counterfactual to be presented to the user. It is of high importance that counterfactual properties describe necessary desiderata for choosing useful counterfactuals. Providing a user with unattainable or ineffective counterfactuals can cause frustration and deteriorate user trust in the system. So far, counterfactual properties have been defined with supervised learning tasks in mind, where only input features are changed in order to obtain a desired prediction. Since counterfactual explanations in RL are not limited only to state features and one-step decisions, counterfactual properties need to be redefined for use in RL tasks:
(1)
Validity is measured as the extent to which the desired score is achieved in the counterfactual instance. In RL, the score can refer to the agent’s current action, plan, or policy. Depending on the chosen definition of the score, validity describes whether the agent in the counterfactual instance chose the desired action or followed the desired plan or policy.
(2)
Proximity is estimated as a difference between the original and counterfactual instances. In supervised learning, similarity is calculated as a function of input features. In RL however, proximity can include not only state features but other explanation factors, such as goals or objectives, which also need to be included when calculating proximity. Additionally, while in supervised learning data samples are independent of each other, states in RL tasks are temporally connected. Offering the user a counterfactual instance that is similar to the original in terms of state features, but temporally distant might be useless. For this reason, temporality also must be considered when calculating the proximity of two states in RL.
(3)
Actionability specifies the need for allowing certain features (e.g., race, country of origin) to be immutable. In supervised learning, developers are required to define which features should not be changed. In RL however, certain immutable features can also be contained within the environment itself. For example, although goal coordinates exist in the feature space, changing them might be impossible under the rules of the environment. Suggesting to the user that they should obtain a counterfactual where the goal is moved would not be actionable. For this reason, the definition of actionability in RL needs to be expanded to limit change of not only developer-defined features but also those that are rendered immutable by the rules of the environment.
(4)
Sparsity refers to the fact that as few as possible features should be changed in order to obtain the counterfactual instance. In current work, it is often assumed that there is no difference in the cost of changing individual features. In RL, however, counterfactual instances can be obtained not only by changing state features but also goals or objectives. Changing a single state’s features or changing the agent’s goal may pose different levels of difficulty for the user, and a sparsity metric should reflect this potential cost inequality in RL tasks.
(5)
Data manifold closeness refers to the requirement that counterfactual instances should be realistic, in order for users to be able to achieve them. The main challenge for estimating data manifold closeness is in defining what is meant by the notion of a realistic instance. In supervised learning, the training data set is used to represent the space of realistic instances. The data manifold closeness of an instance can then be estimated with respect to the defined realistic sample space. In RL, however, the training data set is not readily available. For this reason, a more precise definition of realistic instances in RL and accompanying metrics for evaluating data manifold closeness are needed in RL.
(6)
Causality refers to the notion that counterfactual instances should obey causal constraints between features. In supervised learning, generating causally correct counterfactual instances is only possible if the causal relationships between input features are known. In RL, however, causal rules are ingrained in the environment. Relying on actions provided by the environment can ensure that features are not changed independently but in accordance with the causal rules of the environment.
(7)
Recourse is a sequence of actions the user should perform to transform the original instance into a counterfactual. In supervised learning, the user is in theory allowed to perform whichever actions they desire, provided that they conform to the causality and actionability constraints. In certain RL tasks, however, the user’s actions might be limited to those available in the environment. In game environments, for example, the user can only change the game state by performing actions that obey the rules of the game. For illustration, consider the game StarCraft, a two-player strategy game where the goal is to build an army and defeat the opponent. If the user is interested to know why the RL agent did not choose an attack action in a specific state, the system can answer by providing recourse: “In order for attack to be the best decision, you first need to perform action build_army.” To follow the provided advice, the user must rely only on actions available in the environment.
Counterfactual properties guide the search for the counterfactual instance, and as such directly influence which explanations will be presented to the user. In RL, counterfactual properties need to be extended from their supervised learning counterparts, to encapsulate the complexities of RL tasks. The introduction of different types of variables and scores to counterfactual explanations in RL affects all counterfactual properties. Additionally, the temporal nature of RL needs to be considered when evaluating proximity, one of the properties that all previous works optimize. On the other hand, while supervised learning lacks causal information about input features, this knowledge is often ingrained within an RL environment, and can potentially be used to generate causally-correct counterfactuals and recourse.
6.4 Challenges and Open Questions
So far, we have shown how counterfactual explanations differ between supervised and reinforcement learning and have redefined them for RL tasks. However, the question remains, how can suitable counterfactuals be generated for RL tasks? As there is a large amount of research on counterfactual explanations in supervised learning, and due to similarities between the two learning paradigms, in this section we explore whether this substantial research can be applied to RL. We identify the main challenges and open research directions on this path that should be tackled before counterfactual explanations can be generated for RL tasks.
6.4.1 Counterfactual Search Space.
In supervised learning, the search for a counterfactual instance is often performed over the training data set. The training data set is an essential component of every supervised learning task. Additionally, the training data set is used to represent the space of realistic instances, to measure the data manifold closeness of counterfactual instances. Intuitively, if an instance is consistent with the data distribution defined by the training data set, it is considered realistic. In RL, however, the agent learns the task from scratch and the training data set is not available. For that reason, the search space corresponding to realistic instances needs to be redefined for RL tasks.
The most straightforward way to approximate training data set in RL is to obtain a data set of states by unrolling the agent’s policy. Existing loss-based approaches can then be optimized over such a data set to find the most useful counterfactual instance. Data manifold closeness of counterfactual instances could then be estimated in relation to this data set and realistic instances would correspond to those states that the agent is likely to visit by following its policy. This approach makes sense for counterfactual instances which can be obtained by only changing the state features. The counterfactual instance should then be reachable by the same policy the agent is employing. However, if searching for it includes changing the agent’s goals or dispositions, the counterfactual instance is not likely to be found within the agent’s current policy. For example, consider a multi-goal agent with two subgoals A and B, explaining why decision \(a^{\prime }\) was not taken instead of a in state s. A counterfactual instance might explain the decision by claiming that, had the agent been following subgoal B instead of A, action \(a^{\prime }\) would have been executed instead of action a. However, this counterfactual instance is unlikely to occur under the agent’s policy \(\pi\) which is already following goal A in state s. For this reason, further research to understand how can the search space for counterfactual instances be defined is needed.
6.4.2 Categorical Variables.
Most of the current state-of-the-art methods for generating counterfactual explanations in supervised learning do not support categorical input features. This problem naturally extends to RL as well, where state features can often be categorical. However, in RL this problem is additionally exacerbated with the use of alternative explanation components such as goals, dispositions, or events. These components often represent high-level concepts and as such are often discrete. Understanding how discrete variables can be integrated with existing counterfactual methods is necessary for their employment in RL tasks.
6.4.3 Temporality.
One of the main purposes of counterfactual explanations is to provide a user with a way to change the input features in the most suitable way to change the output of a black-box model. Intuitively, this means that the counterfactual needs to be reachable from the original instance. In supervised learning, many counterfactual properties such as proximity, sparsity, and data manifold closeness are used to guide the search toward easily obtainable counterfactuals. In supervised learning, the notion of reachability is not contained within the task – all samples are considered independent, and whether one sample can be obtained from another often depends on human-defined interpretation of terms such as immutability, causality, or sparsity. In most supervised learning works, proximity is used as a proxy for reachability, and states with similar state features are considered easily reachable. In RL, however, due to its sequential nature, states are temporally connected. Reachability in RL tasks is ingrained in the environment and depends on how temporally distant two states are in terms of execution.
This means that two states with almost identical state feature values might be distant from each other in terms of execution. For illustration, consider an example of a chess game presented in Figure
5. The generated counterfactual for a chess position can satisfy validity, sparsity, and data manifold closeness constraints and be very similar to the original instance in terms of state features. However, if it is not reachable from the original instance using the game rules, recourse cannot be provided based on such counterfactual. In other words, the counterfactual is not useful to the user if they cannot reach it by following the game rules. For this reason, temporal similarity needs to be considered along with the similarity of features when calculating proximity. Further research is necessary to define metrics and integrate temporal similarity in the search for counterfactuals in RL.
6.4.4 Stochasticity.
Most methods for providing recourse in supervised learning assume that model decisions and feature changes are deterministic and that there exists one true sequence of actions for obtaining the counterfactual instance. In many RL tasks, however, the environment can exhibit stochastic behavior and influence the agent’s behavior. Events outside of the agent’s control can change the environment, and the agent’s actions can have stochastic consequences. This is also a characteristic of real-life problems, where decision models can change through time, and performing an action might not have a deterministic outcome. For example, the threshold for a loan can change over time, and even if the user follows the recourse offered a year ago, their application might still get rejected. This inconsistency between advice and results can be frustrating for the user and undermine their trust in the system. For that reason, counterfactual explanations in RL should account for the stochastic nature of the environment when offering recourse to users. For example, counterfactual search could incorporate optimization of certainty constraint, ensuring that the user is presented not only with the fastest but also the most secure path towards their desired outcome.
6.4.5 Alternative Explanation Components.
Current methods for counterfactual explanations in RL rely only on state features as causes of a decision. However, in RL, various other components such as goals, objectives, or outside events can contribute to the agent’s behavior and should be captured in counterfactual explanations. Especially in the fields of multi-goal and multi-objective RL, diverse counterfactual explanations that capture the wide range of possible causes are necessary in order to fully comprehend the agent’s decisions. Further research in the integration of different causes into counterfactual explanations is needed.
Developing metrics for evaluating counterfactual properties in RL is another important research direction. So far, counterfactual properties have been defined and evaluated with supervised learning in mind. However, research is required into how redefined counterfactual properties for RL tasks can be evaluated. For example, to calculate proximity for RL states, temporal similarity might need to be considered. However, it is not clear how to efficiently estimate how easily a state is reachable from another state in RL. A brute-force approach would include a nearest-path search over the graph of all possible transitions in the environment. However, in high-dimensional environments, this approach would be prohibitively expensive, and distances between states would need to be estimated. Similarly, research is needed in estimating data manifold closeness for counterfactual instances in RL. While a training data set can be used in supervised learning to represent the space of realistic instances, in RL tasks there is no natural equivalent to the training data set. For that reason, an alternative metric for evaluating data manifold closeness is necessary. For purposes of evaluating sparsity and proximity, developing metrics that integrate state features with RL-specific explanations components like goals and objectives, is another research direction necessary for implementing counterfactual explanations in RL. Existing research in reward function similarity [
30] could be the starting point for the development of metrics for comparing the goals and objectives of RL agents. Additionally, the cost associated with changing individual goals and objectives when generating recourse might depend on the user. This opens the research to the possibility of a human-in-the-loop approach [
6] where user feedback could be integrated into the counterfactual generation process.
6.4.6 Counterfactual and Prefactual Statements.
Previous research in psychology distinguishes between prefactual and counterfactual explanations, two different types of statements that can be used to describe causes of an event [
12,
18,
25]. For example, for a user wondering why their loan application has been rejected, a counterfactual explanation could state:
Had you had a higher income, your loan would have been approved, while the prefactual explanation would advise:
If you obtain a higher income in the future, your next loan application will be approved [
18]. Previous research in psychology has found that the two explanation types include different psychological processes [
12,
25].
In supervised learning, where models deal with one-step predictions, and there is no notion of time, counterfactual and prefactual statements are used interchangeably. In contrast, RL is designed for sequential tasks, where the distinction between prefactual and counterfactual statements makes more sense. For example, while counterfactual explanations could help locate mistakes in the agent’s previous path, prefactual explanations could help guide them from an undesirable state to a more favorable one. In this sense, counterfactual explanations could be related to the field of credit assignment [
35]. Prefactual explanations, on the other hand, could be used in the field of safe RL to help agents avoid dangerous states and find a way to safety [
29].
While the difference between prefactual and counterfactual statements in supervised learning is purely semantic, in RL different technical approaches would need to be developed to obtain the two types of explanations. Current methods for counterfactual generation in RL [
37,
67] do not distinguish between prefactual and counterfactual statements and further research is needed to utilize these two perspectives of counterfactual explanations.
6.4.7 Evaluation.
Evaluating explanations is challenging as their perceived quality is subjective and depends on the user. Counterfactual explanations are usually evaluated based on their counterfactual properties. Few works build on this and evaluate the real-life applicability of counterfactuals in a user study [
67]. In supervised learning, counterfactual explanations are evaluated against data sets such as German credit [
19,
66], Breast Cancer [
14,
58] or MNIST [
52,
58,
77,
79]. For RL tasks, however, there is no established benchmark environment for the evaluation of counterfactual explanations. Current work in RL evaluates counterfactuals in Atari games [
67]. Establishing benchmark environments is necessary for evaluating and comparing counterfactual methods in RL. Additionally, in RL benchmark environments need to be extended to include multi-goal, multi-objective, and stochastic tasks.
6.5 Conclusion
While successfully applied to a variety of tasks, RL algorithms suffer from a lack of transparency due to their reliance on neural networks. User-friendly explanations are necessary to ensure trust and encourage collaboration of non-expert users with the black-box system. In this work, we explored counterfactuals – a user-friendly, actionable explanation for interpreting black-box systems. Counterfactuals represent a powerful explanation method that can help users understand and better collaborate with the black-box system. However, counterfactuals that offer unrealistic changes or do not deliver a desired output can further damage user trust and hinder their engagement with the system. For this reason, only well-defined and useful counterfactual explanations must be presented to the user.
Although they are researched in supervised learning, counterfactual explanations are still underrepresented in RL tasks. In this work, we explored how counterfactual explanations can be redefined for RL tasks. Firstly, we offered an overview of current state-of-the-art methods for generating counterfactual explanations in supervised learning. Furthermore, we identified the main differences between supervised and RL from the perspective of counterfactual explanations and provided a definition more suited for RL tasks. Specifically, we recognized that definitions of score and variable are not straightforward in RL, and can encompass different concepts. Finally, we proposed the main research directions that are necessary for the successful implementation of counterfactual explanations in RL. Specifically, we identified temporality and stochasticity as important RL-specific concepts that affect counterfactual generation. Additionally, we brought attention to universal issues with counterfactual generation both in supervised and RL, such as the handling of categorical features and evaluation.