survey

Open access

Local Interpretations for Explainable Natural Language Processing: A Survey

Authors:

Siwen Luo,

Hamish Ivison,

Soyeon Caren Han,

Josiah PoonAuthors Info & Claims

ACM Computing Surveys, Volume 56, Issue 9

Article No.: 232, Pages 1 - 36

https://doi.org/10.1145/3649450

Published: 25 April 2024 Publication History

PDF eReader

Abstract

As the use of deep learning techniques has grown across various fields over the past decade, complaints about the opaqueness of the black-box models have increased, resulting in an increased focus on transparency in deep learning models. This work investigates various methods to improve the interpretability of deep neural networks for Natural Language Processing (NLP) tasks, including machine translation and sentiment analysis. We provide a comprehensive discussion on the definition of the term interpretability and its various aspects at the beginning of this work. The methods collected and summarised in this survey are only associated with local interpretation and are specifically divided into three categories: (1) interpreting the model’s predictions through related input features; (2) interpreting through natural language explanation; (3) probing the hidden states of models and word representations.

1 Introduction

As a result of the explosive development of deep learning techniques over the past decade, the performance of deep neural networks (DNN) has significantly improved across various tasks. DNN has been broadly applied in different fields, including business, healthcare, and justice. For example, in healthcare, artificial intelligence startups have raised $864 million in the second quarter of 2019, with higher amounts expected in the future, as reported by the TDC Group.¹ However, while deep learning models have brought many foreseeable benefits to both patients and medical practitioners, such as enhanced image scanning and segmentation, the inability of these models to provide explanations for their predictions is still a severe risk, limiting their application and utility.

Before demonstrating the importance of the interpretability of deep learning models, it is essential to illustrate the opaqueness of DNNs compared to other interpretable machine learning models. Neural networks roughly mimic the hierarchical structures of neurons in the human brain to process information among hierarchical layers. Each neuron receives the information from its predecessors and passes the outputs to its successors, eventually resulting in a final prediction [120]. DNNs are neural networks with a large number of layers, meaning they contain up to billions of parameters. Compared to interpretable machine learning models such as linear regressions, where the few parameters in the model can be extracted as the explanation to illustrate influential features in prediction, or the decision trees, where a model’s prediction process can be easily understood by following the decision rules, the complex and huge computations done by DNNs are hard to comprehend both for experts and non-experts alike. In addition, the representations used and constructed by DNNs are often complex and incredibly difficult to tie back to a set of observable variables in image and natural language processing tasks. As such, vanilla DNNs are often regarded as opaque “black-box” models that have neither interpretable architectures nor clear features for interpretation of the model outputs.

However, why should we want interpretable DNNs? One fundamental reason is that while the recent application of deep learning techniques to various tasks has resulted in high levels of performance and accuracy, these techniques still need improvement. As such, when applying these models to critical tasks where prediction results can cause significant real-world impacts, they are not guaranteed to provide faultless predictions. Furthermore, given any decision-making system, it is natural to demand explanations for the decisions provided. For example, the European Parliament adopted the General Data Protection Regulation (GDPR) in May 2018 to clarify the right of explanation for all individuals to obtain “meaningful explanations of the logic involved” for automated decision-making procedures [58]. As such, it is legally and ethically crucial for the application of DNNs to develop and design ways for these networks to provide explanations for their predictions. In addition, explanations of predictions would help specialists verify their correctness, allowing them to judge if a model is making the right predictions for the right reasons. As such, increasing interpretability is vital for expanding the applicability and correctness of DNNs.

In the past few years, several works have been proposed to improve the interpretability of DNNs. In this survey article, we focus on local interpretable methods proposed for natural language processing tasks. As described in the following sections, we define local methods as those that provide explanations only for specific decisions made by the model—that is, methods that provide explanations for single instances rather than aiming to provide general descriptions of the model’s decision-making process. We explore several recent local interpretation methods/techniques in Natural Language Processing (NLP), which aims to support the normal users with no machine/deep learning expertise²:

—

Feature importance methods, which work by determining and extracting the most important elements of an input instance.

—

Natural language explanation, in which models generate text explanations for a given prediction.

—

Probing, in which model’s internal states are examined when given certain inputs.

1.1 Definitions of Interpretability

While there has been much study of the interpretability of DNNs, there are no unified definitions for the term interpretabilty, with different researchers defining it from different perspectives. We summarise the key aspects of interpretability used by these researchers below.

1.1.1 Explainability vs. Interpretability.

The terms interpretability and explainability are often used synonymously across the field of explainable AI [1, 27], with both terms being used to refer to the ability of a system to justify or explain the reasoning behind its decisions.³ Overall, the machine learning community tends to use the term interpretability, while the HCI community tends to use the term explainability [1]. Recent work has suggested more formal definitions of these terms [27, 44, 58]. Following Doshi-Velez and Kim [44], we define interpretability as “the ability [of a model] to explain or to present [its predictions] in understandable terms to a human.” We take explainability to be synonymous with interpretability unless otherwise stated, reflecting its general usage within the field.

1.1.2 Local and Global Interpretability.

An essential distinction in interpretable machine learning is between local and global interpretability. Following Guidotti et al. [58] and Doshi-Velez and Kim [44], we take local interpretability to be “the situation in which it is possible to understand only the reasons for a specific decision” [58]. That is, a locally interpretable model is a model that can give explanations for specific predictions and inputs. We take global interpretability to be the situation in which it is possible to understand “the whole logic of a model and follow the entire reasoning leading to all the different possible outcomes” [58]. A classic example of a globally interpretable model is a decision tree, in which the general behaviour of the model may be easily understood by examining the decision nodes that make up the tree. As understanding the whole logic of a model often requires the use of specific models or significant changes to an existing model, in this article, we focus on local interpretation methods, as these tend to be more generally applicable to existing and future NLP models.

1.1.3 Post Hoc vs. In-built Interpretations.

Another important distinction is whether an interpretability method is applied to a model after the fact or integrated into the internals of a model. The former is referred as a post hoc interpretation method [118], while the latter is an in-built interpretation method. As post hoc methods are applied to model the fact, they generally do not impact the model’s performance. Some post hoc methods do not require any access to the internals of the model being explained and so are model-agnostic. An example of a typical post hoc interpretable method is LIME [143], which generates the local interpretation for one instance by permuting the original inputs of an underlying black-box model. In contrast to post hoc interpretations, in-built interpretations are closely integrated into the model itself. The interpretation may come from the transparency of the model, where the workings of the model itself are clear and easy to understand (for example, a decision tree) or may come from an interpretation generated by the model in an opaque manner (for example, a model that generates a text explanation during its prediction process). In this survey, we will examine both methods.

1.2 Article Layout

Before examining interpretability methods, we first discuss different aspects of interpretability in Section 2. In Section 3, we summarise and categorise three main interpretable methods in NLP, including (1) improving a model’s interpretability by identifying the important input features; (2) explaining a model’s predictions by generating direct natural language explanations; (3) probing the internal state and mechanisms of a model. We also provide a quick summary of datasets that are commonly used for the study of each method. In Section 4, we summarise several primary methods to evaluate the interpretability of each method discussed in Section 3. We finally discussed the limitations of current interpretable methods in NLP in Section 5 and the possible future trend of interpretability development at the end.

2 Aspects of Interpretability

2.1 Interpretability Requirements

Before discussing the various aspects of interpretability, it is also essential to consider what problems require interpretable solutions and what interpretable models best fit these problems. Following Reference [44], we suggest that anyone looking to build interpretable models first determine the following four points:

(1)

Do you need an explanation for a specific instance or understand how a model works? In the former case, local interpretation methods will likely prove more suitable, while in the latter, global interpretation methods will be required.

(2)

How much time does/will a user have to understand the explanation? This, along with the point below, is an important concern for the usability of an interpretation method. Certain methods lend themselves to quick, intuitive understanding, while others require some more effort and time to comprehend.

(3)

What background and expertise will the users of your interpretable model have? As mentioned, this is an important usability concern. For example, regression weights have classically been considered “interpretable” but require a user to have some understanding of regression beforehand. In contrast, decision trees (when rendered in a tree structure) are often understandable even to non-experts.

(4)

What aspects or parts of the problem do you want to explain? It is important to consider what can and cannot be explained by your model and prioritise accordingly. For example, explaining all potential judgements a self-driving car could make in any situation is infeasible, but restricting explanations to certain systems or situations allows easier measuring and assurance of interpretation quality.

These points allow categorisation of interpretability-related problems, and thus clearer understanding of what is required from an interpretable system and suitable interpretation methods for the problem itself.

2.2 Dimensions of Interpretability

“Interpretability” is not a simple binary or monolithic concept, but rather one that can be measured along multiple dimensions. Different aspects of interpretability have been identified across the literature, which we condense and summarise into four key aspects: faithfulness, stability, comprehensibility, and trustworthiness.

2.2.1 Faithfulness.

Faithfulness measures how well an interpretation method reflects the decision-making process used by the underlying model. For example, an image heatmap that highlights parts of the image not genuinely used by the model would be unfaithful, while highlighting the parts genuinely used by the model would be more faithful. Traditionally, this has been more a concern for post hoc methods such as LIME [143] and SHAP [106]. However, more recent work has called into question the faithfulness of in-built interpretability methods such as attention weight examination [75, 78, 180]. Faithfulness is essential for claims that an interpretation method accurately reflects a model’s process to reach a judgement. Explanations provided by an unfaithful method may hide existing biases that the underlying model uses for judgements, potentially engendering unwarranted trust or belief in these predictions [75]. Related is the notion of fidelity as defined in Molnar [118]: the extent of how well an interpretable method can approximate the performance of a black-box model. Underlying this definition is the assumption that a method that better approximates a black box also must use a similar reasoning process to that underlying model.⁴ As such, this definition of fidelity is a more specific form of faithfulness as applied to interpretability methods that construct models approximating an underlying black-box model, such as LIME [143].

2.2.2 Stability.

An interpretation method is stable if it provides similar explanations for similar inputs [118] unless the difference between the inputs is highly important for the task at hand. For example, an explanation produced by natural language generation (NLG) would be stable if minor differences in the input resulted in similar text explanations and would be unstable if the slight differences resulted in wildly different explanations. Stability is a generally desirable trait important for research [189] and is required for a model to be trustworthy [121]. In addition, the stability of human explanations for a particular task should be considered, i.e., if explanations written by humans differ significantly from each other, then it is unreasonable to expect a model trained on such explanations to do any better. This is especially important for highly free-form interpretation methods such as natural language explanations.

2.2.3 Comprehensibility.

An interpretation is considered comprehensible if it is understandable to an end-user. For an explanation to be useful at all, it must be understandable to some degree. However, this is subjective: There is no global common standard for “understandability.” In addition, the background of the end-user matters: A medical professional will be able to understand an explanation with scientific medical terms far better than a layperson. Nevertheless, there are still several general ways to rate the interpretability of an explanation: examining its size (how much a user must process when “reading” the explanation), examining how well a human can predict a model’s prediction given just the explanation, and examining the understandability of individual features of the explanation [118]. For example, a sparse linear model with only a few non-zero weights has far fewer components for a user to consider and so would be more comprehensible than a linear model with hundreds of weights. Furthermore, comprehensibility is related to the concept of transparency [100], which refers to how well a person can understand the mechanism by which a model works. Transparency can be achieved in several ways: through being able to simulate the model in your mind (for example, a linear regression with few weights) or having deep knowledge of the underlying algorithm used by the model (for example, proving some property of any solution an algorithm will produce). Models with greater degrees of transparency are thus also more comprehensible than non-transparent models.

2.2.4 Trustworthiness.

An interpretation is considered comprehensible if it is understandable to an end-user. For an explanation to be helpful at all, it must be understandable to some degree. However, this is subjective. In other words, there is no global shared standard for “understandability.” In addition, the background of the end-user matters: A medical professional will be able to understand an explanation with scientific medical terms far better than a layperson. Nevertheless, there are still several general ways to rate the interpretability of an explanation: examining its size (how much a user must process when “reading” the explanation), examining how well a human can predict a model’s prediction given just the explanation, and examining the understandability of individual features of the explanation [118]. For example, a sparse linear model with only a few non-zero weights has far fewer components for a user to consider and would be more comprehensible than a linear model with hundreds of weights. Furthermore, comprehensibility is related to the concept of transparency [100], which refers to how well a person can understand the mechanism by which a model works. There are several ways to achieve transparency: through being able to simulate the model in your mind (for example, a linear regression with few weights) or having deep knowledge of the underlying algorithm used by the model (for example, proving some property of any solution an algorithm will produce). Models with greater transparency degrees are thus more comprehensible than non-transparent models.

3 Interpretable Methods

3.1 Feature Importance

Identifying the important input features that significantly impact a model’s prediction results is a straightforward method of improving a model’s local interpretability, directly linking model outputs to inputs. Important features can be, for example, words for text-based tasks or image regions for image-based tasks. This article focuses on the four main different methods of extracting important features as the interpretation for the model’s outputs: rationale extraction, input perturbation, attribution methods, and attention weight extraction. We conclude the typology of feature importance methods in Figure 2 and present the sample visualisations of extracted features from inputs in Figure 1.

Fig. 1.

Fig. 2.

3.1.1 Rationale Extraction.

Rationale extractions are usually used as the local interpretable method for NLP tasks of sentiment analysis and document classification. Rationales are short and coherent phrases from the original textual inputs and represent the critical textual features that contribute most to the output prediction. These identified textual features work as the local explanation that interprets the information the model primarily pays attention to when making the prediction decision for a particular textual input. The good rationales valid for the explanation should lead to the same prediction results as the original textual inputs. As this work area developed, researchers also made extra efforts to extract coherent and consecutive rationales to use them as more readable and comprehensive explanations.

The rationale extraction methods can be mainly divided into two streams: (1) a sequential selector-predictor stacked model, where the selector first selects the rationales from the original textual inputs and then passes to the predictor for the prediction result; (2) the adversarial-based model that involves the parallel models to calibrate the rationales extracted by the selector. In this article, we summarise several iconic and milestone works of rationale extractions for each stream.

For the selector-predictor stream, Lei et al. [92] is one of the first works for the rationale extraction in NLP tasks. The selector process first generates a binary vector of 0 and 1 through a Bernoulli distribution conditioned on the original textual inputs. This binary vector will then be multiplied over the original inputs, where 1 indicates the selection of input words as rationales and 0 indicates the non-selection, resulting in a sparse input representation that indicates which textual tokens are selected as rationales and which tokens are not. The predictor will then process based on such information. Since the selected rationales are represented with non-differentiable discrete values, the REINFORCE algorithm [182] was applied for optimization to update the binary vectors for the eventually accurate rational selection. Lei et al. [92] performed rationale extraction for a sentiment analysis task with the training data that has no pre-annotated rationales to guide the learning process. The training loss is calculated through the difference between a ground truth sentiment vector and a predicted sentiment vector generated from extracted rationales selected by the selector model. Such selector-predictor structure is designed to mainly boost the interpretability faithfulness, i.e., selecting valid rationales that can predict the accurate output as the original textual inputs. To increase the readiness of the explanation, Lei et al. [92] used two different regularizers over the loss function to force rationales to be consecutive words (readable phrases) and limit the number of selected rationales (i.e., selected words/phrases). Bastings et al. [17] followed the same selector-predictor structure as Lei et al. [92]. The main difference is that they used rectified Kumaraswamy distribution [90] instead of Bernoulli distribution to generate the rationale selection vector, i.e., the binary vector of 0 and 1 to be masked over textual inputs. Kumaraswamy distribution allows the gradient estimation for optimization, so there is no need for the REINFORCE algorithm to do the optimization. To boost the short and coherent rationales for better readability and comprehensibility, Bastings et al. [17] also applied a relaxed form of $L_0$ regularization [103] and the Lagrangian relaxation to encourage adjacent words selected or not selected together. Different from the above methods, where rationale extraction is wrapped in an end-to-end model and has not used annotated rationales during the training of rationale selection, Du et al. [45] uses rationales annotated by external experts as guidance during the training of rationale selector to generate the local explanations (short and coherent rationales) that are consistent with these external human-annotated rationales.

For the stream of adversarial-based models, a third module is usually added in addition to the selector-predictor stacks, functioning as a guide to boost the faithfulness of rationales and improve the comprehensibility of interpretation. For example, to boost the faithfulness of extracted rationales, Yu et al. [190] inserted the target labels of sentiment analysis as additional inputs into the rationale selector to boost its participation in prediction. Additionally, to improve the comprehensibility that prevents the rationale selector from selecting meaningless small snippets, this work added a third element: a complement predictor. This additional module predicts the labels for original textual inputs based on non-rationale words. The complement predictor and the generator work much like the discriminative and generative networks in generative adversarial networks (GANs) [56]: the rationale selector aims to extract as many prediction-relevant words as possible as rationales to avoid the complement predictor from being able to predict the actual textual label. Similar to Yu et al. [190], Chang et al. [29] also involved a third module where the target labels of the original inputs are used as additional inputs, but with the addition that these target labels can be incorrect. This work also proposed a counterfactual rational generator to extract relevant rationales that cause false predictions. A discriminator is then applied to discriminate between the actual and counterfactual rationale generator. Recent work, such as Reference [150], reduces the complexity of using three modules but constructs a guider model that operates over the original textual inputs for prediction and the rationale selector model in the adversarial-based architecture to encourage the final prediction vectors from two separate models to be close to each other, and thus achieve the faithfulness of extracted rationales. Also, to achieve better comprehensibility, Reference [150] proposed language models as a regularizer, which significantly contributes to the better fluency of the extracted rationale by selecting consecutive tokens that describe the rationale well.

In general, using extracted rationales from original textual inputs as the models’ local interpretations focuses on the faithfulness and comprehensibility of interpretations. While trying to select rationales that can well represent the complete inputs in terms of accurate prediction results, extracting short and consecutive sub-phrases is also the key objective of the current rationale extraction works. Such fluent and consecutive sub-phrases (i.e., the well-extracted rationales) make this rationales extraction a friendly, interpretable method that provides readable and understandable explanations to non-expert users without NLP-related knowledge.

3.1.2 Input Perturbation.

Another method for identifying important features of textual inputs is input perturbation. For this method, a word (or a few words) of the original input is modified or removed (i.e., “perturbed”), and the resulting performance change is measured. The more significant the model’s performance drop, the more critical these words are to the model and therefore are regarded as important features. Input perturbation is usually model-agnostic, which does not influence the original model’s architecture. The main difference among the proposed input perturbation methods lies in how to perturb the tokens or phrases from original inputs into the new instances.

Ribeiro et al. [143] proposed a local interpretable model-agnostic explanations (LIME) model that can be used as an interpretable method for any black-box model. The main idea of LIME is the approximation of a black-box model with a transparent model using variants of original inputs. For natural language processing tasks such as text classification, words of original textual inputs are randomly selected and removed from the inputs, using a binary representation to mark the inclusion of words. Basaj et al. [16] applied LIME to a QA task for identifying the important words in a question, where the words in the questions are considered to be features, while the associated context (i.e., text containing the answer to the given question) was held constant. The results indicate that in QA tasks, the complete sentence of question plays a minor role, and just a small amount of question words are sufficient for correct answer prediction.

Ribeiro et al. [144] argued that the important features identified by Ribeiro et al. [143] are based on word-level (single token) instead of phrase-level (consecutive tokens) features. Word-level features relate to only one instance and cannot provide general explanations, which makes it difficult to extend such explanations to unseen instances. For example, in sentiment analysis, “not” in “The movie is not good” is a contributing feature for negative sentiment but is not a contributing feature for positive sentiment in “The weather is not bad.” The single token “not” is insufficient as a general explanation for unseen instances, as it will lead to different meanings when combined with different words. Thus, Ribeiro et al. [144] emphasized the phrase-level features for more comprehensive local interpretations and proposed a rule-based method for identifying critical features for predictions. Their proposed algorithm iteratively selects predicates from inputs as key tokens while replacing the rest of the tokens with random tokens that have the same POS tags and similar word embeddings. If the probability of classifying the perturbed text into the same class as that of the original text is above a predefined threshold, then the selected predicates will be considered as the ultimate key features to interpret the prediction results.

Similar to Ribeiro et al. [143],144], Alvarez-Melis and Jaakkola [5] also proposed a model-agnostic interpretable method to relate inputs to outputs through the use of perturbed inputs generated by a variational auto-encoder applied to the original input. The perturbed input is supposed to have a similar meaning to the original input. A bipartite graph is then constructed to link these perturbed inputs and outputs, and the graph is then partitioned to highlight the relevant parts to show which inputs are relevant to the specific output tokens.

Feng et al. [54] proposed a method to gradually remove unimportant words from original texts while maintaining the model’s performance. The remaining words are then considered as the important features for prediction. The importance of each token of the textual input is measured through a gradient approximation method, which involves taking the dot product between a given token’s word embedding and the gradients of its output with respect to its word embedding [47]. The authors show that while the reduced inputs are nonsensical to humans, they are still enough for a given model to maintain a similar level of accuracy when compared with the original inputs.

The input perturbation method seems straightforward in identifying the significant input features by measuring the target task’s performance changes with new perturbed instances. However, there are also works questioning the faithfulness of input perturbation. For example, Reference [154] conducted several experiments and argued that when the distributions of perturbed instances and original instances are less similar, the explanations of LIME [143] are not faithful. Another problem of most input perturbation explanations is that the identified important features are mostly independent tokens instead of coherent phrases like argued by Ribeiro et al. [144], which limits comprehensibility. The recent new track of local explanation: Counterfactual explanations [31, 145, 184] are generated via the approaches of input perturbation to provide counterfactual explanations to show what would happen if some certain features are replaced and prove those features are important for particular model decision. These counterfactual explanations extend beyond the input perturbation from the simple word-level to present the interpretation differently with the more straightforward counterfactual examples. Such presentation of the input perturbation interpretation would give normal users a more intuitive understanding.

3.1.3 Attention Weights.

Attention weight is a weighted sum score of input representation in intermediate layers of neural networks [14]. Extracting attention weights for inputs to provide local interpretations for predictions is commonly used among models that utilise attention mechanisms. For NLP tasks with only textual inputs, tokens with higher attention weights are considered to have more impact on the outputs during the neural network training and are, therefore, regarded as the more important features. Attention weights have been used for explainability in sentiment analysis [107, 112, 173], question answering [151, 164, 166], and neural machine translation [14, 109]. In tasks with both visual and textual inputs, such as Visual Question Answering (VQA) [25, 43, 105, 186, 191] and image captioning [7, 61, 185], attention weights are extracted from both images and questions to identify the contributing features from both modalities. In the case of such multi-modal tasks, it is also important to boost the consistency between the attended image regions and sentence tokens for a plausible explanation. In recent years, different attention mechanisms have been proposed, including the self-attention mechanism [169] and the co-attention mechanism for multi-modal inputs [191], aiming for better attention weights calculation that genuinely reflects the contributing factors to the final prediction.

Though attention mechanisms have proved their effectiveness in performance increment in different tasks and have been used as the indicators of important features to explain the model’s prediction results, there have always been debates arguing about the faithfulness of attention weights as the interpretation for neural networks.

Bai et al. [15] proposed the concept of combinatorial shortcuts caused by the attention mechanism. It argued that the masks used to map the query and key matrices of the self-attention [169] are biased, which would lead to the same positional tokens being attended regardless of the actual word semantics of different inputs. Clark et al. [34] detected that the large amounts of attention of BERT [40] focus on the meaningless tokens such as the special token [SEP]. Jain and Wallace [79] argued that the tokens with high attention weights are not consistent with the important tokens identified by the other interpretable methods, such as the gradient-based measures. Serrano and Smith [149] applied the method of intermediate representation erasure and claimed that attention can only indicate the importance of even intermediate components and are not faithful enough to explain the model’s decision from the level of the actual inputs.

In contrast, Wiegreffe and Pinter [181] proposed the work of “Attention is not not explanation” specifically against the arguments in Reference [79], arguing that whether attention weights are faithful explanations is dependent on the definition of explanation and conducted four different experiments to prove when attention can be used as the explanation. A similar view is also proposed by Jacovi and Goldberg [76], illustrating that under some instances, attention maps over input can be considered as a faithful explanation, which can be verified by the erasure method [9, 54], i.e., whether or not that erasing the attended tokens from inputs would change the prediction results.

To improve the faithfulness of attention as the explanation, some recent works have proposed different methods. For example, Bai et al. [15] proposed a method of generating unbiased mask distribution by using random mask distributions to get attention weights through solely training the attention layers while fixing the other downstream parts of the model, which will therefore scale the attention weights towards tokens that are truly correlated with the predicted label. Chrysostomou and Aletras [33] introduced three different task-scaling mechanisms that scaled over the word representations from different aspects before passing to the attention mechanism and claimed that such scaled word representations help to produce a more faithful attention-based explanation.

Overall, the dilemma of using inputs with high attention weights as the explanation to a black-box model’s decision is associated with the various definition and inconsistent evaluations of explanation faithfulness from different works. Jacovi and Goldberg [76] also proposed in their work that the possible approach to solving this issue is to construct a unified evaluation of the degree of faithfulness either from the level of a specific task or from the level of sub-spaces of the input space. Nevertheless, regardless of the debates over the faithfulness of attention, explanation by attention weights has a lower level of readability. Compared to rationale extraction works that explicitly force the consecutive rationales to be extracted for better comprehensibility, current works using attention as explanation neglect such interpretability aspect. Therefore, even in some cases where the input tokens with high attention weights could work as faithful explanations, it would be hard for non-experts to understand the explanation well with non-coherent highlighted tokens of the textual inputs. However, for the multimodal task such as the visual question answering, some works have attention maps over the images as the explanation [108] or the part of the explanations [183]; the attended region are usually consecutive pixels of the images, which can be more straightforward to be understood by non-expert users compared to the attention map over pure texts.

3.1.4 Attribution Methods.

Another method of detecting important input features that contribute most to a specific prediction is attribution methods, which aim to interpret prediction outputs by examining the gradients of a model. Common attribution methods include DeepLift [153], Layer-wise relevance propagation (LRP) [13], deconvolutional networks [192], and guided back-propagation [157].

Extracting model gradients allows for identifying high-contributing input features to a given prediction. However, directly extracting gradients does not work well with regards to two key properties: sensitivity and implementation invariance. Sensitivity emphasizes that if we have two inputs with one differing feature that lead to different predictions, then this differing feature should be noted as important to the prediction. Implementation invariance means that the outputs of two models should be equivalent if they are functionally equivalent, whether their implementations are the same or not. Focusing on these properties, Sundararajan et al. [163] proposed an integrated gradient method. Integrated gradients are the accumulative gradients of all points on a straight line between an input and a baseline point (e.g., a zero-word embedding). He et al. [65] applied this method to natural machine translation to find the contribution of each input word to each output word. Here, the baseline input is a sequence of zero embeddings in the same length as the input to be translated. Mudrakarta et al. [119] applied integrated gradients to a question-answering task to identify the critical words in questions and found that only a few words in a question contribute to the model answer prediction.

Besides extracting the gradients, scoring input contributions based on the model’s hidden states is also used for attribution. For example, Du et al. [46] proposed a post hoc interpretable method that leaves the original training model untouched by examining the hidden states passed along by RNNs. Ding et al. [42] applied LRP [13] to neural machine translation to provide interpretations using the hidden state values of each source and target word.

The attribution methods are the preliminary approaches for deep learning researchers to explain the neural networks through the identified input features with outstanding gradients. The idea of the attribution methods was mostly proposed before the mature development and vast researches of rationale extraction, attention mechanisms, and even the input perturbation methods. Compared to the other input feature explanation methods, the attribution methods hardly consider the interpretation’s faithfulness and comprehensibility as the other three input feature explanation methods. Visualizing the identified features from inputs would be at the same plausible level as that of the other three feature importance methods to non-expert users, but the attribution methods do not work to form the interpretation into coherent sub-phrases for better readability and easier understanding. Thus, compared to rationale extraction, attention weights extraction, and input perturbation, using attribution methods to generate the interpretation is more like a diagnosis method for deep learning experts to understand the model’s decision and learn the model’s functionality.

3.1.5 Datasets.

Tasks used for examining the interpretable methods discussed above include sentiment analysis, reading comprehension, natural machine translation, question answering, and visual question answering. Below, we list and summarise some common datasets that are used for these tasks:

(1)

BeerAdvocate review dataset [115] is a multi-aspect sentiment analysis dataset that contains around 1.5 million beer reviews written by online users. The average length of each review is about 145 words. These reviews are associated with the overall review of the beer or a particular aspect, such as the appearance, smell, palate, and taste. Each written review also has a corresponding overall rating for beer and another four different ratings for the four review aspects, where each rating ranges from 0 to 5.

(2)

IMDB [110] is a large movie review usually used for binary sentiment classification. The dataset contains 50k reviews labelled as positive or negative and is split in half into train and test sets. The average length for each review is 231 words and 10.7 sentences.

(3)

WMT is a workshop for natural machine translation. Tasks announced in these workshops include translation of different language pairs, such as French to English, German to English, and Czech to English in WMT14, and Chinese to English additionally added in WMT17. The sources are normally news and biomedical publications. For many papers examining interpretable methods, the commonly used datasets are French to English news and Chinese to English news.

(4)

HotpotQA [187] is a multi-hop QA dataset that contains 113k Wikipedia-based question-answer pairs where multiple documents are supposed to be used to answer each question. Apart from questions and answers, the dataset also contains sentence-level supporting facts for each document. This dataset is often used to experiment with interpretable methods for identifying sentence-level significant features for answer prediction.

(5)

SQuAD [140] is a reading comprehension dataset that contains 100k question-answer pairs from Wikipedia articles. SQuAD v2 [139], proposed in 2018, includes around 50k additional unanswerable questions to find the answerable questions with similar semantic meanings.

(6)

VQA datasets are used for multi-modal tasks with both textual and visual inputs. VQA v1 [8] is the first visual question-answering dataset. VQA v1 contains 204,721 images, 614,163 questions and 7,964,119 answers, where most images are authentic images extracted from MS COCO dataset [97] and 50,000 images are newly generated abstract scenes of clipart objects. VQA v2 [57] is an improved version of VQA v1 that mitigates the biased-question problem and contains 1M pairs of images and questions as well as 10 answers for each question. Work on VQA commonly utilises attention weight extraction as a local interpretation method.

3.2 Natural Language Explanation

Natural Language Explanation (NLE) refers to the method of generating free text explanations for a given pair of inputs and their prediction. In contrast to rational extraction, where the explanation text is limited to that found within the input, NLE is entirely freeform, making it an incredibly flexible explanation method. This has allowed it to be applied to tasks outside of NLP, including reinforcement learning [48], self-driving cars [85], and solving mathematical problems [99]. We focus here on methods in which explanations are generated without any or minimal scaffolding, that is, we do not cover methods that form ‘‘natural language explanations’ by filling in templates, but rather cases where the explanation model is tasked with generating the entirety of the explanation content itself.

3.2.1 Multimodal NLE.

Multimodal NLE focuses on generating natural language explanations for tasks that involve multiple input modalities, including images and video. While explanations may span multiple modalities, we focus on cases where the explanations significantly involve natural language. Much work, including text-only NLE, stems from Hendricks et al. [66], which draws upon image captioning research to generate explanations for image classification predictions of bird images. The model first makes a prediction using an image classification network, and then the features from the final layers of the network are fed into an LSTM decoder [71] to generate the explanation text. The explanation is trained with a reinforcement learning–based approach both to match a ground truth correction and to be able to be used to predict the image label itself. Later work has directly built on this model by improving the use of image features used during the explanation generation [177], using a critic model to improve the relevance of the explanations [67] and conditioning on specific image attributes [168]. Park et al. [125] make use of an attention mechanism to augment the text-only explanations with heatmap-based explanations and find that training a model to provide both types of explanations improves the quality of both the text and visual-based explanations. Most of these earlier approaches use learned LSTM decoders to generate the explanations, learning a language generation module from scratch. Most of these methods generate their explanations post hoc, making a prediction before generating an explanation. This means that while the explanations may serve as valid reasons for the prediction, they may also not truthfully reflect the reasoning process of the model itself. Wu and Mooney [183] attempt to build a multimodal model whose explanations better match the model’s reasoning process by training the text generator to generate explanations that can be traced back to objects used for prediction in the image as determined by gradient-based attribution methods. They explicitly evaluate their model’s faithfulness using LIME and human evaluation and find that this improves performance and does indeed result in explanations faithful to the gradient-based explanations.

More recently, NLE datasets have been developed for VQA [72], self-driving car decisions [85], arcade game agents [49], visual commonsense [193], physical commonsense [138], image manipulation detection [37], explaining facial biometric scans [117], as well as for more general vision-language benchmarks [84].

The recent rise of large pretrained language models [40, 128, 134] has also impacted multimodal NLE, with recent approaches replacing the standard LSTM-based decoder with pretrained text generation models such as GPT-2 [12, 84, 114] with a good deal of success. Kayser et al. [84] additionally finds that using a pre-trained unified vision-language model along with GPT-2 works best over other combinations of vision and language-only models. This suggests that further utilising the growing number of large pre-trained multimodal models such as VLBERT [162], UNITER [32], or MERLOT [194] may lead to improved explanations for multimodal tasks. However, while these models often do yield higher-quality explanations that better align with human preferences, the use of large unified transformer models means that the faithfulness of these explanations in representing the reasoning process of the model is hard to determine, as the exact reasoning processes used by these large models is hard to uncover.

3.2.2 Text-only NLE.

Earlier work examining explanations accompanying NLP tasks largely examined integrating them as inputs for fact-checking, concept learning, and relation extraction [4, 62, 158]. These efforts provided useful datasets for examining natural language explanations, but the first work examining generating natural language explanations for NLP tasks in an automated fashion was done by Camburu et al. [23], using a set of explanations gathered for the SNLI dataset [21] called e-SNLI. Similar to the multimodal models discussed above, the baseline models for e-SNLI proposed in Camburu et al. [23] are made up of two parts: a predictor module and an explanation module, with the best-performing model first generating explanations and then using these explanations to make predictions. While this tighter integration of explanation generation into the overall model may suggest more faithful and higher-quality explanations, Camburu et al. [24] shows that this model can still provide explanations that are inconsistent with their predictions, suggesting that either the explanations are faulty or the model uses a flawed decision-making process. Several works try to improve the faithfulness of such models by using generated explanations as inputs to the final predictor model [89, 137, 197, 198]. By ‘‘explaining then predicting,” the explanations are by construction used as part of the prediction process. This may aid overall model performance by exposing latent aspects of the task [63]. Inoue et al. [74] additionally show summarisation models can be trained to serve as explanation generation models for this construction. However, recent work by Wiegreffe et al. [179] suggests that jointly producing explanations actually results in models with a stronger correlation between the predicted label and explanation, suggesting these models are more faithful than explain-then-predict methods despite the different construction. Further evaluation linking the underlying model’s predictive mechanics with the generated explanations (e.g., Prasad et al. [132] for highlighted rationales) may work to investigate further how much these explanations align with the underlying model.

Beyond NLI, other early tasks to which NLE was applied include commonsense QA [137] and user recommendations [123]. While early work used human-collected explanations, Ni et al. [123] shows that using distant supervision via rationales can also work well for training explanation-generating models. Li et al. [93] additionally embed extra non-text features (i.e., user ID, item ID) by using randomly initialised token embeddings. This provides a way to integrate non-text features besides the use of large pre-trained multimodal models.

Much like multimodal NLE, large pre-trained language models have also been integrated into text-based NLE tasks, and most recent papers make use of these models in some way. Rajani et al. [137] introduce an NLE dataset for commonsense QA (“cos-e”) and use a pre-trained GPT model [133] to generate explanations used to make a final prediction. More recently, wT5 [122], which follows the T5 model [135] in framing explanation generation and prediction as a purely text-to-text task, generates the prediction followed by a text explanation. More recent work has shown that using these models allow good explanation generation (and even may improve performance) for tasks and settings with little data [51, 80, 113, 188]. Automatically collecting explanations from existing datasets or generating explanations using existing models can also provide extra supervision for learning to generate NLEs in limited-data settings [22]. This highlights the strength of NLEs: By framing the explanation as a text generation problem, explanation generation is as simple as fine-tuning or even few-shot prompting a large language model to produce explanations, often with fairly good results. However, while these approaches are often impressive, generated explanations can still “hallucinate” data not actually present in the training or input data and fail to generalise to challenging test sets such as HANS [200].

3.2.3 NLE in Dialog.

While the above work has all assumed a setup where a model is able to generate only one explanation and has no memory of previous interactions with a user, some work has examined dialog-based setups where a user is assumed to repetitively interact with a model. Madumal et al. [111] propose a model for the components of an explanation dialog comprising two sections: an explanation dialog, which consists mainly of presenting and accepting explanations; and an argument dialog, where the provided explanation is challenged with an argument. Rebanal et al. [142] draw on QA systems to design a model for explaining basic algorithms, presenting the model as an “interactive dialog that allows users to ask for specific kinds of explanations they deem useful.” More recently, Li et al. [95] use feedback from users as explanations to supervise and improve an open domain QA model, showing how models can improve by taking into account live feedback from users. Given the success of using human-written instructions to train large models [148, 176], making further use of human feedback to improve and guide the way explanations are generated may further improve the quality and utility of NLEs.

3.2.4 Datasets.

There are a number of NLE datasets for NLP tasks, which we summarise in Table 1. Many of these datasets consist of human-generated explanations applied to existing datasets or make use of some automatic extraction method to retrieve explanations from supporting documents. While most datasets simply present one explanation per input sample, others present setups where multiple explanations are attached to each sample, but only one is valid [172, 195]. Wiegreffe and Marasovic [178] also summarise existing NLE-for-NLP datasets, focussing also on text-based rationale and structured explanation datasets. We also provide a list of datasets for multimodal NLE in Table 2.

Table 1.

Ref.	Year	Dataset Name	Task	Human-written explanations?
[23]	2016	e-SNLI	NLI	✓
[81]	2016	-	Science Exam QA	Extracted from auxiliary documents
[99]	2017	-	Algebraic Word Problems	✓
[158]	2017	-	Email Phishing classification	✓
[62]	2018	BabbleLabble	Relation Extraction	✓
[4]	2018	LIAR-PLUS	Fact-checking	Extracted from auxiliary documents
[137]	2019	cos-e	Commonsense QA	✓
[172]	2019	-	Sense making	✓
[11]	2019	ChangeMyView	Opinion changing	Extracted from reddit posts
[195]	2020	WinoWhy	Winograd Schema	✓
[88]	2020	PubHealth	Medical claim fact-checking	Extracted from auxiliary documents
[174]	2020	-	Relation Extraction, Sentiment Analysis	✓
[161]	2020	e-FEVER	Fact-checking	Generated using GPT-3
[3]	2021	ECQA	Commonsense QA	✓
[22]	2021	e- $\delta$ -NLI	$\delta$ -NLI Rationale Generation	Extracted from auxiliary documents, automatically generated

Table 1. Summary of Datasets with Natural Language Explanations for Text-based Tasks

Table 2.

Ref.	Year	Dataset Name	Task	Human-written explanations?
[72]	2018	VQA-X	Visual QA	✓
[72]	2018	ACT-X	Activity Recognition	✓
[85]	2018	BDD-X	Self-driving Car Decision Explanation	✓
[94]	2018	VQA-E	Visual QA	Generated from captions
[49]	2019	-	Frogger Game	✓
[193]	2019	VCR	Visual Commonsense Reasoning	✓
[138]	2020	ESPIRIT	Physical Reasoning	✓
[91]	2020	VLEP	Event Prediction	✓
[37]	2021	EMU	Understanding edits	✓
[84]	2021	E-ViL	Vision-language Tasks	✓

Table 2. Summary of Datasets with Natural Language Explanations for Multimodal Tasks

3.2.5 Challenges and Future Work.

NLE is very attractive as a human-comprehensible approach to interpretation: Rather than trying to utilise model parameters, NLE-based approaches essentially allow their models to ‘‘talk for themselves.” Despite being freely generated, these explanations still display a degree of faithfulness in their agreement with gradient-based explanation methods and can be quite robust to noise [179]. This suggests that this approach exhibits a degree of faithfulness and stability despite a lack of formal guarantee that these methods have either quality. Furthermore, pipeline methods that use explanations for predictions can further guarantee that the generated explanations represent the information being used for prediction, even if their performance suffers compared to joint prediction models. NLEs have the benefit of being extremely comprehensible: Unlike text rationales or gradient methods, which often require some understanding of the model being used, natural language explanations can be easily read and understood by anyone, and tailoring explanations to a specific audience is ‘‘simply’’ a matter of training a model on similar explanations, which is even possible in low-data scenarios [51, 80, 113, 188]. Finally, the trustworthiness of NLE methods is not often explicitly evaluated. The focus has been put on the overall ‘‘explanation quality’’ when evaluating NLEs [35, 74]. While rating ‘‘explanation quality’’ may in some ways suggest how trustworthy the annotators find the explanations, more careful consideration of the type of contract-based trust [77] an NLE-based model may involve is required in determining the utility of deploying these models in real-world scenarios.

Overall, NLE is a very flexible and attractive explanation method, with the potential to greatly improve model explainability without requiring complex setups: Just train your model to output explanations [122]. However, evaluation must be carefully considered due to issues with automated metrics [35] and the human-generated explanations themselves [26]. In addition, further exploring the link between generated NLEs and other explanation or interpretability methods may further yield insights into models and improve our understanding of the faithfulness of this method.

3.3 Probing

Linguistic probes, also referred to as “diagnostic classifiers” [73] or “auxiliary tasks” [2], are a post hoc method for examining the information stored within a model. Specifically, the probes themselves are (often small) classifiers that take as input some hidden representations (either intermediate representations within a model or word embeddings) and are trained to perform some small linguistic task, such as verb-subject agreement [55] or syntax parsing [70]. Intuition follows that if there is more task-relevant information present within the hidden representations, then the classifier will perform better, thus allowing researchers to determine the presence or lack of presence of linguistic knowledge within both word embeddings and at various layers within a model. However, recent research [70, 130, 141] has shown that probing experiments require careful design and consideration of truly faithful measurements of linguistic knowledge.

While current probing methods do not provide layperson-friendly explanations, they do allow for research into the behaviour of popular models, allowing a better understanding of what linguistic and semantic information is encoded within a model [98]. Hence, the target audience of a probe-based explanation is not a layperson, as is the case with other interpretation methods discussed in this article, but rather an NLP researcher or ML practitioner who wishes to gain a deeper understanding of their model. We conclude the typology of different probing methods in Figure 3. Note, we do not provide a list of common datasets in this section, unlike the previous sections, as probing research has largely not focused on any particular subset of datasets and can be applied to most text-based tasks.

Fig. 3.

3.3.1 Embedding Probes.

Early work on probing focused on using classifiers to determine what information could be found in distributional word embeddings [116, 126]. For example, Gupta et al. [59], Köhn [87], Rubinstein et al. [146] all investigated the information captured by word embedding algorithms through the use of simple classifiers (e.g., linear or logistic classifiers) to predict properties of the embedded words, such as part-of-speech or entity attributes (e.g., the colour of the entity referred to by a word). These works all found word embeddings captured the properties probed for, albeit to varying extents. More recently, Sommerauer and Fokkens [155] used both a logistic classifier and a multi-layer perceptron (MLP) to determine the presence of certain semantic information in Word2Vec embeddings, finding that visual properties (e.g., colour) were not represented well, while functional properties (e.g., “is dangerous”) were. Research into distributional models has reduced currently due to the rise of pre-trained language models such as BERT [40].

Alongside word embeddings, sentence embeddings have also been the target of analysis via probing. Ettinger et al. [52] (following Gupta et al. [59]) train a logistic classifier to classify if a sentence embedding contains specific words and specific words with specific semantic roles. Adi et al. [2] train MLP classifiers on sentence embeddings to determine if the embeddings contain information about sentence length, word content, and word order. They examine LSTM auto-encoder, continuous bag-of-words (CBOW), and skip-thought embeddings, finding that CBOW is surprisingly effective at encoding the properties of sentences examined in low dimensions, while the LSTM auto-encoder-based embeddings perform very well, especially with a larger number of dimensions. Further developing on this work, Conneau et al. [36] propose 10 different probing tasks, covering semantic and syntactic properties of sentence embeddings and controlling for various cues that may allow a probe to “cheat” (e.g., lexical cues). To determine if encoding these properties aids models in downstream tasks, the authors also measure the correlation between probing task performance and performance on a set of downstream tasks. More recently, Sorodoc et al. [156] propose 14 additional new probing tasks for examining information stored in sentence embeddings relevant to relation extraction.

3.3.2 Model Probes.

Following the work on probing distributional embeddings, Shi et al. [152] extended probing to NLP models, training a logistic classifier on the hidden states of LSTM-based Neural Machine Translation (NMT) models to predict various syntactic labels. Similarly, they train various decoder models to generate a parse tree from the encodings provided by these models. By examining the performance of these probes on different hidden states, they find that lower-layer states contain more fine-grained word-level syntactic information, while higher-layer states contain more global and abstract information. Following this, Belinkov et al. [18] and Belinkov et al. [20] both examine NMT models with probes in more detail, uncovering various insights about the behaviour of NMT models, including a lack of powerful representations in the decoder, and that the target language of a model has little effect on the source language representation quality. Instead of using a logistic classifier, both studies opt for a basic neural network featuring a hidden layer and a ReLU activation function. This choice exhibits analogous trends to those observed with a simpler classifier, while yielding superior performance. More recently, Raganato and Tiedemann [136] analysed transformer-based NMT models using a similar probing technique alongside a host of other analyses. Finally, Dalvi et al. [38] presented a method for extracting salient neurons from an NMT model by utilising a linear classifier, allowing examination of not just information present within a model but also what parts of the model contribute most to both specific tasks and the overall performance of the model.

Probing is not limited to NMT, however: Research has also turned to examining the linguistic information encoded by language models. Hupkes et al. [73] utilised probing methods to explore how well an LSTM model for solving basic arithmetic expressions matches the intermediate results of various solution strategies, thus examining how LSTM models break up and solve problems with nested structures. Utilising the same method, Giulianelli et al. [55] investigated how LSTM-based language models tracked agreement. The authors trained their probe (a linear model) on the outputs of an LSTM across timesteps and components of the model, showing how the information encoded by the LSTM model changes over time and in model parts. Jumelet and Hupkes [82], Zhang and Bowman [196] also probe LSTM-based models for particular linguistic knowledge, including NPI-licensing and CCG tagging. Importantly, the authors find that even untrained LSTM models contain information probe-based models can exploit to memorise labels for particular words, highlighting the need for careful control of probing tasks (we discuss this further in the next section). More recently, Sorodoc et al. [156] probe LSTM and transformer-based language for referential information. We also note that probing has been applied to speech processing–based models [19, 131].

Finally, probing-based analyses of deep pre-trained language models have also been popular as a method for understanding how these models internally represent language. Peters et al. [127] briefly utilised linear probes to investigate the presence of syntactic information in bidirectional LSTM models, finding that POS tagging is learned in lower layers than constituent parsing. Recently, both Lin et al. [98] and Clark et al. [34] used probing classifiers to investigate the information stored in BERT’s hidden representations across both layers and heads. Clark et al. [34] focused on attention, using a probe trained on attention weights in BERT to examine dependency information, while Lin et al. [98] focused on examining syntactic and positional information across layers. Hewitt and Manning [70] examined representations generated by ELMo [129] and BERT, training a small linear model to predict the distance between words in a parse tree of a given sentence. Liu et al. [102] proposed and examined 16 different probing tasks, involving tagging, segmentation, and pairwise relations, utilising a basic linear model. They compared results across several models, including BERT and ELMo, examining the performance of the models on each task across layers. Tenney et al. [165] trained two-layer MLP classifiers to predict labels for various NLP tasks (POS tagging, named entity labelling, semantic role labelling, etc.), using the representations generated by four different contextual encoder models. They found that the contextualised models improve more on syntactic tasks than semantic tasks when compared to non-contextual embeddings and found some evidence that ELMo does encode distant linguistic information. Klafka and Ettinger [86] investigated how much information about surrounding words can be found in contextualised word embeddings, training MLP classifiers to predict aspects of important words within the sentence, e.g., predicting the gender of a noun from an embedding associated with a verb in the same sentence.

3.3.3 Probe Considerations and Limitations.

The continued growth of probing-based papers has also led to recent work examining best practices for probes and how to interpret their results. Hewitt and Liang [69] considered how to ensure that a probe is genuinely reflective of the underlying information present in a model and proposed the use of a control task, a randomised version of a probe task in which high performance is only possible by memorisation of inputs. Hence, a faithful probe should perform well on a probe task and poorly on a corresponding control task if the underlying model does indeed contain the information being probed for. The authors found that most probes (including linear classifiers) are over-parameterised, and they discuss methods for constraining complex probes (e.g., multilayer perceptrons) to improve faithfulness while still allowing them to achieve similar results.

While most papers we have discussed above follow the intuition that probes should avoid complex probes to prevent memorisation, Pimentel et al. [130] suggest that instead the probe with the best score on a given task should be chosen as the tightest estimate, since simpler models may simply be unable to extract the linguistic information present in a model, and such linguistic information cannot be “added” by more complex probes (since their only inputs are hidden representations). In addition, the authors argue that memorisation is an important part of linguistic competence, and as such probes should not be artificially punished (via control tasks) for doing this. Recent work has also presented methods that avoid making assumptions about probe complexity, such as MDL probing [104, 171], which directly measures ‘‘amount of effort’’ needed to achieve some extraction task, or DirectProbe [199], which directly examines intermediate representations of models to avoid having to deal with additional classifiers.

Finally, Hall Maudslay et al. [60] compared the structural probe [70] with a lightweight dependency parser (both given the same inputs) and demonstrated that the parser is generally able to extract more syntactic information from BERT embedding. In contrast, the probe performs better with a different metric, showing that the choice of metric is important for probes: When testing for evidence of linguistic information, one should consider not only the nature of the probe but also the metric used to evaluate it. Furthermore, the significance of well-performing probes is not clear: Models may encode linguistic information not actually used by the end-task [141], showing that the presence of linguistic information does not imply it is being used for prediction. Some approaches proposed later that integrated the causal approaches such as amnesiac probing [50]—which directly intervene in the underlying model’s representations—might be a possible solution to distinguish between these cases.

3.3.4 Interpretability of Probes and Future Work.

As noted at the beginning of the section, probing is a way for NLP researchers to investigate models rather than end-users. As such, their comprehensibility is relatively low: Understanding probing results requires understanding the linguistic properties they are probing and the more complex experimental setups they make use of (as simple metrics such as task accuracy do not show the whole story [69]). However, probes are naturally fairly faithful in that they directly use the model’s hidden states and are specifically designed to represent only information present within these hidden states. This faithfulness is degraded somewhat by the fact that this information may not be used for predictions [141], but recent causal approaches work towards alleviating this. This also suggests that probing results could be considered trustworthy only when the experimental design is carefully considered, in that their results can only be relied upon if carefully controlled. Finally, probing methods are often reasonably stable for the same model and property, as the probe classifier is trained to some convergence. However, across models (even those with the same architecture but just trained on different data), results can differ quite drastically [50], which shows differences between pre-trained and fine-tuned BERT models. This is more likely to be a function of the underlying models rather than the technique, but also shows that probing results are very specific to the models and properties being examined.

Overall, probes are exciting and valuable tools for investigating models’ ‘‘inner workings.” However, much like other explanation methods, the setup and evaluation of probing techniques must be carefully considered. Some future works of the probing may be associated with the integration of some causal methods [50] as better approaches to make stronger statements about what a model is and is not using for its predictions, better allowing probing to provide explanations for model judgements rather than just show what could be potentially used. Combining this with methods that further reduce the complexity of probing setups [199] may allow even simpler and better ways to get insights into NLP models. Causal models have been applied to the traditional predictive tasks and covered with the convergence of causal inference and language processing [53]. Recent NLP works have tried to involve auxiliary causal-based approaches in their models [53, 68]. Such involvement of causal approaches can be seen as a future trend of interpretable NLP tasks including probing. However, the essence of causal models is different from the association essence of neural networks. Thus, we consider a detailed discussion of causal approaches is out of the scope of this survey. But we do notice that this could be a future trend for the further development of probing.

4 Evaluation Methods

4.1 Evaluation of Feature Importance

4.1.1 Automatic Evaluation.

Evaluations on the interpretable methods of extracting important features usually align with the evaluation of the explanation faithfulness, i.e., whether the extracted features are sufficient and accurate enough to result in the same label prediction as the original inputs. When the datasets come with pre-annotated explanations, the extracted features used as the explanation can be compared with the ground truth annotation through exact matching or soft matching. The exact matching only considers the validness of the explanation when it is exactly the same as the annotation, and such validity is quantified through the precision score. For example, the HotpotQA dataset provides annotations for supporting facts, allowing a model’s accuracy in reporting these supporting facts to be easily measured. This is commonly used for extracting rationals, where the higher the precision score, the better the model matches human-annotated explanations, likely indicating improved interpretability. On the contrary, soft matching will take the extracted features as a valid explanation if some features (tokens/phrases in the case of NLP) matched with the annotation. For instance, DeYoung et al. [41] proposed Intersection-Over-Union (IOU) on the token level, taking the overlap size of the tokens over two spans divided by the union of their token sizes and considering the extracted rationales as a valid explanation if the IOU score is over 0.5.

However, DeYoung et al. [41] also argued that the matching between the identified features and the annotation only measures the plausibility of interpretability but not faithfulness. In other words, either the exact matching or soft matching can reveal if the model’s decisions truly depend on the identified contributing features. Therefore, some other erasure-based metrics are specifically proposed to evaluate the impact of the identified important features to the model’s results. For example, Du et al. [46] proposed a faithfulness score to verify the importance of the identified contributing sentences or words to a given model’s outputs. It is assumed that the probability values for the predicted class will significantly drop if the truly important inputs are removed. The score is calculated as in Equation (1):

$\begin{equation} S_{Faithfulness} = \frac{1}{N}\sum _{i=1}^{N}\left(y_{x^{i}}-y_{x_{\setminus A}^{i}} \right), \end{equation}$

(1)

where $y_{x^{i}}$ is the predicted probability for a given target class with original inputs and $y_{x_{\setminus A}^{i}}$ is the predicted probability for the target class for the input with significant sentences/words removed.

The Comprehensiveness score proposed by DeYoung et al. [41] in later years is calculated in the same way as the Faithfulness score [46]. What is to be noted here is that the Comprehensiveness score is not related to the evaluation of the comprehensibility of interpretability but to measure whether all the identified important features are needed to make the same prediction results. A high score implies the enormous influence of the identified features, while a negative score indicates that the model is more confident in its decision without the identified rationales. DeYoung et al. [41] also proposed a Sufficiency score to calculate the probability difference from the model for the same class once only the identified significant features are kept as the inputs. Thus, opposite to the Comprehensiveness score or Faithfulness score, a lower Sufficiency score indicates the higher faithfulness of the selected features.

Apart from using the above-proposed evaluation metrics, another direct way to evaluate the validity of the explanations for a model’s output is to examine the performance decrease of a model based on the tasks standard performance evaluation metrics after removing or perturbing identified important input features (i.e., words/phrases/sentences). For example, He et al. [65] measured the change in BLEU scores to examine whether certain input words were essential to the predictions in natural machine translation.

4.1.2 Human Evaluation.

Human evaluation is also a common and straightforward but relatively more subjective method for evaluating the validity of explanations for a model. This can be done by researchers themselves or by a large number of crowdsourced participants (sourced from, e.g., Amazon Mechanical Turk). For example, Chen et al. [30] asked Amazon Mechanical Turk workers to predict the sentiment based on predicted keywords in a text, examining the faithfulness of the selected features as interpretation. Sha et al. [150] sampled 300 input-output-interpretation cases to ask the human evaluator to examine whether the selected features are useful (to explain the output), complete (enough to explain the output), and fluent to read).

While faithfulness can be evaluated more easily via automatic evaluation metrics, the comprehensibility and trustworthiness of interpretations usually are evaluated through human evaluations in the current research works. Though using large numbers of participants helps remove the subjective bias, this requires the cost of setting up larger-scale experiments, and it is also hard to ensure that every participant understands the task and the evaluation criteria. It is undoubtedly that the human evaluation results can provide some hints about the interpretation validity and comprehensibility, but we cannot erase the suspicion of the existence of subjective bias, which also limits further references and fair comparison of the human evaluation results for future works.

4.2 Evaluation of NLE

4.2.1 Automatic Evaluation.

As NLE involves generating text, the automatic evaluation metrics for NLE are generally the same metrics used in tasks with free-form text generation, such as machine translation or summarisation. As such, standard automated metrics for NLE are BLEU [124], METEOR [39], ROUGE [96], CIDEr [170], and SPICE [6], with all five generally being reported in VQA-based NLE papers. Perplexity is also occasionally reported [23, 99], keeping in line with other natural language generation–based works. However, these automated metrics must be used carefully, as recent work has found they often correlate poorly with human judgements of explanation quality. Clinciu et al. [35] suggest that model-based scores such as BLUERT and BERTScore better correlate with human judgements, and Hase et al. [64] point out that only examining how well the explanation output matches labels does not measure how well the explanations accurately reflect the model’s behaviour.

Additionally, the quality of the annotated human explanations collected in datasets such as e-SNLI has also come into question. Carton et al. [26] find that human-created explanations across several datasets perform poorly at metrics such as sufficiency and comprehensiveness, suggesting they do not contain all that is needed to explain a given judgement. This suggests that just improving our ability to compare generated explanations with human-generated ones may not be enough to best measure the quality of a given generated explanation, and further work in improving the gold annotations provided by explanation datasets could also help.

4.2.2 Human Evaluation.

Given the limitations of current automatic evaluation methods and the free-form nature of NLE, human evaluation is always necessary to truly judge explanation quality. Such evaluation is most commonly done by getting crowdsourced workers to rate the generated explanations (either just as correct/not correct or on a point scale), which allows easy comparison between models. In addition, Liu et al. [101] use crowdsourced workers to compare their model’s explanations against another, with workers noting which model’s explanation related best to the final classification results. Considering BLEU and similar metrics do not necessarily correlate well with human intuition, all work on NLE should include human evaluation results to some level, even if the evaluation is limited (e.g., just on a sample of generated explanations).

4.3 Evaluation of Probing

As probing tasks are more tests for the presence of linguistic knowledge rather than explanations, the evaluation of probing tasks differs according to the tasks. However, careful consideration should be given to the choice of metric. As Hall Maudslay et al. [60] showed, different evaluation metrics can result in different apparent performances for different methods, so the motivation behind a particular metric should be considered. Beyond metrics, Hewitt and Liang [69] suggested that the selectivity of probes should also be considered, where selectivity is defined as the difference between probe task accuracy and control task⁵ accuracy. While best practices for probes are still being actively discussed in the community [130], control tasks are undoubtedly helpful tools for further investigating and validating the behaviour of models uncovered by probes.

5 Discussion and Conclusion

This article focused on the local interpretable methods commonly used for natural language processing models. In this survey, we have divided these methods into three different categories based on their underlying characteristics: (1) explaining the model’s outputs from the input features, where these features could be identified through rationale extraction, perturbing inputs, traditional attribution methods, and attention weight extraction; (2) generating the natural language explanations corresponding to each input; (3) using diagnostic classifiers to analyse the hidden information stored within a model. For each method type, we have also outlined the standard datasets used for different NLP tasks and different evaluation methods for examining the validity and efficacy of the explanations provided.

By going through the current local interpretable methods in the field of NLP, we identified several limitations and research gaps to be overcome to develop explanations that can stably and faithfully explain the model’s decisions and be easily understood and trusted by users. First, as we have stated in Section 1.1.1, there is currently no unified definition of interpretability across the interpretable method works. While some researchers distinguish interpretability and explainability as two separate concepts [147] with different difficulty levels, many works use them as synonyms of each other, and our work also follows this way to include diverse works. However, such an ambiguous definition of interpretability/explainability leads to inconsistent interpretation validity for the same interpretable method. For example, the debate about whether the attention weights can be used as a valid interpretation/explanation between Wiegreffe and Pinter [181] and Jain and Wallace [79] is due to the conflicting definition. The argument of Jain and Wallace [79] is based on the fact that only the faithful interpretable methods are truly interpretable, while Wiegreffe and Pinter [181] argued that attention is an explanation if we accept that explanation should be plausible but not necessarily faithful as proposed by Reference [147]. Thus, we need a unified and legible definition of interpretability that should be broadly acknowledged and agreed to help further develop valid interpretable methods.

Second, we need effective evaluation methods that can evaluate the multiple dimensions of interpretability, the results of which can be reliable for future baseline comparison. However, the existing evaluation metrics measure only limited interpretability dimensions. Taking the evaluation of rationales as an example, examining the matching between the extracted rationales and the human rationales only evaluates the plausibility but not faithfulness [41]. However, when it comes to the faithfulness evaluation metrics [10, 33, 41, 149], the evaluation results on the same dataset can be opposite by using different evaluation metrics. For example, two evaluation metrics DFFOT [149] and SUFF [41] conclude opposite evaluation results on LIME method of the same dataset [28]. Moreover, the current automatic evaluation approaches mainly focus on the faithfulness and comprehensibility of interpretation. It can hardly be applied to evaluate the other dimensions, such as stability and trustworthy. The evaluation of other interpretability dimensions relies too much on the human evaluation process. Though human evaluation is currently the best approach to evaluate the generated interpretation from various aspects, human evaluation can be subjective and less reproducible. In addition, it is essential to have efficient evaluation methods that can evaluate the validity of interpretation in different formats. For example, the evaluation of the faithful NLE relies on the BLEU scores to check the similarity of generated explanations with the ground truth explanations. However, such evaluation methods neglect that the natural language explanations with different contents from the ground truth explanations can also be faithful and plausible for the same input and output pair. To sum up, there is still a considerable research gap for developing effective evaluation methods and frameworks to verify the interpretable methods from various dimensions, and such development would also require explainable datasets with good-quality annotations. The evaluation framework should provide fair results that can be reused and compared by future works, and should be user-centric, taking into account the aspects of different groups of users [83].

6 Future Trend of Interpretability

The future trend of developing interpretable methods cannot avoid further conquering the current limitations. Developing truly faithful interpretable methods that can precisely explain the model’s decisions is critical to enable the vast application of deep neural networks to crucial fields, including medicine, justice, and finance. Faithful interpretable methods and easily understandable interpretations are key to bringing users’ trust to the model’s decisions, especially for users without deep learning knowledge. It is natural for them to question the decisions from an unfamiliar technique. Providing faithful, comprehensible, and stable interpretations of a model helps eliminate the questions and uncertainties about using a black-box model for any users.

However, apart from the discussed limitations of the current interpretable methods, one existing problem is that evaluating whether an interpretation is faithful mainly considers the interpretations for the model’s correct predictions. In other words, most existing interpretable works only explain why an instance is correctly predicted but do not give any explanations about why an instance is wrongly predicted. If the explanations of a model’s correct predictions precisely reflect the model’s decision-making process, then this interpretable method will usually be regarded as a faithful interpretable method. However, it is also significant to generate explanations of the wrong prediction results to investigate and examine which parts of the input instances the model attended to when it made the wrong decision and whether those parts can reflect the model’s wrong decision-making process. However, the interpretation and explanation of the model’s wrong prediction are not considered in any existing interpretable works. Some works even directly consider the interpretations generated by their interpretable models for the models’ wrong predictions are invalid and incorrect [84, 114, 125, 183] and therefore, would not be taken into account for the measurement of intepretabality faithfulness. This seems reasonable when the current works are still struggling with developing interpretable methods that can at least faithfully explain the model’s correct predictions. However, the interpretation of a model’s decision should not only be applied to one side but to both correct and wrong prediction results.

This also brings us the reflection that the fundamental reason to develop model interpretability is more than providing evidence/support/explanation of a correct prediction to users to make them believe the model’s correct decisions, but also to give them valuable guidance about why the model makes a wrong prediction. The comprehensive interpretations of a model’s decisions should provide faithful explanations for the model’s both correct and incorrect predictions. Such comprehensive interpretations from both sides are the key to developing the ultimate trustworthiness for black-box models and boosting their broader and more stable applications in required fields. Moreover, understanding the reason for the wrong prediction is also essential for deep learning researchers to learn and adjust the model better in the future works.

Therefore, the future works of interpretability would be to fill the current research gap and develop interpretable models that can generate faithful and comprehensible interpretations for both correct and incorrect decisions made by the model, providing reliable information to improve the trust of non-experts in using deep neural networks in crucial fields and help experts understand and improve the model in a more accurate and better way.

Footnotes

https://www.thedoctors.com/articles/the-algorithm-will-see-you-now-how-ais-healthcare-potential-outweighs-its-risk/

Note that there are several local interpretation methods, such as counterfactuals, example-based approaches, have not been included in this article, since only a few initial NLP research tasks have been conducted with these example-based approaches.

For example, Liu et al. [101], Stadelmaier and Padó [159], Stahlberg et al. [160], Wang et al. [175] primarily use explainability or explainable, while Camburu et al. [23], Ribeiro et al. [143], Serrano and Smith [149], Tutek and Šnajder [167] primarily use interpretable or interpretability.

⁴

This is stated as “the model assumption” in Jacovi and Goldberg [75].

⁵

A control task being a variant of the probe task that utilises random outputs to ensure that high scores on the task are only possible through “memorisation” by the probe.

References

[1]

Amina Adadi and Mohammed Berrada. 2018. Peeking inside the black-box: A survey on explainable artificial intelligence (XAI). IEEE Access 6 (2018), 52138–52160.

Abstract

1 Introduction

1.1 Definitions of Interpretability

1.1.1 Explainability vs. Interpretability.

1.1.2 Local and Global Interpretability.

1.1.3 Post Hoc vs. In-built Interpretations.

1.2 Article Layout

2 Aspects of Interpretability

2.1 Interpretability Requirements

2.2 Dimensions of Interpretability

2.2.1 Faithfulness.

2.2.2 Stability.

2.2.3 Comprehensibility.

2.2.4 Trustworthiness.

3 Interpretable Methods

3.1 Feature Importance

3.1.1 Rationale Extraction.

3.1.2 Input Perturbation.

3.1.3 Attention Weights.

3.1.4 Attribution Methods.

3.1.5 Datasets.

3.2 Natural Language Explanation

3.2.1 Multimodal NLE.

3.2.2 Text-only NLE.

3.2.3 NLE in Dialog.

3.2.4 Datasets.

3.2.5 Challenges and Future Work.

3.3 Probing

3.3.1 Embedding Probes.

3.3.2 Model Probes.

3.3.3 Probe Considerations and Limitations.

3.3.4 Interpretability of Probes and Future Work.

4 Evaluation Methods

4.1 Evaluation of Feature Importance

4.1.1 Automatic Evaluation.

4.1.2 Human Evaluation.

4.2 Evaluation of NLE

4.2.1 Automatic Evaluation.

4.2.2 Human Evaluation.

4.3 Evaluation of Probing

5 Discussion and Conclusion

6 Future Trend of Interpretability

Footnotes

References

Index Terms

Recommendations

A lexicon of multiword expressions for linguistically precise, wide-coverage natural language processing

Adversarial Attacks on Deep-learning Models in Natural Language Processing: A Survey

Introduction to Chinese Natural Language Processing

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations