Identifying the important input features that significantly impact a model’s prediction results is a straightforward method of improving a model’s local interpretability, directly linking model outputs to inputs. Important features can be, for example, words for text-based tasks or image regions for image-based tasks. This article focuses on the four main different methods of extracting important features as the interpretation for the model’s outputs: rationale extraction, input perturbation, attribution methods, and attention weight extraction. We conclude the typology of feature importance methods in Figure
2 and present the sample visualisations of extracted features from inputs in Figure
1.
3.1.1 Rationale Extraction.
Rationale extractions are usually used as the local interpretable method for NLP tasks of sentiment analysis and document classification. Rationales are short and coherent phrases from the original textual inputs and represent the critical textual features that contribute most to the output prediction. These identified textual features work as the local explanation that interprets the information the model primarily pays attention to when making the prediction decision for a particular textual input. The good rationales valid for the explanation should lead to the same prediction results as the original textual inputs. As this work area developed, researchers also made extra efforts to extract coherent and consecutive rationales to use them as more readable and comprehensive explanations.
The rationale extraction methods can be mainly divided into two streams: (1) a sequential selector-predictor stacked model, where the selector first selects the rationales from the original textual inputs and then passes to the predictor for the prediction result; (2) the adversarial-based model that involves the parallel models to calibrate the rationales extracted by the selector. In this article, we summarise several iconic and milestone works of rationale extractions for each stream.
For the selector-predictor stream, Lei et al. [
92] is one of the first works for the rationale extraction in NLP tasks. The selector process first generates a binary vector of 0 and 1 through a Bernoulli distribution conditioned on the original textual inputs. This binary vector will then be multiplied over the original inputs, where 1 indicates the selection of input words as rationales and 0 indicates the non-selection, resulting in a sparse input representation that indicates which textual tokens are selected as rationales and which tokens are not. The predictor will then process based on such information. Since the selected rationales are represented with non-differentiable discrete values, the REINFORCE algorithm [
182] was applied for optimization to update the binary vectors for the eventually accurate rational selection. Lei et al. [
92] performed rationale extraction for a sentiment analysis task with the training data that has no pre-annotated rationales to guide the learning process. The training loss is calculated through the difference between a ground truth sentiment vector and a predicted sentiment vector generated from extracted rationales selected by the selector model. Such selector-predictor structure is designed to mainly boost the interpretability faithfulness, i.e., selecting valid rationales that can predict the accurate output as the original textual inputs. To increase the readiness of the explanation, Lei et al. [
92] used two different regularizers over the loss function to force rationales to be consecutive words (readable phrases) and limit the number of selected rationales (i.e., selected words/phrases). Bastings et al. [
17] followed the same selector-predictor structure as Lei et al. [
92]. The main difference is that they used rectified Kumaraswamy distribution [
90] instead of Bernoulli distribution to generate the rationale selection vector, i.e., the binary vector of 0 and 1 to be masked over textual inputs. Kumaraswamy distribution allows the gradient estimation for optimization, so there is no need for the REINFORCE algorithm to do the optimization. To boost the short and coherent rationales for better readability and comprehensibility, Bastings et al. [
17] also applied a relaxed form of
\(L_0\) regularization [
103] and the Lagrangian relaxation to encourage adjacent words selected or not selected together. Different from the above methods, where rationale extraction is wrapped in an end-to-end model and has not used annotated rationales during the training of rationale selection, Du et al. [
45] uses rationales annotated by external experts as guidance during the training of rationale selector to generate the local explanations (short and coherent rationales) that are consistent with these external human-annotated rationales.
For the stream of adversarial-based models, a third module is usually added in addition to the selector-predictor stacks, functioning as a guide to boost the faithfulness of rationales and improve the comprehensibility of interpretation. For example, to boost the faithfulness of extracted rationales, Yu et al. [
190] inserted the target labels of sentiment analysis as additional inputs into the rationale selector to boost its participation in prediction. Additionally, to improve the comprehensibility that prevents the rationale selector from selecting meaningless small snippets, this work added a third element: a complement predictor. This additional module predicts the labels for original textual inputs based on non-rationale words. The complement predictor and the generator work much like the discriminative and generative networks in
generative adversarial networks (GANs) [
56]: the rationale selector aims to extract as many prediction-relevant words as possible as rationales to avoid the complement predictor from being able to predict the actual textual label. Similar to Yu et al. [
190], Chang et al. [
29] also involved a third module where the target labels of the original inputs are used as additional inputs, but with the addition that these target labels can be incorrect. This work also proposed a counterfactual rational generator to extract relevant rationales that cause false predictions. A discriminator is then applied to discriminate between the actual and counterfactual rationale generator. Recent work, such as Reference [
150], reduces the complexity of using three modules but constructs a guider model that operates over the original textual inputs for prediction and the rationale selector model in the adversarial-based architecture to encourage the final prediction vectors from two separate models to be close to each other, and thus achieve the faithfulness of extracted rationales. Also, to achieve better comprehensibility, Reference [
150] proposed language models as a regularizer, which significantly contributes to the better fluency of the extracted rationale by selecting consecutive tokens that describe the rationale well.
In general, using extracted rationales from original textual inputs as the models’ local interpretations focuses on the faithfulness and comprehensibility of interpretations. While trying to select rationales that can well represent the complete inputs in terms of accurate prediction results, extracting short and consecutive sub-phrases is also the key objective of the current rationale extraction works. Such fluent and consecutive sub-phrases (i.e., the well-extracted rationales) make this rationales extraction a friendly, interpretable method that provides readable and understandable explanations to non-expert users without NLP-related knowledge.
3.1.2 Input Perturbation.
Another method for identifying important features of textual inputs is input perturbation. For this method, a word (or a few words) of the original input is modified or removed (i.e., “perturbed”), and the resulting performance change is measured. The more significant the model’s performance drop, the more critical these words are to the model and therefore are regarded as important features. Input perturbation is usually model-agnostic, which does not influence the original model’s architecture. The main difference among the proposed input perturbation methods lies in how to perturb the tokens or phrases from original inputs into the new instances.
Ribeiro et al. [
143] proposed a
local interpretable model-agnostic explanations (LIME) model that can be used as an interpretable method for any black-box model. The main idea of LIME is the approximation of a black-box model with a transparent model using variants of original inputs. For natural language processing tasks such as text classification, words of original textual inputs are randomly selected and removed from the inputs, using a binary representation to mark the inclusion of words. Basaj et al. [
16] applied LIME to a QA task for identifying the important words in a question, where the words in the questions are considered to be features, while the associated context (i.e., text containing the answer to the given question) was held constant. The results indicate that in QA tasks, the complete sentence of question plays a minor role, and just a small amount of question words are sufficient for correct answer prediction.
Ribeiro et al. [
144] argued that the important features identified by Ribeiro et al. [
143] are based on word-level (single token) instead of phrase-level (consecutive tokens) features. Word-level features relate to only one instance and cannot provide general explanations, which makes it difficult to extend such explanations to unseen instances. For example, in sentiment analysis, “
not” in “
The movie is not good” is a contributing feature for negative sentiment but is not a contributing feature for positive sentiment in “
The weather is not bad.” The single token “
not” is insufficient as a general explanation for unseen instances, as it will lead to different meanings when combined with different words. Thus, Ribeiro et al. [
144] emphasized the phrase-level features for more comprehensive local interpretations and proposed a rule-based method for identifying critical features for predictions. Their proposed algorithm iteratively selects predicates from inputs as key tokens while replacing the rest of the tokens with random tokens that have the same POS tags and similar word embeddings. If the probability of classifying the perturbed text into the same class as that of the original text is above a predefined threshold, then the selected predicates will be considered as the ultimate key features to interpret the prediction results.
Similar to Ribeiro et al. [
143],
144], Alvarez-Melis and Jaakkola [
5] also proposed a model-agnostic interpretable method to relate inputs to outputs through the use of perturbed inputs generated by a variational auto-encoder applied to the original input. The perturbed input is supposed to have a similar meaning to the original input. A bipartite graph is then constructed to link these perturbed inputs and outputs, and the graph is then partitioned to highlight the relevant parts to show which inputs are relevant to the specific output tokens.
Feng et al. [
54] proposed a method to gradually remove unimportant words from original texts while maintaining the model’s performance. The remaining words are then considered as the important features for prediction. The importance of each token of the textual input is measured through a gradient approximation method, which involves taking the dot product between a given token’s word embedding and the gradients of its output with respect to its word embedding [
47]. The authors show that while the reduced inputs are nonsensical to humans, they are still enough for a given model to maintain a similar level of accuracy when compared with the original inputs.
The input perturbation method seems straightforward in identifying the significant input features by measuring the target task’s performance changes with new perturbed instances. However, there are also works questioning the faithfulness of input perturbation. For example, Reference [
154] conducted several experiments and argued that when the distributions of perturbed instances and original instances are less similar, the explanations of LIME [
143] are not faithful. Another problem of most input perturbation explanations is that the identified important features are mostly independent tokens instead of coherent phrases like argued by Ribeiro et al. [
144], which limits comprehensibility. The recent new track of local explanation: Counterfactual explanations [
31,
145,
184] are generated via the approaches of input perturbation to provide counterfactual explanations to show what would happen if some certain features are replaced and prove those features are important for particular model decision. These counterfactual explanations extend beyond the input perturbation from the simple word-level to present the interpretation differently with the more straightforward counterfactual examples. Such presentation of the input perturbation interpretation would give normal users a more intuitive understanding.
3.1.3 Attention Weights.
Attention weight is a weighted sum score of input representation in intermediate layers of neural networks [
14]. Extracting attention weights for inputs to provide local interpretations for predictions is commonly used among models that utilise attention mechanisms. For NLP tasks with only textual inputs, tokens with higher attention weights are considered to have more impact on the outputs during the neural network training and are, therefore, regarded as the more important features. Attention weights have been used for explainability in sentiment analysis [
107,
112,
173], question answering [
151,
164,
166], and neural machine translation [
14,
109]. In tasks with both visual and textual inputs, such as
Visual Question Answering (VQA) [
25,
43,
105,
186,
191] and image captioning [
7,
61,
185], attention weights are extracted from both images and questions to identify the contributing features from both modalities. In the case of such multi-modal tasks, it is also important to boost the consistency between the attended image regions and sentence tokens for a plausible explanation. In recent years, different attention mechanisms have been proposed, including the self-attention mechanism [
169] and the co-attention mechanism for multi-modal inputs [
191], aiming for better attention weights calculation that genuinely reflects the contributing factors to the final prediction.
Though attention mechanisms have proved their effectiveness in performance increment in different tasks and have been used as the indicators of important features to explain the model’s prediction results, there have always been debates arguing about the faithfulness of attention weights as the interpretation for neural networks.
Bai et al. [
15] proposed the concept of combinatorial shortcuts caused by the attention mechanism. It argued that the masks used to map the query and key matrices of the self-attention [
169] are biased, which would lead to the same positional tokens being attended regardless of the actual word semantics of different inputs. Clark et al. [
34] detected that the large amounts of attention of BERT [
40] focus on the meaningless tokens such as the special token [SEP]. Jain and Wallace [
79] argued that the tokens with high attention weights are not consistent with the important tokens identified by the other interpretable methods, such as the gradient-based measures. Serrano and Smith [
149] applied the method of intermediate representation erasure and claimed that attention can only indicate the importance of even intermediate components and are not faithful enough to explain the model’s decision from the level of the actual inputs.
In contrast, Wiegreffe and Pinter [
181] proposed the work of “Attention is not not explanation” specifically against the arguments in Reference [
79], arguing that whether attention weights are faithful explanations is dependent on the definition of explanation and conducted four different experiments to prove when attention can be used as the explanation. A similar view is also proposed by Jacovi and Goldberg [
76], illustrating that under some instances, attention maps over input can be considered as a faithful explanation, which can be verified by the erasure method [
9,
54], i.e., whether or not that erasing the attended tokens from inputs would change the prediction results.
To improve the faithfulness of attention as the explanation, some recent works have proposed different methods. For example, Bai et al. [
15] proposed a method of generating unbiased mask distribution by using random mask distributions to get attention weights through solely training the attention layers while fixing the other downstream parts of the model, which will therefore scale the attention weights towards tokens that are truly correlated with the predicted label. Chrysostomou and Aletras [
33] introduced three different task-scaling mechanisms that scaled over the word representations from different aspects before passing to the attention mechanism and claimed that such scaled word representations help to produce a more faithful attention-based explanation.
Overall, the dilemma of using inputs with high attention weights as the explanation to a black-box model’s decision is associated with the various definition and inconsistent evaluations of explanation faithfulness from different works. Jacovi and Goldberg [
76] also proposed in their work that the possible approach to solving this issue is to construct a unified evaluation of the degree of faithfulness either from the level of a specific task or from the level of sub-spaces of the input space. Nevertheless, regardless of the debates over the faithfulness of attention, explanation by attention weights has a lower level of readability. Compared to rationale extraction works that explicitly force the consecutive rationales to be extracted for better comprehensibility, current works using attention as explanation neglect such interpretability aspect. Therefore, even in some cases where the input tokens with high attention weights could work as faithful explanations, it would be hard for non-experts to understand the explanation well with non-coherent highlighted tokens of the textual inputs. However, for the multimodal task such as the visual question answering, some works have attention maps over the images as the explanation [
108] or the part of the explanations [
183]; the attended region are usually consecutive pixels of the images, which can be more straightforward to be understood by non-expert users compared to the attention map over pure texts.
3.1.4 Attribution Methods.
Another method of detecting important input features that contribute most to a specific prediction is attribution methods, which aim to interpret prediction outputs by examining the gradients of a model. Common attribution methods include DeepLift [
153],
Layer-wise relevance propagation (LRP) [
13], deconvolutional networks [
192], and guided back-propagation [
157].
Extracting model gradients allows for identifying high-contributing input features to a given prediction. However, directly extracting gradients does not work well with regards to two key properties: sensitivity and implementation invariance. Sensitivity emphasizes that if we have two inputs with one differing feature that lead to different predictions, then this differing feature should be noted as important to the prediction. Implementation invariance means that the outputs of two models should be equivalent if they are functionally equivalent, whether their implementations are the same or not. Focusing on these properties, Sundararajan et al. [
163] proposed an integrated gradient method. Integrated gradients are the accumulative gradients of all points on a straight line between an input and a baseline point (e.g., a zero-word embedding). He et al. [
65] applied this method to natural machine translation to find the contribution of each input word to each output word. Here, the baseline input is a sequence of zero embeddings in the same length as the input to be translated. Mudrakarta et al. [
119] applied integrated gradients to a question-answering task to identify the critical words in questions and found that only a few words in a question contribute to the model answer prediction.
Besides extracting the gradients, scoring input contributions based on the model’s hidden states is also used for attribution. For example, Du et al. [
46] proposed a post hoc interpretable method that leaves the original training model untouched by examining the hidden states passed along by RNNs. Ding et al. [
42] applied LRP [
13] to neural machine translation to provide interpretations using the hidden state values of each source and target word.
The attribution methods are the preliminary approaches for deep learning researchers to explain the neural networks through the identified input features with outstanding gradients. The idea of the attribution methods was mostly proposed before the mature development and vast researches of rationale extraction, attention mechanisms, and even the input perturbation methods. Compared to the other input feature explanation methods, the attribution methods hardly consider the interpretation’s faithfulness and comprehensibility as the other three input feature explanation methods. Visualizing the identified features from inputs would be at the same plausible level as that of the other three feature importance methods to non-expert users, but the attribution methods do not work to form the interpretation into coherent sub-phrases for better readability and easier understanding. Thus, compared to rationale extraction, attention weights extraction, and input perturbation, using attribution methods to generate the interpretation is more like a diagnosis method for deep learning experts to understand the model’s decision and learn the model’s functionality.