Does Faithfulness Conflict with Plausibility? An Empirical Study in Explainable AI across NLP Tasks

Xiaolei Lu
xiaoleilu2-c@my.cityu.edu.hk
&Jianghong Ma
majianghong@hit.edu.cn

Abstract

Explainability algorithms aimed at interpreting decision-making AI systems usually consider balancing two critical dimensions: 1) faithfulness, where explanations accurately reflect the model’s inference process. 2) plausibility, where explanations are consistent with domain experts. However, the question arises: do faithfulness and plausibility inherently conflict? In this study, through a comprehensive quantitative comparison between the explanations from the selected explainability methods and expert-level interpretations across three NLP tasks: sentiment analysis, intent detection, and topic labeling, we demonstrate that traditional perturbation-based methods Shapley value and LIME could attain greater faithfulness and plausibility. Our findings suggest that rather than optimizing for one dimension at the expense of the other, we could seek to optimize explainability algorithms with dual objectives to achieve high levels of accuracy and user accessibility in their explanations.

Keywords Explainability $\cdot$ Faithfulness $\cdot$ Plausibility

1 Introduction

Deep Neural Networks (DNNs) have demonstrated impressive results in many domains including Natural Language Processing (NLP), Computer Vision (CV) and speech processing [1, 2]. These deep neural models operate like black-box models by applying multiple layers of non-linear transformation on the vector representations of input data, which fails to provide insights to understand the inference process.

Explainability algorithms including attention-based, gradient-based and perturbation-based feature attribution methods have been extensively studied to explore the internal mechanisms of black-box deep models [3, 4, 5], which improves transparency in AI, particularly in sensitive applications like clinical decision-making systems. A good explanation in such contexts should consider two critical dimensions: 1) Faithfulness. The explanation could accurately attribute the model’s decision to the specific features. 2) Plausibility. The explanation is logically sound and understandable to the domain experts.

As plausibility focuses on the human’s perception of the explanation, more faithful explanations that accurately convey the reasoning of complex models (e.g. deep neural networks) may be implausible to a domain expert, and vice versa. Explainability research commonly recognizes a trade-off between faithfulness and plausibility, suggesting that enhancing one may compromise the other [6, 7]. However, few studies explicitly address the conflicts between these dimensions during evaluation, which requires further empirical investigations.

In this work, through comprehensive quantitative analysis we evaluate the explanations from the selected explainability methods and expert-level interpretations across NLP tasks. Our contributions are summarized as follows:

•

We utilize GPT-4, which demonstrates its expert role in our consistency verification, to construct professional explanations across our targeted datasets, serving as benchmarks for plausibility evaluation.
•

We thoroughly evaluate the faithfulness and plausibility of explanations from GPT-4 and the selected explainability methods. Our findings suggest the possibility of optimizing explainability algorithms to simultaneously achieve high performance in both faithfulness and plausibility.

2 Related Work

Existing explainability methods for interpreting model training and inference process could be categorized into two types: instance attribution measures how a training point influence the prediction of a given instance while feature attribution quantifies the contribution of each feature (or feature interaction) to the model’s output on a specific instance. For example, Influence Function [8] attends to the final iteration of the training and computes the influence score of a training instance on the prediction loss of a test sample. Shapley value [5] that is derived from cooperative game theory treats each feature as a player and computes the marginal contribution of each feature toward the model’s output. Integrated Gradients [4] measures feature importance by computing the path integral of the gradients respect to each dimension of input.

Faithfulness and plausibility are the primary criteria to evaluate explainability methods. Previous works [9, 10] collected ground truth explanations (or rationales) from crowdworkers with human-to-human agreement. However, plausible explanations may not be faithful to reflect the model’s inference process. Proxy model [6] was proposed to use the trained model’s predictions as training labels to balance faithfulness and plausibility. Since obtaining professional human explanations is challenging, few studies explicitly address the conflicts between faithfulness and plausibility.

3 Experimental Setup

3.1 Tasks, Datasets and Models

We conduct experiments on various NLP tasks including sentiment analysis, intent detection and topic labeling. The employed datasets are SST-2 [11], SNIPS [12] and 20Newsgroups¹¹1http://qwone.com/ jason/20Newsgroups/, and study the performance on BERT-base [1] and RoBERTa-base [13] models. Appendix A provides the configurations of finetuning pretrained BERT-base and RoBERTa-base models in these downstream tasks. To enable user-friendly human evaluation, we select the explained set where the sequence remains unchanged after tokenization. Appendix B summarizes the details of the datasets.

3.2 Explanation Methods

We study model explainability from three groups of explanation methods: attention-based, gradient-based and perturbation-based attributions. The employed attribution methods are described as follows: Inherent Attention Explanation (RawAtt)[3]: directly measure the feature importance with attention weights. Attention Rollout (AttRll) [14]: aggregate attention across all heads and layers to measure how much each input feature attends to every other feature across the entire depth of the model. Input $\odot$ Gradients (InputG) [15]: measure the change of the model’s output with respect to a small change in the input feature. Integrated Gradients (IG) [4]: accumulate the gradients along the path from a given baseline to the input. Shapley value (SV) [5]: average marginal contribution of the feature being explained toward the model’s output over all possible permutations. LIME [16]: generate explanation by learning an inherently interpretable model locally on the instance being explained.

Refer to caption — (a) Faithfulness evaluation performance over BERT on SST-2.

3.3 Evaluation matrices

Faithfulness: three faithful evaluation metrics are employed and we choose padding replacement operation²²2Deletion operation produces the similar results over BERT architecture. RoBERTa’s training process is more robust to variations in input and padding and deletion operations generate the same results over RoBERTa architecture.. The employed matrices include Log-odds (LOR) [15]: average the difference of negative logarithmic probabilities on the predicted class over the test data before and after replacing the top $k$ influential words from the text sequence. Sufficiency (SF) [17]: measure whether important features identified by the explanation method are adequate to remain confidence on the original predictions. Comprehensiveness (CM) [17]: evaluate if the features assigned lower weights are unnecessary for the predictions.

Plausibility: evaluate the similarity between the feature importance ranking generated by the explanation methods and GPT-4. Given the $i_{th}$ input sequence with size $n$ , let $H_{i}$ denotes the human explanation toward feature importance ranking and $E_{i}$ is provided by the explainability method. The employed matrices include Rank Correlation (RC): measure the similarity between two ranks. Spearman’s Rank Correlation Coefficient is employed to compute RC. Overlap Rate (OR): measure the overlap of the top $k$ influential elements between $E_{i,k}$ and $H_{i,k}$ .

3.4 Human explanation

There are some research reports [18, 19] showing that large language models (LLMs), such as GPT-3.5 and GPT-4, can provide high-quality annotations like an excellent crowdsourced annotator does. We first randomly select 83 explained instances for BERT and 74 for RoBERTa on SST-2, 86 explained instances for both BERT and RoBERTa on SNIPS. By comparing the explanations generated by GPT-4 (details of the prompt for generating explanation is given in Appendix C) and an NLP researcher, Rank Correlation scores are 0.71 and 0.77 for BERT and RoBERTa, respectively, on SST-2, and 0.86 and 0.83, respectively, on SNIPS. These results demonstrate the quality of GPT-4 in the expert role. Therefore we use GPT-4 to provide explanations toward the model’s output for more instances (we will provide the explained sets with corresponding GPT-4 explanations including our consistency verification in the public version).

Method		SV	LIME	IG	InputG	RawAtt	AttRll	GPT-4
BERT	LOR	-5.9748	-3.5052	-0.9578	-1.1743	-2.2261	-0.6265	-3.2694
	CM	0.8874	0.6880	0.2156	0.2677	0.4352	0.1330	0.5848
	SF	-0.0572	0.1360	0.6132	0.5600	0.4189	0.6815	0.3071
RoBERTa	LOR	-5.4660	-3.1295	-1.2463	-1.2423	-0.9516	-0.5808	-3.4327
	CM	0.8392	0.5721	0.2238	0.2315	0.2021	0.1217	0.5748
	SF	-0.0868	0.2523	0.5217	0.5160	0.6284	0.6451	0.3367

Table 1: Faithfulness evaluation performances on 20Newsgroup over BERT and RoBERTa architectures.

4 Results

Fig.1 and Table 1 demonstrate the faithfulness evaluation performance ³³3For 20Newsgroup we use GPT-4 to provide the most positive influential features and the remaining treated as unimportant features.. We can observe that in SST-2 and 20Newsgroup over BERT and RoBERTa architectures SV outperforms the other baselines, and LIME and GPT-4 obtain comparative performance which is second only to that of SV. In SNIPS, both LIME and SV achieve similar outcomes, whereas GPT-4’s performance is moderate.

Generally, SV, LIME and GPT-4 outperform the selected gradient-based and attention-based methods in these datasets. First, attention-based methods assume higher attention weights correlate with higher importance while these weights may also contain additional information that could be utilized by the downstream models [20]. Compared with the perturbation-based attribution that is less reliant on the model architecture, gradient-based methods might not accurately measure how input features affect the output of the complex non-linear model. Furthermore, the plausible explanations provided by the expert (e.g. GPT-4) could be more faithful than some explainability algorithms.

Fig.2 presents the rank coefficient between the explainability methods and GPT-4 on SST-2 and SNIPS over BERT and RoBERTa architectures. Overall the explanations toward feature importance ranking between the explainability methods and GPT-4 are weakly correlated. We further examine the overlap rate of the explanations between these methods. Appendix E Fig.3 shows the overlap rate of the feature importance ranking ( $k=4$ ) between the explainability methods and GPT-4. SV and LIME achieve over $60\%$ OR with GPT-4 in identifying the most positive influential features in the SST-2 and SNIPS datasets, and these two methods also obtain higher OR values in 20Newsgroup. Despite the weak correlation when considering the full importance ranking, significant overlap exists in identifying the most critical features between the chosen methods and GPT-4. We also present the overlap rate with different $k$ in Appendix E.

5 Conclusion

In this work by conducting experiments on three NLP tasks and constructing expert-level human explanations with GPT-4, we quantitatively analyze the explanations from the selected explainability methods and human-generated interpretations toward NLP deep models. The results show that SV, LIME, and GPT-4 outperform traditional gradient-based and attention-based methods across various datasets. Our findings suggest that plausibility and faithfulness can be complementary. The explainability method could achieve a high overlap rate in identifying influential features and also tend to provide explanations that are plausible to human interpreters, which implies that the explainability algorithms can be optimized toward the dual objective of faithfulness and plausibility.

6 Limitations

This empirical study focuses on a selected set of explainability methods across three specific NLP tasks: sentiment analysis, intent detection, and topic labeling. While the results provide insights into the relationship between faithfulness and plausibility, it also limits the generalizability of our findings. Future research could include more tasks and models. Furthermore, our findings suggest that explainability algorithms could be optimized to achieve both faithfulness and plausibility. How to optimize these algorithms for multiple objectives simultaneously require further investigation.

References

[1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[2] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[3] Sarah Wiegreffe and Yuval Pinter. Attention is not not explanation. arXiv preprint arXiv:1908.04626, 2019.
[4] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In International conference on machine learning, pages 3319–3328. PMLR, 2017.
[5] Lloyd S Shapley et al. A value for n-person games. 1953.
[6] Zach Wood-Doughty, Isabel Cachola, and Mark Dredze. Faithful and plausible explanations of medical code predictions. arXiv preprint arXiv:2104.07894, 2021.
[7] Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? arXiv preprint arXiv:2004.03685, 2020.
[8] Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In International conference on machine learning, pages 1885–1894. PMLR, 2017.
[9] Julia El Zini, Mohamad Mansour, Basel Mousi, and Mariette Awad. On the evaluation of the plausibility and faithfulness of sentiment analysis explanations. In IFIP International Conference on Artificial Intelligence Applications and Innovations, pages 338–349. Springer, 2022.
[10] Tasuku Sato, Hiroaki Funayama, Kazuaki Hanawa, and Kentaro Inui. Plausibility and faithfulness of feature attribution-based explanations in automated short answer scoring. In International Conference on Artificial Intelligence in Education, pages 231–242. Springer, 2022.
[11] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013.
[12] Alice Coucke, Alaa Saade, Adrien Ball, Théodore Bluche, Alexandre Caulier, David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco Caltagirone, Thibaut Lavril, et al. Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190, 2018.
[13] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
[14] Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. arXiv preprint arXiv:2005.00928, 2020.
[15] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In International conference on machine learning, pages 3145–3153. PMLR, 2017.
[16] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. " why should i trust you?" explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144, 2016.
[17] Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric Lehman, Caiming Xiong, Richard Socher, and Byron C Wallace. Eraser: A benchmark to evaluate rationalized nlp models. arXiv preprint arXiv:1911.03429, 2019.
[18] Petter Törnberg. Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning. arXiv preprint arXiv:2304.06588, 2023.
[19] Yunhe Feng, Sreecharan Vanam, Manasa Cherukupally, Weijian Zheng, Meikang Qiu, and Haihua Chen. Investigating code generation performance of chatgpt with crowdsourcing social data. In Proceedings of the 47th IEEE Computer Software and Applications Conference, pages 1–10, 2023.
[20] Bing Bai, Jian Liang, Guanhua Zhang, Hao Li, Kun Bai, and Fei Wang. Why attentions may not be interpretable? In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 25–34, 2021.

Appendix A Configurations for finetuning deep models

We use AdamW optimizer with weight decay $0.001$ and start with learning rate of 2e-5 to tune pretrained BERT-base-uncased and RoBERTa-base models. For the setting of epochs and batch size, SST-2 at 10 epochs with a batch size of 32, SNIPS at 10/64, and NG at 20/64, ensuring good model performance for each task. The corresponding performance is reported in Table 2.

Models	SST-2	SNIPS	20Newsgroup
BERT	90.49	97.71	74.48
RoBERTa	94.56	97.85	73.37

Table 2: Task performance (%) with finetuning BERT and RoBERTa.

Appendix B Details of the tasks and datasets

Table 3 summarizes the selected datasets with corresponding explained sets.

Datasets	Train set	Test set	Label set	BERT		RoBERTa
Datasets	Train set	Test set	Label set	Explained set	Avg_len	Explained set	Avg_len
SST-2	6899	1819	2	152	7.39	164	8.88
SNIPS	13082	700	7	188	7.35	194	7.50
20Newsgroup	10663	7019	20	89	23.66	78	29.03

Table 3: Summary of the selected datasets, where Avg_len denotes the average length.

Appendix C Prompt for GPT-4 to generate explanations

In this section, we demonstrate how to use GPT-4 to generate explanations via Table 4, 5 and 6. “Input” and “Output” refer to the prompt provided to GPT-4 and the generated explanations, respectively. It could be treated as zero-shot evaluation. We maintained the output integrity without alternations, while occasionally adjusting the requirements to ensure a complete ranking. For instance, when dealing with repeated strings, each instance was assigned an individual rank. Due to the lengthy context in the 20Newsgroup dataset, we only use GPT-4 to provide the most positive influential features toward the model’s output.

Component	Description
Input	The task is described as follows: given a text sequence of a movie review with the sentiment classification label (positive or negative), there are a few requirements: 1. Transform this long string sequence into a list of strings. 2. Measure the contributions of each string in the list toward the sentiment label based on your understanding. Then rank all strings (ensuring that no strings are excluded) including the repeated strings (each occurrence should have its own rank) in this list based on their contributions. 3. The ranking should follow an order from the most positive to neutral to the most negative. Place the strings with the highest positive contribution at the top and the strings with the most negative contribution at the bottom. 4. Output all ranked strings ensuring that no strings are excluded.
Example Input	Sequence: something the true film buff will enjoy Label: positive
Example Output	Ranked strings: [‘enjoy’, ‘true’, ‘something’, ‘film’, ‘buff’, ‘will’, ‘the’]

Table 4: Illustration of how to use GPT-4 to generate explanations for SST-2 explained set.

Component	Description
Input	The task is described as follows: given the utterance with the corresponding predicted intent, by treating yourself as an human, please rank all words based on their influence toward this predicted intent. 1. Transform this long string sequence into a list of strings. 2. Treat yourself as a human and rank each individual string (ensuring that no strings are excluded), including the repeated strings, in this list based on their contributions toward the predicted intent. 3. The ranking should follow an order from the most positive to neutral to the most negative. Place the strings with the highest positive contribution at the top and the strings with the most negative contribution at the bottom. 4. Output all ranked strings ensuring that no strings are excluded.
Example Input	Sequence: find an album called just call me stupid Label: search creative work
Example Output	Ranked strings: [‘find’, ‘album’, ‘called’, ‘just’, ‘call’, ‘stupid’, ‘me’, ‘an’]

Table 5: Illustration of how to use GPT-4 to generate explanations for SNIPS explained set.

Component	Description
Input	The task is described as follows: given a news with the corresponding topic, please provide the evaluation for the topic labeling of this news. There are a few specified requirements: 1. Transform this long string sequence to a list of strings. 2. Treat yourself as a human and find out the most positive influential strings toward the predicted topic. 3. Output the ranked list.
Example Input	Sequence: does anyone know where i can get some voice synthesis chips i am looking for something like the ones that do the time and date stamp on answering machines Topic: discussions about electronics
Example Output	Ranked strings: [‘voice’, ‘synthesis’, ‘chips’, ‘time’, ‘date’, ‘stamp’]

Table 6: Illustration of how to use GPT-4 to generate explanations for 20Newsgroup explained set.

Appendix D Evaluation matrices

Log-odds (LOR) [15]: average the difference of negative logarithmic probabilities on the predicted class over the test data before and after replacing the top $k$ influential words from the text sequence.

\mathrm{LOR}(k)=\frac{1}{N}\sum_{i=1}^{N}\log\frac{f(\bm{x^{\prime}}_{i})}{f(% \bm{x}_{i})},

(1)

where $\bm{x^{\prime}}_{i}$ is obtained by replacing the $k$ top-scored words from $\bm{x}_{i}$ . The lower LOR, the more faithful feature importance ranking.

Sufficiency (SF) [17]: measure whether important features identified by the explanation method are adequate to remain confidence on the original predictions.

\mathrm{SF}(k)=\frac{1}{N}\sum_{i=1}^{N}f(\bm{x}_{i})-f(\bm{x}_{i,k}),

(2)

where $\bm{x}_{i,k}$ is obtained by replacing non-top $k$ influential elements in $\bm{x}_{i}$ . The lower SF, the more faithful feature importance ranking.

Comprehensiveness (CM) [17]: evaluate if the features assigned lower weights are unnecessary for the predictions.

\mathrm{CM}(k)=\frac{1}{N}\sum_{i=1}^{N}f(\bm{x}_{i})-f(\bm{x}_{i}\backslash% \bm{x}_{i,k}),

(3)

where $\bm{x}_{i}\backslash\bm{x}_{i,k}$ is obtained by replacing top $k$ influential elements in $\bm{x}_{i}$ . The higher CM, the more faithful feature importance ranking.

Given the $i_{th}$ input sequence with size $n$ , let $H_{i}$ denotes the human explanation toward feature importance ranking and $E_{i}$ is provided by the explanation method.

Rank Correlation (RC): measure the similarity between two ranks. Spearman’s Rank Correlation Coefficient is employed to compute RC as

\mathrm{RC}=\frac{1}{N}\sum_{i=1}^{N}1-\frac{6\sum(E_{i}-H_{i})^{2}}{n_{i}(n_{% i}^{2}-1)},

(4)

where RC value ranges from $[-1,1]$ .

Overlap Rate (OR): measure the overlap of the top $k$ influential elements between $E_{i,k}$ and $H_{i,k}$ as

OR(k)=\frac{1}{N}\sum\frac{1}{k}\left|E_{i,k}\cap H_{i,k}\right|.

(5)

Appendix E Overlap rate evaluation

Fig.4, 5, 6 and 7 demonstrate the overlap rate with different $k$ over BERT and RoBERTa architectures. By varying $k$ SV and LIME always have significant overlap with GPT-4.