Tintarev and Masthoff [
76] propose in total seven design goals for explainable recommendation, including
effectiveness,
efficiency,
persuasiveness,
satisfication,
scrutability,
transparency, and
trust. Balog and Radlinski [
7] and Nunes and Jannach [
55] point out that most of the existing works only focus on one single goal, and only a few consider multiple goals [
16,
21]. Balog and Radlinski [
7] conducted user study to measure the correlation between the aforementioned goals and the results, not surprisingly, show that these goals are interrelated. However, we notice that some goals are more important than others in certain application scenarios/domains, e.g.,
rersuasiveness in generating post hoc explanations in recipe/music recommendation [
46,
87]. More importantly, while various models have been proposed for explainable recommendation, there has not been a universal and model-independent metric to systematically evaluate the quality of generated explanations in terms of faithfulness and scrutability. Peake and Wang [
58] propose to mine association rules to explain the recommendations given by matrix factorization, but their evaluation is limited to the explanations’ fidelity, i.e., whether the association rules recommend similar items to the matrix factorization model. Some works [
3,
23,
96] use user/case study to prove the generated explanations are superior than those from the baseline models in terms of persuasiveness. For example, Ai and Narayanan [
3] conduct crowdsourcing to compare model-intrinsic and model-agnostic explanations from
informativeness,
usefulness, and
satisfaction, where usefulness has the same definition as persuasiveness. Similarly, Ghazimatin [
23] shows that the proposed explanation framework outperforms other explanation types in terms of usefulness. A recent line of works [
23,
38,
46,
72] takes a counterfactual perspective. Ghazimatin et al. [
23] propose to use a subset of users’ interaction history as counterfactual explanation and use user studies to prove the generated explanations are of higher quality. Tan et al. [
72] propose to evaluate the explanations using
necessity (whether the condition is necessary) and
sufficiency (whether the condition is sufficient). However, their methods are bound to the specific recommendation algorithm, and the authors fail to properly illustrate how these explanation frameworks can be adapted to other explanation models. Besides, although the counterfactual models can deliver human-interpretable explanations [
23,
65,
72], the evaluation of faithfulness has been overlooked in these works. We also include two counterfactual baselines and show that explanations being counterfactually true do not necessarily guarantee they can reflect the actual reasoning process of the recommendation model. In this work, we propose an evaluation pipeline to evaluate the model-agnostic explanations from the perspective of faithfulness and scrutability. Our work is different from existing works from three perspectives: (1) We focus on the evaluation of generated explanations from the perspectives of faithfulness and scrutability, which has been largely overlooked by previous works in explainable recommendation; (2) our evaluation pipeline is not limited to a certain explanation style, e.g., aspects [
72] or association rules [
58]; and (3) with proper modification, our model-agnostic explanation framework can be used for other genres of recommendation model (detailed in Section
6.3).