6.2 Explanation Faithfulness (RQ1, RQ2)
Figure
5 explicitly shows how the fairness (i.e., Head-tailed Rate@
K) and recommendation (i.e., NDCG@
K) performance of our
CFairER and baselines changes along with erasure, where the x-axis shows the erasure iteration and the y-axis demonstrates the corresponding fairness and recommendation performance at
\(K=\lbrace 5,20\rbrace\) . Each data point in Figure
5 is generated by cumulatively erasing a batch of attributes. Those erased attributes are selected from, at most
8, the top 10 (i.e.,
\(E=10\) ) attribute sets of the explanation lists provided by each method
9. As PopUser and PopItem baselines enjoy very similar data trends, we choose not to present them simultaneously in Figure
5. Besides, we plot the relationship between fairness and recommendation performance at each erasure iteration in Figure
6, showcasing the fairness-accuracy trade-off of our
CFairER and baselines. We also use Table
2 to present the final recommendation and fairness performance of all methods after erasing
\(E = [5, 10, 20]\) attributes in explanations. Note that in both Figure
5, Figure
6 and Table
2, larger NDCG@
K and Hit Ratio @
K values indicate better recommendation performance while smaller Head-tailed Rate@
K and Gini@
K values represent better fairness. Analyzing Figure
5, Figure
6 and Table
2, we have the following findings.
From Table
2, it is observed that our
CFairER achieves the best recommendation and fairness performance amongst all methods after erasing attributes from our explanations. For instance,
CFairER beats the strongest baseline CEF by 25.9%, 24.4%, 8.3% and 36.0% for NDCG@40, Hit Ratio@40, Head-tailed Rate@40 and Gini@40 with erasure length
\(E=20\) on
Yelp. This indicates that explanations generated by
CFairER are more faithful to explaining unfair factors while not harming recommendation accuracy compared to all baseline methods. Unlike heuristic approaches (i.e., RDExp, PopUser, and PopItem) that rely on random attribute selection, CFairER incorporates a more complex and rational attribute selection mechanism for fairness explanation generation. Particularly, CFairER prioritizes relevant attributes that change the model fairness by using a counterfactual reward (cf. Equation (
9)), incorporating two criteria of
Rationality and
Proximity to ensure the quality of generated counterfactual explanations. Besides, FairKGAT considers mitigating the unfairness of explanation diversity caused by different user activeness, which ignores the impact of item exposure imbalance on explanation fairness. Our
CFairER mitigates the unfairness of item exposure to promote the fair allocation of user-preferred but less exposed items, thus achieving better recommendation and fairness performance than FairKGAT. Regarding CEF, although CEF generates counterfactual explanations as in our
CFairER, it conducts feature-level optimizations and does not apply to discrete attributes such as gender and age. Our
CFairER uses an off-policy learning agent to directly visit attributes in a given HIN, adapting to both discrete and continuous attributes when finding the counterfactual explanations. As a result, our
CFairER outperforms CEF due to its generalizability to discrete attributes. Another interesting finding is that PopUser and PopItem perform even worse than RDExp (i.e., randomly selecting attributes) on
LastFM. Early recommendation models largely recommend items that have popular attributes favored by users. Though this is intuitive, recommending items with popular attributes would deprive the exposure of less-noticeable items, causing serious model unfairness and degraded recommendation performance. This further highlights the importance of mitigating item exposure unfairness in recommendations.
From Figure
5, the fairness of all models consistently improves while erasing attributes from explanations, shown by the decreasing trend of Head-tailed Rate@
K values. Besides, Figure
5 also shows the decreasing trend of NDCG@
K values, demonstrating a decreased recommendation performance. The improved fairness performance of all methods is reasonable, as erasing attributes, even those selected randomly from attribute sets, could potentially remove unfair factors to mitigate the representation gap between popular and long-tailed items. This finding is also consistent with the finding in CEF [
19]. Unfortunately, we can observe the downgraded recommendation performance of all models in Figure
5, as also evidenced by CEF [
19]. For example, in Figure
5, the NDCG@5 of CEF drops from approximately 1.17 to 0.60 on
LastFM at erasure iteration 0 and 50. This is due to the well-known fairness-accuracy trade-off issue, in which the fairness constraint could be achieved with a sacrifice of recommendation performance. Facing this issue, both baselines suffer from huge declines in recommendation performance. On the contrary, our
CFairER still enjoys favorable recommendation performance and outperforms all baselines. Besides, the decline rates of our
CFairER are much slower than baselines on both datasets in Figure
5. We hence conclude that the attribute-level explanations provided by our
CFairER can achieve a much better fairness-accuracy trade-off than other methods. This is because our
CFairER finds minimal but vital attributes as explanations for model fairness. Those attributes produced by
CFairER are fairness-related factors but not the ones that affect the recommendation accuracy. As a result, our
CFairER could alleviate the item exposure unfairness while maintaining stable recommendation performance. Another finding is that our
CFairER may not outperform FairKGAT and PopUser in fairness evaluations when the number of erasure iterations is insufficient. In Figure
5(a), it is observed that the HT@5 of FairKGAT and PopUser exhibit more quick degradation compared to
CFairER’s degradation at the beginning of the erasure iterations. This can be attributed to the fact that FairKGAT and PopUser construct explanations using a fixed length (i.e.,
\(N=20\) ), which may absorb more attributes in each explanation than
CFairER’s adaptive explanation length, i.e., minimal for each explanation. Consequently,
CFairER may not have a fair opportunity to hit the attributes that explain model unfairness compared to FairKGAT and PopUser, resulting in slower degradation of HT@5 during the initial erasure iterations. However, as more erasure iterations are performed,
CFairER exhibits stable performance and eventually surpasses FairKGAT and PopUser in fairness evaluations. This indicates that
CFairER consistently discovers suitable explanations that align with model fairness during the learning process. These explanations generated by CFairER prioritize simple yet essential attributes, in contrast to the complex combinations of attributes utilized by FairKGAT and PopUser.
Figure
6 provides deeper insights into the fairness-accuracy trade-off issue, specifically the relationship between fairness and recommendation performance metrics, i.e., Head-tailed Rate@
K and NDCG@
K. By analyzing Figure
6, we find that our
CFairER achieves the best fairness-accuracy trade-off compared to the baseline methods on the
Douban Movie and
LastFM datasets. This is evident from the blue curves positioned to the left-hand side of the diagonals of Figure
6(c) (d) (e) (f). Moreover, we observe that RDExp performs the poorest on the
Douban Movie dataset, while PopUser exhibits the worst performance on the
LastFM dataset. In our experiments, the trade-off is caused by the disagreement between the goals of item exposure fairness and user preference. While we aim to align fair allocations to item exposures, the recommendation model primarily focuses on selecting items similar to those in users’ historical interactions. We thus conclude the attributes found by our
CFairER are not necessarily similar to the attributes of historical items. Instead, they are sensitive attributes that cause the recommendations to favor the historically popular items. Another finding is that our
CFairER initially does not outperform other baselines during the early erasure process on the
Yelp dataset, as depicted in Figure
6(a) (b). We attribute this sub-optimal performance on
Yelp to the extremely sparse nature of the dataset, i.e., a density of only 0.086%. Compared with
Douban Movie (0.63% density) and
LastFM (0.28% density),
Yelp records a larger number of users that have few interactions with items. While other baseline methods may rely on popular attributes to predict the preferences of those users,
CFairER takes a different approach. It aims to identify sensitive attributes that may not necessarily be the most popular ones. Consequently, at the beginning of the erasure process,
CFairER may struggle to adapt to the data sparsity of the
Yelp dataset, resulting in sub-optimal performance. However, as more erasure iterations are performed,
CFairER surpasses baseline models and achieves the best fairness-accuracy trade-off. This demonstrates the effectiveness of
CFairER in progressively refining its attribute selection process and achieving a favorable fairness-accuracy trade-off. With more iterations,
CFairER can identify the appropriate attributes that better explain item exposure fairness and align with user preferences.
6.4 Parameter Analysis (RQ4)
We analyze how erasure length
E (cf. Section
6.1.3) and candidate size
n (as in Equation (
7)) impact the performance of
CFairER. Erasure length
E is the number of erased attributes selected from each explanation, which determines the erasure size for evaluations. Candidate size
n is the number of candidate actions selected by our attentive action pruning. We present the evaluation results of
CFairER under different
E and
n on
Yelp and
LastFM in Figure
7. Since the results on
Douban Movie draw similar conclusions to the other datasets, we choose not to present them here.
Apparently, the performance of CFairER demonstrates decreasing trends from \(E=5\) , then becomes stable after \(E=10\) . The decreased performance is due to the increasing erasure of attributes found by our generated explanations. This indicates that our CFairER can find valid attribute-level explanations that impact fair recommendations. The performance of CFairER degrades slightly after the bottom, then becomes stable. This is reasonable since the attributes number provided in datasets are limited, while increasing the erasure length would allow more overlapping attributes with previous erasures to be found.
By varying candidate size
n from
\(n=[10, 20, 30, 40, 50, 60]\) in Figure
7(c) (d), we observe that
CFairER performance first improves drastically as candidate size increases on both datasets. The performance of our
CFairER reaches peaks at
\(n=40\) and
\(n=30\) on
Yelp and
LastFM, respectively. After the peaks, we can witness a downgraded model performance by increasing the candidate size further. We consider the poorer performance of
CFairER before reaching peaks is due to the limited candidate pool, i.e., insufficient attributes limit the exploration ability of
CFairER to find appropriate candidates as fairness explanations. Meanwhile, a too-large candidate pool (e.g.,
\(n=60\) ) would offer more chances for the agent to select inadequate attributes as explanations. Based on the two findings, we believe it is necessary for our
CFairER to carry the attentive action search, such as to select high-quality attributes as candidates based on their contributions to the current state.