1 Introduction
Nowadays, the amount of information available on the Internet has far exceeded individuals’ information needs and processing capacity, which is known as information overload [
40]. As a tool to alleviate information overload, recommender systems are widely used in people’s daily lives (e.g., news recommendations, career recommendations, and even medical recommendations) and play a crucial role. Utility (such as click-through rate, dwell time, etc.) has been the most vital metric for recommender systems. However, only considering utility may lead to problems like the Matthew effect [
111] and filter bubble [
70]. Hence, more views of recommender system performance have been proposed, such as diversity, efficiency, privacy, and so on. Fairness is one of these critical issues. Recommender systems serve a resource allocation role in society by allocating information to users and exposure to items. Whether the allocation is fair can affect the personal experience and social good [
73].
Fairness problems have received increasing attention from academia, industry, and society. Unfairness exists in different recommendation scenarios and various resources for both users and items. For users, there are significant differences in the recommendation accuracy between users of different ages and genders in movie recommendations and music recommendations, with female users and older users getting worse recommendation results [
22]. In addition to accuracy, existing studies have also found considerable differences in other recommendation measurements such as diversity and novelty [
91]. For items, existing research has found that minority items could get worse ranking performance and less exposure opportunity [
6,
116]. Besides, in premium business scenarios, paid items may receive worse services from the platform than non-paid items [
56]. Moreover, there are potential unfairness issues with various recommendation methods. Both traditional recommendation methods [
22] and deep learning models [
39] can suffer from unfairness.
Mitigating these unfairness phenomena is of great importance for recommender systems. Here are some reasons: (1) from an
ethical perspective, as early as ancient Greece, fairness was listed by Aristotle as one of the crucial virtues to make people live well [
3]. Fairness is an important virtue, and a fundamental requirement for a just society [
74]. (2) From a
legal perspective, Anti-discrimination laws [
38] require that employment, admissions, housing, and public services do not discriminate against different groups of people based on gender, age, race, and so on. For example, minority-owned companies should be recommended at a similar rate to white-owned companies in a job recommendation scenario [
59]. (3) From a
user perspective, a fair recommender system facilitates the exposure of different information in the recommendations, including some niche information, which may help break the information cocoon, alleviate the societal polarization, broaden users’ horizons, and enhance the value of recommendations. (4) From an
item perspective, a fair recommender system can allocate more exposure to long-tail items, alleviating the Matthew effect [
54]. It may also motivate these providers of niche items and then improve the diversity and creativity of items. (5) From a
system perspective, a fair recommender system is conducive to its long-term interest. For example, an unfair recommender system may recommend popular content for users with niche interests, resulting in a bad experience. Similarly, it may also provide little exposure for niche providers. The lack of positive feedback may lead to a tendency for niche groups to leave the platform, which will reduce the diversity of content and users on the platform in the long run and affect the platform’s growth [
65]. Therefore, addressing unfairness is a critical issue for recommender systems.
A concept closely related to fairness is bias, which has also attracted extensive attention in current years. Some biases in recommender systems can lead to unfairness problems, such as popularity bias [
113] and mainstream bias [
53]. There are also some biases that have little to do with fairness, such as position bias. Generally speaking, fairness reflects normative ideas about how a recommender system should be, while bias is more concerned with statistical issues, such as the difference between what the model learns and the real world.
Although fairness has been studied in computer science for decades [
27], and there is a lot of related work in machine learning [
100,
101], fairness in recommendation has its unique problems. First, recommender systems are two-sided platforms that serve both users and items, where two-sided fairness needs to be guaranteed. Second, fairness in recommendation is dynamic in nature, as there exists a feedback loop between users and the system. Third, on most platforms, the recommendation needs to be personalized by considering the unique needs of each user. Fairness in recommendation should also take users’ personalization into account. Furthermore, apart from accuracy, fairness needs to be jointly considered with other measurements in the recommendation, such as diversity, explainability, and novelty. Therefore, current fairness work in machine learning, which mainly focuses on classification, could hardly be leveraged in recommender systems directly.
For the above reasons, fairness in recommendation has become an important topic in the research community. The attracted attention is increasing; trends are shown in Figure
1. As shown in Table
1, more than 60 fairness-related papers about recommendations have been published in top IR-related conferences and journals (e.g., TOIS, SIGIR, WWW, and KDD) in the past five years. In the table, researches on fairness are summarized with their different definitions, targets, subjects, granularity, and optimization objects (details on these definitions are given in Table
3 and Section
3). We can find the focus of current studies. For example, consistent fairness (CO) is the most common definition of fairness, and current studies mainly focus on the group level. These trends are further discussed in the corresponding sections below.
Research on the fairness in recommendation is blossoming. However, due to various scenarios, diverse stakeholders, and different measurements, the research on fairness in the recommendation field is scattered. To fill this gap, this survey systematically reviews the existing research on fairness in the recommendation from several perspectives. The corresponding summary and discussion can guide and inspire future work. In summary, the contributions of this survey are as follows:
•
We summarize existing definitions of fairness in recommendation and provide several views for classifying fairness issues in recommendation.
•
We introduce some widely used measurements for fairness in recommendation and review fairness-related recommendation datasets in previous studies.
•
We review current methods for fair recommendations and provide an elaborate taxonomy of methods.
•
We outline several promising future research directions from the perspective of definition, evaluation, algorithm design, and explanation.
Several surveys are related to the topic of this survey. As far as we know, Castillo [
13] first reviews fairness and transparency in information retrieval briefly. However, it only covers the related work before 2018, and in recent years, fairness in recommendations has developed greatly. References [
14,
63] concentrate on fairness in machine learning, but fairness in recommendation is not covered, especially its unique characteristics. Chen et al. [
16] recently reviewed bias in recommender systems and introduced fairness issues, but fairness is not their main focus, and fairness measurements and datasets are not covered. To the best of our knowledge, there is no survey dedicated to systemically reviewing and detailing the fairness in the recommendation in a complete view.
This survey is structured as follows: In Section
2, we introduce existing definitions of fairness in the recommendation and discuss some related concepts. In Section
3, we present several perspectives to classify fairness issues in the recommendation. In Section
4, we introduce representative measurements for measuring fairness in the recommendation. In Section
5, we provide a taxonomy of methods to address unfairness in the recommendation. In Section
6, we introduce fairness-related datasets in recommender systems. In Section
7, we present possible future research directions. We conclude this survey in Section
8.
4 Measurements of Unfairness in Recommendation
4.1 Overview of Fairness Metrics
We introduce some widely used metrics for fairness in the recommendation, as shown in Table
5. Since there are different fairness definitions, the measurements of unfairness are not the same. Moreover, as the characteristics of fairness issues mentioned in Section
3 also affect the design and choice of fairness metrics, different metrics have different scopes of application, which are also marked in Table
5.
As demonstrated in Table
5, most fairness metrics are proposed for outcome fairness, as it is the focus of most work, where more metrics for consistent fairness and calibrated fairness. Thus, we mainly present the corresponding metrics for these two fairness definitions in Sections
4.2 and
4.3, respectively, and show all the others in Section
4.4.
When selecting fairness metrics based on definitions, it is important to note that different metrics do not have the same scope of application. For consistent fairness, Absolute Difference, Variance, and Gini coefficient are commonly used measurements at the two-group, multi-group, and individual levels. These three metrics have a wide range of applicability to different subjects, granularity, and optimization objects. For calibrated fairness, KL-divergence and L1-norm are common measurements for multi-group and individual fairness. These two metrics also have broad applicability. Due to many groups in the group-level calibrated fairness studies, there are no metrics specifically designed for the two group situations. These common metrics are generic and can be used for both users and items but are relatively coarse-grained. They have two main drawbacks:
(1)
These common metrics typically use the first-order moment like the average to describe groups, ignoring higher-order information;
(2)
These metrics do not consider the characteristics of user fairness and item fairness.
To address the first point, some researchers [
90,
114] use statistical tests such as
KS statistic or
ANOVA, which consider the population distribution. For the second point, for users, some researchers [
104] consider user fairness on each item and then aggregate them. For items, some researchers [
31,
102] consider unfairness across different positions and then aggregate them. Although limited in application, these metrics could be more proper for specific fairness issues. Specific details of these metrics are described below.
Since the metrics for different fairness definitions are not the same, we next present the corresponding metrics based on the fairness definitions. The meanings of the commonly used symbols are shown in Table
6.
4.2 Metrics for Consistent Fairness (CO)
As mentioned in Section
2, current work on consistent fairness in recommendation requires that all individuals or groups should be treated similarly. Therefore, the corresponding measurements mainly measure the inconsistency of the utility distribution. Most metrics apply to both user fairness and item fairness. They consider the utility of each individual or group as a number and then measure the inconsistency of these numbers. Due to many metrics on consistent fairness and that early studies concentrate on situations where only two groups exist, we will present these metrics in the order of metrics for two groups, multiple groups, and individuals.
Absolute Difference. Absolute Difference
(AD) is the absolute difference of the utility between the protected group
\(G_0\) and the unprotected group
\(G_1\). For user, the group utility
\(f(G)\) is often defined as the average predicted rating [
114] or the average recommendation performance in the group
G [
28,
54]. For item, the group utility
\(f(G)\) can be defined as the whole exposure in the recommendation lists for the group
G [
93]. The lower the value, the fairer the recommendations.
KS statistic. Kolmogorov-Smirnov statistic is a nonparametric test used to determine the equality of two distributions. It measures the area difference between two empirical cumulative distributions of the utilities for groups. The utilities are often defined as the predicted ratings in the group [
43,
114]. Compared to
AD using the average utility, KS statistic can measure the high-order inconsistency. The lower the value, the fairer the recommendations.
Here, T is the number of intervals in the empirical cumulative distribution, l is the size of each interval, \(\mathcal {G}(R_0,i)\) is the number of utilities of the group \(G_0\) that are inside the ith interval.
rND, rKL, and rRD. rND, rKL, and rRD measure item exposure fairness for a ranking
\(\tau\) [
102]. Unlike previous metrics, these metrics take the exposure position into account, calculating the normalized discounted cumulative unfairness similar to NDCG. Experiments show that rKD is smoother and more robust than rRD, and that rRD has limited application scope. The lower the value, the fairer the recommendations are for these metrics:
Here, the normalizer Z is the highest possible value of corresponding measurements,
\(|S^+_{1...i}|\) is the number of the protected group in the top-i of the ranking
\(\tau\),
\(S^+\) is the number of the unprotected group in the whole ranking.
Pairwise Ranking Accuracy Gap. Pairwise Ranking Accuracy Gap (PRAG) measures item unfairness in the pairwise manner [
6,
93]. Unlike previous metrics focusing on exposure or click-through rate, PRAG measures the unfairness of pairwise ranking accuracy, and it is calculated on data from randomized experiments. The lower the value, the fairer the recommendations.
Here,
PairAcc represents the ranking accuracy for a pair of items
\(x_i, x_j\) from different groups
\(I_1,I_2\).
\(f(x_i)\) and
\(f(x_j)\) are the predicted score for the recommendation query
q.
\(y_i\) and
\(y_j\) are the true feedback, which are collected through randomized experiments.
Value Unfairness and its variants. Value unfairness is proposed to measure inconsistency in signed prediction error between two user groups [
104]. There are three variants of Value unfairness. Absolute Unfairness measures the inconsistency of absolute prediction error, while Underestimation Unfairness and Overestimation Unfairness measure inconsistency in how much the predictions underestimate and overestimate the true ratings, respectively. The lower the value, the fairer the recommendations.
Here,
\(E_0[\hat{r}]_i\) is the average predicted score for the
ith item from group 0, and
\(E_0[r]_i\) is the average rating for the
ith item from group 0.
The above metrics are only applicable to measure inconsistency between two groups. In the following, we present the metrics to measure unfairness for three or more groups. It is worth noting that, since we can consider individual fairness as a special case of group fairness (i.e., each individual belongs to a unique group), theoretically, these group fairness metrics below can also apply to individual fairness. However, in practice, the common metrics for individual and group fairness are different.
Variance. Variance is a commonly used metrics for dispersion, which is applied to both group-level [
73,
97] and individual-level [
73,
97,
99]. The utility can be the rating prediction error [
73], the predicted recommendation satisfaction for a single user [
97,
99], and the average exposure for an item group [
97]. The lower the value, the fairer the recommendations.
Min-Max Difference. Min-Max Difference (MMD) is the difference between the maximum and the minimum of all allocated utilities. This metric is used to measure the inconsistency of the average exposure for multiple item groups [
36], and the disagreement for users in group recommendation at the individual level [
86]. The lower the value, the fairer the recommendations.
F-statistic of ANOVA. The one-way analysis of variance (ANOVA) is used to determine any statistically significant differences between the mean values of three or more independent groups. Its F-statistic can be considered a fairness measurement. The utility can be the rating prediction error for a single rating [
90]. The lower the value, the fairer the recommendations.
Here,
\(f(ind_j)\) is the utility of an individual belong to
\(v_i\),
\(\overline{v}_i\) is the mean utility of group
\(v_i\),
\(\overline{v}\) is the mean utility of all individuals.
In the following, we present some metrics commonly used for individual fairness. Note that in addition to the metrics below, Variance above is also often used to measure individual fairness.
Gini coefficient. Gini coefficient is widely used in sociology and economics to measure the degree of social unfairness [
28,
30,
52,
60,
61]. To our knowledge, it is also the most commonly used metric for consistent individual fairness. The utility can be the predicted relevance for a user [
28,
52] or the exposure for an item [
30,
60,
61]. The lower the value, the fairer the recommendations.
Jain’s index. Jain’s index [
41] is commonly used to measure unfairness in network engineering. Some studies use it to measure the inconsistency of predicted user satisfaction in group recommendations [
99] and the inconsistency of item exposure [
112]. The higher the value, the fairer the recommendations.
Entropy. Entropy is often used to measure the uncertainty of a system. In recommendation, it is used to measure the inconsistency of item exposure [
60,
61,
71]. The lower the value, the fairer the recommendations.
Min-Max Ratio. Min-Max Ratio is the ratio of the minimum to the maximum of all allocated utility. Some studies [
45,
99] use it to measure the inconsistency of the predicted user satisfaction in group recommendation. The higher the value, the fairer the recommendations.
Least Misery. Least Misery is the minimum of all allocated utility. It is also a commonly used fairness metric in group recommendation [
45,
75,
99]. The higher the value, the fairer the recommendations.
4.3 Metrics for Calibrated Fairness (CA)
Calibrated fairness requires defining the merit of an individual or group. We denote \(Merit(\cdot)\) as a merit function that measures the merit of an individual or group. We can calculate the fair distribution of the allocation based on \(Merit(\cdot)\), i.e., the proportion of the individual’s or group’s allocation to the total allocation in the fair case, i.e., \(p_f(v_i) = \frac{Merit(v_i)}{\sum _j Merit(v_j)}\). We can also calculate the proportion of the total allocation for an individual or group in the current situation, i.e., \(p(v_i) = \frac{f(v_i)}{\sum _j f(v_j)}\). Most measurements of calibrated fairness measure the difference between the distribution of utilities p and the distribution of merits \(p_f\).
Since all the group fairness metrics in calibrated fairness can be applied to multiple groups, we will present them in the order of group fairness and individual fairness.
MinSkew and
MaxSkew. The deviation (Skew) on a certain group
v can be defined as
\(\log (\frac{p_f(v)}{p(v)})\). And then, we can define the min-skew and the max-skew as follows: Here, the utility can be the exposure of the item group, while the
\(p_f\) is a predefined distribution [
31]. For MinSkew, the higher the value, the fairer the recommendations. For MaxSkew, the lower the value, the fairer the recommendations.
KL-divergence. KL-divergence measures how one probability distribution is different from the other. It can be used to measure the difference between
\(p_f\) and
p. Here, the utility can be the exposure of the item group, while the
\(p_f\) can be calculated by the group’s historical exposure [
56,
84,
90]. The lower the value, the fairer the recommendations.
NDKL. NDKL is an item unfairness measure based on KL-divergence [
31]. It computes the KL-divergence for each position and then obtains a normalized discounted cumulative value. The lower the value, the fairer the recommendations.
Here, the normalizer Z is computed as the highest possible value, and
\(D_{KL}^i\) is the KL-divergence of the top-i ranking.
JS. Like KL-divergence, JS-divergence also measures how one probability distribution differs from the other. Some work [
66] uses JS-divergence as a metric instead of KL-divergence, as it is symmetrical while KL-divergence is asymmetrical. The lower the value, the fairer the recommendations.
Overall Disparity. Overall disparity measures the average disparity of the proportion of the utility and merit among different groups. The utility can be exposure-based or click-based [
67,
103]. The lower the value, the fairer the recommendations.
Generalized Cross-entropy. Generalized cross-entropy [
19,
56] also measures how one probability distribution is different from the other. The higher the value, the fairer the recommendations.
Here,
\(\alpha\) is a hyperparameter.
In the following, we present calibrated fairness measures frequently used at the individual level.
L1-norm. L1-norm is the sum of the magnitudes of the vectors in a space. Some researchers [
7,
8,
48] treat the merit and utility distributions as vectors and then use the L1-norm to calculate the distance between the vectors. This metric is often used for individual-level measurement [
7,
8], and there is also work [
48] that uses it to measure group-level unfairness. The lower the value, the fairer the recommendations.
It is worth noting that some measures of calibrated fairness and consistent fairness are interconvertible. Theoretically, for a calibrated fairness measurement, if we set \(p_f\) to a uniform distribution, then it can become a measurement for consistent fairness. Similarly, for a consistent fairness measurement that contains \(f(v)\), we can set \(f(v)\) to \(\frac{p(v)}{p_f(v)}\), then it become a calibrated fairness measurement.
4.4 Metrics for Other Fairness Definitions
4.4.1 Metrics for Envy-free Fairness (EF).
Envy-free fairness requires a definition of envy, which can be different in different scenarios. In group recommendations, different users in the group receive the same recommendations. Serbos [
77] defines envy as follows:
Envy-freeness (in group recommendation). Given a group G, a group recommendation package P, and a parameter \(\delta\), we say that a user \(u \in G\) is envy-free for an item \(i \in P\) if \(r_{u,i}\) is in the top-\(\delta\)% of the preferences in the set \(\lbrace r_{v,i} : v \in G\rbrace\).
This envy definition can be applied to a single item. This definition means that a user u feels envy on an item if at least \(\delta\)% users in the group like this item more than u. It is impossible for all users in a group to be envy-free (i.e., the user is envy-free for all items in the package). In practice, m-envy-free is often used, which means that the user in the group is envy-free for at least m items.
A measurement for envy-free fairness can be the proportion of m-envy-free users:
where
\(|G_{ef}|\) is the number of m-envy-free users. The higher the value, the fairer the recommendations.
In general recommendations, different users receive different recommendations. Patro et al. [
71] define envy-freeness as follow:
Envy-freeness(in general recommendation). Given a utility metrics f and all the recommendation lists \(\mathcal {L}\), we say that a user u is envy-free for a user v if and only if \(f(l_v,u) \ge f(l_u,u)\) and the degree of envy can be defined as \(max(f(l_v,u) - f(l_u,u),0)\). Here, \(f(l,u)\) is the predicted relevance sum for the user u with the recommendation list l.
This envy definition is applied to each pair of users. Unlike envy in group recommendations, this definition does not involve the third user. Moreover, it is feasible to make all users envy-free with utility metrics properly chosen.
The average of envy among users can be a measurement of envy-free fairness:
where
\(envy(u_i,u_j) = max(f(l_i,u_i) - f(l_j,u_i),0)\). The lower the value, the fairer the recommendations.
4.4.2 Metrics for Counterfactual Fairness (CF).
Li et al. [
55] demonstrate that counterfactual user fairness can be guaranteed when user embeddings are independent of fairness-related attributes. Therefore, they use a classifier to predict fairness-related attributes based on user embeddings and use classification measurements to measure counterfactual fairness. The classification measurements can be Precision, Recall, AUC, and F1 et al.
4.4.3 Metrics for Rawlsian Maximin Fairness (RMF).
Rawlsian maximin fairness argues that fairness depends on the worst individual or group. A simple measurement is the utility of the worst case, but it is vulnerable to noise. To make the metrics robust, some work [
115] uses the average utility of the bottom n% as a measurement. The higher the value, the fairer the recommendations.
4.4.4 Metrics for Maximin-shared Fairness (MSF).
Maximin-shared fairness requires the outcome of each individual to be more than its maximin share. A measurement for item maximin-shared fairness is the proportion of individuals satisfying this condition, where the maximin share for every item is a constant value, i.e., the average exposure [
71]. The higher the value, the fairer the recommendations.
4.4.5 Metrics for Process Fairness (PR).
One criterion of process fairness is that the model should use fair representations. A fair representation should be independent of fairness-related attributes, so some work [
9,
96] trains a classifier to predict fairness-related attributes of users and items according to their representations. Then they use some classification measurements (e.g., precision) to measure the fairness of representations, which are similar to the counterfactual fairness measurements [
55].
8 Conclusion
Unfairness is widespread in recommender systems, which has attracted increasing attention in recent years, and a series of fairness definitions, measurements, and methods have been proposed. This survey systematically reviews fairness-related research in the recommendation and summarizes current fairness work from multiple perspectives, including definitions, views, measurements, datasets, and methods.
For fairness definitions, previous studies mainly focus on outcome fairness, which we further classify according to different targets and concepts. We find that group fairness is the most common target, and consistent fairness and calibrated fairness are the most common concepts. As for fairness views, we present some views to classify fairness issues in the recommendation, containing fairness subjects, fairness granularity, and fairness optimization objects. For fairness measurements, we introduce representative measurements of existing work and summarize common metrics of different fairness definitions. As for fairness methods, we review representative studies from data-oriented methods, ranking methods, and re-ranking methods. It is common for researchers to develop ranking and re-ranking methods to achieve fair recommendations, while there are only a few studies on adjusting data to improve fairness. Additionally, we summarize fairness-related recommendation datasets occurring in previous fairness work for researchers to find relevant datasets easily.
Furthermore, we discuss some promising future research directions from different perspectives for fairness in the recommendation. In terms of definitions, which definition of fairness is most proper to recommender systems may be an important problem for future work. As for evaluation, we could develop effective benchmarks to compare different fairness methods fairly. In terms of algorithm design, we discuss some promising future work containing fairness methods for both user and item and fairness methods beyond accuracy. In terms of explanation, explaining why unfairness exists could be a problem worth exploring. Finally, we hope this survey may help readers better understand fairness issues in recommender systems and provide some inspiration.