\DocumentMetadata

A Survey on Causal Inference for Recommendation

Huishi Luo hsluo2000@buaa.edu.cn 0000-0002-3553-2280 , Fuzhen Zhuang zhuangfuzhen@buaa.edu.cn 0000-0001-9170-7009 Institute of Artificial Intelligence, Beihang UniversityBeijingChina100191 , Ruobing Xie ruobingxie@tencent.com 0000-0003-3170-5647 WeChat Search Application Department, TencentBeijingChina100080 , Hengshu Zhu zhuhengshu@gmail.com 0000-0003-4570-643X the Career Science Laboratory, BOSS ZhipinBeijingChina100028 , Deqing Wang dqwang@buaa.edu.cn 0000-0001-6441-4390 SKLSDE, School of Computer Science, Beihang UniversityBeijingChina100191 , Zhulin An anzhulin@ict.ac.cn and Yongjun Xu xyj@ict.ac.cn Institute of Computing Technology, Chinese Academy of SciencesBeijingChina

Abstract.

Causal inference has recently garnered significant interest among recommender system (RS) researchers due to its ability to dissect cause-and-effect relationships and its broad applicability across multiple fields. It offers a framework to model the causality in recommender systems like confounding effects and deal with counterfactual problems such as offline policy evaluation and data augmentation. Although there are already some valuable surveys on causal recommendations, they typically classify approaches based on the practical issues faced in RS, a classification that may disperse and fragment the unified causal theories. Considering RS researchers’ unfamiliarity with causality, it is necessary yet challenging to comprehensively review relevant studies from a coherent causal theoretical perspective, thereby facilitating a deeper integration of causal inference in RS. This survey provides a systematic review of up-to-date papers in this area from a causal theory standpoint and traces the evolutionary development of RS methods within the same causal strategy. Firstly, we introduce the fundamental concepts of causal inference as the basis of the following review. Subsequently, we propose a novel theory-driven taxonomy, categorizing existing methods based on the causal theory employed—namely, those based on the potential outcome framework, the structural causal model, and general counterfactuals. The review then delves into the technical details of how existing methods apply causal inference to address particular recommender issues. Finally, we highlight some promising directions for future research in this field. Representative papers and open-source resources will be progressively available at https://github.com/Chrissie-Law/Causal-Inference-for-Recommendation.

Recommender Systems, Causal Inference, Causal Learning

^†^†ccs: Information systems Recommender systems

1. Introduction

Recommender systems (RS), working as filtering systems to present personalized information to users and alleviate information overload, have been widely deployed in various online applications, including e-commerce, social networks, and multimedia services. Recently, an emerging research direction has attracted increasing attention from RS researchers, which explores the integration of advanced machine learning with a traditional statistics field, causal inference. Causal inference (Gelman, 2011; Imbens and Rubin, 2015) works to analyze the relationship between a cause and its effect (Pearl, 2009), which has a wide range of real-world applications in both academic and industrial domains, such as medicine (Kessler et al., 2019; Shalit, 2020), climate (Tan et al., 2021a), political science (Schlotter et al., 2011), and online advertising evaluation (Li et al., 2016; Fong et al., 2018). Treatment effect estimation is a fundamental problem in causal inference, often applied in policy evaluation. For example, in pharmaceutical research, where we are interested in the effect of a drug on lifespan, we need to answer a causal question involving a so-called intervention or treatment: What is the probability that a typical patient would survive in $L$ years if made to take the drug? A large-scale randomized controlled trial is the golden solution but it may suffer from the expense and even ethical issues. Therefore, in most cases, we can only estimate the effect from non-randomized observational data, where the correlation between the drug and survival does not imply causation, because factors including age, gender, and severity of the disease may affect the outcomes.

Causality for recommendation has been widely used in uplift modeling for policy effect evaluation (Radcliffe, 2007; Gutierrez and Gérardy, 2017), but it was not until the last few years that research has tended to focus on applying it to model training. Common recommendation scenarios in practice, including click-through rate (CTR) prediction and post-click metric prediction, etc., can be abstracted into causal problems, and causal inference can be applied in different stages of the entire RS project, such as preliminary data collection (Wang et al., 2021c), representations learning of users and items embeddings (Liu et al., 2021; He et al., 2022; Wang et al., 2023a), objective optimization (Mehrotra et al., 2018; McInerney et al., 2020; Bonner and Vasile, 2018), and policy evaluation offline and online (Sato et al., 2019; Saito and Joachims, 2021; Sato, 2021). Causal recommender systems can surpass traditional approaches, primarily due to two key strengths (Fig. 1):

1. Model cause and effect. The majority of current machine learning systems, including RS, operate predominantly in a statistical mode (Pearl, 2018; Xu et al., 2023c), which focuses on the correlation between variables. However, in applications, we care more about causality rather than correlation, and it is well known that “Correlation is not causation”. For example, a movie recommendation platform recorded a female user who has finished watching an action movie, so it concludes that she likes action movies and makes many recommendations for related action movies. Nevertheless, the user may have watched the movie due to its popularity rather than her inherent preference for action movies in fact. Therefore, the spurious correlation between user interest and movie genres learned by traditional recommender systems may lead to a degraded user experience. In contrast, causal recommendation systems can learn the causal effects of users’ individual interest as well as conformity on the interaction outcome (i.e., watching), respectively, so that action movies will not be incorrectly recommended later. Modeling cause and effect enables causality-based recommender systems to 1) measure the causal effects on user interaction of source variables of a wide range of bias, such as popularity (Zhang et al., 2021b; Wei et al., 2021) and exposure (Liang et al., 2016; Wang et al., 2021b), thus performing effective debiasing, which is currently the most common application of causal inference for recommendation; 2) better control of RS due to decomposition and inference of the causal effect of variables, for example, leveraging the causal effect of certain bias to improve recommendation accuracy (Zhang et al., 2021b).

2. Answer counterfactual questions. Many recommender system problems, including data augmentation, out-of-distribution (OOD) generalization, and policy evaluation, are essentially counterfactual problems, that is, the situation where the values of some causal variables are different from reality. 1) In terms of the data augment problem, as a significant complementary resource of the observed data (Wang et al., 2021c), the counterfactual data needs to answer questions such as ”What would be the user’s interaction if the recommended items had been different?” or ”What would be the probability of click if an item had been recommended to a user who has not been recommended before?”. 2) The OOD problem refers to the recommendation which violates the Independent and Identically Distributed (IID) assumption of the interactions between training and testing periods (He et al., 2022). Traditional recommendation may learn false associations between users and items, while causal recommender systems adopt the counterfactual means to find invariant or unchangeable variables or causal relationships in the recommendation task and reuse them to generalize when the distribution changes. For example, if a pregnant female user had purchased red high heels before pregnancy, a traditional recommendation system might continue to make recommendations of high heels, but a causal inference system can learn the causal relationship between high heels and pregnancy status through causal tools like causal graph. Therefore, when the user’s status shifts (identified from the user’s behavior like purchasing baby products), the causal recommender system no longer recommends high-heeled shoes but retains the user’s preference for the red color and recommends red clothing instead. 3) Uplift modeling estimates the increase, or uplift, in user interactions caused by recommendations, which is a counterfactual problem because one needs to estimate the difference between two mutually exclusive outcomes for an item (either item $i$ is recommender or not for a specific user). 4) In addition to the above issues, counterfactuals can also optimize beyond-accuracy objectives such as fairness and explainability. For example, to ensure fairness, sensitive features like sex and race can be modified and removed in the counterfactual world (Li et al., 2021a), and explainability can be attained by comparing the real world with the counterfactual world to search for user interactions that affect recommendation results(Ghazimatin et al., 2020; Tan et al., 2021b).

Refer to caption — Figure 1. Strengths of causal inference for recommendation.

It is worth mentioning that there are some existing surveys (Wu et al., 2022; Gao et al., 2022c; Zhu et al., 2023a; Xu et al., 2023b) of causal inference for recommender systems. However, the present study distinguishes itself from these previous works for several reasons.

1. Theoretically coherent classification framework from a causal perspective. The aforementioned surveys fall short in providing a comprehensive taxonomy of causal recommender systems. Specifically, (Wu et al., 2022) only discusses recommendation methods of potential outcome framework (Rubin, 1974), and approaches investigated in (Gao et al., 2022c; Zhu et al., 2023a; Xu et al., 2023b) are mainly classified from the application perspectives, i.e., issues of recommender systems. This application-centric taxonomy, while practical, tends to obscure the underlying theoretical coherence of causal inference methods, as a single causal theory could be applied in various problems. Contrastingly, our survey adopts a more nuanced and theory-driven classification, involving the Neyman-Rubin potential outcome (PO) framework (Splawa-Neyman et al., 1990; Rubin, 1974) and the Pearl structural causal model (SCM) framework (Pearl, 1995, 1988, 2009). Within this paper, causal-based recommendation algorithms are categorized into three main types: PO-based, SCM-based, and general counterfactuals-based. Both PO-based and SCM-based methods utilize specific causal inference techniques, but the former does not explicitly employ causal structure information. On the other hand, general counterfactuals-based methods refer to those designed under the inspiration of counterfactual concepts, without using particular causal inference techniques. The classification framework is illustrated in Fig. 2. This taxonomy not only provides a more structured and holistic understanding of causal theories but also empowers researchers, particularly newcomers to the field of causal inference, to effectively grasp and apply these theories in practice.

2. Evolution of Causal Methods in Recommender Systems. We systematically delineates the developmental trajectory of the integration between prevalent causal inference theories and recommender systems, as illustrated in Fig. 7 and 16. Through this intuitive exposition, readers can readily perceive how methodologies within a specific domain have been iteratively proposed and the particular issues they address in their respective evolutions.

3. Up-to-Date Collection and Review. Given the growing popularity of this domain, our survey encompasses numerous recent publications absent in (Wu et al., 2022; Gao et al., 2022c; Zhu et al., 2023a; Xu et al., 2023b). We have collected papers related to causal inference-based recommender algorithms from esteemed conference proceedings and journals, and visualize the statistics of them concerning the published year and causal inference framework in Fig. 3.

This paper provides a comprehensive summary of the work on causal recommender systems. The rest of this survey is organized into six sections. Section 3 introduces the basic concepts within recommendation and causal inference of both frameworks, and highlights the distinctions between this survey and existing reviews in this field. Sections 4, 5, and 6 interpret causal recommendation approaches from the perspective of causal techniques: PO-based, SCM-based, and general counterfactuals-based, respectively. Future research directions are openly discussed in Section 7. The last section concludes this survey. Detailed discussions on related fields and more foundational concepts of causal inference theories, are presented in the Supplemental Information. The main contributions of this survey are summarized below.

•

Novel taxonomy: We separate various causal recommendation methods into three major categories based on the causal framework they adopt, which may be more instructive than existing taxonomies for the readers to integrate causal inference with recommender systems and propose new approaches in practice (see Section 3.3).
•

Comprehensive review: Over one hundred and twenty causal recommendation papers from the last century to 2023 are introduced, explained, and summarized, which might give readers a comprehensive overview of causality for recommendation.
•

Open discussion: Research directions for applying causality to improve recommendation methods in the academic and industrial areas are openly discussed.

2. Related fields

Some areas related to causal inference may be unfamiliar to recommender system researchers; thus, we introduce them and carefully clarify the connections and differences between them and causal inference.

Causal Discovery (Peters et al., 2017): Causal discovery is a crucial technique in causality. Causality is the science of cause and effect (Pearl and Mackenzie, 2018). In Pearl’s theory, causality contains two fundamental problems: the first is to prove that one variable is the cause of another or find the cause of a variable, and the second is to draw a conclusion of what might be the effects if changing the value of a variable. The former corresponds to causal discovery, also called causal structure learning, seeking to discover causal relations, which are stable physical mechanisms in nature and manifest themselves in determined functional relationships between variables, from the data based on some causal assumptions that are hardly testable in observational studies (Pearl, 2010). The latter corresponds to causal inference, which estimates the outcome after an intervention with a given causal relationship (usually a causal structure obtained by causal discovery or empirically based hypotheses) (Pearl and Mackenzie, 2018; Yao et al., 2021; Pearl, 2009). In other words, the former is the basis of the latter, because it is impossible to tell what variables would be affected by an intervention without a causal structure, and thus no interventions and counterfactuals can be implemented. For example, we cannot determine the effect of opening umbrellas on rain if we do not know the causal relevance between them since they always coincide. Note that most methods for causal discovery rely on the SCM framework (Naser, 2022).

Bayesian Inference: Bayesian inference is a popular approach to data analysis based on Bayes’ theorem, where all observed and unobserved parameters are given a joint probability distribution, i.e., the prior and data distributions, to inference prior distribution (van de Schoot et al., 2021). Bayesian inference is regarded as one of key techniques and integral components in both PO and SCM: In PO, with assignment mechanisms and the definition of potential outcomes, a Bayesian model can be used to connect the treatment and potential outcomes in real world or counterfactual world; in SCM, Bayesian networks are widely used as causal graphs to present causal associations between variables. Nevertheless, there is a primary distinction between Bayesian inference and causal inference. Bayesian inference is causality-free statistics that focus on associations, such as dependence, likelihood, etc., which can be formulated in terms of distribution functions. However, what is unique to causal inference is that causal concepts cannot be defined from statistics associations alone. As the example mentioned above, it is impossible to tell from the statistics whether raining causes the behavior of opening umbrellas or vice versa. This core distinction leads to two differences between Bayesian inference and causal inference in their specific manifestations, including assumptions and notations. 1) Bayesian inference is based on associational assumptions, which, even untested, are testable in principle (Pearl, 2010). However, as for causal inference, causal assumptions, in contrast, cannot be verified even in principle unless we proactively influence the observed data, i.e., resort to experimental control. In general, the sensitivity to priors in Bayesian statistics, such as the IID assumption, will decrease with increasing sample size, while sensitivity to prior causal assumptions, say that whether to open umbrellas does not affect the weather, remains substantial regardless of sample size. 2) New notations are introduced to causal inference as causal expressions compared with Bayesian statistics, which is presented in detail in Section 3.

3. Foundation

In this section, background information and several important concepts of causal inference and recommender systems are introduced to facilitate readers’ understanding of the inter-study of the two research fields. The notations used in this survey are listed for convenience. At the end of this section, we set up categorizations of causal recommendations.

3.1. Causal Inference

In this part, we will give a brief review of two representative frameworks of causal inference, including the potential outcome (PO) framework by Rubin et al. (Splawa-Neyman et al., 1990; Rubin, 1974; Imbens and Rubin, 2015) and the structural causal models (SCM) framework by Pearl et al. (Pearl, 1995, 1988, 2009) . Note that these two frameworks are logically equivalent (Pearl, 2009).

3.1.1. Potential Outcome Framework

The Potential Outcomes Framework (aka the Neyman-Rubin Causal Model) (Splawa-Neyman et al., 1990; Rubin, 1974; Imbens and Rubin, 2015) is the most widely used framework across many disciplines. With a hypothetical treatment (or manipulation, intervention), the causal effect, i.e., treatment effect, is defined as the difference between the potential outcomes under treatment and control for the same unit (Imbens and Rubin, 2015).

Definition 0 (Unit).

A unit refers to the research object in the potential framework.

A unit can be a physical object, an individual, or a collection of objects or persons, such as a classroom or a market, at a particular point in time (Imbens and Rubin, 2015). In recommendation research, a user-item pair will usually be defined as a unit. It should be noticed that the same physical object or person at a different time is a different unit. This is a reasonable restriction, considering the same user will make different decisions at a different time even if exposed to the same item due to factors like preference shift, mood, occasion and so on.

Definition 0 (Treatment).

Treatment can be defined as the action applied to a unit.

This paper focuses on binary treatment (e.g., recommend or not), the most common setting in the recommendation field. In practice, we refer to the more active treatment simply as the “treatment” $T=1$ and the other treatment as the “control” $T=0$ .

Potential Outcome. For each treatment-unit pair, the potential outcome is the outcome that the treatment is applied to the unit, denoted as $Y(T=t)$ (ignoring unit). For a unit, only the potential outcome corresponding to the treatment actually taken will be observed, denominated as observed outcome, while others are referred to as counterfactual outcomes. The fundamental problem of causal inference in PO framework is that we can never obtain both observed and counterfactual outcomes for a unit: it is impossible to realize all treatments and observe the corresponding outcomes.

Treatment Effect/ Causal effect. Treatment effect is represented by the difference between the potential outcomes under treatment and control for the same unit, formulated as:

(1)

\textrm{TE}=Y(T=1)-Y(T=0),

where ${Y}(T=1)$ and ${Y}(T=0)$ are the potential treated and control outcome of the unit, respectively. Treatment effect like Equation 1 is also called Individual Treatment Effect. Furthermore, the treatment effect can be defined at the population and subpopulation levels. At the population level, Average Treatment Effect (ATE) is the expectation of ITE over the whole population (Guo et al., 2020), denoted as:

(2)

\textrm{ATE}=\mathbb{E}[Y(T=1)-Y(T=0)].

The ATE on the subpopulation level is often of particular interest; thus we define Conditional Average Treatment Effect (CATE) on the units with the same features $X=x$ as:

(3)

\textrm{CATE}=\mathbb{E}[Y(T=1|X=x)-Y(T=0|X=x)].

Assumptions. Despite the simple definition of the causal effect, the fundamental problem in causal inference, i.e., the missing data problem, appear to be a major obstacle to the estimation of the causal effect. Therefore, it is critical to make additional assumptions.

Assumption 1 (SUTVA).

The potential outcomes for any unit do not vary with the treatments assigned to other units. For each unit, there are no different forms or versions of each treatment level, which lead to different potential outcomes.

The stable unit treatment value assumption, or SUTVA (Imbens and Rubin, 2015) is the most fundamental assumption in causal inference, incorporating both the No Interference idea that treatments applied to one unit do not affect the outcome for another unit and the No Hidden Variations of Treatments concept that for each unit there is only a single version of each treatment level. The second assumption, ignorability or unconfoundedness (Rubin, 1990), states that treatment assignment is free from dependence on the potential outcomes.

Assumption 2 (Unconfoundedness / Ignorability).

Treatment assignment $W$ is independent to the potential outcomes, i.e., $T\perp Y(T=0),Y(T=1)|X$ , also written as $\textrm{Pr}(T=1|X,Y(T=0),Y(T=1))=\textrm{Pr}(T=1|X)$ , where $X$ denotes the background variables.

In other words, within subpopulations defined by the values of observed background variables, or covariates, the treatment assignment is random. The ignorability assumption rules out unmeasured confounders, which causally influences both the treatment $T$ and the outcome $Y(T)$ . $\textrm{Pr}(T=1|X)$ is called the propensity score (Rosenbaum and Rubin, 1983). The last assumption is positivity, or overlap:

Assumption 3 (Positivity).

$0<\textrm{Pr}(T=t|X=x)<1,\forall t,x.$

In large data samples, positivity requires that there are both treated and control units for all values of the covariates. In contrast to the untestable ignorability assumption (Imbens and Rubin, 2015), positivity can be tested from observed data. The combination of unconfoundedness and positivity is referred to as “strong ignorability (Rosenbaum and Rubin, 1983).”

3.1.2. Structural Causal Models Framework

Structural causal models (SCM) (Pearl, 1995, 1988, 2009) serve as a comprehensive causality framework, which unifies graphical models, nonparametric structural equations, and counterfactual and interventional logic. The most significant advantage of SCM is its intuitive structure of real-world causal dependencies based on graphical models as well as the wise and friendly symbiosis between counterfactual and graphical methods.

Causal Graph. A causal graph, or a causal diagram, is usually a Bayesian network, which describes the causal relations between variables by a Directed Acyclic Graph (DAG), where the nodes represent the variables and the edges record the causal relations. Causal graphs play an essential role in the SCM framework, for they provide a vivid representation of sets of variables that are relevant to each other in any given state of knowledge, and serves as a carrier of conditional independence relationships along the order of construction, through which we can confirm whether it satisfies the criteria such that certain causal inference methods can be applied (Pearl, 2009).

d-Separation. We first review the concept of dependency-separation (d-Separation) as the knowledge base for conditional independence. There are three typical causal graphs of three disjoint sets of variables, shown in Fig. 4, with the help of which we can characterize any pattern of arrows in the network. In the chain (Fig. 4(a)), $B$ is the $mediator$ that transmits the effect of $A$ to $C$ . In the fork (Fig. 4(b)), $B$ is often called a common cause or confounder of $A$ and $C$ . A confounder will make $A$ and $C$ statistically correlated even though there is no direct causal link between them, which may give rise to a so-called spurious correlation in the application. In the collider (Fig. 4(c)), though $A$ and $C$ are independent to begin with, conditioning on (i.e., knowing the value of) $B$ will make them dependent. A good example is three features of Hollywood actors: Talent $\rightarrow$ Celebrity $\leftarrow$ Beauty (Elwert and Winship, 2014). Although beauty and talent are completely unrelated to one another in the general population, an unanticipated negative correlation is found between talent and beauty if we only focus on famous actors: a celebrity is unattractive increases our belief that he or she is talented (Pearl and Mackenzie, 2018). This negative correlation is sometimes called collider bias or the “explain-away” effect. In recommendation systems, a similar example can be found in analyzing popularity bias from the user’s perspective. The factors influencing user interaction can be summarized as: Conformity $\rightarrow$ Interaction $\leftarrow$ Inherent Interest (Zheng et al., 2021). When a user interacts with a popular item, it does not necessarily indicate his or her true preference for it; such interaction may be driven by the desire to conform to prevailing trends. Conversely, if a user engages with an unpopular item — where the influence of conformity is significantly reduced — it is more probable that the item is in close alignment with the user’s inherent interests.

A path means a sequence of consecutive edges (of any directionality) in the graph, and we regard stopping the flow of dependency between the variables that are connected by such paths as blocked. In the chain and fork, the path between $A$ and $C$ will be blocked by conditioning on $B$ , while in the collider, any conditioning on $B$ will introduce a correlation between them. The formal definition of d-separation or blocking is defined as follows.

Definition 0 (d-Separation).

A path is said to be d-separated (or blocked) by conditioning on a set of nodes $\mathcal{Z}$ if and only if one of the two conditions is satisfied:

(1)

The path contains a chain $A\rightarrow B\rightarrow C$ or a fork $A\leftarrow B\rightarrow C$ such that the middle node $B$ is in $\mathcal{Z}$ ;
(2)

The path contains a collider such that the middle node $B$ is not in $\mathcal{Z}$ and such that no descendant of $B$ is in $\mathcal{Z}$ .

Structural Equations. Beside causal graph, structural equation is another representation of causal information, where the former is an abstraction of the latter. In its general form, a structural equation of a variable $Y$ is defined as:

(4)

Y=f_{Y}(Pa,U),

where $Pa$ (connoting parents) stands for the set of variables that directly determine the value of $Y$ and where $U$ represents exogenous variables or errors (or “disturbances”) due to omitted factors. For example, the causal graph in Fig. 5(a) is associated with the structural model as follows. In this section, uppercase letters are used to represent variables, while their corresponding lowercase counterparts denote the values of these variables. Thus, we have:

(5)

\displaystyle\begin{cases}a&=\ f_{A}(u_{A}),\\ b&=\ f_{B}(a,u_{B}),\\ c&=\ f_{C}(a,b,u_{C}),\end{cases}

where $U_{A}$ , $U_{B}$ and $U_{C}$ represent exogenous variables. A set of equations in the form of Equation 4 is called a structural model; if each variable has a distinct equation in which it appears on the left-hand side, then the model is called a structural causal model.

Intervention. The do-calculus allows researchers to complete intervention, interpreted as controlling the value of a variable, by purely mathematical means instead of by carrying out a physical experiment, which is one of the outstanding contributions of Pearl’s SCM framework. The do-calculus involves the do-operation, like $do(T=t)$ , which denotes the intervention of setting the variable $T$ to $t$ , realizing by blocking the effect of $T$ ’s parents on $T$ and set the value of $T$ as $t$ . For example, if we $do(B=b_{0})$ on the model in Fig. 5(a), Equations 5 will be modified as:

(6)

\displaystyle\begin{cases}a&=\ f_{A}(u_{A}),\\ b&=\ b_{0},\\ c&=\ f_{C}(a,b_{0},u_{C}),\end{cases}

the graphical description of which is shown in Fig. 5(b).

It is crucial to note that $\textrm{Pr}(Y=y|do(T=t))$ and $\textrm{Pr}(Y=y|T=t)$ are not the same. For example. the fork structure (Fig.4(b))might represent the causal mechanism that connects the number of sales at a local ice cream shop on that day ( $A$ ), a day’s temperature in a city ( $B$ ), and the number of violent crimes in the city on that day ( $C$ ) (Glymour et al., 2016). Because both ice cream sales and violent crime are more common in hot weather, a positive correlation might be found when estimate $P(C=c|A=a)$ . However, as illustrated in manipulated graphical model of Fig. 5, crime rates $C$ are independent of ice cream sales $B$ , which results in a different $\textrm{Pr}(C=c|do(A=a))$ from $\textrm{Pr}(C=c|A=a)$ .

Although causal graph manipulation is the most fundamentalist approach to calculating $\textrm{Pr}(Y=y|do(T=t))$ , it can be challenging and even impossible in reality. Fortunately, we can estimate $\textrm{Pr}(Y=y|do(T=t))$ from observed data with the following causal effect rule:

Definition 0 (The Causal Effect Rule).

Given a graph $G$ in which a set of variables $PA$ are designated as the parents of $T$ , the causal effect of $T$ on $Y$ is given by

(7)

\textrm{Pr}(Y=y\mid do(T=t))=\sum_{x}\textrm{Pr}(Y=y\mid T=t,PA=x)\textrm{Pr}(% PA=x)=\sum_{x}\frac{\textrm{Pr}(T=t,Y=y,PA=t)}{\textrm{Pr}(T=t\mid PA=x)},

where $x$ ranges over all the combinations of values that the variables in $PA$ can take.

The most important benefit brought by the rule is that it enables us to finish the do-calculus purely on passive observational data (Xu et al., 2023a). The factor $\textrm{Pr}(T=t\mid PA=x)$ is the propensity score, and Equation 7 is named inverse propensity score (Section 4.1.2) in PO framework, which partly reflects the unity of the two frameworks.

Counterfactuals. Counterfactuals are employed to emphasize our wish to compare two outcomes under the exact same conditions, differing only in one aspect: the antecedent, or hypothetical condition (Glymour et al., 2016). For example, in the counterfactual question “What would be the user’s interaction if the recommended items had been different?” mentioned above in the Section 1, we would like to compare the user’s interaction under the same conditions except for the recommended item. Counterfactuals, situations which are non-existent in reality, cannot be inferred by do-calculus. Fortunately, Pearl (Pearl, 2009) proposed a new set of notations: $\textrm{Pr}(Y(T=1)|T=0,Y=Y(T=0))$ indicates the probability of the outcome $Y(T=1)$ would be if the observed treatment value is $T=0$ , given the fact that we observe $Y=Y(T=0)$ in the data.

3.1.3. Comparison between the two frameworks

As mentioned above, the two frameworks are equivalent logically: an assumption or a theorem can be translated to its counterpart in the other, and a problem solved in one framework would yield the same solution in another (Pearl, 2009; Glymour et al., 2016). For example, $\textrm{Pr}(Y=y|do(T=t))$ in the SCM is equivalent to $\textrm{Pr}(Y(T=t)=y)$ in the PO, which the regular assessment in a controlled experiment, in which the distribution of $Y$ is estimated for each level $w$ of a random variable $T$ . Causal effects that are measured between the results of the counterfactual world and the real world can be estimated conveniently in both frameworks. However, there are several important differences between PO and SCM. The most significant difference is that PO does not assume the causal relations between concerned variables, while SCM makes assumptions of causal mechanisms among a set of variables or searches for ones based on some assumptions. In other words, any given PO model corresponds to multiple causal graphs in SCM. For PO, it can be a strength for PO that causal effects can be reasoned without knowing the causal model, and be a weakness either. According to the unconfoundednes assumption, all confounders should be observed to infer a correct treatment effect since the mechanism is unknown, almost impossible in practice (Aliprantis, 2015). In contrast, in SCM, causal diagrams allow us to work with causal effects by interventions on the fewest number of variables or the observed variables as much as possible.

3.2. Recommender Systems

Recommender systems predict users’ preferences and proactively recommend items users might like (Ricci et al., 2015; Zhang et al., 2019) to alleviate information overload.

3.2.1. Recommendation Techniques

RSs are usually classified into the following three categories (Adomavicius and Tuzhilin, 2005; Zhang et al., 2019): content-based, collaborative filtering (CF), and hybrid. Content-based recommendation learns to recommend primarily based on comparisons across items’ and users’ auxiliary information (Zhang et al., 2019), such as items’ human-set tags, images, texts, and users’ sex. Collaborative filtering recommender systems recommend items according to user/item historical interactions, i.e., explicit (e.g., user’s previous ratings) or implicit feedback (e.g., click behavior) (Zhang et al., 2019). Hybrid approaches are those that combine collaborative filtering and content-based methods.

If we review the model structure, recommender systems can generally be divided into shallow models and neural network models. Shallow models involve methods that directly calculate the similarity of interactions and CF methods with matrix factorization (MF) (Koren et al., 2009) or factorization machine (FM) (Rendle, 2010), but suffer from insufficient learning of users’ complicated interest. Neural network models are proposed to solve this issue, with the advantage of high-order feature interactions (Guo et al., 2017). For example, Wide & Deep (Cheng et al., 2016) jointly trains linear models and deep neural networks to combine the benefits of memorization and generalization. Deep factorization machine (DeepFM) (Guo et al., 2017) combines traditional factorization machine (FM) with multi-layer perceptrons (MLP) in parallel. Graph neural networks (GNN)-based methods adopt embedding propagation to iteratively aggregate neighborhood embedding, thereby more effectively exploring structural information (Gao et al., 2022b).

3.2.2. Notation

Considering a general recommender system, we assume $\mathcal{O}$ and $\mathcal{O}^{-}$ denote the observed dataset and unobserved dataset. Each observed sample includes the treatment $T$ , background features $X$ , and an interaction label $Y$ . Background features $X$ , aka., covariates are usually formulated as a high dimensional sparse vector containing information such as user ID, item ID, user profile, item category, etc. The interaction label $Y$ , or the outcome, can be explicit feedback (e.g., rating) or implicit feedback (such as click and watch behavior).

In normal circumstances, researchers prefer choosing whether to recommend as the treatment. Therefore, the observed dataset can be denoted as ${\mathcal{O}}=\{(T=1,X,Y)\}|_{1}^{|\mathcal{O}|}\in\mathcal{T}\times\mathcal{X% }\times\mathcal{Y}$ , where $\mathcal{T}$ means the treatment space, $\mathcal{X}$ is the feature spaces, and $\mathcal{Y}$ is the label space. In general, the observed dataset is obtained with the deployed recommender policy $\pi$ ; thus ${\mathcal{O}}$ will be specifically expressed as ${\mathcal{O}_{\pi}}$ if we are concerned about the policy. Note that settings of $T,X,Y$ vary slightly according to specific work. It would be better to understand with reference to the context.

3.3. Existing Categorizations of Causal Recommendation

There are several categorization criteria for causal recommender systems. For example, similar to (Yao et al., 2021), Yao et.al. (Wu et al., 2022) divides biases in RS into three categories from the perspective of violating what causal assumptions are adopted in the standard PO framework. 1) Position bias and conformity bias can be seen as violations of the SUTVA assumption if recommender systems do not pay enough attention to the positions of items and users’ social networks. 2) Unconfoundedness and positivity are crucial assumptions in the recoverability of the target estimated. However, the former can be violated by popularity bias, and the latter can be violated by exposure bias, both of which result in the problem of missing not at random (MNAR). 3) The final bias violates some model-specific assumptions.

According to the survey (Gao et al., 2022c), existing work of causal recommendation can be categorized into three groups: for addressing data bias, for addressing data missing and noise, and for beyond-accuracy objectives. 1) Causal debiasing work can be further divided into several subcategories based on the specific bias, such as popularity bias, clickbait bias, and exposure bias. 2) The problem of data missing refers to the usually-discussed data sparsity issue in RS, and data noise stems from unreliable implicit signals and delayed feedback. In order to alleviate these issues, researchers use the counterfactual technique to augment insufficient data and adjust sample weights. Besides, some causal recommender systems are designed for beyond-accuracy objectives like explainability, diversity and fairness.

Zhu et al. (Zhu et al., 2023a) summarizes different causal inference techniques with an emphasis on debiasing, explainability promotion, and generalization improvement. Xu et al. (Xu et al., 2023b) introduces existing causal methods on explainable recommendation, fairness in recommendation, uplift-based recommendation, robust recommendation, unbiased recommendation, respective.

The four studies mentioned above are pioneering efforts in this field, and each has a distinct focus on the causal frameworks it discussed. However, this paper will systematically classify causal inference for RS from a new perspective of the employed causal theories. We regard that, while it is undeniably convenient for researchers, especially those embedded in the RS industry, to quickly reference existing causal methods based on the application issues they address, these categorizations result in a fragmented and non-systematic representation of causal theories, since a single causal theory could potentially be applied to resolve a variety of recommendation issues.

Consequently, this paper systematically classifies causal inference for RS from a new perspective of the employed causal approach. This taxonomy enables readers to grasp the progressive integration and iterative development of causal methods within RS, fostering an understanding of their advantages over previous techniques, as well as their inherent limitations. Such a comprehensive overview is instrumental for continuous research and paves the way for significant breakthroughs in the in-depth integration of causal inference and recommender systems, ensuring a more robust and holistic development in the field.

4. PO-based Methods

Many causal recommendation approaches, especially in early research, have focused on applying the potential outcome (PO) framework proposed by Donald B. Rubin (Splawa-Neyman et al., 1990; Rubin, 1974; Imbens and Rubin, 2015). These approaches primarily integrate PO-based causal inference into the optimization functions in traditional deep-learning-based methods or the reward functions in reinforcement-learning-based methods.

Fig. 2 illustrates the strategies and objectives concerning the PO framework in the context of RS, categorizing the strategies into two main types: propensity score and causal effect. The former generally leverages estimated propensity scores from causal inference methods to adjust importance weights, while the latter concentrates on the difference between potential outcomes under treatment and control (see Definition 1). Despite their different focuses, they are not entirely mutually exclusive. On one hand, propensity scores can be utilized to adjust the weights of samples or the weights of outcomes within causal effects. On the other hand, causal effects can be estimated in a couple of ways. One approach involves directly modeling outcomes, exemplified by fitting two separate models (Radcliffe, 2007; Bonner and Vasile, 2018) to estimate $\mathbb{E}[Y(T=1|X=x)]$ and $\mathbb{E}[Y(T=0|X=x)]$ in the CATE (refer to Equation 3). Alternatively, propensity score-based methods like Inverse Propensity Scoring (IPS) or Doubly Robust (DR) can be applied to weigh the potential outcome predictions.

It is essential to clarify that in this paper, models that estimate causal effects without explicitly utilizing causal structure information are classified as PO-based, whereas those explicitly incorporating causal structure information are categorized as SCM-based.

4.1. Propensity Score Strategy

Let’s consider the process by which the recommendation system works, where given background variables $x\sim\textrm{Pr}(x)$ , also referred to as pre-treatment variables or covariates (Imbens and Rubin, 2015), (e.g., user and item features, time of the day, etc.), a recommender policy $\pi$ plays a role as a decision-making system, which makes a decision of whether to take an active treatment $t\sim\pi(t\mid x)$ (e.g., recommend an item), and the potential outcome $y\sim\textrm{Pr}(y\mid x,t)$ , i.e., “reward” in the reinforcement learning context (e.g., click indicator), will be observed (Saito and Joachims, 2022). For example, in online markets, information like user profile, historical consumptions, and products in the cart will be treated as context variables $x$ , according to which the policy $\pi$ will produce a list of recommended items (i.e., treatment $t$ ), and the logged reward $y$ can be the click signal, conversions, or revenue, etc. The effectiveness of the policy $\pi$ can be evaluated through its running expected reward, formulated as:

(8)

R(\pi):=\iiint y\textrm{Pr}(y\mid x,t)\pi(t\mid x)\textrm{Pr}(x)dxdtdy=\mathbb% {E}_{\textrm{Pr}(x)\pi(t\mid x)\textrm{Pr}(y\mid x,t)}[y].

To learn the optimal policy

(9)

\pi\in\underset{\pi\in\mathcal{\Pi}}{\arg\max}V(\pi),

where $\Pi$ means the policy class, an online A/B test will be the best choice (Gomez-Uribe and Hunt, 2015; Kohavi et al., 2013), but suffers from high expense. A substitute and common practice is offline evaluation, by calculating an estimator $\hat{R}$ for the reward of a target policy $\pi$ using logged data $\mathcal{O}_{\pi_{0}}$ collected by a logging policy $\pi_{0}$ (which is different from $\pi$ ) (Saito and Joachims, 2022). However, like many other empirical sciences, offline evaluation is challenged with the problem of missing not at random (MNAR).

To address this issue, early approaches tend to predict the missing data directly (Steck, 2010) but have accentuated the problem of high bias (Wang et al., 2019; Saito and Joachims, 2021). Recently, many researchers have resorted to the propensity score $e(X)$ in causality to recover the data distribution. For example, ExpoMF (Liang et al., 2016) first predicts the exposure matrix and then uses the exposures (i.e., propensity scores) to guide the model of the interaction matrix, which is inspired by the separation between propensity scores and potential outcomes in the PO framework. Similarly, Wang et al. (Wang et al., 2018) propose SERec to integrate social exposure into collaborative filtering. A refreshing work is that Wang et al. (Wang et al., 2020) aim to overcome the confounder issue with propensity score. They regard correlations among the interacted items as bringing indirect evidence for confounders and propose the deconfounded recommenders. They first build an exposure model to estimate the propensity score, and then use this exposure model to estimate a substitute for the unobserved confounders, conditional on which the final outcome model (specifically in (Wang et al., 2020), a rating model based on matrix factorization) is trained. In addition, inspired by (Joachims et al., 2017; Fang et al., 2019), Chen et al. (Chen et al., 2021b) propose IOBM (Interactional Observation-Based Model)to estimate propensity score in interaction settings, which learns low-dimensional embeddings as a substitute for unobservable confounders. Specifically, it learns individual embeddings to capture the potential outcome information from specific exposure events. Based on individual embeddings, the interactional embeddings, which uncovers the hidden relationship among single exposure events and utilizes query context information to apply attention, are learned through the bidirectional LSTM model. Recently, the incorporation of Contrastive Learning (CL) (Yu et al., 2023b; Zhou et al., 2021) with propensity scores has offered new avenues to address noisy data in recommendation systems. A prominent example is the CCL (Contrastive Causal Learning) framework (Zhou et al., 2023), which innovatively employs propensity score-based sampling to generate informative positive pairs for contrastive learning tasks.

Propensity-based methods can be further divided into approaches based on inverse propensity score (IPS) and approaches based on doubly robust (DR) (Fig.7). One of the greatest strengths of applying propensity-based methods in RS is that most of them are unbiased and model-agnostic, simply deployed on the objective function for policy evaluation directly or for policy learning indirectly.

4.1.1. Missing Not At Random

In this part, we will introduce the phenomena and factors of missing not at random, to provide explanations and conclusions of challenges in recommender systems in a causal language to understand existing work better.

Recommendation algorithms often obey the missing at random (MAR) (Rubin, 1976) assumption but may lead to biased prediction and suboptimal policy (Little and Rubin, 2019; Marlin and Zemel, 2009). The MAR condition essentially states that the probability that a potential outcome is missing does not depend on the value of that potential outcome and can be easily violated in recommender systems (Marlin and Zemel, 2009). For example, on movie rating websites, movies with high ratings are less likely to be missing compared to movies with low ratings (Pradel et al., 2012). The issue of missing not at random (MNAR) has been demonstrated by Marlin and Zemel (Marlin and Zemel, 2009) and it is a phenomena stemming from selection bias and confounding bias (Correa et al., 2019; Wu et al., 2022).

Selection bias, or sampling bias, is usually discussed in the prediction task and can be further classified into model selection bias and user self-selection bias (Wu et al., 2022). For example, the case that the platform may systematically recommend pop music to younger users who may be more active on the service regardless of genre preferences (McInerney et al., 2020) will be regarded as model selection bias (Yuan et al., 2019a; McInerney et al., 2020) and can be eliminated by random recommendation. User self-selection bias (Bareinboim and Pearl, 2012; Elwert and Winship, 2014), on the contrary, can not be removed by randomization of recommendation (Correa et al., 2019). It is caused by preferential exclusion of samples from the data (Bareinboim and Pearl, 2012). A typical example is a song recommender system, in which users usually rate songs they like or dislike and seldom rate what they feel neutral about (Saito, 2020). Some of the most frequently discussed biases like popularity bias (Zhang et al., 2021b; Wei et al., 2021) and exposure bias (Liang et al., 2016; Wang et al., 2018) will lead to model selection bias, while conformity bias (Zhang et al., 2021b; Zheng et al., 2021) and clickbait bias (Wang et al., 2021b) fall under user self-selection bias as a result of user preference.

Confounding bias (Hernán et al., 2002; Pearl, 2009) arises from the confounder described in Section 3.1.2, which affects both the treatment and the outcome, illuminated in Fig.6(b). Alternatively, it can be identified if the probabilistic distribution representing the statistical association is not always equivalent to the interventional distribution, i.e., $\textrm{Pr}(y\mid t)\neq\textrm{Pr}(y\mid do(t))$ (Guo et al., 2020). A notable example of confounding bias is that a system trained with historical user interactions may over recommend items that the user used to like, and the user’s decision (i.e., outcome) is also affected by historical interactions (Wang et al., 2021a).

Both biases can lead to invalid estimates of causality from the data, and they are not mutually exclusive because selection bias does not explicitly involve causality. Many model selection biases, including popularity bias and exposure bias, are also confounding biases. As for user self-selection bias, the model in Fig. 6 (a) gives an illustration of its causal nature in which $S$ is a variable affected by both $T$ (treatment) and $Y$ (outcome), indicating entry into the data pool (Bareinboim and Pearl, 2012). Therefore, confounding bias is significantly different from user self-selection bias from the causal perspective. The former originates from common causes, whereas the latter originates from common outcomes (Elwert and Winship, 2014). The former stems from the systematic bias introduced during the treatment assignment, while the latter comes from the systematic bias during the collection of units into the sample (Correa et al., 2019).

4.1.2. Inverse Propensity Score

Inverse Propensity Score (IPS) (Horvitz and Thompson, 1952; Rosenbaum, 1987; Rosenbaum and Rubin, 1983; Little and Rubin, 2019), also named as inverse propensity weighting (IPW), or inverse propensity of treatment weighting (IPTW), is one of the favorite counterfactual techniques and has inspired a lot of causal inference methods in RS, especially for unbiased learning (Joachims et al., 2017). Propensity score is the probability of receiving the treatment given covariates $X$ , formulated as:

(10)

e_{\pi}(X)=\textrm{Pr}_{\pi}(T=1\mid X).

IPS assigns a weight $w$ to each sample:

(11)

w=\frac{t}{e(x)}+\frac{1-t}{1-e(x)},

which indicates the inverse probability of receiving the observed treatment and control. The unbiasedness of IPS can be proven (Rosenbaum, 1987). More specifically, for the reward estimation of recommendation policy, IPS adjusts the distribution of background features in the logged dataset to be consistent with that during $\pi$ tests online, formulated as:

(12)

\hat{R}_{\mathrm{IPS}}\left(\pi;\mathcal{O}_{\pi_{0}}\right):=\frac{1}{% \mathcal{O}_{\pi_{0}}}\sum_{k=1}^{|\mathcal{O}_{\pi_{0}}|}\frac{e_{\pi}(X)}{e_% {\pi_{0}}(X)}\cdot y_{k}=\frac{1}{\mathcal{O}_{\pi_{0}}}\sum_{k=1}^{|\mathcal{% O}_{\pi_{0}}|}\frac{\textrm{Pr}_{\pi}(T=1\mid X)}{\textrm{Pr}_{\pi_{0}}(T=1% \mid X)}\cdot y_{k},

where we assume that only positive feedback is taken into account, and $w=\frac{e_{\pi}(X)}{e_{\pi_{0}}(X)}$ is the ratio of the evaluation and logged policies. Note that in most applications in RS, IPS is model-agnostic, applied to the training objective function for policy evaluation directly or for policy learning indirectly.

Table 1. Summary of propensity score strategies for recommendation.

Category	Model	Causal method	Backbone model	Issue of concern	Year
Approach Inspired by Propensity Score	ExpoMF (Liang et al., 2016)	Propensity score	MF	Exposure bias	2016
	SERec (Wang et al., 2018)	Propensity score	MF	Social recommendation	2018
	Dcf (Wang et al., 2020)	Propensity score	MF	Unobserved confounding bias	2020
	CNFI (Zhang et al., 2021g)	Propensity score	MF	Implicit feedback	2021
	IOBM (Chen et al., 2021b)	Propensity score	Bi-LSTM (Graves et al., 2013)	Interactional observation bias	2021
	CCL (Zhou et al., 2023)	Propensity score	(custom-designed)	Unobserved confounding bias	2023
Approach with Inverse Propensity Score (IPS)	MF-IPS (Schnabel et al., 2016)	IPS, SNIPS	MF	Selection bias	2016
	PBM (Joachims et al., 2017)	IPS	SVM-Rank (Joachims, 2002, 2006)	Position bias	2017
	PieceNCIS, PointNCIS (Gilotte et al., 2018)	CIPS, SNIPS	-	Offline A/B testing	2018
	(Mehrotra et al., 2018)	IPS	(reinforcement learning)	Fairness	2018
	Multi-IPW (Zhang et al., 2020)	IPS	Multi-task MLP	Selection bias	2019
	CPBM (Fang et al., 2019)	IPS	SVM-Rank	Selection bias	2019
	ULRMF,ULBPR (Sato et al., 2019)	IPS, SNIPS, ATE	MF	Uplift	2019
	DLCE (Sato et al., 2020)	CIPS	MF	Unobserved confounding bias	2020
	Rel-MF (Saito et al., 2020)	CIPS	MF	Unobserved confounding bias	2020
	(Christakopoulou et al., 2020)	IPS	Multi-task DNN	Observed confounding bias	2020
	RIPS (McInerney et al., 2020)	RIPS	(model-agnostic)	Slate recommendation	2020
	ACL- (Xu et al., 2020)	IPS	(adversarial learning)	Identifiability	2020
	UR-IPW (Zhang et al., 2021e)	SNIPS	Multi-task MLP	Post-click revisit effect &selection bias	2021
	(Li et al., 2021b)	IPS	(model-agnostic)	Domain bias	2021
	CBDF (Zhang et al., 2021c)	IPS	(reinforcement learning)	Delayed feedback	2021
	RD&BRD (Ding et al., 2022b)	IPS/DR/ AutoDebias (Chen et al., 2021a)	MF	Unobserved confounding bias	2022
	CET (Cai et al., 2022)	IPS	BERT	False negative	2022
	CAFL (Krauth et al., 2022)	IPS	MF	Feedback loop	2022
	RIIPS (Liu et al., 2022b)	RIIPS	Two-tower structure	Selection bias	2022
	DENC (Li et al., 2023a)	IPS	(custom-designed)	Selection bias	2023
Approach with Doubly Robust	Propensity-free DR (Yuan et al., 2019a)	DR	FFM (Yuan et al., 2019b)	Selection bias	2019
	DR-JL (Wang et al., 2019)	DR	MF	Selection bias	2019
	Multi-DR (Zhang et al., 2020)	DR	Multi-task MLPDNN	Selection bias	2020
	MRDR-DL (Guo et al., 2021)	MRDR	MF	Selection bias	2021
	Cascade-DR (Kiyohara et al., 2022)	Cascade-DR	MF	High variance of RIPS (McInerney et al., 2020)	2022
	ASPIRE (Mondal et al., 2022)	DR, ATE	LightGBM (Ke et al., 2017)	Uplift	2022
	DRIB (Xiao and Wang, 2022)	DR	MF	Unobserved confounding bias	2022
	DR-BIAS, DR-MSE (Dai et al., 2022)	DR	FM	Selection bias	2022
	CDR (Song et al., 2023a)	DR	MF	Selection bias	2023
	CF-MTL (Li et al., 2023b)	CATE, IPS, DR	(custom-designed)	Personalized incentive policy	2023

Much IPS-based recommendation focuses on data debiasing in user interactions, mainly selection bias (Schnabel et al., 2016; Saito et al., 2020; Sato et al., 2020; Zhang et al., 2021e; Sato, 2021; Zhang et al., 2021g; Wu et al., 2021; Li et al., 2023a). For example, (Schnabel et al., 2016) is a representative work adopting IPS to recommender system for the elimination of selection bias, in which the recommendation algorithm is based on matrix factorization and propensity scores are estimated via naive Bayes or logistic regression. Similarly, Saito et al. (Saito et al., 2020) estimate the exposure propensity for each user-item pair and Sato et al. (Sato et al., 2020) propose the DLCE (Debiased Learning for the Causal Effect) model with IPS-based estimators to evaluating unbiased ranking uplift. Unbiased IPS-based uplift is also concerned by (Sato et al., 2019). In addition, (Zhang et al., 2021e) proposes UR-IPW (User Retention Modeling with Inverse Propensity Weighting) to model revisit rate estimation accounting for the selection bias problem and (Li et al., 2021b) adjusts domain weights based on IPS to reduce domain bias. Though IPS-based methods do not require an explicit analysis of the causal correlation between variables, some works (Christakopoulou et al., 2020; McInerney et al., 2020; Ding et al., 2022b) still discuss causal graphs as an excellent guide to accurate model. For example, Ding et al. (Ding et al., 2022b) leverage a causal graph to explain the risk of unmeasurable confounders on the accuracy of propensity estimation and propose RD (Robust Deconfounder) with the sensitivity analysis, obtaining the bound of propensity score to enhance the robustness of methods against unmeasured confounders. Li et al. (Li et al., 2023a) construct the DENC (De-bias Network Confounding in Recommendation). This causal graph-based recommendation framework disentangles three determinants for the outcomes, including inherent factors, social network-based confounder and exposure, and estimates each of them with a specific component, respectively. By the way, there are some works (Christakopoulou et al., 2020; Cai et al., 2022; Zhang et al., 2021e) integrate multi-task models with IPS to learn propensity scores and user interactions simultaneously.

In addition to debiasing, some IPS-based methods are dedicated to addressing other issues that abound in RS (Mehrotra et al., 2018; Zhang et al., 2021c; Krauth et al., 2022). For example, Mehrotra et al. (Mehrotra et al., 2018) proposes an unbiased estimator of user satisfaction based on IPS to jointly optimize for supplier fairness and consumer relevance. Besides, the CBDF (Counterfactual Bandit with Delayed Feedback) algorithm (Zhang et al., 2021c) re-weights the observed feedback with importance sampling, which is determined by a survival model to deal with delayed feedbacks. The CAFL (causal adjustment for feedback loops) (Krauth et al., 2022) extends the IPS estimator to break feedback loops.

Despite the unbiasedness strength of IPS, the inaccurate estimation of the unknown propensity $e(x)$ or sample weight, which results in high variance (Gilotte et al., 2018), becomes the biggest obstacle to achieving it. To alleviate this problem, modified versions of IPS have been proposed to control variance and applied to RS, including Self Normalized IPS (Schnabel et al., 2016; Zhang et al., 2021e), Clipped IPS (Saito et al., 2020; Sato et al., 2020), Reward interaction IPS (McInerney et al., 2020), and Regularized per-Item IPS (Liu et al., 2022b). Self Normalized Inverse Propensity Scoring (SNIPS) (Swaminathan and Joachims, 2015) rescales the estimate of the original IPS without any parameters to reduce the high variance, which is:

(13)

\hat{R}_{\mathrm{SNIPS}}\left(\pi;\mathcal{O}_{\pi_{0}}\right):=\left(\sum_{k=% 1}^{|\mathcal{O}_{\pi_{0}}|}\frac{e_{\pi}(X)}{e_{\pi_{0}}(X)}\right)^{-1}\sum_% {k=1}^{|\mathcal{O}_{\pi_{0}}|}\frac{e_{\pi}(X)}{e_{\pi_{0}}(X)}\cdot y_{k},

and is introduced to RS by works like (Schnabel et al., 2016) and (Zhang et al., 2021e) to alleviate selection bias. Clipped IPS (CIPS) (Bottou et al., 2013; Saito et al., 2020; Sato et al., 2020), or Capped IPS, tightens the bound of the sample weight by introducing a scalar hyperparameter $\lambda_{\mathrm{CIPS}}$ , formulated as:

(14)

\hat{R}_{\mathrm{CIPS}}\left(\pi;\mathcal{O}_{\pi_{0}}\right):=\frac{1}{% \mathcal{O}_{\pi_{0}}}\sum_{k=1}^{|\mathcal{O}_{\pi_{0}}|}\min\left\{\frac{e_{% \pi}(X)}{e_{\pi_{0}}(X)},\lambda_{\mathrm{CIPS}}\right\}\cdot y_{k},

which has a lower variance but gives away its unbiasedness. Expanding upon the groundwork established by NCIPS (Swaminathan and Joachims, 2015), which amalgamated SNIPS and CIPS, the study by Gilotte et al. (Gilotte et al., 2018) advances PieceNCIS and PointNCIS as enhancements that utilize contextual information to refine bias modeling. McInerney et al. (McInerney et al., 2020) loosen the SUTVA assumption and propose Reward interaction IPS (RIPS) for sequential recommendations, which assumes a causal model in which users interact with a list of items from the top to the bottom. RIPS uses iterative normalization and lookback to estimate the average reward and achieves a better bias-variance trade-off than IPS. In addition to high variance, violation of the Unconfoundedness assumption is another challenge of utilizing IPS in RS. That is, the treatment mechanism is identifiable (Glymour et al., 2016; Mohan and Pearl, 2021) from observed covariates due to the existence of unobserved ones, which leads to the inaccurate estimate of propensity score and the disagreement between the online and offline evaluations. To address the uncertainty brought by the identifiability issue, (Xu et al., 2020) proposes minimax empirical risk formulation, which can be converted to an adversarial game between two recommendation models via duality arguments and relaxations.

More recently, Liu et al. (Liu et al., 2022b) propose Regularized per-item IPS (RIIPS) with an additional penalty function that constrains the difference in recommended outcomes between the deployed system and the new system so that the explosion of propensity scores can be avoided.

4.1.3. Doubly Robust

Doubly Robust (DR) (Funk et al., 2011; Dudík et al., 2014; Jiang and Li, 2016; Wang et al., 2019) is another powerful and effective causal method account for the MNAR issue. To understand DR, let us consider the two common-used approaches to mitigate against MNAR: direct method (DM) (Beygelzimer and Langford, 2009) and IPS (Saito et al., 2021). The former designs a model (linear regression, deep neural network, etc.) to directly learn the missing outcomes based on the observed data, which has low variance due to the advantage of supervised learning but suffers from high bias caused by unmet IID assumptions, denoted as (Saito et al., 2021):

(15)

\hat{R}_{\mathrm{DM}}\left(\pi_{0};\mathcal{O}_{\pi_{0}},\hat{y}\left(x_{k},t% \right)\right):=\frac{1}{|\mathcal{O}_{\pi_{0}}|}\sum_{k=1}^{|\mathcal{O}_{\pi% _{0}}|}\textrm{Pr}_{\pi}\left(t=1\mid x_{k}\right)\hat{y}\left(x_{k},t\right),

where $\hat{y}\left(x,t\right)$ is the estimated outcomes. The latter, though unbiased theoretically, often causes training losses to oscillate stemming from the inverse of propensity with high variance (Thomas and Brunskill, 2016). What DR does is to combine the direct method and IPS, which takes advantage of both and overcomes their limitations:

(16)

\hat{R}_{\mathrm{DR}}\left(\pi;\mathcal{O}_{\pi_{0}},\hat{r}\right):=\hat{R}_{% \mathrm{DM}}\left(\pi;\mathcal{O}_{\pi_{0}},\hat{y}\left(x_{k},t\right)\right)% +\frac{1}{|\mathcal{O}_{\pi_{0}}|}\sum_{k=1}^{|\mathcal{O}_{\pi_{0}}|}\frac{e_% {\pi}(X)}{e_{\pi_{0}}(X)}\left(y_{k}-\hat{y}\left(x_{k},t_{k}\right)\right).

DR uses the estimated outcomes to decrease the variance of IPS. It is also doubly robust in that it is consistent with the policy reward value if either the propensity scores or the imputed outcomes are accurate for all user-item pairs (Wang et al., 2019; Saito et al., 2021). By the way, advanced versions like Switch-DR (Wang et al., 2017) and DRos (Doubly Robust with Optimistic Shrinkage) (Su et al., 2020) are proposed to further control the variance.

Based on the above advantages, DR has found an increasingly wide utilization in RSs (Yuan et al., 2019a; Wang et al., 2019; Zhang et al., 2020; Guo et al., 2021; Kiyohara et al., 2022; Mondal et al., 2022; Xiao and Wang, 2022; Dai et al., 2022; Song et al., 2023a; Li et al., 2023b). Wang et al. (Wang et al., 2019) utilize DR for unbiased RS prediction and further propose a joint learning approach that simultaneously learns rating prediction and propensity to guarantee a low prediction inaccuracy at inference time. Yuan et al. (Yuan et al., 2019a) propose a propensity-free doubly robust method to address the issue that samples with low propensity scores are absent in the observed dataset. Zhang et al. (Zhang et al., 2020) propose Multi-DR based on a multi-task learning framework to address selection bias and data sparsity issues in CVR estimation. Gun et al. (Guo et al., 2021) propose the MRDR (more robust doubly robust) estimator to further reduce the variance caused by inaccurate imputed outcomes in DR while retaining its double robustness. In addition, Kiyohara et al. (Kiyohara et al., 2022) expand previous RIPS to Cascade Doubly Robust estimator, which has the same user interaction assumption as RIPS. Xiao et al. (Xiao and Wang, 2022) propose an information bottleneck-based approach to effectively learn the DR estimator for the estimation of recommendation uplift, with the hope of a better trade-off between the bias and variance of propensity scores. Dai et al. (Dai et al., 2022) learns imputation with balancing the variance and bias of DR loss. More recently, Song et al. (Song et al., 2023a) filter imputation data through examination of their mean and variance, in order to reduce poisonous imputations that significantly deviate from the truth and impair the debiasing performance.

4.2. Causal Effect Strategy

The most critical and fundamental role of causal inference is to estimate the causal effects from observational data, which has a variety of applications in real-world recommender systems. Some works are dedicated to estimating and enhancing the treatment effect of a recommender policy on specific customer outcomes, namely uplift (Gutierrez and Gérardy, 2017). In such scenarios, the causal effect is typically implemented as either a direct or indirect optimization goal, aiming to maximize platform benefits. Additionally, treatment effects extend to other application areas in recommender systems, serving purposes beyond uplift.

It is crucial to highlight that within the PO framework, the causal relationships between variables are not the focal point while calculating causal effect, and all variables affecting potential outcomes except treatment will be treated as covariates.

Table 2. Summary of causal effect strategies for recommendation.

Category	Model	Causal method	Backbone model	Issue of concern	Year
Causal effect for Uplift	ULRMF, ULBPR (Sato et al., 2019)	IPS, SNIPS, ATE	MF	Uplift	2019
	(Goldenberg et al., 2020)	CATE	Xgboost (Chen and Guestrin, 2016)		2020
	AUUC-max (Betlei et al., 2021)	CATE	Linear /Wide & Deep		2021
	CausCF (Xie et al., 2021)	CATE	MF		2021
	ASPIRE (Mondal et al., 2022)	DR, ATE	LightGBM (Ke et al., 2017)		2022
Causal effect beyond Uplift	(Rosenfeld et al., 2017)	ITE	Linear/regularized kernel methods	Domain adaptation	2017
	CausE (Bonner and Vasile, 2018)	ITE	MF	Domain adaptation	2018
	(Mehrotra et al., 2020)	TE	Structural state-space model (Brodersen et al., 2015)	Causal effect of a new track release	2020
	CACF (Zhang et al., 2021a)	ITE	(custom-designed)	Unobserved confounding bias	2021
	MCRec (Yao et al., 2022)	CATE	DIN (Zhou et al., 2018)	Device-cloud recommendation	2022
	LRIR (Tran et al., 2022)	ITE, ATE	(custom-designed)	Disability employment	2022

4.2.1. Causal Effect for Uplift

Uplift, denoting the causal effect of recommendations, refers to the increase in user interactions purely caused by recommendations. Typical evaluations of recommender systems regard positive user interactions as a success. However, a subset of these interactions might persist even in the absence of recommendations. This assertion is substantiated by the conclusion of Sharma et al. (Sharma et al., 2015), which indicates that more than 75% of click-throughs would still occur in the absence of recommendations. For marketing campaigns where Return on Investment (ROI) is paramount, targeting ’voluntary buyers’ — individuals who would interact with or without any recommendations — is deemed unnecessary. Therefore, the industry regards uplift as a valuable metric for recommendations in expectation of higher rewards.

It is a natural application to introduce the causality concepts such as ATE and CATE for uplift modeling since the definition of uplift is a counterfactual problem and consistent with the objective of causal effect estimation (Yamane et al., 2018; Zhang et al., 2021d; Gutierrez and Gérardy, 2017). Causal approaches with traditional machine learning methods for uplift estimation include two-model approach (Radcliffe, 2007; Nassif et al., 2013), transformed outcome (Jaskowski and Jaroszewicz, 2012) and uplift trees (Radcliffe and Surry, 2011; Rzepakowski and Jaroszewicz, 2012). Regarding recommender systems, uplift estimation on online A/B testing suffers from the high expense and large fluctuations due to user self-selection bias (Sato, 2021), while uplift estimated offline is bedeviled by a wide variety of biases that could lead to MNAR. In order to deal with these issues, much of the literature has been published. Sato et al. (Sato et al., 2019) utilize SNIPS-based ATE to accomplish offline uplift-based evaluation. Goldenberg et al. (Goldenberg et al., 2020) leverage the Retrospective Estimation technique that relies solely on data with positive outcomes for CATE-based uplift modeling, which makes it especially suited for many recommendation scenarios where only the treatment outcomes are observable. (Betlei et al., 2021) learns a model that directly optimizes an upper bound on AUUC, a popular uplift metric based on the uplift curves and unified with ATE (Yamane et al., 2018). In addition, CausCF (Xie et al., 2021) extends the classical MF to the tensor factorization with three dimensions—user, item, and treatment effect for better uplift performance. CF-MTL (Li et al., 2023b) accounts for whether users actively accept the treatment, leading to a more granular classification of users, and then estimates the probability for each user type within a multi-task learning framework. It is worth mentioning that in the uplift modeling literature (Diemert et al., 2018; Gutierrez and Gérardy, 2017; Zhang et al., 2021d), there are two closely related metrics for uplift modeling, uplift and Qini curves, the latter of which is evaluated based on the ranking of conditional treatment effect estimations.

4.2.2. Causal Effect beyond Uplift

There are some other impressive recommendation works with causal effect (Mehrotra et al., 2020; Zhang et al., 2021a; Rosenfeld et al., 2017; Bonner and Vasile, 2018; Yao et al., 2022; Tran et al., 2022). For example, (Mehrotra et al., 2020) adapts a Bayesian model to infer the causal impact of new track releases, which may be an essential consideration in the design of music recommendation platforms. (Zhang et al., 2021a) minimizes the distance between the traditional attention weights in the recommendation method and the ITE to reflect the true impact of the features on the interactions. (Rosenfeld et al., 2017) and (Bonner and Vasile, 2018) frames causal inference as a domain adaptation problem and leverages ITE with a large sample of biased data and a small sample of unbiased data to eliminate the bias problems, which are described in more detail in 6.1.

4.3. Why Potential Outcomes Framework?

The PO framework has maintained its popularity in the realm of recommender systems since its inception due to its close association with A/B testing. Online A/B testing evaluates the performance of two different recommender policies through randomized experiments. Specifically, a user pool on the platform is randomly divided into a treatment group and a control group, with each group being exposed to one of the policies (Gilotte et al., 2018). Upon completion of the experiment, metrics such as revenue and click-through rates are compared to determine the policy to be adopted for future use. As illustrated in Fig. 8, the efficacy of A/B testing stems from the ideal randomized controlled trial (RCT) that disables all the confounders simultaneously affecting the treatment and the outcomes, thereby leading to a pure assessment of the policy’s treatment effect on potential outcomes (Pearl, 2009). In practice, however, A/B testing often fails to achieve ideal randomization due to issues such as insufficient sample sizes leading to distributions that do not match the overall population. In such cases, methods like IPS from the PO framework can adjust sample weights, thus mitigating selection biases.

Due to the time-intensive and costly nature of online A/B testing, offline A/B testing serves as a more expedient and cost-effective approach to estimate the efficacy of recommended policies. Offline data are accumulated using the current recommendation system, referred to as the logging policy; hence, we cannot use the direct estimation of the target policy on offline data since it was not collected under the conditions of the target policy. Instead, it is necessary to re-weight the importance of samples to align with the data distribution that would be expected under the target policy. Moreover, in the context of marketing campaigns, beyond accurately estimating the effect of a policy, it is crucial to calculate each user’s uplift to precisely identify the target users. CATE-based uplift modeling is adept at distinguishing between voluntary buyers and the persuadables—who would only interact in reaction to an incentive—thus fulfilling marketing objectives.

The policy evaluation methods mentioned above can also be transformed into the optimization objectives (e.g., the loss function) of recommendation algorithms, thereby aligning the model’s optimization targets with evaluation goals.

Policy evaluation is fundamentally crucial for several reasons: (1) its sustained significance since the inception of recommender systems; (2) its steadfastness amidst the development of recommendation algorithm technology shift; and (3) its archetypal alignment with the PO framework as a typical counterfactual question, where the true outcome of a unit under an alternative treatment remains perpetually indeterminate. Therefore, the PO framework seizes a prominent stage for deployment within recommender systems.

5. SCM-based Methods

Unlike the PO framework, Structural Causal Model explicitly expresses the causal relationship between variables on a causal graph, based on the experiences, before analyzing the causal effect. Its intuitive features make it win undivided admiration among researchers in computer field. In this section, the corresponding strategies is classified according to their causal structures, i.e., collider, mediator, and confounder. We focus on how researchers abstract recommendation issues into causal problems with causal graphs and exploit tools in causal inference to cope with it.

5.1. Causal Recommendation with Collider Structure

As represented in Fig. 4(c), a collider node occurs when it receives effects from two or more other factors. Collider exists in recommender systems. For instance, item positions in the ranking list are influenced by user preference and item popularity.

Analyzing the dependency between variables in collider structures will contribute to its utilization in recommender systems. Although $A$ and $C$ are independent, i.e., for all $a$ and $c$ , $\textrm{Pr}(A=a|B=b)=\textrm{Pr}(A=a)$ , conditioning on the collision node $C$ produces a dependence between the node’s parents, i.e., for some $a,b,c$ , $\textrm{Pr}(A=a|B=b,C=c)=\textrm{Pr}(A=a|C=c)$ . To understand the point, let us consider the most basic example where $C=A+B$ , and $A$ and $B$ are independent variables (Glymour et al., 2016). In this case, given $C=10$ , knowing $A=3$ means we can immediately calculate that $B=7$ . Thus, $A$ and $B$ are dependent, given that $C=10$ . This characteristic inspires us that in RS issues with collider structure, knowing the common effect and one of the causes would provide information for another effect (Zhang et al., 2022).

Though collider structures permeate RSs, they are usually compounded by other causal relationships and are treated as other causal structures, which results in minor literature discussing purely colliders. A representative work is DICE (Zheng et al., 2021), which is proposed by Zheng et al. and tracks the popularity issue from the user’s perspective instead of eliminating popularity bias from the item’s perspective. Zheng et al. argue that users’ interactions are driven by individual interest as well as users’ conformity, which is independent of user interest and describes how users tend to follow other people, and provides a causal graph as shown in Fig. 9 (a). From this point of view, DICE splits user and item embeddings into interest and conformity embeddings, respectively, and learns disentangled representations with conformity-specific and interest-specific data, driven by the colliding effect: if a user interacts with a less popular item, not conforming to the mainstream, it usually indicates that the user is highly interested in the item itself, and vice versa. Further, (Ding et al., 2022a) proposes CIGC (Causal Incremental Graph Convolution), which includes a new operator named CED (Colliding Effect Distillation), to efficiently retrain graph convolution network (GCN) based recommender models. CED frames the whole incremental training phase as a causal graph (see Fig. 9 (b)) and create a collider $S_{t}$ between inactive nodes $R_{In,t}$ and new data $R_{Ac,t}$ , which is represented as the pair-wise distance. Therefore, the incremental integration data $I_{t}$ can update both $R_{Ac,t}$ and $R_{In,t}$ , since conditioning on the collider $S_{t}$ opens the path $I_{t}\rightarrow R_{Ac,t}\leftrightarrow R_{In,t}$ .

5.2. Causal Recommendation with Mediator Structure

When one variable causes another, it may not do it directly but through a set of mediating variables instead. For example, an item purchased by your friends increases your purchase probability not only directly through the recommendation that integrates social network, but also indirectly through increased trust in the item.

The distinction between direct and indirect effects of the change of treatment on outcome is key to the utilization of the mediator structure, which can be done by conditioning on the mediating variable traditionally (Glymour et al., 2016). Specifically, as illustrated in Fig. 10 (a), the total effect (ToE) of $I=i$ on $Y$ is defined as:

(17)

ToE=Y(I=i,K(I=i))-Y(I=i^{*},K(I=i^{*})),

$I=i^{*}$ refers to the situation where the value of $I$ is different from the reality, i.e., counterfactual. Total effect can be further decomposed into natural direct effect (NDE) and total indirect effect (TIE). NDE reflects the effect of $I$ on $Y$ through the direct path, i.e., $I\rightarrow Y$ , while $K$ is set to the value when $I=i^{*}$ :

(18)

NDE=Y(I=i,K(I=i^{*}))-Y(I=i^{*},K(I=i^{*})).

TIE is defined as the difference between TE and NIE, denoted as:

(19)

TIE=ToE-NDE=Y(I=i,K(I=i))-Y(I=i,K(I=i^{*})),

which represents the effect of $I$ on $Y$ through the indirect path $I\rightarrow K\rightarrow Y$ . TE can also be decomposed into natural indirect effect (NIE) and total direct effect (TDE). NIE represents the effect of $I$ on $Y$ through the mediator, i.e., $I\rightarrow K\rightarrow Y$ , while the direct effect on $I\rightarrow Y$ is blocked by setting $I$ as $I*$ , denoted as:

(20)

NIE=Y(I=i*,K(I=i))-Y(I=i^{*},K(I=i^{*})).

In linear systems, NIE and TIE have the same value, and NDE and TDE have the same value (Glymour et al., 2016; Pearl, 2022).

However, if there are confounders of the mediator and the outcome, as the case of (Wei et al., 2021) shown in Fig. 10 (b), conditioning on the mediator means conditioning on a collider, and thus indirect dependence will pass through the confounder to the outcome and misguide the calculation of indirect effect. To tackle the problem, we should intervene on the mediator, which involves counterfactuals. The controlled direct effect (CDE) on $Y$ of $I$ is defined as:

(21)

CDE=Y(do(I=i),do(K=k))-Y(do(I=i^{*}),do(K=k)).

The difference between NDE and CDE is explained in (Glymour et al., 2016).

Table 3. Summary of recommendation models with collider structure and mediator structure.

Category	Model	Causal method	Backbone model	Issue of concern	Year
Causal recommendation with collider structure	DICE (Zheng et al., 2021)	(causal view)	MF(multi-task)	Popularity bias	2021
Causal recommendation with collider structure	CIGC (Ding et al., 2022a)	Intervention on the cause factor	LightGCN (He et al., 2020)	GCN model retraining	2022
Causal recommendation with mediator structure	(Choi et al., 2011)	Mediation analysis	-	Effect of social presence	2011
	(Luo et al., 2013)	Mediation analysis	-	Effect of informational factors	2013
	CMA (Yin and Hong, 2019)	NDE, TIE	-	Effect of induced change	2019
	MACR (Wei et al., 2021)	TIE	(model-agnostic, multi-task)	Popularity bias	2021
	CIRS (Gao et al., 2022a)	Intervention on the mediator	PPO (Schulman et al., 2017)	Filter bubble (Pariser, 2011)	2022
	CCF (Xu et al., 2023a)	Intervention on the mediator, counterfactuals	NCF (He et al., 2017), GRU4Rec (Hidasi et al., 2015), etc.	Historical bias	2023

Some works are generally interested in how much of the treatment’s causal effect on variable $Y$ is direct and how much is indirect, which is usually explored with the technique of mediation analysis (Kenny, 1979; Baron and Kenny, 1986), similar to SCM but without exogenous variables and the introduction of counterfactuals. For example, in early studies, (Choi et al., 2011) conducts an experiment varying the level of social presence over hundreds of testers and examines the effect of social presence on users’ reuse intention and trust through mediation analysis. A similar structure is used to evaluate how electronic word-of-mouth affects user interactions in (Luo et al., 2013). Further, Yin et al. (Yin and Hong, 2019) aim to separate the direct effects of the change in user behaviors in the tested product from the effect of changes in user behaviors in other products, aka induced changes, for example, the effects of significant lifts in CTR on the recommendation list and of significant decreases in CTR on organic search results on the final insignificant lifts in the sitewide CVR during the A/B test of a new version of recommendation module. Therefore, they use causal mediation analysis (CMA) of potential outcome framework to estimate causal effects of the induced changes and also discuss the estimation under the situation that multiple unmeasured causally-dependent mediators exist with the help of a directed acyclic graph.

Some other works utilize Pearl’s counterfactual tool to cope with mediator structure in order to improve accuracy (Wei et al., 2021; Xu et al., 2023a; Gao et al., 2022a). Wei et al. (Wei et al., 2021) explore the popularity issue with the SCM framework and formulate the causal graph as Fig. 10 (b) shown, in which the probability of interaction $Y$ is influenced by three main factors: user-item matching ( $K(U,I)\rightarrow Y$ ), item popularity ( $I\rightarrow Y$ ) and user conformity ( $U\rightarrow Y$ ), the last two of which are usually ignored by existing models and thus result in the terrible Matthew effect. Following this causal graph, Wei et al. propose MACR (Model-Agnostic Counterfactual Reasoning), a multi-task framework that consists of three modules to jointly learn the effects of $U\rightarrow Y$ , $U\&I\rightarrow K\rightarrow Y$ , and $I\rightarrow Y$ , respectively during recommender training and estimates TIE of $I$ on $Y$ in counterfactual inference:

(22)	$\displaystyle TIE=$	$\displaystyle ToE-NDE$
	$\displaystyle=$	$\displaystyle Y(U=u,I=i,K=K(U=u,I=i))-Y(U=u,I=i,do(K=K(U=u^{},I=i^{})))$
	$\displaystyle=$	$\displaystyle Y_{k}(K(U=u,I=i))Y_{u}(U=u)Y_{i}(I=i)-Y_{k}(K(U=u^{},I=i^{})% )Y_{u}(U=u)Y_{i}(I=i)$
	$\displaystyle=$	$\displaystyle\hat{y}_{k}\sigma\left(\hat{y}_{i}\right)\sigma\left(\hat{y}_{u% }\right)-c\sigma\left(\hat{y}_{i}\right)\sigma\left(\hat{y}_{u}\right),$

where $\sigma(\cdot)$ denotes the sigmoid function, and $c$ is a hyper-parameter that represents $Y_{k}(K(U=u*,I=I*))$ , the reference situation of $Y_{k}(K(U=u,I=i))$ . With counterfactual inference, MACR could rank items without popularity bias by reducing the direct effect from item properties to the ranking score.

The work by Xu et al. (Xu et al., 2023a) regards the user interaction history $H$ as a mediator (Fig. 11 (a)) and proposes CCF (Causal Collaborative Filtering) to estimate $\textrm{Pr}(Y=y|U=u,do(I=i))$ , where $u,i$ is a user-item pair and $y$ is the preference score for the pair. More specifically, $H=f_{h}(U=u)$ is a database retrieval operation that returns a user’s interaction history from the observational data, $I=f_{0}(U=u,H=h)$ means the recommended item $I$ returned from the already deployed recommendation system based on the user and the user’s interaction history, and $Y=f(U=u,I=i)$ represents the estimation of unbiased user preference on the item. $\textrm{Pr}(Y=y|U=u,do(I=i))$ adopts the conditional intervention to consider both observed and unobserved (counterfactual) interaction history, as presented in Fig. 11 (b). The derivation result of $\textrm{Pr}(Y=y|U=u,do(I=i))$ is given:

(23)		$\displaystyle\textrm{Pr}(Y=y\mid U=u,do(I=i))\approx$	$\displaystyle\textrm{Pr}(Y=y\mid U=u,do(I=f_{0}(U=u,H=h)))$
(23)		$\displaystyle=$	$\displaystyle\sum_{h}\textrm{Pr}(y\mid u,h,f_{0}(u,h))\textrm{Pr}(h\mid u)$

It is tempting to conclude that if trained only with observed history $h$ , f(U=u, I=i) would naturally degenerate to the original recommendation model $f_{0}(U=u,H=h)$ . Therefore, Xu et al. adopt a heuristic-based approach to generate counterfactual history $h’$ .

5.3. Causal Recommendation with Confounder Structure

There is a large volume of published studies investigating the confounding structures in recommendation since a lot of data biases widespread in recommender systems are, essentially, confounding biases mentioned in Section 4.1.1. Approaches to tackle confounder structures of existing literature can be categorized into four types: with back-door adjustment, with instrumental variables, with front-door adjustment, and with deep learning based intervention.

5.3.1. The Back-door-based Approach

Before introducing the back-door adjustment approaches, let us brieﬂy review the definitions of back-door path and back-door criterion (Imbens and Rubin, 2015).

Definition 0 (Back-door Path).

Given a pair of treatment $T$ and outcome variable $Y$ , a path connecting $T$ and $Y$ is a back-door path for $(T,Y)$ if it satisfies that

(1)

it is not a directed path (it contains an arrow pointing into $T$ ); and
(2)

it is not blocked (it has no collider).

Back-door path help us to identify confounders, which is the central node of a fork on a back-door path of $(T,Y)$ . The following two examples will help to illustrate it (Pearl and Mackenzie, 2018). In Fig. 12 (a), there is one back-door path from $T$ to $Y$ , $T\leftarrow A\rightarrow Y$ , indicating that $A$ is the confounder. For the estimation the effect of $T$ on $Y$ , we should eliminate the confounding bias by either controlling $A$ to block the back-door path or running a randomized controlled experiment. Note that $T\rightarrow B\leftarrow A\rightarrow Y$ is blocked by the collider at $B$ and, therefore, not a back-door path. In Fig. 12 (b), we can control for $C$ to close the back-door path $T\leftarrow B\leftarrow C\rightarrow Y$ . Here we present the formal definition of the back-door criterion to deal with the confounding effects.

Definition 0 (Back-door Criterion).

Given a pair of treatment $T$ and outcome variable $Y$ , a set of variables $X$ satisfied the back-door criterion if $X$ blocks all back-door paths of $(T,Y)$ .

Based on the Back-door Criterion, we can further derive the Back-door Adjustment Theorem, which adjusts fewer variables compared to the Causal Effect Rule (Definition 3.4).

Definition 0 (Back-door Adjustment).

If a set of variables $X$ satisfies the back-door criterion for $T$ and $Y$ , the causal effect of $T$ on $Y$ is identifiable and given by the formula:

(24)

\textrm{Pr}(Y=y\mid do(T=t))=\sum_{x}\textrm{Pr}(Y=y\mid T=t,X=x)\textrm{Pr}(X% =x),

To see what this means in practice, let us look at a concrete example, as presented in Fig 13. Suppose we need to evaluate the effect of recommendation ( $T$ ) on user’s click behavior ( $Y$ ) of a newly deployed recommendation strategy on an online shopping platform. However, the time-varying consuming desire ( $A$ ) makes it difficult to compare the effect with that of the existing one. For example, users might be more willing to spend due to the proximity of holidays, resulting in a seemly better recommendation effect of the tested policy. However, the consuming desire is unmeasurable for do-calculation. Instead, we could control for an observed variable, the number of recent interactions $B$ , that fits the back-door criterion from $T$ to $Y$ . Therefore, adjusting for $B$ to block the back-door path $T\leftarrow A\rightarrow B\rightarrow Y$ will give us the true causal effect of recommendation $T$ on click $Y$ , formulated as:

(25)

\displaystyle\textrm{Pr}(Y=y\mid do(T=t))=\sum_{x}\textrm{Pr}(Y=y\mid T=t,B=b)% \textrm{Pr}(B=b).

Some literature on recommendation issues with confounder structures introduces the theory of back-door criterion (Huang et al., 2012; Sharma et al., 2015; Tran et al., 2021; Wang et al., 2021b). (Huang et al., 2012) utilizes the back-door criterion to verify whether or not word-of-mouth recommendations can influence users’ evaluation of the recommended items. Sharma et al. (Sharma et al., 2015) treat an instantaneous shock in direct traffic as an instrumental variable to answer the counterfactual question from purely observational data: how much interaction activity would there have been on the online shopping website if recommendations were absent, and apply the back-door criterion to block the possible unobserved confounding effect between the “exposure” $T_{i}$ and “click” $Y_{ij}$ , as Fig. 14 shown. Besides, Tran et al. (Tran et al., 2021) consider the job personal recommendation issue in Disability Employment Services and present a causality-based method to tackle the problem, in which the covariate set is determined by the back-door criterion.

A multitude of studies employ back-door adjustment to block the back-door path by directly intervening on the treatment variable (Wang et al., 2021a; Zhang et al., 2021b; He et al., 2023; Wang et al., 2023a; Zhan et al., 2022; Rajanala et al., 2022; Xia et al., 2023; Zhang et al., 2023a; Yu et al., 2023a; Tsoumas et al., 2023). For example, Wang et al. (Wang et al., 2021a) propose the framework named DecRS (Deconfounded Recommender System) to eliminate bias amplification through intervention on the user representation $U$ , which removes the effect of the historical user distribution over item groups $D$ on $U$ , as Fig. 15 (a) shown. Zhang et al. (Zhang et al., 2021b) propose PDA (Popularity-bias Deconfounding and Adjusting) to eliminate the effect of item popularity $P$ through intervention on the item $I$ (see Fig. 15 (b)), denoted as:

(26)

\textrm{Pr}(Y=y\mid do(U=u,I=i))=\sum_{p}\textrm{Pr}(y\mid u,i,p)\textrm{Pr}(p% \mid u,i)=\textrm{Pr}(y\mid u,i,p)\textrm{Pr}(p),

where $U$ denotes the user representation and $Y$ represents interactions. $\textrm{Pr}(y\mid u,i,p)$ and $\textrm{Pr}(p)$ are learned separately. It is worth mentioning that PDA can leverage popularity bias to enhance the recommendation performance by adjusting $\textrm{Pr}(p)$ in the inference stage, which can be regarded as counterfactual inference. More recently, Zhang et al. (Zhang et al., 2023a) address duration bias by identifying duration time as a confounder. Subsequently, they group data samples based on watch time feedback and craft novel duration supervision labels, thereby alleviating the confounding bias.

In the above literature elaboration, we may find a series of works that accomplish the integration of SCM-based causal inference and recommender systems with a similar pattern, as shown in Fig. 16: they first analyze the causal relationship between the variables regarding the concern issue and formulate the causal graph based on it; after theoretical analysis, a multi-task or separated structure is adopted to learn the causal effects of the variables on the potential outcome in the training phase; once the training has been completed, appropriate variables are selected to intervene during the inference stage, i.e., they are set to counterfactual values directly or indirectly, and the outcome is estimated based on applicable causal rules (e.g., backdoor adjustment, TIE, etc.) to conduct counterfactual inference.

5.3.2. Instrumental Variable-based Approach

The instrumental variable (IV) method is such a powerful approach for learning causal effects with confounders that it can be done even without controlling for, or collecting data on, the confounders (Pearl and Mackenzie, 2018). The instrumental variable causally influences the outcome only through the treatment (Fig. 17 (a)), defined as:

Definition 0 (Instrumental Variable).

Given an observed variable $Z$ , covariates $X$ , the treatment $T$ and the outcome $Y$ , $Z$ is a valid instrumental variable (IV) for the causal effect of $T\rightarrow Y$ if $Z$ satisfies (Angrist et al., 1996):

(1)

$Z\not\!\perp\!\!\!\!\perp T\mid X$ ; and
(2)

$Z\perp\!\!\!\!\perp Y\mid do(T),X$ .

In practice, IV is often implemented in a two-stage lease squares (2SLS) procedure.

Table 4. Summary of recommendation models with confounder structure.

Model	Causal method	Backbone model	Issue of concern	Year
(Huang et al., 2012)	Back-door criterion	MF	Effect of WOM recommendation	2012
(Sharma et al., 2015)	Back-door adjustment, IV	-	Effect of recommendations	2015
(Chaney et al., 2018)	-	MF, etc.	Feedback loop bias	2018
DEMER (Shang et al., 2019)	-	(RL)	Unobserved confounding bias	2019
CPR (Yang et al., 2021)	Back-door adjustment	(model-agnostic)	Data insufficiency	2021
CauSeR (Gupta et al., 2021).	Back-door adjustment	SR-GNN (Wu et al., 2019a)	Popularity bias in SBRSs	2021
MCT (Tran et al., 2021)	Back-door criterion, CATE	(custom-designed)	Disability employment	2021
DecRS (Wang et al., 2021a)	Back-door adjustment	FM, NFM (He and Chua, 2017)	Bias amplification	2021
PDA (Zhang et al., 2021b)	Back-door adjustment	MF	Popularity bias	2021
CR (Wang et al., 2021b)	Back-door criterion, TIE	MMGCN (Wei et al., 2019) (multi-task)	Clickbait	2021
D2Q (Zhan et al., 2022)	Back-door adjustment	(custom-designed)	Duration bias	2022
DeSCoVeR (Rajanala et al., 2022)	Back-door adjustment	(custom-designed)	Venue recommendation	2022
IV4Rec (Si et al., 2022)	IV	DIN, NRHUB (Wu et al., 2019b)	Recommendation using search data	2022
HCR (Zhu et al., 2022)	Front-door adjustment	MMGCN	Unobserved confounding bias	2022
DCR (He et al., 2023)	Back-door adjustment	NFM	Observed confounding bias	2023
CaDSI (Wang et al., 2023a)	Back-door adjustment	(custom-designed)	Observed confounding bias	2023
DecUCB (Xia et al., 2023)	Back-door adjustment	(custom-designed, bandit)	Observed confounding bias	2023
iDCF (Zhang et al., 2023c)	Proxy variable	MF	Unobserved confounding bias	2023
CVRDD (Tang et al., 2023)	TIE	MLP(model-agnostic)	Duration bias	2023
DML (Zhang et al., 2023a)	Back-door adjustment	MMoE	Duration bias	2023
CGSR (Yu et al., 2023a)	Back-door adjustment	(custom-designed)	Shortcut paths in SBRSs	2023
(Tsoumas et al., 2023)	Back-door adjustment, IPS	(custom-designed, knowledge-based RS)	Digital agriculture	2023
DDCE (Wang et al., 2023b)	-	(custom-designed)	Popularity bias	2023

•

*Here, WOM stands for word-of-mouth, RL for reinforcement learning, and SBRS for session-based recommender system.

Though a popular tool, instrumental variable seems to find little application in recommender systems because of the difficulty of finding variables that satisfy the conditions of instrumental variables. As already cited above, Sharma et al. (Sharma et al., 2015) utilize an instantaneous shock in direct traffic as an instrumental variable to evaluate the recommendation effect. Si et al. (Si et al., 2022) propose a model-agnostic framework named IV4Rec that effectively decomposes the embedding vectors into two parts: the causal part indicating a user’s personal preference for an item, and the non-causal part merely reflects the statistical dependencies between users and items such as exposure mechanism and display position, with users’ search behaviors as the instrumental variable. More specifically, it modifies the traditional IV method, using the residual of the least square regression as the causal embedding instead of discarding it. The causal graph is illustrated in Fig. 18.

Considering the stringent conditions often associated with IVs, a recent theoretical advancement (Miao et al., 2023) has been proposed to estimate treatment effects utilizing an auxiliary variable, which requires less restrictive prerequisites compared to IVs. An example causal diagram for auxiliary variables is visually represented in Fig. 17(b), where $Z$ serves as a proxy variable for the unmeasurable confounder. Building on this theory, Zhang et al.(Zhang et al., 2023c) developed the iDCF (identifiable deconfounder) to account for the unmeasured user’s socio-economic status $X$ by employing the user’s consumption level as a proxy variable $Z$ , a descendant of the unobserved confounder $X$ yet not directly causally associated with either treatment or outcomes. Furthermore, they leverage iVAE (Khemakhem et al., 2020) to infer the conditional distribution of the latent confounder, thus resolving the Non-Identification issue encountered in (Wang et al., 2020).

5.3.3. The Front-door-based Approach

The front-door adjustment (Imbens and Rubin, 2015) is another popular method for learning causal effects with unobserved confounders, in which we condition on a set of variables $K$ that satisfies the front-door criterion.

Definition 0 (Front-door Criterion).

Given a pair of treatment $T$ and outcome variable $Y$ , a set of variables $K$ is said to satisfy the front-door criterion if:

(1)

$K$ intercepts all directed paths from $T$ to $Y$ ;
(2)

there is no back-door path from $T$ to $K$ ; and
(3)

all back-door paths from $K$ to $Y$ are blocked by $T$ .

A graph depicting the front-door criterion is shown in Fig. 19 (a). In practice, $K$ is usually the mediator of the causal effect $T\rightarrow Y$ . With the help of $K$ , the causal effect of $T$ on $Y$ can be calculated as follows:

Definition 0 (Front-Door Adjustment)).

If $K$ satisfies the front-door criterion relative to $(T,Y)$ and $\textrm{Pr}(T,Y)>0$ , then the causal effect of $T$ on $Y$ is given by the formula

(27)

\textrm{Pr}(Y\mid do(T))=\sum_{K}\textrm{Pr}(Y\mid do(K))\textrm{Pr}(K\mid do(% T))=\sum_{K}\textrm{Pr}(K\mid T)\sum_{T^{\prime}}\textrm{Pr}\left(Y\mid T^{% \prime},K\right)\textrm{Pr}\left(T^{\prime}\right).

Zhu et al. (Zhu et al., 2022) propose HCR (Hidden Confounder Removal) framework to mitigate hidden confounding effects by front-door adjustment, in which user and item feature $U$ and $I$ are treatments, post-click user behaviors $Y$ are the concerned outcome, and the click feedback $K$ acts as the mediator that satisfies the front-door criterion, as Fig. 19 (b) shown. However, in real-world recommendation scenarios, confounding bias also exists in the estimation of the click feedback, which means it is not competent to perform the front-door adjustment. In fact, the front-door adjustment, like the IV method, finds little application in recommender systems because of the lack of eligible variables.

5.3.4. Deconfounded Recommender Algorithms

Instead of directly introducing causal technique, some literature expands sheer recommendation algorithms to deal with confounders under the inspiration of analysis from the perspective of causal inference. For example, (Chaney et al., 2018) modifies several traditional recommendation algorithms to explore the impact of algorithmic confounding, which has found that the data-algorithm feedback loop amplifies the homogenization of user behavior without corresponding gains in utility and also amplifies the impact of recommendation systems on item consumption.

Some works integrate reinforcement learning-based recommender systems with causal inference to tackle the confounding issue. For example, DEMER (deconfounded multi-agent environment reconstruction) (Shang et al., 2019) is proposed following the generative adversarial training framework to model the hidden confounder, which affects both actions and rewards as an agent interacts with the environment and thus obstructs an effective reconstruction of the environment, by treating the hidden confounder as a hidden policy. In (Yang et al., 2021), user representations $U$ are considered as a confounder of the recommendation lists $T$ and users’ interactions $Y$ on recommendation lists. To alleviate this confounding bias, CPR (counterfactual personalized ranking framework) builds the recommender simulator to generate new training samples based on the causal graph.

As for session-based recommender systems (SBRSs), Gupta et al. (Gupta et al., 2021) propose the CauSeR (Causal Session-based Recommendations) framework to perform deconfounded training to handle popularity bias. COCO-SBRS (Song et al., 2023b) adopts a self-supervised approach to pre-train a recommendation model to learn the causalities in SBRSs, so as to eliminate confounding bias and make accurate next item recommendations. In terms of GNN-based recommendations, Gas et al. infer the unobserved confounders existing in representation learning with the CVAE model (Sohn et al., 2015) and apply it to GNN-based strategy (Gao et al., 2021).

6. General Counterfactuals-based Methods

Some causal recommender approaches are established based on the general concept of counterfactuals, the world that does not exist but can be reasoned with some fundamental law and human intuition. In this section, we will introduce related strategies from the perspective of recommender issues they try to address, including domain adaptation, data augmentation, fairness, and explanation.

6.1. Domain Adaptation

RSs are trained and evaluated offline with the supervision of previously-collected data, which usually suffers from selection bias and confounding bias. It results in a gap between the training goal and the true recommendation objective, and, therefore, a sub-optimal recommender algorithm. To address this issue, we hope to evaluate the training policy on the unbiased data, which is collected from the randomized treatment policy. However, uniform data is always expensive and small-scale. To take full advantage of the uniform data, researchers train the recommender systems with a small amount of unbiased data and a large amount of biased data, with the hope of learning the counterfactual distribution of the biased data, which is both a counterfactual problem and a domain adaptation problem.

(Rosenfeld et al., 2017) and (Bonner and Vasile, 2018) train recommender policies on biased and unbiased data, and add regulation terms to the loss function so that the distance of parameters between the two policies in the inspiration of individual treatment effect is controllable. (Yuan et al., 2019a) trains an unbiased imputation model to impute the labels of all observed and unobserved events in biased and unbiased data, and learns the final CTR model by combining the two data with the propensity-free doubly robust method. Further, (Liu et al., 2020) propose KDCRec (Knowledge Distillation framework for Counterfactual Recommendation) in which the teacher network with unbiased data as input is used to guide the biased model via four approaches.

Table 5. Summary of recommendation models with general counterfactuals.

Category	Model	Causal inference method	Backbone model	Year
Domain adaptation	(Rosenfeld et al., 2017)	ITE	Linear/regularized kernel methods	2017
	(Bonner and Vasile, 2018)	ITE	Matrix factorization	2018
	Propensity-free DR (Yuan et al., 2019a)	DR	FFM	2019
	KDCRec (Liu et al., 2020)	ITE	MF (knowledge distillation)	2020
Data augmentation	CF2 (Xiong et al., 2021)	”Minimum” counterfactuals	(custom-designed)	2021
	CASR (Wang et al., 2021c)	”Minimum” counterfactuals	NARM (Li et al., 2017), STAMP (Liu et al., 2018), SASRec (Kang and McAuley, 2018)	2021
	CauseRec (Zhang et al., 2021f)	Counterfactuals	(custom-designed, sequential recommendation)	2021
	POEM (Liu et al., 2022a)	Counterfactuals	GCN	2022
	COCO-SBRS (Song et al., 2023b)	Counterfactuals	(custom-designed, sequential recommendation)	2023
Fairness	(Li et al., 2021a)	Counterfactuals	(custom-designed)	2021
	F-UCB (Huang et al., 2022)	Counterfactuals	UCB	2022
	CLOVER (Wei and He, 2022)	Counterfactuals	MELU (Lee et al., 2019)	2022
	PSF-RS (Zhu et al., 2023b)	”Minimum” counterfactuals	(custom-designed)	2023
Explanation	PRINCE (Ghazimatin et al., 2020)	”Minimum” counterfactuals	HIN (Shi et al., 2016)	2020
Explanation	CountER (Tan et al., 2021b)	”Minimum” counterfactuals	MLP(black-box)	2021
	CounterNet (Guo et al., 2023)	”Minimum” counterfactuals	(custom-designed)	2023

6.2. Data Augmentation

Data augmentation is an uncontroversial counterfactual problem, such as answering the question: “what would be the user’s decision if a different item had been exposed?”. Therefore, some works are trying to integrate counterfactuals into the procedure of data augmentation.

Xiong et al. (Xiong et al., 2021) generate new data samples by users’ feature-level preference for review-based recommendation. To generate more effective samples, they leverage the “minimum” idea in counterfactuals, learning the “minimum” change of the user feature-level preference that can “exactly” reverse the preference ranking of the user on a given item pair. For example, if slightly increasing the price attention of a user who had purchased an iPhone will make Xiaomi more attractive to her, this will be regarded as an effective counterfactual sample. Similarly, CASR (Counterfactual Data-Augmentation Sequential Recommendation) (Wang et al., 2021c) generates the counterfactual sequence of items by “minimally” changing the user’s historical items, such that her currently interacted item can be “exactly” altered.

The CauseRec (Counterfactual User Sequence Synthesis for Sequential Recommendation) proposed by Zhang et al. (Zhang et al., 2021f) generates counterfactual data in a different way. It identifies indispensable and dispensable concepts in the historical behavior sequence. The former can represent a meaningful aspect of the user’s interest, while the latter indicates noisy behaviors that are less important in representing user interest. Therefore, it is reasonable to argue that replacing indispensable concepts in the original user sequence incurs a preference deviation of the original user representation, while replacing the dispensable ones still has a similar user representation, which CauseRec realizes through contrastive learning. Liu et al. (Liu et al., 2022a) focus on the recommendation scenario where users are exposed with decision factor-based persuasion texts, i.e., persuasion factors, and generate new training samples by making simple but reasonable counterfactual assumptions about user behaviors, including:

•

If a user clicks on an item without the existence of persuasion factors, the user will still be likely to click on it with a matching persuasion factor.
•

If a user does not click on an item with the existence of persuasion factors, the user will not click on it when the persuasion factor does not exist.

In recent work, Song et al. (Song et al., 2023b) categorize the factors influencing user interactions in session-based recommender systems into two types: inner-session causes and outer-session causes, and then generate counterfactual data samples through a novel combination of original inner-session causes and outer-session causes from similar users.

6.3. Fairness and Explanation

The counterfactual technique is a natural tool for the evaluation of fairness since we can compare the outcome (ratings, recommendation lists, etc.) in the real world and in the counterfactual world in which only users’ sensitive features (e.g., gender and race) are intervened (Huang et al., 2022; Wei and He, 2022).

Definition 0 (Counterfactual Fairness).

A recommender model is counterfactually fair if, for any possible user $u$ with features $X=x$ and $Z=z$ :

(28)

\textrm{Pr}(y\mid x,z)=\textrm{Pr}(y\mid x,do(z’))

For any value $y$ and $z’$ , where $Y$ denotes the potential outcome for user $u$ . $Z$ are users’ sensitive features and $X$ are causally $Z$ -independent features.

Based on the counterfactual fairness, Li et al. generate sensitive feature-independent user embeddings through adversary learning (Li et al., 2021a). They train a predictor to learn the filtered embedding and an adversarial classifier to predict the sensitive features from the learned representation simultaneously. For the reinforcement learning-based recommendation, Huang et al. propose the F-UCB (fair causal bandit) (Huang et al., 2022), picking arms from a subset of arms at each round in which all the arms satisfy counterfactual fairness constraint that users receive similar rewards regardless of their sensitive attributes. Zhu et al. (Zhu et al., 2023b) contend that directly removing or altering sensitive features will inevitably compromise the quality of recommendations, as these features can influence user interests fairly (e.g., racial influences on cultural preferences). To address this issue, their proposed PSF-RS (Path-Specific Fair Recommender System) delineates the influence process of sensitive features on interaction outcomes into fair and unfair paths, and addresses the path-specific bias by minimally transforming the biased factual world into a hypothetically fair one.

As for explanation, counterfactuals describe a dependency on the external facts that lead to certain outcomes, and thus allow researchers to reason about the behavior of a black-box algorithm (Wachter et al., 2017). Literature on counterfactual explanation also resorts to the “minimum” idea in counterfactuals. For example, (Ghazimatin et al., 2020) presents PRINCE (Provider-side Interpretability with Counterfactual Evidence) to search for a set of minimal actions performed by the user that, if removed, changes the recommendation to a different item, in a heterogeneous information network with users, items, and so on. To understand the point, consider the following example. If a user who has bought an iPhone and followed MacBook receives a recommendation about AirPods and would not have received it if she had not bought iPhone, PRINCE will regard the behavior “purchase of iPhone” as the explanation of the recommendation. Similarly, CountER (Counterfactual Explainable Recommendation) proposed by (Tan et al., 2021b) seeks the minimum changes of item features that exactly reverse the recommendation decision.

The aforementioned studies are predominantly post-hoc methods tailored for proprietary machine learning models, which restricts the explanatory models from leveraging information from the predictive models. The work by Guo et al. (Guo et al., 2023) introduces CounterNet, an integration that combines the predictive model and the counterfacutal explanation generator in an end-to-end framework. Beyond the scope of recommender systems, there are additional counterfactual explanation studies that may serve as supplementary references (Wachter et al., 2017; Joshi et al., 2019; Nemirovsky et al., 2022; Pawelczyk et al., 2020).

7. Opening Problems and Future Directions

The introduction of causal inference into recommender systems is still relatively recent, and there are still many promising but unexplored research directions, which we will discuss in this section.

Causal assumptions in recommendations. To extract causality knowledge from statistical data, causal inference-based methods are conducted with several causal assumptions. Although much of the existing work employ causal inference methods, these assumptions are often not explicitly clarified or even violated. To address this issue, existing work either ignores it or makes simple assumptions about data distributions. Therefore, estimating the impact of violations of causal assumptions on experimental results is crucial for bridging the gap between modern recommender system design and causal inference. For example, the positivity assumption is essential in PO-based approaches for the unbiased estimation of causal effect, but the data sparsity problem of recommender systems makes it difficult to satisfy. Considering that a small difference in recommendation accuracy may lead to a huge rise and fall in platform revenue, the effect of the violation of causal assumptions and that of artificial assumptions on predictions should be investigated.

Causal Discovery in recommendations. The integration of causal discovery and causal inference is inevitable in recommender systems. This belief stems from the fact that, as mentioned in Section 2, the former serves as the foundation for the latter. In the absence of causal discovery, divergent causal graphs are often proposed for identical problems across different studies, leading to a plethora of methodologies, while overlooking the validation of the causal graph’s correctness. Although these proposed methods have experimentally proven effective, they might lack generality and could be challenging to extrapolate to other datasets, a limitation notably prevalent in SCM-based methods. Causal discovery substantially reduces the reliance on manually designed causal graphs, enhancing the generalizability and applicability of causal recommendation methods across diverse scenarios. Possible research directions for Causal Discovery include: 1) Discovering causal relationships between variables. For example, exploring causal relationships between user attributes (e.g., age, economic status, and geographic context) and interaction decisions. 2) Discovering causal relationships between users and items. For example, paper and ink cartridges are always simultaneously observed. But they are causally irrelevant; instead, they share a common cause - the item “printer” (Wang et al., 2022a). Accurately identifying item-level causal relationships significantly enhances the precision of recommendations. In this direction, some exploratory work has been done (Wang et al., 2022a; He et al., 2022). One potential solution involves combining causal discovery with GNNs. The ability of GNNs to explore structural information between nodes in a graph (Gao et al., 2022b) gives them a natural advantage in identifying causal relationships between users and items. Furthermore, the causal knowledge uncovered can be further incorporated into GNN-based recommendation algorithms to facilitate the learning of semantically meaningful and identifiable graph representations (Jiang et al., 2023).

Transfer learning and Out-of-distribution recommendation. Due to the data sparsity issue of recommendation systems, it will be a wise and practical choice to transfer user and item knowledge from other domains to improve prediction performance during cold start, offline evaluation, or online test, which is a transfer learning problem. Even in the same dataset, natural shifts of user preference or artificial bias also cause a violation of the IID hypothesis, which is an out-of-distribution (OOD) recommendation issue (He et al., 2022). The core of both transfer learning and OOD recommendation is to transfer beneficial shared knowledge, such as users’ inherent and unchanged preferences. Thus they can be formulated as invariant learning in some cases (Wang et al., 2022b). As we mentioned in Section 1, causal inference works to discover the unchangeable causal relationship in data, which can be reused in new domains. From this perspective, adopting causal inference to improve robustness and generalization ability to accomplish cross-domain or OOD recommendation is a promising direction, and some exciting attempts can be found recently (He et al., 2022; Wang et al., 2022c).

Dynamic recommendation. Modern recommender systems usually involve feedback loops and dynamic updates. Therefore, it is crucial to incorporate loops into the causality-based methods to accurately model the dynamic and iterative data collection process for recommender systems (Xu et al., 2022). Some impressive work has also been proposed (Chaney et al., 2018; Wang et al., 2021a; Gupta et al., 2021; Krauth et al., 2022). However, uncontrolled feedback loops may give rise to issues like the Matthew effect, echo chambers (Chaney et al., 2018; Ge et al., 2020; Xu et al., 2022) and bias amplification (Chaney et al., 2018; Wang et al., 2021a). Original debiasing approaches (e.g., back-door adjustment) cannot be applied directly due to the change in the form of causal models. Therefore, deconfounding in multi-step and feedback loop-involved causal models is still an open research field.

Causality-inspired foundation models for recommendation. The emergence of Large Language Models (LLMs) has sparked extensive exploration into the development of recommendation foundation models. These models, pre-trained on diverse language or interaction data, can be adapted for a wide array of downstream recommendation tasks (Liu et al., 2023; Qiu et al., 2021; Hou et al., 2022; Kang et al., 2023; Lin et al., 2023). Integrating causality into these models is a promising direction (Petrov and Macdonald, 2023). However, it has been observed that bias in the pretraining corpus of foundation models can lead to unfairness in recommender systems from both user-side (Hua et al., 2023; Zhang et al., 2023b) and item-side (Hou et al., 2023). Current studies primarily focus on the fairness issue in specific recommendation tasks. To address this, there is a growing interest in formulating novel pretraining tasks to evaluate the causal inference capabilities of recommendation foundation models (Jin et al., 2023), aiming to mitigate bias issues at their root and enhance overall recommendation performance.

8. Conclusion

In this survey paper, we have summarized the mechanisms and the strategies of causal inference for recommender systems, from the theoretical perspective: PO framework-based, SCM framework-based and general counterfactuals-based. The survey gives the clear description about the strengths of causal inference for recommendations and manages to use a uniﬁed symbol system to describe a large number of existing causal recommender approaches. We hope this survey can well help researchers in the recommendation field to utilize and innovate.

Acknowledgements.

This research work is supported by the National Key Research and Development Program of China under Grant No. 2021ZD0113602, the National Natural Science Foundation of China under Grant Nos. 62176014, 62276015, the Fundamental Research Funds for the Central Universities.

Declaration of Interests

The authors declare no competing interests that could have appeared to influence the work reported in this paper.

References

(1)
Adomavicius and Tuzhilin (2005) Gediminas Adomavicius and Alexander Tuzhilin. 2005. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering 17, 6 (2005), 734–749.
Aliprantis (2015) Dionissi Aliprantis. 2015. A distinction between causal effects in Structural and Rubin Causal Models. (2015).
Angrist et al. (1996) Joshua D Angrist, Guido W Imbens, and Donald B Rubin. 1996. Identification of causal effects using instrumental variables. Journal of the American statistical Association 91, 434 (1996), 444–455.
Bareinboim and Pearl (2012) Elias Bareinboim and Judea Pearl. 2012. Controlling selection bias in causal inference. In Artificial Intelligence and Statistics. PMLR, 100–108.
Baron and Kenny (1986) Reuben M Baron and David A Kenny. 1986. The moderator–mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology 51, 6 (1986), 1173.
Betlei et al. (2021) Artem Betlei, Eustache Diemert, and Massih-Reza Amini. 2021. Uplift modeling with generalization guarantees. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 55–65.
Beygelzimer and Langford (2009) Alina Beygelzimer and John Langford. 2009. The offset tree for learning with partial labels. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 129–138.
Bonner and Vasile (2018) Stephen Bonner and Flavian Vasile. 2018. Causal embeddings for recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems. 104–112.
Bottou et al. (2013) Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. 2013. Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising. Journal of Machine Learning Research 14, 11 (2013).
Brodersen et al. (2015) Kay H Brodersen, Fabian Gallusser, Jim Koehler, Nicolas Remy, and Steven L Scott. 2015. Inferring causal impact using Bayesian structural time-series models. The Annals of Applied Statistics (2015), 247–274.
Cai et al. (2022) Yinqiong Cai, Jiafeng Guo, Yixing Fan, Qingyao Ai, Ruqing Zhang, and Xueqi Cheng. 2022. Hard Negatives or False Negatives: Correcting Pooling Bias in Training Neural Ranking Models. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 118–127.
Chaney et al. (2018) Allison JB Chaney, Brandon M Stewart, and Barbara E Engelhardt. 2018. How algorithmic confounding in recommendation systems increases homogeneity and decreases utility. In Proceedings of the 12th ACM Conference on Recommender Systems. 224–232.
Chen et al. (2021a) Jiawei Chen, Hande Dong, Yang Qiu, Xiangnan He, Xin Xin, Liang Chen, Guli Lin, and Keping Yang. 2021a. Autodebias: Learning to debias for recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 21–30.
Chen et al. (2021b) Mouxiang Chen, Chenghao Liu, Jianling Sun, and Steven CH Hoi. 2021b. Adapting Interactional Observation Embedding for Counterfactual Learning to Rank. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 285–294.
Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, 785–794.
Cheng et al. (2016) Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. 7–10.
Choi et al. (2011) Jaewon Choi, Hong Joo Lee, and Yong Cheol Kim. 2011. The influence of social presence on customer intention to reuse online recommender systems: The roles of personalization and product type. International Journal of Electronic Commerce 16, 1 (2011), 129–154.
Christakopoulou et al. (2020) Konstantina Christakopoulou, Madeleine Traverse, Trevor Potter, Emma Marriott, Daniel Li, Chris Haulk, Ed H Chi, and Minmin Chen. 2020. Deconfounding user satisfaction estimation from response rate bias. In Fourteenth ACM Conference on Recommender Systems. 450–455.
Correa et al. (2019) Juan D Correa, Jin Tian, and Elias Bareinboim. 2019. Identification of causal effects in the presence of selection bias. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 2744–2751.
Dai et al. (2022) Quanyu Dai, Haoxuan Li, Peng Wu, Zhenhua Dong, Xiao-Hua Zhou, Rui Zhang, Rui Zhang, and Jie Sun. 2022. A generalized doubly robust learning framework for debiasing post-click conversion rate prediction. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 252–262.
Diemert et al. (2018) Eustache Diemert, Artem Betlei, Christophe Renaudin, and Massih-Reza Amini. 2018. A large scale benchmark for uplift modeling. In KDD.
Ding et al. (2022a) Sihao Ding, Fuli Feng, Xiangnan He, Yong Liao, Jun Shi, and Yongdong Zhang. 2022a. Causal incremental graph convolution for recommender system retraining. IEEE Transactions on Neural Networks and Learning Systems (2022).
Ding et al. (2022b) Sihao Ding, Peng Wu, Fuli Feng, Yitong Wang, Xiangnan He, Yong Liao, and Yongdong Zhang. 2022b. Addressing unmeasured confounder for recommendation with sensitivity analysis. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 305–315.
Dudík et al. (2014) Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. 2014. Doubly robust policy evaluation and optimization. Statist. Sci. 29, 4 (2014), 485–511.
Elwert and Winship (2014) Felix Elwert and Christopher Winship. 2014. Endogenous selection bias: The problem of conditioning on a collider variable. Annual review of sociology 40 (2014), 31.
Fang et al. (2019) Zhichong Fang, Aman Agarwal, and Thorsten Joachims. 2019. Intervention harvesting for context-dependent examination-bias estimation. In Proceedings of the 42nd international ACM SIGIR Conference on Research and Development in Information Retrieval. 825–834.
Fong et al. (2018) Christian Fong, Chad Hazlett, and Kosuke Imai. 2018. Covariate balancing propensity score for a continuous treatment: Application to the efficacy of political advertisements. The Annals of Applied Statistics 12, 1 (2018), 156–177.
Funk et al. (2011) Michele Jonsson Funk, Daniel Westreich, Chris Wiesen, Til Stürmer, M Alan Brookhart, and Marie Davidian. 2011. Doubly robust estimation of causal effects. American journal of epidemiology 173, 7 (2011), 761–767.
Gao et al. (2022a) Chongming Gao, Wenqiang Lei, Jiawei Chen, Shiqi Wang, Xiangnan He, Shijun Li, Biao Li, Yuan Zhang, and Peng Jiang. 2022a. CIRS: Bursting Filter Bubbles by Counterfactual Interactive Recommender System. arXiv preprint arXiv:2204.01266 (2022).
Gao et al. (2022b) Chen Gao, Xiang Wang, Xiangnan He, and Yong Li. 2022b. Graph neural networks for recommender system. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 1623–1625.
Gao et al. (2022c) Chen Gao, Yu Zheng, Wenjie Wang, Fuli Feng, Xiangnan He, and Yong Li. 2022c. Causal Inference in Recommender Systems: A Survey and Future Directions. arXiv preprint arXiv:2208.12397 (2022).
Gao et al. (2021) Junruo Gao, Mengyue Yang, Yuyang Liu, and Jun Li. 2021. Deconfounding Representation Learning Based on User Interactions in Recommendation Systems. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 588–599.
Ge et al. (2020) Yingqiang Ge, Shuya Zhao, Honglu Zhou, Changhua Pei, Fei Sun, Wenwu Ou, and Yongfeng Zhang. 2020. Understanding echo chambers in e-commerce recommender systems. In Proceedings of the 43rd international ACM SIGIR Conference on Research and Development in Information Retrieval. 2261–2270.
Gelman (2011) Andrew Gelman. 2011. Causality and statistical learning.
Ghazimatin et al. (2020) Azin Ghazimatin, Oana Balalau, Rishiraj Saha Roy, and Gerhard Weikum. 2020. PRINCE: Provider-side interpretability with counterfactual explanations in recommender systems. In Proceedings of the 13th International Conference on Web Search and Data Mining. 196–204.
Gilotte et al. (2018) Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. 2018. Offline a/b testing for recommender systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. 198–206.
Glymour et al. (2016) Madelyn Glymour, Judea Pearl, and Nicholas P Jewell. 2016. Causal inference in statistics: A primer. John Wiley & Sons.
Goldenberg et al. (2020) Dmitri Goldenberg, Javier Albert, Lucas Bernardi, and Pablo Estevez. 2020. Free lunch! retrospective uplift modeling for dynamic promotions recommendation within roi constraints. In Fourteenth ACM Conference on Recommender Systems. 486–491.
Gomez-Uribe and Hunt (2015) Carlos A Gomez-Uribe and Neil Hunt. 2015. The netflix recommender system: Algorithms, business value, and innovation. ACM Transactions on Management Information Systems (TMIS) 6, 4 (2015), 1–19.
Graves et al. (2013) Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 6645–6649.
Guo et al. (2023) Hangzhi Guo, Thanh H Nguyen, and Amulya Yadav. 2023. CounterNet: End-to-End Training of Prediction Aware Counterfactual Explanations. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 577–589.
Guo et al. (2017) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorization-machine based neural network for CTR prediction. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. 1725–1731.
Guo et al. (2020) Ruocheng Guo, Lu Cheng, Jundong Li, P Richard Hahn, and Huan Liu. 2020. A survey of learning causality with data: Problems and methods. ACM Computing Surveys (CSUR) 53, 4 (2020), 1–37.
Guo et al. (2021) Siyuan Guo, Lixin Zou, Yiding Liu, Wenwen Ye, Suqi Cheng, Shuaiqiang Wang, Hechang Chen, Dawei Yin, and Yi Chang. 2021. Enhanced doubly robust learning for debiasing post-click conversion rate estimation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 275–284.
Gupta et al. (2021) Priyanka Gupta, Ankit Sharma, Pankaj Malhotra, Lovekesh Vig, and Gautam Shroff. 2021. CauSeR: Causal Session-based Recommendations for Handling Popularity Bias. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 3048–3052.
Gutierrez and Gérardy (2017) Pierre Gutierrez and Jean-Yves Gérardy. 2017. Causal inference and uplift modelling: A review of the literature. In International Conference on Predictive Applications and APIs. PMLR, 1–13.
He and Chua (2017) Xiangnan He and Tat-Seng Chua. 2017. Neural factorization machines for sparse predictive analytics. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 355–364.
He et al. (2020) Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 639–648.
He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web. 173–182.
He et al. (2023) Xiangnan He, Yang Zhang, Fuli Feng, Chonggang Song, Lingling Yi, Guohui Ling, and Yongdong Zhang. 2023. Addressing Confounding Feature Issue for Causal Recommendation. ACM Transactions on Information Systems (TOIS) (2023), 23 pages.
He et al. (2022) Yue He, Zimu Wang, Peng Cui, Hao Zou, Yafeng Zhang, Qiang Cui, and Yong Jiang. 2022. CausPref: Causal Preference Learning for Out-of-Distribution Recommendation. In Proceedings of the ACM Web Conference 2022. 410–421.
Hernán et al. (2002) Miguel A Hernán, Sonia Hernández-Díaz, Martha M Werler, and Allen A Mitchell. 2002. Causal knowledge as a prerequisite for confounding evaluation: an application to birth defects epidemiology. American journal of epidemiology 155, 2 (2002), 176–184.
Hidasi et al. (2015) Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2015. Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939 (2015).
Horvitz and Thompson (1952) Daniel G Horvitz and Donovan J Thompson. 1952. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association 47, 260 (1952), 663–685.
Hou et al. (2022) Yupeng Hou, Shanlei Mu, Wayne Xin Zhao, Yaliang Li, Bolin Ding, and Ji-Rong Wen. 2022. Towards universal sequence representation learning for recommender systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 585–593.
Hou et al. (2023) Yupeng Hou, Junjie Zhang, Zihan Lin, Hongyu Lu, Ruobing Xie, Julian McAuley, and Wayne Xin Zhao. 2023. Large language models are zero-shot rankers for recommender systems. arXiv preprint arXiv:2305.08845 (2023).
Hua et al. (2023) Wenyue Hua, Yingqiang Ge, Shuyuan Xu, Jianchao Ji, and Yongfeng Zhang. 2023. UP5: Unbiased Foundation Model for Fairness-aware Recommendation. arXiv preprint arXiv:2305.12090 (2023).
Huang et al. (2012) Junming Huang, Xue-Qi Cheng, Hua-Wei Shen, Tao Zhou, and Xiaolong Jin. 2012. Exploring social influence via posterior effect of word-of-mouth recommendations. In Proceedings of the fifth ACM International Conference on Web Search and Data Mining. 573–582.
Huang et al. (2022) Wen Huang, Lu Zhang, and Xintao Wu. 2022. Achieving Counterfactual Fairness for Causal Bandit. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 6952–6959.
Imbens and Rubin (2015) Guido W Imbens and Donald B Rubin. 2015. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press.
Jaskowski and Jaroszewicz (2012) Maciej Jaskowski and Szymon Jaroszewicz. 2012. Uplift modeling for clinical trial data. In ICML Workshop on Clinical Data Analysis, Vol. 46. 79–95.
Jiang and Li (2016) Nan Jiang and Lihong Li. 2016. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning. PMLR, 652–661.
Jiang et al. (2023) Wenzhao Jiang, Hao Liu, and Hui Xiong. 2023. Survey on Trustworthy Graph Neural Networks: From A Causal Perspective. arXiv preprint arXiv:2312.12477 (2023).
Jin et al. (2023) Zhijing Jin, Jiarui Liu, Zhiheng Lyu, Spencer Poff, Mrinmaya Sachan, Rada Mihalcea, Mona Diab, and Bernhard Schölkopf. 2023. Can Large Language Models Infer Causation from Correlation? arXiv preprint arXiv:2306.05836 (2023).
Joachims (2002) Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 133–142.
Joachims (2006) Thorsten Joachims. 2006. Training linear SVMs in linear time. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 217–226.
Joachims et al. (2017) Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased learning-to-rank with biased feedback. In Proceedings of the tenth ACM International Conference on Web Search and Data Mining. 781–789.
Joshi et al. (2019) Shalmali Joshi, Oluwasanmi Koyejo, Warut Vijitbenjaronk, Been Kim, and Joydeep Ghosh. 2019. Towards realistic individual recourse and actionable explanations in black-box decision making systems. arXiv preprint arXiv:1907.09615 (2019).
Kang and McAuley (2018) Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recommendation. In 2018 IEEE International Conference on Data Mining (ICDM). IEEE, 197–206.
Kang et al. (2023) Wang-Cheng Kang, Jianmo Ni, Nikhil Mehta, Maheswaran Sathiamoorthy, Lichan Hong, Ed Chi, and Derek Zhiyuan Cheng. 2023. Do LLMs Understand User Preferences? Evaluating LLMs On User Rating Prediction. arXiv preprint arXiv:2305.06474 (2023).
Ke et al. (2017) Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: a highly efficient gradient boosting decision tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 3149–3157.
Kenny (1979) David A Kenny. 1979. Correlation and causality. New York: Wiley (1979).
Kessler et al. (2019) Ronald C Kessler, Robert M Bossarte, Alex Luedtke, Alan M Zaslavsky, and Jose R Zubizarreta. 2019. Machine learning methods for developing precision treatment rules with observational data. Behaviour Research and Therapy 120 (2019), 103412.
Khemakhem et al. (2020) Ilyes Khemakhem, Diederik Kingma, Ricardo Monti, and Aapo Hyvarinen. 2020. Variational autoencoders and nonlinear ica: A unifying framework. In International Conference on Artificial Intelligence and Statistics. PMLR, 2207–2217.
Kiyohara et al. (2022) Haruka Kiyohara, Yuta Saito, Tatsuya Matsuhiro, Yusuke Narita, Nobuyuki Shimizu, and Yasuo Yamamoto. 2022. Doubly robust off-policy evaluation for ranking policies under the cascade behavior model. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 487–497.
Kohavi et al. (2013) Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann. 2013. Online controlled experiments at large scale. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1168–1176.
Koren et al. (2009) Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009), 30–37.
Krauth et al. (2022) Karl Krauth, Yixin Wang, and Michael I Jordan. 2022. Breaking Feedback Loops in Recommender Systems with Causal Inference. arXiv preprint arXiv:2207.01616 (2022).
Lee et al. (2019) Hoyeop Lee, Jinbae Im, Seongwon Jang, Hyunsouk Cho, and Sehee Chung. 2019. Melu: Meta-learned user preference estimator for cold-start recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1073–1082.
Li et al. (2023b) Haoxuan Li, Chunyuan Zheng, Peng Wu, Kun Kuang, Yue Liu, and Peng Cui. 2023b. Who should be given incentives? counterfactual optimal treatment regimes learning for recommendation. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1235–1247.
Li et al. (2017) Jing Li, Pengjie Ren, Zhumin Chen, Zhaochun Ren, Tao Lian, and Jun Ma. 2017. Neural attentive session-based recommendation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 1419–1428.
Li et al. (2023a) Qian Li, Xiangmeng Wang, Zhichao Wang, and Guandong Xu. 2023a. Be causal: De-biasing social network confounding in recommendation. ACM Transactions on Knowledge Discovery from Data 17, 1 (2023), 1–23.
Li et al. (2016) Sheng Li, Nikos Vlassis, Jaya Kawale, and Yun Fu. 2016. Matching via dimensionality reduction for estimation of treatment effects in digital marketing campaigns. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence. 3768–3774.
Li et al. (2021b) Siqing Li, Liuyi Yao, Shanlei Mu, Wayne Xin Zhao, Yaliang Li, Tonglei Guo, Bolin Ding, and Ji-Rong Wen. 2021b. Debiasing Learning based Cross-domain Recommendation. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 3190–3199.
Li et al. (2021a) Yunqi Li, Hanxiong Chen, Shuyuan Xu, Yingqiang Ge, and Yongfeng Zhang. 2021a. Towards personalized fairness based on causal notion. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1054–1063.
Liang et al. (2016) Dawen Liang, Laurent Charlin, James McInerney, and David M Blei. 2016. Modeling user exposure in recommendation. In Proceedings of the 25th International Conference on World Wide Web. 951–961.
Lin et al. (2023) Jianghao Lin, Xinyi Dai, Yunjia Xi, Weiwen Liu, Bo Chen, Xiangyang Li, Chenxu Zhu, Huifeng Guo, Yong Yu, Ruiming Tang, et al. 2023. How Can Recommender Systems Benefit from Large Language Models: A Survey. arXiv preprint arXiv:2306.05817 (2023).
Little and Rubin (2019) Roderick JA Little and Donald B Rubin. 2019. Statistical analysis with missing data. Vol. 793. John Wiley & Sons.
Liu et al. (2022a) Chang Liu, Chen Gao, Yuan Yuan, Chen Bai, Lingrui Luo, Xiaoyi Du, Xinlei Shi, Hengliang Luo, Depeng Jin, and Yong Li. 2022a. Modeling Persuasion Factor of User Decision for Recommendation. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3366–3376.
Liu et al. (2020) Dugang Liu, Pengxiang Cheng, Zhenhua Dong, Xiuqiang He, Weike Pan, and Zhong Ming. 2020. A general knowledge distillation framework for counterfactual recommendation via uniform data. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 831–840.
Liu et al. (2021) Dugang Liu, Pengxiang Cheng, Hong Zhu, Zhenhua Dong, Xiuqiang He, Weike Pan, and Zhong Ming. 2021. Mitigating confounding bias in recommendation via information bottleneck. In Fifteenth ACM Conference on Recommender Systems. 351–360.
Liu et al. (2023) Qijiong Liu, Nuo Chen, Tetsuya Sakai, and Xiao-Ming Wu. 2023. A First Look at LLM-Powered Generative News Recommendation. arXiv preprint arXiv:2305.06566 (2023).
Liu et al. (2018) Qiao Liu, Yifu Zeng, Refuoe Mokhosi, and Haibin Zhang. 2018. STAMP: short-term attention/memory priority model for session-based recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1831–1839.
Liu et al. (2022b) Yaxu Liu, Jui-Nan Yen, Bowen Yuan, Rundong Shi, Peng Yan, and Chih-Jen Lin. 2022b. Practical Counterfactual Policy Learning for Top-K Recommendations. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1141–1151.
Luo et al. (2013) Chuan Luo, Xin Robert Luo, Laurie Schatzberg, and Choon Ling Sia. 2013. Impact of informational factors on online recommendation credibility: The moderating role of source credibility. Decision Support Systems 56 (2013), 92–102.
Marlin and Zemel (2009) Benjamin M Marlin and Richard S Zemel. 2009. Collaborative prediction and ranking with non-random missing data. In Proceedings of the third ACM Conference on Recommender Systems. 5–12.
McInerney et al. (2020) James McInerney, Brian Brost, Praveen Chandar, Rishabh Mehrotra, and Benjamin Carterette. 2020. Counterfactual evaluation of slate recommendations with sequential reward interactions. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1779–1788.
Mehrotra et al. (2020) Rishabh Mehrotra, Prasanta Bhattacharya, and Mounia Lalmas. 2020. Inferring the Causal Impact of New Track Releases on Music Recommendation Platforms through Counterfactual Predictions. In Fourteenth ACM Conference on Recommender Systems. 687–691.
Mehrotra et al. (2018) Rishabh Mehrotra, James McInerney, Hugues Bouchard, Mounia Lalmas, and Fernando Diaz. 2018. Towards a fair marketplace: Counterfactual evaluation of the trade-off between relevance, fairness & satisfaction in recommendation systems. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 2243–2251.
Miao et al. (2023) Wang Miao, Wenjie Hu, Elizabeth L Ogburn, and Xiao-Hua Zhou. 2023. Identifying effects of multiple treatments in the presence of unmeasured confounding. J. Amer. Statist. Assoc. 118, 543 (2023), 1953–1967.
Mohan and Pearl (2021) Karthika Mohan and Judea Pearl. 2021. Graphical models for processing missing data. J. Amer. Statist. Assoc. 116, 534 (2021), 1023–1037.
Mondal et al. (2022) Abhirup Mondal, Anirban Majumder, and Vineet Chaoji. 2022. ASPIRE: Air Shipping Recommendation for E-commerce Products via Causal Inference Framework. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3584–3592.
Naser (2022) MZ Naser. 2022. Causality, Causal Discovery, and Causal Inference in Structural Engineering. arXiv preprint arXiv:2204.01543 (2022).
Nassif et al. (2013) Houssam Nassif, Finn Kuusisto, Elizabeth S Burnside, and Jude W Shavlik. 2013. Uplift Modeling with ROC: An SRL Case Study.. In ILP (late breaking papers). Citeseer, 40–45.
Nemirovsky et al. (2022) Daniel Nemirovsky, Nicolas Thiebaut, Ye Xu, and Abhishek Gupta. 2022. CounteRGAN: Generating counterfactuals for real-time recourse and interpretability using residual GANs. In Uncertainty in Artificial Intelligence. PMLR, 1488–1497.
Pariser (2011) Eli Pariser. 2011. The filter bubble: How the new personalized web is changing what we read and how we think. Penguin.
Pawelczyk et al. (2020) Martin Pawelczyk, Klaus Broelemann, and Gjergji Kasneci. 2020. Learning model-agnostic counterfactual explanations for tabular data. In Proceedings of the web conference 2020. 3126–3132.
Pearl (1988) Judea Pearl. 1988. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan kaufmann.
Pearl (1995) Judea Pearl. 1995. Causal diagrams for empirical research. Biometrika 82, 4 (1995), 669–688.
Pearl (2009) Judea Pearl. 2009. Causality (2 ed.). Cambridge University Press. https://doi.org/10.1017/CBO9780511803161
Pearl (2010) Judea Pearl. 2010. Causal inference. Causality: Objectives and Assessment (2010), 39–58.
Pearl (2018) Judea Pearl. 2018. Theoretical Impediments to Machine Learning With Seven Sparks from the Causal Revolution. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. 3–3.
Pearl (2022) Judea Pearl. 2022. Direct and indirect effects. In Probabilistic and Causal Inference: The Works of Judea Pearl. 373–392.
Pearl and Mackenzie (2018) Judea Pearl and Dana Mackenzie. 2018. The Book of Why: The New Science of Cause and Effect (1st ed.). Basic Books, Inc., USA.
Peters et al. (2017) Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. 2017. Elements of causal inference: foundations and learning algorithms. The MIT Press.
Petrov and Macdonald (2023) Aleksandr V Petrov and Craig Macdonald. 2023. Generative Sequential Recommendation with GPTRec. arXiv preprint arXiv:2306.11114 (2023).
Pradel et al. (2012) Bruno Pradel, Nicolas Usunier, and Patrick Gallinari. 2012. Ranking with non-random missing ratings: influence of popularity and positivity on evaluation metrics. In Proceedings of the sixth ACM Conference on Recommender Systems. 147–154.
Qiu et al. (2021) Zhaopeng Qiu, Xian Wu, Jingyue Gao, and Wei Fan. 2021. U-BERT: Pre-training user representations for improved recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 4320–4327.
Radcliffe (2007) Nicholas Radcliffe. 2007. Using control groups to target on predicted lift: Building and assessing uplift model. Direct Marketing Analytics Journal (2007), 14–21.
Radcliffe and Surry (2011) Nicholas J Radcliffe and Patrick D Surry. 2011. Real-world uplift modelling with significance-based uplift trees. White Paper TR-2011-1, Stochastic Solutions (2011), 1–33.
Rajanala et al. (2022) Sailaja Rajanala, Arghya Pal, Manish Singh, Raphaël C-W Phan, and KokSheik Wong. 2022. DeSCoVeR: Debiased Semantic Context Prior for Venue Recommendation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2456–2461.
Rendle (2010) Steffen Rendle. 2010. Factorization machines. In 2010 IEEE International Conference on Data Mining. IEEE, 995–1000.
Ricci et al. (2015) Francesco Ricci, Lior Rokach, and Bracha Shapira. 2015. Recommender systems: introduction and challenges. In Recommender systems handbook. Springer, 1–34.
Rosenbaum (1987) Paul R Rosenbaum. 1987. Model-based direct adjustment. Journal of the American statistical Association 82, 398 (1987), 387–394.
Rosenbaum and Rubin (1983) Paul R Rosenbaum and Donald B Rubin. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70, 1 (1983), 41–55.
Rosenfeld et al. (2017) Nir Rosenfeld, Yishay Mansour, and Elad Yom-Tov. 2017. Predicting counterfactuals from large historical data and small randomized trials. In Proceedings of the 26th International Conference on World Wide Web Companion. 602–609.
Rubin (1974) Donald B Rubin. 1974. Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies. Journal of Educational Psychology 66, 5 (1974), 688–701.
Rubin (1976) Donald B Rubin. 1976. Inference and missing data. Biometrika 63, 3 (1976), 581–592.
Rubin (1990) Donald B Rubin. 1990. Formal mode of statistical inference for causal effects. Journal of Statistical Planning and Inference 25, 3 (1990), 279–292.
Rzepakowski and Jaroszewicz (2012) Piotr Rzepakowski and Szymon Jaroszewicz. 2012. Decision trees for uplift modeling with single and multiple treatments. Knowledge and Information Systems 32 (2012), 303–327.
Saito (2020) Yuta Saito. 2020. Asymmetric tri-training for debiasing missing-not-at-random explicit feedback. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 309–318.
Saito and Joachims (2021) Yuta Saito and Thorsten Joachims. 2021. Counterfactual learning and evaluation for recommender systems: Foundations, implementations, and recent advances. In Proceedings of the 15th ACM Conference on Recommender Systems. 828–830.
Saito and Joachims (2022) Yuta Saito and Thorsten Joachims. 2022. Off-Policy Evaluation for Large Action Spaces via Embeddings. In International Conference on Machine Learning. PMLR, 19089–19122.
Saito et al. (2021) Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, and Kei Tateno. 2021. Evaluating the Robustness of Off-Policy Evaluation. In Proceedings of the 15th ACM Conference on Recommender Systems (Amsterdam, Netherlands) (RecSys ’21). Association for Computing Machinery, New York, NY, USA, 114–123. https://doi.org/10.1145/3460231.3474245
Saito et al. (2020) Yuta Saito, Suguru Yaginuma, Yuta Nishino, Hayato Sakata, and Kazuhide Nakata. 2020. Unbiased recommender learning from missing-not-at-random implicit feedback. In Proceedings of the 13th International Conference on Web Search and Data Mining. 501–509.
Sato (2021) Masahiro Sato. 2021. Online Evaluation Methods for the Causal Effect of Recommendations. In Proceedings of the 15th ACM Conference on Recommender Systems. 96–101.
Sato et al. (2019) Masahiro Sato, Janmajay Singh, Sho Takemori, Takashi Sonoda, Qian Zhang, and Tomoko Ohkuma. 2019. Uplift-based evaluation and optimization of recommenders. In Proceedings of the 13th ACM Conference on Recommender Systems. 296–304.
Sato et al. (2020) Masahiro Sato, Sho Takemori, Janmajay Singh, and Tomoko Ohkuma. 2020. Unbiased learning for the causal effect of recommendation. In Fourteenth ACM Conference on Recommender Systems. 378–387.
Schlotter et al. (2011) Martin Schlotter, Guido Schwerdt, and Ludger Woessmann. 2011. Econometric methods for causal evaluation of education policies and practices: a non-technical guide. Education Economics 19, 2 (2011), 109–137.
Schnabel et al. (2016) Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. 2016. Recommendations as treatments: Debiasing learning and evaluation. In International Conference on Machine Learning. PMLR, 1670–1679.
Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
Shalit (2020) Uri Shalit. 2020. Can we learn individual-level treatment policies from clinical data? Biostatistics 21, 2 (2020), 359–362.
Shang et al. (2019) Wenjie Shang, Yang Yu, Qingyang Li, Zhiwei Qin, Yiping Meng, and Jieping Ye. 2019. Environment reconstruction with hidden confounders for reinforcement learning based recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 566–576.
Sharma et al. (2015) Amit Sharma, Jake M Hofman, and Duncan J Watts. 2015. Estimating the causal impact of recommendation systems from observational data. In Proceedings of the Sixteenth ACM Conference on Economics and Computation. 453–470.
Shi et al. (2016) Chuan Shi, Yitong Li, Jiawei Zhang, Yizhou Sun, and S Yu Philip. 2016. A survey of heterogeneous information network analysis. IEEE Transactions on Knowledge and Data Engineering 29, 1 (2016), 17–37.
Si et al. (2022) Zihua Si, Xueran Han, Xiao Zhang, Jun Xu, Yue Yin, Yang Song, and Ji-Rong Wen. 2022. A Model-Agnostic Causal Learning Framework for Recommendation using Search Data. In Proceedings of the ACM Web Conference 2022. 224–233.
Sohn et al. (2015) Kihyuk Sohn, Xinchen Yan, and Honglak Lee. 2015. Learning structured output representation using deep conditional generative models. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2. 3483–3491.
Song et al. (2023b) Wenzhuo Song, Shoujin Wang, Yan Wang, Kunpeng Liu, Xueyan Liu, and Minghao Yin. 2023b. A Counterfactual Collaborative Session-based Recommender System. In Proceedings of the ACM Web Conference 2023. 971–982.
Song et al. (2023a) Zijie Song, JiaWei Chen, Sheng Zhou, Qihao Shi, Yan Feng, Chun Chen, and Can Wang. 2023a. CDR: Conservative Doubly Robust Learning for Debiased Recommendation. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 2321–2330.
Splawa-Neyman et al. (1990) Jerzy Splawa-Neyman, Dorota M Dabrowska, and TP Speed. 1990. On the application of probability theory to agricultural experiments. Essay on principles. Section 9. Statist. Sci. (1990), 465–472.
Steck (2010) Harald Steck. 2010. Training and testing of recommender systems on data missing not at random. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 713–722.
Su et al. (2020) Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík. 2020. Doubly robust off-policy evaluation with shrinkage. In International Conference on Machine Learning. PMLR, 9167–9176.
Swaminathan and Joachims (2015) Adith Swaminathan and Thorsten Joachims. 2015. The self-normalized estimator for counterfactual learning. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2. 3231–3239.
Tan et al. (2021a) Jiping Tan, Nan Li, Xiaoxiao Wang, Gongbo Chen, Lailai Yan, Luning Wang, Yiming Zhao, Shanshan Li, and Yuming Guo. 2021a. Associations of particulate matter with dementia and mild cognitive impairment in China: A multicenter cross-sectional study. The Innovation 2, 3 (2021), 100147. https://doi.org/10.1016/j.xinn.2021.100147
Tan et al. (2021b) Juntao Tan, Shuyuan Xu, Yingqiang Ge, Yunqi Li, Xu Chen, and Yongfeng Zhang. 2021b. Counterfactual explainable recommendation. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 1784–1793.
Tang et al. (2023) Shisong Tang, Qing Li, Dingmin Wang, Ci Gao, Wentao Xiao, Dan Zhao, Yong Jiang, Qian Ma, and Aoyang Zhang. 2023. Counterfactual Video Recommendation for Duration Debiasing. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4894–4903.
Thomas and Brunskill (2016) Philip Thomas and Emma Brunskill. 2016. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning. PMLR, 2139–2148.
Tran et al. (2021) Ha Xuan Tran, Thuc Duy Le, Jiuyong Li, Lin Liu, Jixue Liu, Yanchang Zhao, and Tony Waters. 2021. Recommending the Most Effective Intervention to Improve Employment for Job Seekers with Disability. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 3616–3626.
Tran et al. (2022) Ha Xuan Tran, Thuc Duy Le, Jiuyong Li, Lin Liu, Jixue Liu, Yanchang Zhao, and Tony Waters. 2022. What is the Most Effective Intervention to Increase Job Retention for this Disabled Worker?. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3981–3991.
Tsoumas et al. (2023) Ilias Tsoumas, Georgios Giannarakis, Vasileios Sitokonstantinou, Alkiviadis Koukos, Dimitra Loka, Nikolaos Bartsotas, Charalampos Kontoes, and Ioannis Athanasiadis. 2023. Evaluating digital agriculture recommendations with causal inference. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 14514–14522.
van de Schoot et al. (2021) Rens van de Schoot, Sarah Depaoli, Ruth King, Bianca Kramer, Kaspar Märtens, Mahlet G Tadesse, Marina Vannucci, Andrew Gelman, Duco Veen, Joukje Willemsen, et al. 2021. Bayesian statistics and modelling. Nature Reviews Methods Primers 1, 1 (2021), 1–26.
Wachter et al. (2017) Sandra Wachter, Brent Mittelstadt, and Chris Russell. 2017. Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harv. JL & Tech. 31 (2017), 841.
Wang et al. (2023b) Chenyu Wang, Yawen Ye, Liyuan Ma, Dun Li, and Lei Zhuang. 2023b. Dual disentanglement of user–item interaction for recommendation with causal embedding. Information Processing & Management 60, 5 (2023), 103456.
Wang et al. (2018) Menghan Wang, Xiaolin Zheng, Yang Yang, and Kun Zhang. 2018. Collaborative filtering with social exposure: A modular approach to social recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
Wang et al. (2022c) Ruoyu Wang, Mingyang Yi, Zhitang Chen, and Shengyu Zhu. 2022c. Out-of-distribution Generalization with Causal Invariant Transformations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 375–385.
Wang et al. (2021a) Wenjie Wang, Fuli Feng, Xiangnan He, Xiang Wang, and Tat-Seng Chua. 2021a. Deconfounded recommendation for alleviating bias amplification. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 1717–1725.
Wang et al. (2021b) Wenjie Wang, Fuli Feng, Xiangnan He, Hanwang Zhang, and Tat-Seng Chua. 2021b. Clicks can be cheating: Counterfactual recommendation for mitigating clickbait issue. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1288–1297.
Wang et al. (2023a) Xiangmeng Wang, Qian Li, Dianer Yu, Peng Cui, Zhichao Wang, and Guandong Xu. 2023a. Causal Disentanglement for Semantic-Aware Intent Learning in Recommendation. IEEE Transactions on Knowledge and Data Engineering 35, 10, 9836–9849. https://doi.org/10.1109/TKDE.2022.3159802
Wang et al. (2019) Xiaojie Wang, Rui Zhang, Yu Sun, and Jianzhong Qi. 2019. Doubly robust joint learning for recommendation on data missing not at random. In International Conference on Machine Learning. PMLR, 6638–6647.
Wang et al. (2020) Yixin Wang, Dawen Liang, Laurent Charlin, and David M Blei. 2020. Causal inference for recommender systems. In Fourteenth ACM Conference on Recommender Systems. 426–431.
Wang et al. (2017) Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudık. 2017. Optimal and adaptive off-policy evaluation in contextual bandits. In International Conference on Machine Learning. PMLR, 3589–3597.
Wang et al. (2022a) Zhenlei Wang, Xu Chen, Rui Zhou, Quanyu Dai, Zhenhua Dong, and Ji-Rong Wen. 2022a. Sequential recommendation with causal behavior discovery. arXiv preprint arXiv:2204.00216 (2022).
Wang et al. (2022b) Zimu Wang, Yue He, Jiashuo Liu, Wenchao Zou, Philip S Yu, and Peng Cui. 2022b. Invariant Preference Learning for General Debiasing in Recommendation. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1969–1978.
Wang et al. (2021c) Zhenlei Wang, Jingsen Zhang, Hongteng Xu, Xu Chen, Yongfeng Zhang, Wayne Xin Zhao, and Ji-Rong Wen. 2021c. Counterfactual data-augmented sequential recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 347–356.
Wei et al. (2021) Tianxin Wei, Fuli Feng, Jiawei Chen, Ziwei Wu, Jinfeng Yi, and Xiangnan He. 2021. Model-agnostic counterfactual reasoning for eliminating popularity bias in recommender system. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 1791–1800.
Wei and He (2022) Tianxin Wei and Jingrui He. 2022. Comprehensive fair meta-learned recommender system. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1989–1999.
Wei et al. (2019) Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng Chua. 2019. MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. In Proceedings of the 27th ACM International Conference on Multimedia. 1437–1445.
Wu et al. (2019b) Chuhan Wu, Fangzhao Wu, Mingxiao An, Tao Qi, Jianqiang Huang, Yongfeng Huang, and Xing Xie. 2019b. Neural news recommendation with heterogeneous user behavior. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 4874–4883.
Wu et al. (2022) Peng Wu, Haoxuan Li, Yuhao Deng, Wenjie Hu, Quanyu Dai, Zhenhua Dong, Jie Sun, Rui Zhang, and Xiao-Hua Zhou. 2022. On the Opportunity of Causal Learning in Recommendation Systems: Foundation, Estimation, Prediction and Challenges. In International Joint Conference on Artificial Intelligence.
Wu et al. (2019a) Shu Wu, Yuyuan Tang, Yanqiao Zhu, Liang Wang, Xing Xie, and Tieniu Tan. 2019a. Session-based recommendation with graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 346–353.
Wu et al. (2021) Xinwei Wu, Hechang Chen, Jiashu Zhao, Li He, Dawei Yin, and Yi Chang. 2021. Unbiased learning to rank in feeds recommendation. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 490–498.
Xia et al. (2023) Yu Xia, Junda Wu, Tong Yu, Sungchul Kim, Ryan A Rossi, and Shuai Li. 2023. User-Regulation Deconfounded Conversational Recommender System with Bandit Feedback. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2694–2704.
Xiao and Wang (2022) Teng Xiao and Suhang Wang. 2022. Towards unbiased and robust causal ranking for recommender systems. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 1158–1167.
Xie et al. (2021) Xu Xie, Zhaoyang Liu, Shiwen Wu, Fei Sun, Cihang Liu, Jiawei Chen, Jinyang Gao, Bin Cui, and Bolin Ding. 2021. CausCF: Causal Collaborative Filtering for Recommendation Effect Estimation. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 4253–4263.
Xiong et al. (2021) Kun Xiong, Wenwen Ye, Xu Chen, Yongfeng Zhang, Wayne Xin Zhao, Binbin Hu, Zhiqiang Zhang, and Jun Zhou. 2021. Counterfactual Review-based Recommendation. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2231–2240.
Xu et al. (2020) Da Xu, Chuanwei Ruan, Evren Korpeoglu, Sushant Kumar, and Kannan Achan. 2020. Adversarial counterfactual learning and evaluation for recommender system. In Proceedings of the 34th International Conference on Neural Information Processing Systems. 13515–13526.
Xu et al. (2023a) Shuyuan Xu, Yingqiang Ge, Yunqi Li, Zuohui Fu, Xu Chen, and Yongfeng Zhang. 2023a. Causal collaborative filtering. (2023), 235–245.
Xu et al. (2023b) Shuyuan Xu, Jianchao Ji, Yunqi Li, Yingqiang Ge, Juntao Tan, and Yongfeng Zhang. 2023b. Causal Inference for Recommendation: Foundations, Methods and Applications. arXiv preprint arXiv:2301.04016 (2023).
Xu et al. (2022) Shuyuan Xu, Juntao Tan, Zuohui Fu, Jianchao Ji, Shelby Heinecke, and Yongfeng Zhang. 2022. Dynamic Causal Collaborative Filtering. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 2301–2310.
Xu et al. (2023c) Yongjun Xu, Fei Wang, Zhulin An, Qi Wang, and Zhao Zhang. 2023c. Artificial intelligence for science—bridging data to wisdom. The Innovation 4, 6 (2023), 100525. https://doi.org/10.1016/j.xinn.2023.100525
Yamane et al. (2018) Ikko Yamane, Florian Yger, Jamal Atif, and Masashi Sugiyama. 2018. Uplift modeling from separate labels. Advances in Neural Information Processing Systems 31 (2018).
Yang et al. (2021) Mengyue Yang, Quanyu Dai, Zhenhua Dong, Xu Chen, Xiuqiang He, and Jun Wang. 2021. Top-N Recommendation with Counterfactual User Preference Simulation. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2342–2351.
Yao et al. (2022) Jiangchao Yao, Feng Wang, Xichen Ding, Shaohu Chen, Bo Han, Jingren Zhou, and Hongxia Yang. 2022. Device-cloud Collaborative Recommendation via Meta Controller. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4353–4362.
Yao et al. (2021) Liuyi Yao, Zhixuan Chu, Sheng Li, Yaliang Li, Jing Gao, and Aidong Zhang. 2021. A survey on causal inference. ACM Transactions on Knowledge Discovery from Data (TKDD) 15, 5 (2021), 1–46.
Yin and Hong (2019) Xuan Yin and Liangjie Hong. 2019. The identification and estimation of direct and indirect effects in A/B tests through causal mediation analysis. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2989–2999.
Yu et al. (2023a) Dianer Yu, Qian Li, Hongzhi Yin, and Guandong Xu. 2023a. Causality-guided Graph Learning for Session-based Recommendation. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 3083–3093.
Yu et al. (2023b) Junliang Yu, Hongzhi Yin, Xin Xia, Tong Chen, Jundong Li, and Zi Huang. 2023b. Self-supervised learning for recommender systems: A survey. IEEE Transactions on Knowledge and Data Engineering (2023).
Yuan et al. (2019a) Bowen Yuan, Jui-Yang Hsia, Meng-Yuan Yang, Hong Zhu, Chih-Yao Chang, Zhenhua Dong, and Chih-Jen Lin. 2019a. Improving ad click prediction by considering non-displayed events. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 329–338.
Yuan et al. (2019b) Bowen Yuan, Meng-Yuan Yang, Jui-Yang Hsia, Hong Zhu, Zhirong Liu, Zhenhua Dong, and Chih-Jen Lin. 2019b. One-class field-aware factorization machines for recommender systems with implicit feedbacks. (2019).
Zhan et al. (2022) Ruohan Zhan, Changhua Pei, Qiang Su, Jianfeng Wen, Xueliang Wang, Guanyu Mu, Dong Zheng, Peng Jiang, and Kun Gai. 2022. Deconfounding Duration Bias in Watch-time Prediction for Video Recommendation. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4472–4481.
Zhang et al. (2023b) Jizhi Zhang, Keqin Bao, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. 2023b. Is chatgpt fair for recommendation? evaluating fairness in large language model recommendation. arXiv preprint arXiv:2305.07609 (2023).
Zhang et al. (2021a) Jingsen Zhang, Xu Chen, and Wayne Xin Zhao. 2021a. Causally attentive collaborative filtering. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 3622–3626.
Zhang et al. (2023c) Qing Zhang, Xiaoying Zhang, Yang Liu, Hongning Wang, Min Gao, Jiheng Zhang, and Ruocheng Guo. 2023c. Debiasing Recommendation by Learning Identifiable Latent Confounders. arXiv preprint arXiv:2302.05052 (2023).
Zhang et al. (2021f) Shengyu Zhang, Dong Yao, Zhou Zhao, Tat-Seng Chua, and Fei Wu. 2021f. Causerec: Counterfactual user sequence synthesis for sequential recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 367–377.
Zhang et al. (2019) Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019. Deep learning based recommender system: A survey and new perspectives. ACM Computing Surveys (CSUR) 52, 1 (2019), 1–38.
Zhang et al. (2020) Wenhao Zhang, Wentian Bao, Xiao-Yang Liu, Keping Yang, Quan Lin, Hong Wen, and Ramin Ramezani. 2020. Large-scale causal approaches to debiasing post-click conversion rate estimation with multi-task learning. In Proceedings of The Web Conference 2020. 2775–2781.
Zhang et al. (2021d) Weijia Zhang, Jiuyong Li, and Lin Liu. 2021d. A unified survey of treatment effect heterogeneity modelling and uplift modelling. ACM Computing Surveys (CSUR) 54, 8 (2021), 1–36.
Zhang et al. (2021g) Weina Zhang, Xingming Zhang, and Dongpei Chen. 2021g. Causal neural fuzzy inference modeling of missing data in implicit recommendation system. Knowledge-Based Systems 222 (2021), 106678.
Zhang et al. (2021c) Xiao Zhang, Haonan Jia, Hanjing Su, Wenhan Wang, Jun Xu, and Ji-Rong Wen. 2021c. Counterfactual reward modification for streaming recommendation with delayed feedback. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 41–50.
Zhang et al. (2023a) Yang Zhang, Yimeng Bai, Jianxin Chang, Xiaoxue Zang, Song Lu, Jing Lu, Fuli Feng, Yanan Niu, and Yang Song. 2023a. Leveraging Watch-time Feedback for Short-Video Recommendations: A Causal Labeling Framework. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 4952–4959.
Zhang et al. (2021b) Yang Zhang, Fuli Feng, Xiangnan He, Tianxin Wei, Chonggang Song, Guohui Ling, and Yongdong Zhang. 2021b. Causal intervention for leveraging popularity bias in recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 11–20.
Zhang et al. (2021e) Yang Zhang, Dong Wang, Qiang Li, Yue Shen, Ziqi Liu, Xiaodong Zeng, Zhiqiang Zhang, Jinjie Gu, and Derek F Wong. 2021e. User Retention: A Causal Approach with Triple Task Modeling.. In IJCAI. 3399–3405.
Zhang et al. (2022) Yang Zhang, Wenjie Wang, Peng Wu, Fuli Feng, and Xiangnan He. 2022. Causal Recommendation: Progresses and Future Directions. https://causalrec.github.io/
Zheng et al. (2021) Yu Zheng, Chen Gao, Xiang Li, Xiangnan He, Yong Li, and Depeng Jin. 2021. Disentangling user interest and conformity for recommendation with causal embedding. In Proceedings of the Web Conference 2021. 2980–2991.
Zhou et al. (2021) Chang Zhou, Jianxin Ma, Jianwei Zhang, Jingren Zhou, and Hongxia Yang. 2021. Contrastive learning for debiased candidate generation in large-scale recommender systems. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 3985–3995.
Zhou et al. (2023) Guanglin Zhou, Chengkai Huang, Xiaocong Chen, Xiwei Xu, Chen Wang, Liming Zhu, and Lina Yao. 2023. Contrastive Counterfactual Learning for Causality-aware Interpretable Recommender Systems. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. 3564–3573.
Zhou et al. (2018) Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1059–1068.
Zhu et al. (2022) Xinyuan Zhu, Yang Zhang, Fuli Feng, Xun Yang, Dingxian Wang, and Xiangnan He. 2022. Mitigating Hidden Confounding Effects for Causal Recommendation. arXiv preprint arXiv:2205.07499 (2022).
Zhu et al. (2023a) Yaochen Zhu, Jing Ma, and Jundong Li. 2023a. Causal Inference in Recommender Systems: A Survey of Strategies for Bias Mitigation, Explanation, and Generalization. arXiv preprint arXiv:2301.00910 (2023).
Zhu et al. (2023b) Yaochen Zhu, Jing Ma, Liang Wu, Qi Guo, Liangjie Hong, and Jundong Li. 2023b. Path-Specific Counterfactual Fairness for Recommender Systems. arXiv preprint arXiv:2306.02615 (2023).