DCDILP: a distributed learning method for large-scale causal structure learning

Shuyu Dong LISN, INRIA, Université Paris-Saclay Michèle Sebag LISN, INRIA, Université Paris-Saclay Kento Uemura Akito Fujii Shuang Chang Yusuke Koyanagi Koji Maruhashi

(June 15, 2024)

Abstract

This paper presents a novel approach to causal discovery through a divide-and-conquer framework. By decomposing the problem into smaller subproblems defined on Markov blankets, the proposed DCDILP method first explores in parallel the local causal graphs of these subproblems. However, this local discovery phase encounters systematic challenges due to the presence of hidden confounders (variables within each Markov blanket may be influenced by external variables). Moreover, aggregating these local causal graphs in a consistent global graph defines a large size combinatorial optimization problem. DCDILP addresses these challenges by: i) restricting the local subgraphs to causal links only related with the central variable of the Markov blanket; ii) formulating the reconciliation of local causal graphs as an integer linear programming method. The merits of the approach, in both terms of causal discovery accuracy and scalability in the size of the problem, are showcased by experiments and comparisons with the state of the art.

1 Introduction

Discovering causal relations from observational data emerges as an important problem for artificial intelligence with fundamental and practical motivations (Pearl, 2000; Peters et al., 2017). One notable reason is that causal models support modes of reasoning, e.g., counterfactual reasoning and algorithmic recourse (Tsirtsis et al., 2021), that are beyond the reach of correlation-based models, as shown by Peters et al. (2016); Arjovsky et al. (2019); Sauer and Geiger (2021). However, causal structure learning from data, referred to as causal discovery, is fraught with statistical and algebraic difficulties. Statistical difficulties arise when the number $n$ of observations is limited, which hampers the estimation process. Algebraic difficulties are related to discovering the directed acyclic graph (DAG) expressing the causal relations, as learning a DAG is NP-hard (Chickering, 1996).

Related work.

In the literature of causal discovery and Bayesian network learning, there are two main categories of methods, namely constraint-based methods (Spirtes et al., 2000; Meek, 1995) and score function-based methods (Chickering, 2002a; Loh and Bühlmann, 2014). Depending on the specific method, strategies for learning large causal graphs include restricting the search space of directed graphs in a sparse graph (Ramsey et al., 2017; Loh and Bühlmann, 2014), or transforming the underlying combinatorial problem into a continuous optimization problem (Zheng et al., 2018; Aragam et al., 2019; Ng et al., 2020, 2021; Lopez et al., 2022). While these strategies have resulted in significant improvements in reducing the complexity, their scalability is still moderate when the number of variables and/or the degree of the target causal graph are high.

To further tackle the computational challenges in learning large causal structures, a growing number of works consider breaking down the large-scale causal discovery problem into smaller ones. These include methods using Markov blanket discovery (e.g., Tsamardinos et al. (2003); Wu et al. (2020, 2022, 2023); Mokhtarian et al. (2021))—the problem of inferring a smallest subset of variables that shield a given node from the influence of all other variables, and more generally, the divide-and-conquer strategy (Gao et al., 2017; Zhang et al., 2020; Gu and Zhou, 2020). It is noticeable that most of these methods are designed in the framework of constraint-based causal learning and are therefore strictly tied to this methodology. At the same time, the fusion procedure for the conquer step of these divide-and-conquer methods is essentially rule-based, which may limit their applicability.

Contributions.

In this paper we present a new divide-and-conquer approach called DCDILP to first address the scalability challenge inherent to causal discovery, and to propose an enhanced methodology for the conquer step. DCDILP consists of three phases: (i) Phase-1: a divide phase that breaks the causal discovery problem into smaller subproblems by leveraging Markov blanket estimation for each variable. (ii) Phase-2: a causal learning phase that explores causal relations at a local level (within the individual Markov blankets). Learning causal relations locally enjoys a reduced complexity but these subproblems may no longer be causally sufficient—as variable(s) external to the subproblem might act as hidden confounder, having an impact on the (internal) variables. (iii) Phase-3: a conquer phase that reconciles the different causal relations identified for each subproblem into a consistent causal graph, noting that a simple concatenation of the local causal relations found in Phase-2 is unlikely to yield a satisfactory global solution.

The contributions of the proposed approach, DCDILP (Distributed Causal Discovery using Integer Linear Programming), are twofold. First, the causal discovery subproblems associated with each Markov blanket are handled independently in parallel. The causal insufficiency issue is mitigated by only retaining the causal relations involving the center variable of the Markov Blanket. Second, and more importantly, we show that the reconciliation of the causal subgraphs at the subproblem level can be formulated and solved as an integer linear programming (ILP) problem. Binary ILP variables are defined to represent the causal relations (causes, effects, spouses, and v-structures); logical constraints are defined to enforce their consistency, and the optimization of the ILP variables aims to find a causal graph as close as possible to the local subgraphs, subject to being consistent.

The primary strength of the approach lies in the highly parallelizable nature of its Phase-2, making it scaling up to a few thousand variables. Phase-3 corresponds to one single ILP problem that can be delegated to highly efficient ILP solvers. Note that DCDILP allows for flexible choices: (i) for the Markov blanket discovery task in Phase-1; (ii) for the causal discovery subproblems in Phase-2 (GES (Chickering, 2002b) and DAGMA (Bello et al., 2022) are used in the experiments); (iii) for the ILP solver in Phase-3 (Gurobi (Gurobi Optimization, 2023) is used in the experiments).

The paper is organized as follows. After presenting the background in Section 2, we detail DCDILP in Section 3. Section 4 presents the experimental setting and the comparative experimental evidence showing the merits of the approach on small to large-scale causal discovery problems. The paper concludes with a discussion and some perspectives for further research.

2 Formal background

2.1 Definitions and notation

Definition 1.

The linear Structural Equation Model (SEM) of a multivariate random variable $\mbox{\bf X}=(X_{1},\dots,X_{d})$ is a set of $d$ equations: for all $i=1\ldots d$ ,

\displaystyle X_{i}

\displaystyle=\beta_{1,i}X_{1}+\dots+\beta_{d,i}X_{d}+\epsilon_{i}

with $\epsilon_{i}$ an external random variable, independent of any $X_{j}$ for $j\neq i$ . Coefficient $\beta_{i,i}=0$ ; at most one of $\beta_{i,j}$ and $\beta_{j,i}$ is nonzero. If $\beta_{j,i}\neq 0$ , $X_{j}$ is said to be a cause, or parent, of $X_{i}$ ; $X_{i}$ is said to be an effect of $X_{j}$ .

The graph $G:=(\mbox{\bf X},E)$ with adjacency matrix $B=(\beta_{i,j})$ is a directed graph such that the edge set $E$ corresponds to the set of pairs $(i,j)$ with $\beta_{i,j}\neq 0$ . The directed graph of a linear SEM is usually required to be acyclic (DAG). $G$ is also called the causal graph or causal structure of the SEM, in the sense that any edge in $G$ , denoted as $(X_{i}\to X_{j})$ , signifies that $X_{i}$ is a cause of $X_{j}$ (and $X_{j}$ is an effect of $X_{i}$ ).

Given two directed graphs $G_{1},G_{2}$ on X, the intersection $G_{1}\cap G_{2}$ refers to the intersection of their edge sets. The number of nonzeros of matrix $B$ is denoted as $\operatorname{nnz}(B)$ .

Definition 2.

Consider a DAG $G=(\mbox{\bf X},B)$ defined on $\mbox{\bf X}=(X_{1},\dots,X_{d})$ . The Markov blanket of variable $X_{i}$ , denoted as $\mbox{\bf MB}(X_{i})$ , is the smallest set $M\subset\mbox{\bf X}$ such that

X\perp\!\!\!\!\perp_{G}\mbox{\bf X}\backslash(M\cup\{X_{i}\})\text{~{}given~{}}M

where $\perp\!\!\!\!\perp_{G}$ denotes d-separation (e.g., (Peters et al., 2017, Definition 6.1)).

When the distribution of X and the DAG $G$ satisfy the Markov property, then the d-separation property above entails

X_{i}\perp\!\!\!\!\perp\mbox{\bf X}\backslash(M\cup\{X_{i}\})\text{~{}given~{}% }M.

The Markov blanket $\mbox{\bf MB}(X_{i})$ contains exactly the variables $X_{j}$ that are causes or effects of $X_{i}$ (i.e., $\beta_{j,i}\neq 0$ or $\beta_{i,j}\neq 0$ ) and the spouse variables $X_{k}$ (i.e., there exists a variable $X_{\ell}$ that is an effect of both $X_{i}$ and $X_{k}$ ). A triplet ( $X_{i},X_{j},X_{k}$ ) form a v-structure if $X_{i}$ and $X_{j}$ are causes of $X_{k}$ while the first two are not directly linked.

2.2 Related work

Divide-and-conquer strategy.

The divide-and-conquer strategy is used by Gu and Zhou (2020) in their method named PEF (Partition, Estimation and Fusion), which consists of three phases: (i) partitioning the set of all variables into clusters, (ii) estimating causal structures cluster by cluster; and (iii) finally producing a fusion of all local learning results. Zhang et al. (2020) proposed a divide-and-conquer approach based on constraint-based methods. In earlier works of Gao et al. (2017), a strategy similar to divide-and-conquer is used, which consists of first Markov blanket learning for each variable and then a procedure for learning and fusion in a from-local-to-global manner.

Markov blanket.

A different but related approach is the recursive methods (Mokhtarian et al., 2021; Rahman et al., 2021). Mokhtarian et al. (2021) proposed the recursive variable elimination algorithm named MARVEL that involves operations similar to the divide step based on Markov boundary discovery. To obtain Markov blanket information for MARVEL as well as our proposed method (Phase-1), many different algorithms can be used including Grow-Shrink (GS) Margaritis and Thrun (1999), IAMB Tsamardinos et al. (2003), precision matrix-based methods (e.g., Loh and Bühlmann (2014)), and more recent methods by Wu et al. (2020, 2023) such as KMB (Kernel MB learning). The choice of which algorithm to use for computing the Markov boundaries should be made according to the data and application.

ILP methods for causal learning.

Jaakkola et al. (2010) tackle the problem of learning Bayesian network structures using linear programming relaxations. Cussens et al. (2017); Cussens (2023) introduce integer programming and linear programming methods for Bayesian network learning, shedding light on polytopes and facets regarding the search of DAGs.

3 DCDILP: a divide-and-conquer approach to causal discovery

In this section, we present the DCDILP approach by first introducing its principles based on the divide-and-conquer strategy and then introducing the integer linear programming method for the conquer step within this strategy.

3.1 Divide-and-conquer strategy

We consider the following three-phase procedure named DCDILP, as illustrated in Figure 1:

Figure 1: Illustration of the divide-and-conquer framework: (a) observational data; (b)

d

data subsets: the

i

-th data subset only includes variable

X_{i}

and the variables in its Markov blanket

\mbox{\bf MB}(X_{i})

; (c) output matrix

\widehat{B}^{(i)}

i

-th subproblem for

i=1,\dots,d

; (d) final solution

B

•

Phase-1: this phase consists of identifying the Markov blankets $\mbox{\bf MB}(X_{i})$ for each variable $X_{i}$ . As said, $\mbox{\bf MB}(X_{i})$ only includes variables that are causes, effects or spouses of $X_{i}$ . Under the assumption that $\mbox{\bf MB}(X_{i})$ is accurately identified, this subset of variables contains all variables related to $X_{i}$ . The Markov blanket-based divide scheme thus is structured in a way that favors separability.
•

Phase-2: this phase tackles the local causal discovery problems (subproblem) defined on $\mbox{\bf S}_{i}:=\mbox{\bf MB}(X_{i})\cup\{X_{i}\}$ , for each $i=1\ldots d$ . This restriction makes the subproblems much smaller than the original problem but it entails a causal insufficiency issue as the variables in $\mbox{\bf S}_{i}$ may well be influenced by variables external to $\mbox{\bf MB}(X_{i})$ . In Phase-2, we partially mitigate this issue by retaining the causal relations involving only $X_{i}$ ; other relations found within $\mbox{\bf S}_{i}$ are discarded. This choice is motivated by the fact that $X_{i}$ is the only variable that has all its causes and effects in $\mbox{\bf MB}(X_{i})$ for certain (still under the assumption that $\mbox{\bf MB}(X_{i})$ is accurately identified).
•

Phase-3: this phase aims to reconcile eventually all edges found in Phase-2, and enforce their consistency. As will be shown, this task is formalized as an integer linear programming problem. The constraints formalize the logical relations among the notions of causes, effects, spouses and v-structures, and the objective function aims at finding an overall causal graph $B$ as aligned as possible with the local relations found in Phase-2.

Algorithm.

The above framework, as detailed in Algorithm 1, can be realized in different ways depending on how Phase-1 and Phase-2 are carried out. In the present work, we consider two different algorithms—GES (Chickering, 2002b) and DAGMA (Bello et al., 2022)—for learning the preliminary subgraphs $A^{(i)}$ in Phase-2 (line 3). For Phase-1, a discussion about our choices is given in Section 3.3.

Algorithm 1 (DCDILP) Distributed causal discovery using ILP

0: Observational data

\mathcal{X}\in\mathbb{R}^{n\times d}

1: (Phase-1) Divide: Estimate Markov blanket

\mbox{\bf MB}(X_{i})

for

i=1,\dots,d

2: (Phase-2) for

i=1,\dots,d

do in parallel

A^{(i)}\leftarrow

Causal discovery on

\mbox{\bf S}_{i}:=\mbox{\bf MB}(X_{i})\cup\{X_{i}\}

\widehat{B}^{(i)}_{j,k}\leftarrow A^{(i)}_{j,k}

j=i

k=i

, otherwise

\widehat{B}^{(i)}_{j,k}\leftarrow 0

for

(j,k)\in[d]\times[d]

5: (Phase-3) Conquer:

B\leftarrow\text{Reconciliation from }\{\widehat{B}^{(i)},i=1\ldots d\}\text{~% {}through the ILP \eqref{eq:fobj-lp}--\eqref{eq:v2}}

Each subproblem in line 3 takes as input an $n\times|\mbox{\bf S}_{i}|$ submatrix of the whole data matrix $\mathcal{X}$ . Subsequently, the output of Phase-2, $\widehat{B}^{(i)}$ (line 4) consists of only the direct causes and effects of $X_{i}$ . Note that the estimation of $\widehat{B}^{(i)}$ for each $X_{i}$ is mutually independent given the Markov blankets. Therefore, their computations can be distributed to a number of different CPUs in parallel.

Phase-3 is concerned with building a global causal graph from all partial relations in $\widehat{B}^{(i)}$ constructed in Phase-2.

3.2 Phase-3: formulating causal graph reconciliation as an ILP problem

A naive approach is to concatenate all partial graphs $\widehat{B}^{(i)}$ found in Phase-2:

\widehat{B}=\sum_{i=1}^{d}\widehat{B}^{(i)}.

(1)

In general, however, this approach does not yield a consistent solution due to diverse conflicts among the $\widehat{B}^{(i)}$ . For example, $X_{i}$ might be considered as a cause of $X_{j}$ in $\widehat{B}^{(i)}$ but, at the same time, it might be an effect of $X_{j}$ in $\widehat{B}^{(j)}$ . In the rest of this subsection, we first analyse the properties of these conflicts and then present the proposed ILP method.

Merge conflicts after Phase-2.

In contrast to the naive merge $\widehat{B}$ defined in (1), we propose a reconciliation process that combines the edges in matrices $\widehat{B}^{(i)}$ in a selective manner. Since the global causal discovery problem is divided into the separate subproblems according to the Markov blankets, the local solutions $\widehat{B}^{(i)}$ from Phase-2 generally have overlapping answers regarding the causal relations. In a general sense, we refer to the overlapping answers (of two graphs) as merge conflicts defined as follows.

Definition 3 (Merge conflict).

For $i=1,2$ , respectively, let $\mathcal{S}^{(i)}\subset[d]\times[d]$ be a set of edges on $\mbox{\bf X}=(X_{1},\dots,X_{d})$ , and let $B^{(i)}$ be the binary adjacency matrix of a graph with edges restricted in $\mathcal{S}^{(i)}$ . Then, $B^{(1)}$ and $B^{(2)}$ are said to constitute a merge conflict if and only if there exists $(j,k)\in\mathcal{S}^{(1)}\cap\mathcal{S}^{(2)}$ such that $B^{(1)}_{jk}\neq B^{(2)}_{jk}$ .

Proposition 4.

For $i,j\in[d]$ , let $\widehat{B}^{(i)}$ and $\widehat{B}^{(j)}$ respectively denote the $i$ -th and $j$ -th binary adjacency matrix output by Phase-2 of Algorithm 1 (lines 4–5). A merge conflict between $\widehat{B}^{(i)}$ and $\widehat{B}^{(j)}$ is one of the following three types:

1.

(Type-1) One of the two adjacency matrices gives a directed link between $X_{i}$ and $X_{j}$ while the other suggests $X_{i}\perp\!\!\!\!\perp X_{j}$ .
2.

(Type-2) The two adjacency matrices result in two directed links between $X_{i}$ and $X_{j}$ with opposite directions: either one matrix contains the two directions while the other gives $X_{i}\perp\!\!\!\!\perp X_{j}$ , or each one gives a directed edge opposite to the other.
3.

(Type-3) One of the two adjacency matrices gives a undirected link between $X_{i}$ and $X_{j}$ while the other gives a directed link.

The proof is given in Appendix A. An illustration about the above proposition is shown in Figure 2.

$\widehat{B}^{(i)}$		$\widehat{B}^{(j)}$		$\widehat{B}$
	+		$\rightarrow$

Any pair of matrices $\widehat{B}^{(i)}$ an $\widehat{B}^{(j)}$ by Phase-2 can only have the above three types of conflicts because their edge sets intersect at most on the two directed edges between $X_{i}$ and $X_{j}$ ; indeed, many more conflicts can otherwise be observed among $\widehat{A}^{(i)}$ and $\widehat{A}^{(j)}$ (in line 3 of Algorithm 1).

The ILP formulation.

Considering all conflicts among the output matrices of Phase-2, DCDILP delegates their resolution to an integer linear programming problem (Wolsey, 2020) by formulating all constraints on the sought solution as follows.

This problem involves: (i) binary variables noted $B_{ij}$ and $S_{ij}$ for all pairs $(i,j)\in[d]\times[d]$ such that $X_{i}\in\mbox{\bf MB}(X_{j})$ ; (ii) binary variables noted $V_{ijk}$ for all triples $(i,j,k)\in[d]\times[d]\times[d]$ such that $X_{i}$ (respectively, $X_{j}$ and $X_{k}$ ) belongs to both Markov blankets $\mbox{\bf MB}(X_{j})$ and $\mbox{\bf MB}(X_{k})$ (respectively, $\mbox{\bf MB}(X_{i})$ and $\mbox{\bf MB}(X_{k})$ ; or $\mbox{\bf MB}(X_{i})$ and $\mbox{\bf MB}(X_{j})$ ):

$\displaystyle B_{ij}=1$	$\displaystyle\text{if~{}}X_{i}\to X_{j}$	(2)
$\displaystyle V_{ijk}=V_{jik}=1$	$\displaystyle\text{if }(X_{i},X_{j},X_{k})\text{ form a v-structure $(X_{i}\to X% _{k}\leftarrow X_{j})$}.$	(3)
$\displaystyle S_{ij}=S_{ji}=1$	$\displaystyle\text{if $X_{i}$ and $X_{j}$ are spouses, i.e. $\exists k$ s.t. $% V_{ijk}=1$}.$	(4)

The constraints on the above variables express the fact that the sought solution denoted $B$ must be consistent with the given Markov blankets while the objective function of the problem is defined as the similarity of $B$ with the naive concatenation of matrices $\widehat{B}^{(i)}$ , namely $\widehat{B}=\sum_{i=1}^{d}\widehat{B}^{(i)}$ :

	$\displaystyle\underset{B\in\mathbb{B}^{d\times d},S\in\mathbb{S}^{d\times d},V% \in\mathbb{B}^{d\times d\times d}}{\text{maximize}}\braket{\widehat{B},B}\quad% \text{subject to }$		(5)
	$\displaystyle\hskip 25.60747ptB_{ij}=0\hskip 130.88268pt\text{~{} if ~{}}% \widehat{B}_{ij}=\widehat{B}_{ji}=0$		(6)
	$\displaystyle\hskip 25.60747ptS_{ij}=S_{ji}=0\hskip 108.12047pt\text{~{}if~{}}% X_{i}\notin\mbox{\bf MB}(X_{j})$		(7)
	$\displaystyle\hskip 25.60747ptV_{ijk}=0\hskip 130.88268pt\text{~{}if~{}}% \widehat{B}_{ik}=0\text{~{}or~{}}\widehat{B}_{jk}=0$		(8)
	$\displaystyle\hskip 25.60747ptB_{ij}+B_{ji}\leq 1,\quad B_{ij}+B_{ji}+S_{ij}% \geq 1\quad\text{if~{}}X_{i}\in\mbox{\bf MB}(X_{j})$		(9)
	$\displaystyle\hskip 25.60747ptV_{ijk}\leq B_{ik},~{}V_{ijk}\leq B_{jk},~{}V_{% ijk}\leq S_{ij}\quad\text{and}$		(10)
	$\displaystyle\hskip 25.60747ptB_{ik}+B_{jk}\leq 1+V_{ijk},~{}S_{ij}\leq\sum_{k% }V_{ijk}\quad\forall k,\widehat{B}_{ik}\neq 0,\widehat{B}_{jk}\neq 0.$		(11)

More precisely, the constraints of the above ILP are motivated by the following reasons:

•

Sparsity. The constraints (6)–(8) enable us to discard all pairs in $B,S$ or triplets in $V$ that are not involved with the given Markov blanket information.
•

2-cycle exclusiveness. For any pair $(i,j),i\neq j$ and $X_{j}\in\mbox{\bf MB}(X_{i})$ (and vice-versa), the first constraint in (9), $B_{ij}+B_{ji}\leq 1$ , excludes the 2-cycle between $X_{i}$ and $X_{j}$ given that the entries of $B$ are binary variables.
•

Markov blanket membership. The second constraint in (9), $B_{ij}+B_{ji}+S_{ij}\geq 1$ , dictates that, when $X_{j}\in\mbox{\bf MB}(X_{i})$ (and vice-versa), they must be either directly causally related or spouses.
•

V-structures. For any $k\in[d]$ , $\widehat{B}_{ik}\neq 0$ and $\widehat{B}_{jk}\neq 0$ , there is a chance that $(X_{i},X_{j},X_{k})$ form a v-structure. Therefore $V_{ijk}$ is encoded (or created) as a binary variable. The constraints in (10) and (11) are necessary conditions for $(V_{ijk},B_{ik},B_{jk},S_{ij})$ to be consistent with the Markov blanket information.

[Uncaptioned image] — Table 1: Example of Remark 5.

Remark 5.

The constraints of DCDILP will exclude spurious solutions that are not conform with the Markov blanket information (obtained from Phase-1). In the first example of Table 1 (row (a)), if $X$ and $Y$ do not belong to each other’s Markov blanket, then only the solution $B^{\prime}$ is feasible to the ILP; otherwise, only the solution $B$ is feasible to the ILP. $\square$

3.3 Discussion

Phase-1.

In a classical setting of Gaussian graphical models the sparsity pattern of the inverse covariance matrix of X encodes conditional independence relations between the variables. Consider the usual covariance matrix $\Sigma=\operatorname*{cov}(\mbox{\bf X})$ . It is a well-known consequence of the Hammersley–Clifford theorem that the entries of the precision matrix $\Theta=\Sigma^{-1}$ correspond to rescaled conditional correlations Loh and Wainwright (2012). In DCDILP, the empirical inverse covariance estimator is used to achieve Phase-1, that is, $\mbox{\bf MB}(X_{i})\cup\{i\}$ is set as the support of $\Theta_{i,:}$ ; details are given in Section B.1.

Time efficiency.

Apart from Phase-1, the computational cost of DCDILP is dominated by Phase-2, which is distributed on a number of parallel workers. Note that the wall time of DCDILP depends mostly on the maximal running time among the parallel workers. Therefore, the total running time of DCDILP is in principle dominated by the running time of the chosen causal discovery routine on the largest Markov blanket during Phase-2. The computational time of the three phases is shown empirically under different settings in Section 4.

4 Experiments

We conduct experiments on data generated on linear SEMs to assess the performance of DCDILP for causal causal discovery. The primary goal of the experiments is to examine the learning accuracy of the proposed method and its computational efficiency in different settings.

4.1 Experimental setting

Benchmark data.

The observational data are generated from linear SEMs following the usual settings of Zheng et al. (2018), where the causal structure $B$ is drawn from random DAGs with the Erdős–Rényi (ER) model. The coefficients (edge weights) of $B$ are drawn from the uniform distribution $\text{Unif}([-2,-0.5]\cup[0.5,2])$ .

Baselines.

The baselines of the causal discovery benchmarks are GES Chickering (2002b) and DAGMA Bello et al. (2022). More precisely, the GES implementation used in the benchmark—labeled as GES (pcalg)—is from the R package pcalg (https://cran.r-project.org/web/packages/pcalg/index.html), which is a highly efficient implementation.

In these benchmarks, the proposed DCDILP method is represented by DCDILP-GES and DCDILP-DMA, which refers to the specific implementations of DCDILP using GES and DAGMA, respectively, during Phase 2 (line 3). The estimated graphs are evaluated by usual metrics (SHD, TPR, FDR and FPR) for DAG learning (details in Section C.1). The other algorithms are run on one CPU of Intel(R) Xeon(R) Gold 5120 14 cores @ 2.2GHz. The computations in Phase 2 of DCDILP are distributed on a maximal of 300 CPU cores.

The DCDILP method is tested in two scenarios. One scenario is to assess the proposed causal learning methodology irrespective of the statistical performance of Markov blanket discovery, in which case DCDILP (MB^∗) will be used as the label, meaning that the Markov blankets are the ground truth ones. The other scenario is causal discovery, in which case DCDILP takes observational data as input and proceeds by following exactly the framework specified in Algorithm 1.

The implementation of DCDILP is made available at https://github.com/shuyu-d/dcdilp-exp.

4.2 Experimental results

DCDILP using GES in Phase 2.

In this experiment, DCDILP-GES represents the algorithm of DCDILP that uses GES during its Phase 2 (Algorithm 1, line 3), and is evaluated in causal discovery tasks in comparison with GES. The number $d$ of nodes varies in $\{$ 50, 100, 200, 400, 800, 1000, 1600 $\}$ and the number of samples of X varies as $n=50d$ . The choice of the sample sizes is discussed in Section 4.3. The computations in Phase 2 of DCDILP-GES is distributed on $N=\min(2d,300)$ CPU cores. The results are shown in Figure 3, Figure 4 and in Appendix C in the supplementary material.

Figure 3 presents the learning scores of the two algorithms in (TPR, FDR, SHD), depending on the problem dimension $d$ . The results in this figure shows that DCDILP-GES outperforms GES in all three learning scores for under all problem dimensions.

It is worth noting that DCDILP-GES uses partially the subgraphs obtained by GES during the local learning subproblems of Phase 2. It is therefore surprising to some extent that DCDILP-GES achieves significant gains over GES in the DAG learning accuracy. We believe that such gains are due to the following reasons: (i) the causal learning subproblems in Phase 2 of DCDILP enjoys a higher effective sample size, in the sense that the ratio $n/|\mbox{\bf S}_{i}|$ —for each subproblem on $\mbox{\bf S}_{i}$ —is effectively higher than $n/d$ (for the global problem); (ii) the reconciliation by the ILP method of DCDILP-GES is a selection process that removes spurious directions (in terms of DAG learning) from the subgraphs given by GES during Phase 2, while GES is a method for producing CPDAGs.

Figure 4 presents the running time (wall time) of the two algorithms depending on the problem dimension $d$ . The results in this figure shows that (i) GES (pcalg) enjoys a highly competitive time efficiency on small- to mid-side problems (for $d$ ranging from $50$ to around $700$ ), which is superior or comparable to DCDILP-GES; (ii) however, the running time of DCDILP-GES grows more slowly than GES with increasing $d$ . The trend in this figure shows that when $d$ surpasses $800$ (on ER1 data) or $700$ (on ER2 data), DCDILP-GES starts to gain speedups over GES with an increasing rate. When $d=1600$ , for example, DCDILP-GES achieves around $4\times$ and $5\times$ speedups over GES on ER1 data and ER2 data respectively.

DCDILP using DAGMA in Phase 2.

In this experiment, DCDILP-DMA represents the algorithm of DCDILP that uses DAGMA during Phase 2 (Algorithm 1, line 3), and is evaluated in comparison with DAGMA. The number $d$ of nodes varies in $\{$ 50, 100, 200, 400, 800, 1000, 1600 $\}$ and the number of samples of X varies as $n=50d$ .

The computation in Phase 2 of DCDILP-DAGMA is distributed on $N=\min(2d,300)$ CPU cores. The results are shown in Figure 5 and Figure 6.

The results in Figure 5 and Figure 6 show that DCDILP-DMA achieves significant gains over DAGMA in running time—under almost all problem dimensions—within a reasonable compromise in learning accuracy. More precisely, on ER1 data, the median TPRs of both DCDILP-DMA and DAGMA close to 100% and the median FDRs of DCDILP-DMA are under 10% under all dimensions $d$ , while DAGMA has FDRs that are close to zero (indicating almost exact recovery of the underlying DAGs). On ER2 data, DCDILP-DMA has median TPRs slightly above 90% while DAGMA stays close to 100%; and DCDILP-DMA has median FDRs between 10% and 20%, which is reasonably low despite being higher than DAGMA.

On the other hand, the time efficiency gains of DCDILP-DMA over DAGMA are even more significant than the case with GES (pcalg). Despite that DCDILP-DMA uses DAGMA for the local learning subproblems in Phase 2, its overall wall time becomes $100\times$ lower than DAGMA when $d$ surpasses $400$ . As discussed in Section 3.3, such speedups are largely due to the parallelization of the local learning subproblems in Phase 2. Despite that the parallelization of subproblem tasks in the present experiment actually encounters congestion (when $d$ grows) given the predefined limitation on the maximal number of CPU cores, the speedups obtained by DCDILP are already considerable for problems with more than 1000 variables.

4.3 Discussions

Effects of the ILP in Phase 3.

In the same settings as the above experiments, we showcase the effects of reconciliation process via the ILP-based method (formulated in Section 3.2, (5)–(11)) in comparison with the naive merge result $\widehat{B}$ (1).

Figure 7 shows their comparisons in the representative case with ER2 data. The scores in this figure shows that for both DCDILP-GES and DCDILP-DMA, the reconciliation process by the ILP method achieves significant gains in learning accuracy than the naive merge result $\widehat{B}$ (1) under all problem dimensions. More precisely, the median TPRs of DCDILP are slightly below those of the naive merge while staying around 90% or above; more interestingly, the median FDRs of DCDILP are significantly reduced than those of the naive merge. These comparisons provide solid validation to the effects of the ILP method.

Evaluation of finite-sample cases.

In the same settings as the above experiments, we evaluate learning accuracy of DCDILP under different sample sizes. The sample size of observational data varies between $5d$ and $50d$ for causal discovery tasks of $d=100$ variables.

The comparative results in Figure 8 give an empirical insight in the sample requirement of DCDILP-GES under the given choices for Phase 1 (MBs by inverse covariance estimation) and Phase 2 (using GES). The learning accuracy of DCDILP-GES improves with increasing sample size. For Gaussian SEMs, the accuracy attains a level comparable to the case with oracle MBs for $n$ reaches $20d$ ; and for Gumbel-noised SEMs, the accuracy of DCDILP-GES becomes comparable to DCDILP-GES (MB*) when $n$ reaches $40d$ . Future work should include more sophisticated or more sample-efficient methods for Markov blanket discovery.

5 Conclusion and perspectives

The main contribution of this paper, DCDILP, is a divide-and-conquer method for learning causal structures from data. Firstly, it takes advantage of the natural decomposition through the Markov blankets associated to each variable. Secondly, a causal learner is deployed on $\mbox{\bf MB}(X_{i})$ to learn causal relations involving $X_{i}$ . Lastly, these causal relations are reconciled through ILP. The main novelty of the approach is the reconciliation process formulated as an ILP problem.

The computational efficiency of DCDILP, empirically demonstrated on problems involving more than a thousand variables, can be explained by (i) on one hand, the learning phase can be achieved in parallel, considering the different (overlapping) subsets of variables; (ii) on the other hand, the reconciliation of the causal relations found in Phase-2 can be delegated to an efficient general-purpose ILP solver.

DCDILP defines a general scheme as all its three phases can be implemented by different algorithmic components. In this work, a basic empirical inverse covariance estimation method is considered in Phase-1, GES and DAGMA are considered in Phase-2, and the Gurobi-based ILP solver is considered in Phase-3. The limitation of the approach is that its accuracy depends naturally on sufficiently accurate Markov blankets to be estimated in Phase-1 and on the causal relations discovered in Phase-2.

A first research perspective is to consider the interactions of the algorithmic components and act on their gearing. For instance, the over-estimation of Markov blankets (including spurious variables) in Phase-1, or a high false discovery ratio in Phase-2 can be handled through relaxed formulations of the ILP problem tackled in Phase-3. A longer term perspective is to take advantage of the multiple solutions discovered by the ILP solver in Phase-3: their intersection could be used to characterize the ‘backbone’ of the sought causal graph, and to focus the Phase-2 on refining this backbone.

Appendix A Proof

Proof of Proposition 4.

For $\widehat{B}^{(i)}$ ( $i\in[d]$ ) obtained in Phase-2 of Algorithm 1, the edge set of its graph is restricted within

\displaystyle C_{i}=\{(i,k):k\in\mbox{\bf MB}(X_{i})\}\cup\{(k,i):k\in\mbox{% \bf MB}(X_{i})\}.

Hence the intersection of the support graphs of $\widehat{B}^{(i)}$ and $\widehat{B}^{(j)}$ for any pair $i\neq j$ is included in $C_{i}\cap C_{j}=\{(i,j),(j,i)\}$ . Therefore, by Definition 3, $\widehat{B}^{(i)}$ and $\widehat{B}^{(j)}$ result in a merge conflict if and only if

\displaystyle\widehat{B}^{(i)}_{ij}\neq\widehat{B}^{(j)}_{ij}\text{~{}or~{}}% \widehat{B}^{(i)}_{ji}\neq\widehat{B}^{(j)}_{ji}.

As a consequence, all merge conflicts can be classified into the following three types, according to the number of nonzeros in the quadruplet $Q_{ij}=\{\widehat{B}^{(i)}_{ij},\widehat{B}^{(j)}_{ij},\widehat{B}^{(i)}_{ji},% \widehat{B}^{(j)}_{ji}\}$ :

Table 2: Classification of all merge conflicts from a pair of local results. The symbol * indicates a nonzero number.

Type	$\hat{B}^{(i)}_{ij}$	$\hat{B}^{(i)}_{ji}$	$\hat{B}^{(j)}_{ij}$	$\hat{B}^{(j)}_{ji}$	Graphical model	Characteristics
(3) Undirected	*	0	*	*	$(i\rightarrow j)$ vs $(i\leftrightarrow j)$	$\operatorname{nnz}(Q_{ij})=3$
	0	*	*	*	$(i\leftarrow j)$ vs $(i\leftrightarrow j)$
	⋮	⋮	⋮	⋮	⋮
	*	*	0	*	$(i\leftrightarrow j)$ vs $(i\leftarrow j)$
(2) Acute	*	0	0	*	$(i\rightarrow j)$ vs $(i\leftarrow j)$	$\operatorname{nnz}(Q_{ij})=2$
	0	*	*	0	$(i\leftarrow j)$ vs $(i\rightarrow j)$
	*	*	0	0	$(i\leftrightarrow j)$ vs $(i\perp\!\!\!\!\perp j)$
	0	0	*	*	$(i\perp\!\!\!\!\perp j)$ vs $(i\leftrightarrow j)$
(1) Addition	*	0	0	0	$(i\rightarrow j)$ vs $(i\perp\!\!\!\!\perp j)$	$\operatorname{nnz}(Q_{ij})=1$
	0	*	0	0	$(i\leftarrow j)$ vs $(i\perp\!\!\!\!\perp j)$
	⋮	⋮	⋮	⋮	⋮
	0	0	0	*	$(i\perp\!\!\!\!\perp j)$ vs $(i\leftarrow j)$

$\square$

Appendix B Algorithms

B.1 An empirical inverse covariance estimator

In the implementation of DCDILP, we use a basic empirical inverse covariance estimator, detailed in Algorithm 2, for the inference of Markov blankets.

Algorithm 2 Empirical inverse covariance estimator

0: Data matrix

X\in\mathbb{R}^{n\times d}

, parameter

\lambda_{1}\in(0,1)

\widehat{\Theta}_{\lambda_{1}}\in\mathbb{R}^{d\times d}

1: Compute empirical covariance and its inverse:

\displaystyle\widehat{C}=\frac{1}{n}{(X-\bar{X})}^{\mathrm{T}}(X-\bar{X})\quad% \text{and}\quad\widehat{\Theta}=\widehat{C}^{\dagger},

(12)

where

\widehat{C}^{\dagger}

denotes the pseudo-inverse of

\widehat{C}

2: Element-wise thresholding on off-diagonal entries:

	$\displaystyle\operatorname*{diag}(\widehat{\Theta}_{\lambda_{1}})$	$\displaystyle:=\operatorname*{diag}(\widehat{\Theta}),$
	$\displaystyle(\widehat{\Theta}_{\lambda_{1}})_{\mathrm{off}}$	$\displaystyle:=\mathbb{H}(\widehat{\Theta}_{\mathrm{off}},\lambda_{1}\\|% \widehat{\Theta}_{\mathrm{off}}\\|_{\mathrm{max}}),$		(13)

where

\mathbb{H}

is defined as

\mathbb{H}(y,\tau)=\left\{\begin{array}[]{ll}y&\mbox{if }|y|\geq\tau\\ 0&\mbox{ }\text{otherwise.}\end{array}\right.

In the computation of (12), the pseudo-inverse coincides with the inverse of $\widehat{C}$ when $\widehat{C}$ is positive definite (e.g., when the number $n$ of samples is sufficiently large). In (13), the subscript ‘off’ indicates the following filtering operation

\Theta_{\mathrm{off}}=\{\Theta_{ij}:i\neq j\}

where the indices of the remaining (off-diagonal) entries are preserved.

Selection of $\lambda_{1}$ for Algorithm 2.

In the implementation of Phase 1 of DCDILP, a simple grid search is included for selecting values of $\lambda_{1}$ for the empirical inverse covariance estimator (Algorithm 2). The total time for this parameter selection corresponds to the computation time of Phase 1 in the benchmark of Section 4.

Given that the sought causal structures have an average degree $1\leq\operatorname{deg}\leq 4$ , the target sparsity of $\widehat{\Theta}_{\lambda_{1}}$ by Algorithm 2 is bounded by $\bar{\rho}_{\operatorname{deg}}=\max(\frac{\operatorname{deg}}{d})\approx 2.0\%$ for graphs with $d\geq 200$ nodes. This gives us an approximate target percentile of around $98\%$ , i.e., top $2\%$ edges in terms of absolute weight of $\widehat{\Theta}_{\mathrm{off}}$ . In other words, the maximal value $\lambda_{1}^{\max}$ of the grid search area is set as $\lambda_{1}^{\max}:=\frac{|\widehat{\Theta}_{\mathrm{off}}(\tau_{\mathrm{98}})% |}{\|\widehat{\Theta}_{\mathrm{off}}\|_{\mathrm{max}}}$ , where $\tau_{\mathrm{98}}$ refers to the index of the $98$ -th percentile in $\{|\widehat{\Theta}_{\mathrm{off}}|\}$ . For the experiments with ER2 graphs in Section 4, the estimated $\lambda_{1}^{\max}$ is $6.10^{-1}$ . Hence, the search grid of $\lambda_{1}$ is set up as $n_{I_{1}}=20$ equidistant values on $I_{1}=[10^{-2},6.10^{-1}]$ .

The selection criterion, similar to GraphicalLasso, is defined as

\displaystyle C(\lambda_{1}):=\operatorname*{tr}(\widehat{C}\widehat{\Theta}_{% \lambda_{1}})-\log\det(\tilde{\Theta}_{\lambda_{1}}),

(14)

where $\tilde{\Theta}_{\lambda_{1}}=\widehat{\Theta}_{\lambda_{1}}+\frac{9}{10}% \operatorname*{diag}(\widehat{\Theta}_{\lambda_{1}})$ is used in the $\log\det$ -evaluation for an enhanced positive definiteness in all cases.

Figure 9 shows the criterion values compared to the Hamming distances with the oracle precision matrix $\Theta^{\star}:=\phi(B^{\star})$ . We observe that the selection criterion with $\operatorname*{arg\,min}_{I_{1}}C(\lambda_{1})$ gives an answer that is rather close to the optimal value in terms of distance of $\widehat{\Theta}_{\lambda_{1}}$ to the oracle precision matrix $\Theta^{\star}$ .

Appendix C Experiments

C.1 Evaluation metrics

The graph metrics for the comparison of graph edge sets are the commonly used (e.g., by the aforementioned baseline methods) ones as follows:
(1) TPR = TP/T (higher is better),
(2) FDR = (R + FP)/P (lower is better),
(3) FPR = (R + FP)/F (lower is better),
(4) SHD = E +M + R (lower is better).
More precisely, SHD is the (minimal) total number of edge additions (E), deletions (M), and reversals (R) needed to convert an estimated DAG into a true DAG. Since a pair of directed graphs are compared, a distinction between True Positives (TP) and Reversed edges (R) is needed: the former is estimated with correct direction whereas the latter is not. Likewise, a False Positive (FP) is an edge that is not in the undirected skeleton of the true graph. In addition, Positive (P) is the set of estimated edges, True (T) is the set of true edges, False (F) is the set of non-edges in the ground truth graph. Finally, let (E) be the extra edges from the skeleton, (M) be the missing edges from the skeleton.

Running time of DCDILP.

The running time benchmark of DCDILP-GES in the causal discovery tasks on ER1 and ER2 data is given in Figure 10.

The running time benchmark of DCDILP-DMA is given in Figure 11.

C.2 Effects of the reconciliation process by DCDILP (Phase-3)

The proposed ILP-based method is evaluated in comparison with the naive merge $\widehat{B}$ (1) under the same experimental settings with different sample sizes. The supplementary results are shown in Figure 12–Figure 16.

References

Aragam et al. [2019] Bryon Aragam, Arash Amini, and Qing Zhou. Globally optimal score-based learning of directed acyclic graphs in high-dimensions. Advances in Neural Information Processing Systems, 32, 2019.
Arjovsky et al. [2019] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.
Bello et al. [2022] Kevin Bello, Bryon Aragam, and Pradeep Ravikumar. DAGMA: Learning dags via m-matrices and a log-determinant acyclicity characterization. Advances in Neural Information Processing Systems, 35:8226–8239, 2022.
Chickering [1996] David Maxwell Chickering. Learning bayesian networks is NP-complete. Learning from data: Artificial intelligence and statistics V, pages 121–130, 1996.
Chickering [2002a] David Maxwell Chickering. Learning equivalence classes of bayesian-network structures. The Journal of Machine Learning Research, 2:445–498, 2002a.
Chickering [2002b] David Maxwell Chickering. Optimal structure identification with greedy search. Journal of machine learning research, 3(Nov):507–554, 2002b.
Cussens [2023] James Cussens. Branch-price-and-cut for causal discovery. In 2nd Conference on Causal Learning and Reasoning, 2023.
Cussens et al. [2017] James Cussens, Matti Järvisalo, Janne H Korhonen, and Mark Bartlett. Bayesian network structure learning with integer programming: Polytopes, facets and complexity. Journal of Artificial Intelligence Research, 58:185–229, 2017.
Gao et al. [2017] Tian Gao, Kshitij Fadnis, and Murray Campbell. Local-to-global bayesian network structure learning. In International Conference on Machine Learning, pages 1193–1202. PMLR, 2017.
Gu and Zhou [2020] Jiaying Gu and Qing Zhou. Learning big gaussian bayesian networks: Partition, estimation and fusion. Journal of machine learning research, 21(158):1–31, 2020.
Gurobi Optimization [2023] Gurobi Optimization. Gurobi Optimizer Reference Manual, 2023. URL https://www.gurobi.com.
Jaakkola et al. [2010] Tommi Jaakkola, David Sontag, Amir Globerson, and Marina Meila. Learning bayesian network structure using lp relaxations. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 358–365. JMLR Workshop and Conference Proceedings, 2010.
Loh and Bühlmann [2014] Po-Ling Loh and Peter Bühlmann. High-dimensional learning of linear causal networks via inverse covariance estimation. The Journal of Machine Learning Research, 15(1):3065–3105, 2014.
Loh and Wainwright [2012] Po-Ling Loh and Martin J Wainwright. Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses. Advances in Neural Information Processing Systems, 25, 2012.
Lopez et al. [2022] Romain Lopez, Jan-Christian Hütter, Jonathan Pritchard, and Aviv Regev. Large-scale differentiable causal discovery of factor graphs. Advances in Neural Information Processing Systems, 35:19290–19303, 2022.
Margaritis and Thrun [1999] Dimitris Margaritis and Sebastian Thrun. Bayesian network induction via local neighborhoods. Advances in neural information processing systems, 12, 1999.
Meek [1995] Christopher Meek. Causal inference and causal explanation with background knowledge. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, UAI’95, pages 403–410. Morgan Kaufmann Publishers Inc., 1995. ISBN 1558603859.
Mokhtarian et al. [2021] Ehsan Mokhtarian, Sina Akbari, AmirEmad Ghassami, and Negar Kiyavash. A recursive Markov boundary-based approach to causal structure learning. In The KDD’21 Workshop on Causal Discovery, pages 26–54. PMLR, 2021.
Ng et al. [2020] Ignavier Ng, AmirEmad Ghassami, and Kun Zhang. On the role of sparsity and DAG constraints for learning linear DAGs. Advances in Neural Information Processing Systems, 33:17943–17954, 2020.
Ng et al. [2021] Ignavier Ng, Yujia Zheng, Jiji Zhang, and Kun Zhang. Reliable causal discovery with improved exact search and weaker assumptions. Advances in Neural Information Processing Systems, 34:20308–20320, 2021.
Pearl [2000] Judea Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2000.
Peters et al. [2016] Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society. Series B (Statistical Methodology), pages 947–1012, 2016.
Peters et al. [2017] Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017.
Rahman et al. [2021] Md Musfiqur Rahman, Ayman Rasheed, Md Mosaddek Khan, Mohammad Ali Javidian, Pooyan Jamshidi, and Md Mamun-Or-Rashid. Accelerating recursive partition-based causal structure learning. In Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, pages 1028–1036, 2021.
Ramsey et al. [2017] Joseph Ramsey, Madelyn Glymour, Ruben Sanchez-Romero, and Clark Glymour. A million variables and more: the fast greedy equivalence search algorithm for learning high-dimensional graphical causal models, with an application to functional magnetic resonance images. International journal of data science and analytics, 3:121–129, 2017.
Sauer and Geiger [2021] Axel Sauer and Andreas Geiger. Counterfactual generative networks. In International Conference on Learning Representations (ICLR), 2021.
Spirtes et al. [2000] Peter Spirtes, Clark N Glymour, Richard Scheines, and David Heckerman. Causation, prediction, and search. MIT press, 2000.
Tsamardinos et al. [2003] Ioannis Tsamardinos, Constantin F Aliferis, Alexander R Statnikov, and Er Statnikov. Algorithms for large scale Markov blanket discovery. In FLAIRS conference, volume 2, pages 376–380. St. Augustine, FL, 2003.
Tsirtsis et al. [2021] Stratis Tsirtsis, Amir-Hossein Karimi, Ana Lucic, Manuel Gomez-Rodriguez, Isabel Valera, and Hima Lakkaraju. ICML workshop on algorithmic recourse. 2021.
Wolsey [2020] Laurence A Wolsey. Integer programming. John Wiley & Sons, 2020.
Wu et al. [2020] Xingyu Wu, Bingbing Jiang, Kui Yu, chunyan Miao, and Huanhuan Chen. Accurate Markov boundary discovery for causal feature selection. IEEE Transactions on Cybernetics, 50(12):4983–4996, 2020. doi: 10.1109/TCYB.2019.2940509.
Wu et al. [2022] Xingyu Wu, Bingbing Jiang, Yan Zhong, and Huanhuan Chen. Multi-target Markov boundary discovery: Theory, algorithm, and application. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
Wu et al. [2023] Xingyu Wu, Bingbing Jiang, Tianhao Wu, and Huanhuan Chen. Practical Markov boundary learning without strong assumptions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 10388–10398, 2023.
Zhang et al. [2020] Hao Zhang, Shuigeng Zhou, Chuanxu Yan, Jihong Guan, Xin Wang, Ji Zhang, and Jun Huan. Learning causal structures based on divide and conquer. IEEE Transactions on Cybernetics, 52(5):3232–3243, 2020.
Zheng et al. [2018] Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. DAGs with NO TEARS: Continuous optimization for structure learning. In Advances in Neural Information Processing Systems, volume 31, 2018. URL https://proceedings.neurips.cc/paper/2018/file/e347c51419ffb23ca3fd5050202f9c3d-Paper.pdf.

DCDILP: a distributed learning method for large-scale causal structure learning

Abstract

1 Introduction

Related work.

Contributions.

2 Formal background

2.1 Definitions and notation

Definition 1.

Definition 2.

2.2 Related work

Divide-and-conquer strategy.

Markov blanket.

ILP methods for causal learning.

3 DCDILP: a divide-and-conquer approach to causal discovery

3.1 Divide-and-conquer strategy

Algorithm.

3.2 Phase-3: formulating causal graph reconciliation as an ILP problem

Merge conflicts after Phase-2.

Definition 3 (Merge conflict).

Proposition 4.

The ILP formulation.

Remark 5.

3.3 Discussion

Phase-1.

Time efficiency.

4 Experiments

4.1 Experimental setting

Benchmark data.

Baselines.

4.2 Experimental results

DCDILP using GES in Phase 2.

DCDILP using DAGMA in Phase 2.

4.3 Discussions

Effects of the ILP in Phase 3.

Evaluation of finite-sample cases.

5 Conclusion and perspectives

Appendix A Proof

Proof of Proposition 4.

Appendix B Algorithms

B.1 An empirical inverse covariance estimator

Selection of λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for Algorithm 2.

Appendix C Experiments

C.1 Evaluation metrics

Running time of DCDILP.

C.2 Effects of the reconciliation process by DCDILP (Phase-3)

References

Selection of $\lambda_{1}$ for Algorithm 2.