Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

DCDILP: a distributed learning method for large-scale causal structure learning

Shuyu Dong LISN, INRIA, Université Paris-Saclay Michèle Sebag LISN, INRIA, Université Paris-Saclay Kento Uemura Akito Fujii Shuang Chang Yusuke Koyanagi Koji Maruhashi
(June 15, 2024)
Abstract

This paper presents a novel approach to causal discovery through a divide-and-conquer framework. By decomposing the problem into smaller subproblems defined on Markov blankets, the proposed DCDILP method first explores in parallel the local causal graphs of these subproblems. However, this local discovery phase encounters systematic challenges due to the presence of hidden confounders (variables within each Markov blanket may be influenced by external variables). Moreover, aggregating these local causal graphs in a consistent global graph defines a large size combinatorial optimization problem. DCDILP addresses these challenges by: i) restricting the local subgraphs to causal links only related with the central variable of the Markov blanket; ii) formulating the reconciliation of local causal graphs as an integer linear programming method. The merits of the approach, in both terms of causal discovery accuracy and scalability in the size of the problem, are showcased by experiments and comparisons with the state of the art.

1 Introduction

Discovering causal relations from observational data emerges as an important problem for artificial intelligence with fundamental and practical motivations (Pearl, 2000; Peters et al., 2017). One notable reason is that causal models support modes of reasoning, e.g., counterfactual reasoning and algorithmic recourse (Tsirtsis et al., 2021), that are beyond the reach of correlation-based models, as shown by Peters et al. (2016); Arjovsky et al. (2019); Sauer and Geiger (2021). However, causal structure learning from data, referred to as causal discovery, is fraught with statistical and algebraic difficulties. Statistical difficulties arise when the number n𝑛nitalic_n of observations is limited, which hampers the estimation process. Algebraic difficulties are related to discovering the directed acyclic graph (DAG) expressing the causal relations, as learning a DAG is NP-hard (Chickering, 1996).

Related work.

In the literature of causal discovery and Bayesian network learning, there are two main categories of methods, namely constraint-based methods (Spirtes et al., 2000; Meek, 1995) and score function-based methods (Chickering, 2002a; Loh and Bühlmann, 2014). Depending on the specific method, strategies for learning large causal graphs include restricting the search space of directed graphs in a sparse graph (Ramsey et al., 2017; Loh and Bühlmann, 2014), or transforming the underlying combinatorial problem into a continuous optimization problem (Zheng et al., 2018; Aragam et al., 2019; Ng et al., 2020, 2021; Lopez et al., 2022). While these strategies have resulted in significant improvements in reducing the complexity, their scalability is still moderate when the number of variables and/or the degree of the target causal graph are high.

To further tackle the computational challenges in learning large causal structures, a growing number of works consider breaking down the large-scale causal discovery problem into smaller ones. These include methods using Markov blanket discovery (e.g., Tsamardinos et al. (2003); Wu et al. (2020, 2022, 2023); Mokhtarian et al. (2021))—the problem of inferring a smallest subset of variables that shield a given node from the influence of all other variables, and more generally, the divide-and-conquer strategy (Gao et al., 2017; Zhang et al., 2020; Gu and Zhou, 2020). It is noticeable that most of these methods are designed in the framework of constraint-based causal learning and are therefore strictly tied to this methodology. At the same time, the fusion procedure for the conquer step of these divide-and-conquer methods is essentially rule-based, which may limit their applicability.

Contributions.

In this paper we present a new divide-and-conquer approach called DCDILP to first address the scalability challenge inherent to causal discovery, and to propose an enhanced methodology for the conquer step. DCDILP consists of three phases: (i) Phase-1: a divide phase that breaks the causal discovery problem into smaller subproblems by leveraging Markov blanket estimation for each variable. (ii) Phase-2: a causal learning phase that explores causal relations at a local level (within the individual Markov blankets). Learning causal relations locally enjoys a reduced complexity but these subproblems may no longer be causally sufficient—as variable(s) external to the subproblem might act as hidden confounder, having an impact on the (internal) variables. (iii) Phase-3: a conquer phase that reconciles the different causal relations identified for each subproblem into a consistent causal graph, noting that a simple concatenation of the local causal relations found in Phase-2 is unlikely to yield a satisfactory global solution.

The contributions of the proposed approach, DCDILP (Distributed Causal Discovery using Integer Linear Programming), are twofold. First, the causal discovery subproblems associated with each Markov blanket are handled independently in parallel. The causal insufficiency issue is mitigated by only retaining the causal relations involving the center variable of the Markov Blanket. Second, and more importantly, we show that the reconciliation of the causal subgraphs at the subproblem level can be formulated and solved as an integer linear programming (ILP) problem. Binary ILP variables are defined to represent the causal relations (causes, effects, spouses, and v-structures); logical constraints are defined to enforce their consistency, and the optimization of the ILP variables aims to find a causal graph as close as possible to the local subgraphs, subject to being consistent.

The primary strength of the approach lies in the highly parallelizable nature of its Phase-2, making it scaling up to a few thousand variables. Phase-3 corresponds to one single ILP problem that can be delegated to highly efficient ILP solvers. Note that DCDILP allows for flexible choices: (i) for the Markov blanket discovery task in Phase-1; (ii) for the causal discovery subproblems in Phase-2 (GES (Chickering, 2002b) and DAGMA (Bello et al., 2022) are used in the experiments); (iii) for the ILP solver in Phase-3 (Gurobi (Gurobi Optimization, 2023) is used in the experiments).

The paper is organized as follows. After presenting the background in Section 2, we detail DCDILP in Section 3. Section 4 presents the experimental setting and the comparative experimental evidence showing the merits of the approach on small to large-scale causal discovery problems. The paper concludes with a discussion and some perspectives for further research.

2 Formal background

2.1 Definitions and notation

Definition 1.

The linear Structural Equation Model (SEM) of a multivariate random variable X=(X1,,Xd)Xsubscript𝑋1subscript𝑋𝑑\mbox{\bf X}=(X_{1},\dots,X_{d})X = ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) is a set of d𝑑ditalic_d equations: for all i=1d𝑖1𝑑i=1\ldots ditalic_i = 1 … italic_d,

Xisubscript𝑋𝑖\displaystyle X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =β1,iX1++βd,iXd+ϵiabsentsubscript𝛽1𝑖subscript𝑋1subscript𝛽𝑑𝑖subscript𝑋𝑑subscriptitalic-ϵ𝑖\displaystyle=\beta_{1,i}X_{1}+\dots+\beta_{d,i}X_{d}+\epsilon_{i}= italic_β start_POSTSUBSCRIPT 1 , italic_i end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ⋯ + italic_β start_POSTSUBSCRIPT italic_d , italic_i end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

with ϵisubscriptitalic-ϵ𝑖\epsilon_{i}italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT an external random variable, independent of any Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for ji𝑗𝑖j\neq iitalic_j ≠ italic_i. Coefficient βi,i=0subscript𝛽𝑖𝑖0\beta_{i,i}=0italic_β start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT = 0; at most one of βi,jsubscript𝛽𝑖𝑗\beta_{i,j}italic_β start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT and βj,isubscript𝛽𝑗𝑖\beta_{j,i}italic_β start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT is nonzero. If βj,i0subscript𝛽𝑗𝑖0\beta_{j,i}\neq 0italic_β start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT ≠ 0, Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is said to be a cause, or parent, of Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT; Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is said to be an effect of Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

The graph G:=(X,E)assign𝐺X𝐸G:=(\mbox{\bf X},E)italic_G := ( X , italic_E ) with adjacency matrix B=(βi,j)𝐵subscript𝛽𝑖𝑗B=(\beta_{i,j})italic_B = ( italic_β start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) is a directed graph such that the edge set E𝐸Eitalic_E corresponds to the set of pairs (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) with βi,j0subscript𝛽𝑖𝑗0\beta_{i,j}\neq 0italic_β start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ≠ 0. The directed graph of a linear SEM is usually required to be acyclic (DAG). G𝐺Gitalic_G is also called the causal graph or causal structure of the SEM, in the sense that any edge in G𝐺Gitalic_G, denoted as (XiXj)subscript𝑋𝑖subscript𝑋𝑗(X_{i}\to X_{j})( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), signifies that Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a cause of Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (and Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is an effect of Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT).

Given two directed graphs G1,G2subscript𝐺1subscript𝐺2G_{1},G_{2}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT on X, the intersection G1G2subscript𝐺1subscript𝐺2G_{1}\cap G_{2}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∩ italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT refers to the intersection of their edge sets. The number of nonzeros of matrix B𝐵Bitalic_B is denoted as nnz(B)nnz𝐵\operatorname{nnz}(B)roman_nnz ( italic_B ).

Definition 2.

Consider a DAG G=(X,B)𝐺X𝐵G=(\mbox{\bf X},B)italic_G = ( X , italic_B ) defined on X=(X1,,Xd)Xsubscript𝑋1subscript𝑋𝑑\mbox{\bf X}=(X_{1},\dots,X_{d})X = ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ). The Markov blanket of variable Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, denoted as MB(Xi)MBsubscript𝑋𝑖\mbox{\bf MB}(X_{i})MB ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), is the smallest set MX𝑀XM\subset\mbox{\bf X}italic_M ⊂ X such that

XGX\(M{Xi}) given MX\perp\!\!\!\!\perp_{G}\mbox{\bf X}\backslash(M\cup\{X_{i}\})\text{~{}given~{}}Mitalic_X ⟂ ⟂ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT X \ ( italic_M ∪ { italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ) given italic_M

where Gperpendicular-toabsentsubscriptperpendicular-to𝐺\perp\!\!\!\!\perp_{G}⟂ ⟂ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT denotes d-separation (e.g., (Peters et al., 2017, Definition 6.1)).

When the distribution of X and the DAG G𝐺Gitalic_G satisfy the Markov property, then the d-separation property above entails

XiX\(M{Xi}) given M.X_{i}\perp\!\!\!\!\perp\mbox{\bf X}\backslash(M\cup\{X_{i}\})\text{~{}given~{}% }M.italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟂ ⟂ X \ ( italic_M ∪ { italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ) given italic_M .

The Markov blanket MB(Xi)MBsubscript𝑋𝑖\mbox{\bf MB}(X_{i})MB ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) contains exactly the variables Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT that are causes or effects of Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (i.e., βj,i0subscript𝛽𝑗𝑖0\beta_{j,i}\neq 0italic_β start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT ≠ 0 or βi,j0subscript𝛽𝑖𝑗0\beta_{i,j}\neq 0italic_β start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ≠ 0) and the spouse variables Xksubscript𝑋𝑘X_{k}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (i.e., there exists a variable Xsubscript𝑋X_{\ell}italic_X start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT that is an effect of both Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Xksubscript𝑋𝑘X_{k}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT). A triplet (Xi,Xj,Xksubscript𝑋𝑖subscript𝑋𝑗subscript𝑋𝑘X_{i},X_{j},X_{k}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT) form a v-structure if Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are causes of Xksubscript𝑋𝑘X_{k}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT while the first two are not directly linked.

2.2 Related work

Divide-and-conquer strategy.

The divide-and-conquer strategy is used by Gu and Zhou (2020) in their method named PEF (Partition, Estimation and Fusion), which consists of three phases: (i) partitioning the set of all variables into clusters, (ii) estimating causal structures cluster by cluster; and (iii) finally producing a fusion of all local learning results. Zhang et al. (2020) proposed a divide-and-conquer approach based on constraint-based methods. In earlier works of Gao et al. (2017), a strategy similar to divide-and-conquer is used, which consists of first Markov blanket learning for each variable and then a procedure for learning and fusion in a from-local-to-global manner.

Markov blanket.

A different but related approach is the recursive methods (Mokhtarian et al., 2021; Rahman et al., 2021). Mokhtarian et al. (2021) proposed the recursive variable elimination algorithm named MARVEL that involves operations similar to the divide step based on Markov boundary discovery. To obtain Markov blanket information for MARVEL as well as our proposed method (Phase-1), many different algorithms can be used including Grow-Shrink (GS) Margaritis and Thrun (1999), IAMB Tsamardinos et al. (2003), precision matrix-based methods (e.g., Loh and Bühlmann (2014)), and more recent methods by Wu et al. (2020, 2023) such as KMB (Kernel MB learning). The choice of which algorithm to use for computing the Markov boundaries should be made according to the data and application.

ILP methods for causal learning.

Jaakkola et al. (2010) tackle the problem of learning Bayesian network structures using linear programming relaxations. Cussens et al. (2017); Cussens (2023) introduce integer programming and linear programming methods for Bayesian network learning, shedding light on polytopes and facets regarding the search of DAGs.

3 DCDILP: a divide-and-conquer approach to causal discovery

In this section, we present the DCDILP approach by first introducing its principles based on the divide-and-conquer strategy and then introducing the integer linear programming method for the conquer step within this strategy.

3.1 Divide-and-conquer strategy

We consider the following three-phase procedure named DCDILP, as illustrated in Figure 1:

Refer to caption Refer to caption Refer to caption

Refer to caption
Refer to caption
Refer to caption
(a) B^(i)superscript^𝐵𝑖\widehat{B}^{(i)}over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT
Refer to caption
(b) B𝐵Bitalic_B
Figure 1: Illustration of the divide-and-conquer framework: (a) observational data; (b) d𝑑ditalic_d data subsets: the i𝑖iitalic_i-th data subset only includes variable Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the variables in its Markov blanket MB(Xi)MBsubscript𝑋𝑖\mbox{\bf MB}(X_{i})MB ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ; (c) output matrix B^(i)superscript^𝐵𝑖\widehat{B}^{(i)}over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT of i𝑖iitalic_i-th subproblem for i=1,,d𝑖1𝑑i=1,\dots,ditalic_i = 1 , … , italic_d; (d) final solution B𝐵Bitalic_B.
  • Phase-1: this phase consists of identifying the Markov blankets MB(Xi)MBsubscript𝑋𝑖\mbox{\bf MB}(X_{i})MB ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for each variable Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. As said, MB(Xi)MBsubscript𝑋𝑖\mbox{\bf MB}(X_{i})MB ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) only includes variables that are causes, effects or spouses of Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Under the assumption that MB(Xi)MBsubscript𝑋𝑖\mbox{\bf MB}(X_{i})MB ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is accurately identified, this subset of variables contains all variables related to Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The Markov blanket-based divide scheme thus is structured in a way that favors separability.

  • Phase-2: this phase tackles the local causal discovery problems (subproblem) defined on Si:=MB(Xi){Xi}assignsubscriptS𝑖MBsubscript𝑋𝑖subscript𝑋𝑖\mbox{\bf S}_{i}:=\mbox{\bf MB}(X_{i})\cup\{X_{i}\}S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT := MB ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∪ { italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }, for each i=1d𝑖1𝑑i=1\ldots ditalic_i = 1 … italic_d. This restriction makes the subproblems much smaller than the original problem but it entails a causal insufficiency issue as the variables in SisubscriptS𝑖\mbox{\bf S}_{i}S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT may well be influenced by variables external to MB(Xi)MBsubscript𝑋𝑖\mbox{\bf MB}(X_{i})MB ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). In Phase-2, we partially mitigate this issue by retaining the causal relations involving only Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT; other relations found within SisubscriptS𝑖\mbox{\bf S}_{i}S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are discarded. This choice is motivated by the fact that Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the only variable that has all its causes and effects in MB(Xi)MBsubscript𝑋𝑖\mbox{\bf MB}(X_{i})MB ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for certain (still under the assumption that MB(Xi)MBsubscript𝑋𝑖\mbox{\bf MB}(X_{i})MB ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is accurately identified).

  • Phase-3: this phase aims to reconcile eventually all edges found in Phase-2, and enforce their consistency. As will be shown, this task is formalized as an integer linear programming problem. The constraints formalize the logical relations among the notions of causes, effects, spouses and v-structures, and the objective function aims at finding an overall causal graph B𝐵Bitalic_B as aligned as possible with the local relations found in Phase-2.

Algorithm.

The above framework, as detailed in Algorithm 1, can be realized in different ways depending on how Phase-1 and Phase-2 are carried out. In the present work, we consider two different algorithms—GES (Chickering, 2002b) and DAGMA (Bello et al., 2022)—for learning the preliminary subgraphs A(i)superscript𝐴𝑖A^{(i)}italic_A start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT in Phase-2 (line 3). For Phase-1, a discussion about our choices is given in Section 3.3.

Algorithm 1 (DCDILP) Distributed causal discovery using ILP
0:  Observational data 𝒳n×d𝒳superscript𝑛𝑑\mathcal{X}\in\mathbb{R}^{n\times d}caligraphic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT
1:  (Phase-1) Divide: Estimate Markov blanket MB(Xi)MBsubscript𝑋𝑖\mbox{\bf MB}(X_{i})MB ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for i=1,,d𝑖1𝑑i=1,\dots,ditalic_i = 1 , … , italic_d
2:  (Phase-2) for i=1,,d𝑖1𝑑i=1,\dots,ditalic_i = 1 , … , italic_d do in parallel
3:                       A(i)superscript𝐴𝑖absentA^{(i)}\leftarrowitalic_A start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ← Causal discovery on Si:=MB(Xi){Xi}assignsubscriptS𝑖MBsubscript𝑋𝑖subscript𝑋𝑖\mbox{\bf S}_{i}:=\mbox{\bf MB}(X_{i})\cup\{X_{i}\}S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT := MB ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∪ { italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }
4:                       B^j,k(i)Aj,k(i)subscriptsuperscript^𝐵𝑖𝑗𝑘subscriptsuperscript𝐴𝑖𝑗𝑘\widehat{B}^{(i)}_{j,k}\leftarrow A^{(i)}_{j,k}over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT ← italic_A start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT if j=i𝑗𝑖j=iitalic_j = italic_i or k=i𝑘𝑖k=iitalic_k = italic_i, otherwise B^j,k(i)0subscriptsuperscript^𝐵𝑖𝑗𝑘0\widehat{B}^{(i)}_{j,k}\leftarrow 0over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT ← 0  for (j,k)[d]×[d]𝑗𝑘delimited-[]𝑑delimited-[]𝑑(j,k)\in[d]\times[d]( italic_j , italic_k ) ∈ [ italic_d ] × [ italic_d ]
5:  (Phase-3) Conquer:
BReconciliation from {B^(i),i=1d} through the ILP (5)–(11)𝐵Reconciliation from superscript^𝐵𝑖𝑖1𝑑 through the ILP (5)–(11)B\leftarrow\text{Reconciliation from }\{\widehat{B}^{(i)},i=1\ldots d\}\text{~% {}through the ILP \eqref{eq:fobj-lp}--\eqref{eq:v2}}italic_B ← Reconciliation from { over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_i = 1 … italic_d } through the ILP ( )–( )

Each subproblem in line 3 takes as input an n×|Si|𝑛subscriptS𝑖n\times|\mbox{\bf S}_{i}|italic_n × | S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | submatrix of the whole data matrix 𝒳𝒳\mathcal{X}caligraphic_X. Subsequently, the output of Phase-2, B^(i)superscript^𝐵𝑖\widehat{B}^{(i)}over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT (line 4) consists of only the direct causes and effects of Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Note that the estimation of B^(i)superscript^𝐵𝑖\widehat{B}^{(i)}over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT for each Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is mutually independent given the Markov blankets. Therefore, their computations can be distributed to a number of different CPUs in parallel.

Phase-3 is concerned with building a global causal graph from all partial relations in B^(i)superscript^𝐵𝑖\widehat{B}^{(i)}over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT constructed in Phase-2.

3.2 Phase-3: formulating causal graph reconciliation as an ILP problem

A naive approach is to concatenate all partial graphs B^(i)superscript^𝐵𝑖\widehat{B}^{(i)}over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT found in Phase-2:

B^=i=1dB^(i).^𝐵superscriptsubscript𝑖1𝑑superscript^𝐵𝑖\widehat{B}=\sum_{i=1}^{d}\widehat{B}^{(i)}.over^ start_ARG italic_B end_ARG = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT . (1)

In general, however, this approach does not yield a consistent solution due to diverse conflicts among the B^(i)superscript^𝐵𝑖\widehat{B}^{(i)}over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. For example, Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT might be considered as a cause of Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in B^(i)superscript^𝐵𝑖\widehat{B}^{(i)}over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT but, at the same time, it might be an effect of Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in B^(j)superscript^𝐵𝑗\widehat{B}^{(j)}over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT. In the rest of this subsection, we first analyse the properties of these conflicts and then present the proposed ILP method.

Merge conflicts after Phase-2.

In contrast to the naive merge B^^𝐵\widehat{B}over^ start_ARG italic_B end_ARG defined in (1), we propose a reconciliation process that combines the edges in matrices B^(i)superscript^𝐵𝑖\widehat{B}^{(i)}over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT in a selective manner. Since the global causal discovery problem is divided into the separate subproblems according to the Markov blankets, the local solutions B^(i)superscript^𝐵𝑖\widehat{B}^{(i)}over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT from Phase-2 generally have overlapping answers regarding the causal relations. In a general sense, we refer to the overlapping answers (of two graphs) as merge conflicts defined as follows.

Definition 3 (Merge conflict).

For i=1,2𝑖12i=1,2italic_i = 1 , 2, respectively, let 𝒮(i)[d]×[d]superscript𝒮𝑖delimited-[]𝑑delimited-[]𝑑\mathcal{S}^{(i)}\subset[d]\times[d]caligraphic_S start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ⊂ [ italic_d ] × [ italic_d ] be a set of edges on X=(X1,,Xd)Xsubscript𝑋1subscript𝑋𝑑\mbox{\bf X}=(X_{1},\dots,X_{d})X = ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ), and let B(i)superscript𝐵𝑖B^{(i)}italic_B start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT be the binary adjacency matrix of a graph with edges restricted in 𝒮(i)superscript𝒮𝑖\mathcal{S}^{(i)}caligraphic_S start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. Then, B(1)superscript𝐵1B^{(1)}italic_B start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT and B(2)superscript𝐵2B^{(2)}italic_B start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT are said to constitute a merge conflict if and only if there exists (j,k)𝒮(1)𝒮(2)𝑗𝑘superscript𝒮1superscript𝒮2(j,k)\in\mathcal{S}^{(1)}\cap\mathcal{S}^{(2)}( italic_j , italic_k ) ∈ caligraphic_S start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ∩ caligraphic_S start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT such that Bjk(1)Bjk(2)subscriptsuperscript𝐵1𝑗𝑘subscriptsuperscript𝐵2𝑗𝑘B^{(1)}_{jk}\neq B^{(2)}_{jk}italic_B start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ≠ italic_B start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT.

Proposition 4.

For i,j[d]𝑖𝑗delimited-[]𝑑i,j\in[d]italic_i , italic_j ∈ [ italic_d ], let B^(i)superscript^𝐵𝑖\widehat{B}^{(i)}over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT and B^(j)superscript^𝐵𝑗\widehat{B}^{(j)}over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT respectively denote the i𝑖iitalic_i-th and j𝑗jitalic_j-th binary adjacency matrix output by Phase-2 of Algorithm 1 (lines 4–5). A merge conflict between B^(i)superscript^𝐵𝑖\widehat{B}^{(i)}over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT and B^(j)superscript^𝐵𝑗\widehat{B}^{(j)}over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT is one of the following three types:

  1. 1.

    (Type-1) One of the two adjacency matrices gives a directed link between Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT while the other suggests XiXjX_{i}\perp\!\!\!\!\perp X_{j}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟂ ⟂ italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

  2. 2.

    (Type-2) The two adjacency matrices result in two directed links between Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with opposite directions: either one matrix contains the two directions while the other gives XiXjX_{i}\perp\!\!\!\!\perp X_{j}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟂ ⟂ italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, or each one gives a directed edge opposite to the other.

  3. 3.

    (Type-3) One of the two adjacency matrices gives a undirected link between Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT while the other gives a directed link.

The proof is given in Appendix A. An illustration about the above proposition is shown in Figure 2.

B^(i)superscript^𝐵𝑖\widehat{B}^{(i)}over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT B^(j)superscript^𝐵𝑗\widehat{B}^{(j)}over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT B^^𝐵\widehat{B}over^ start_ARG italic_B end_ARG

Refer to caption

+

Refer to caption

\rightarrow

Refer to caption

Figure 2: Concatenation of two local results. A merge conflict, if it happens, is located on the two entries marked in red (Proposition 4).

Any pair of matrices B^(i)superscript^𝐵𝑖\widehat{B}^{(i)}over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT an B^(j)superscript^𝐵𝑗\widehat{B}^{(j)}over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT by Phase-2 can only have the above three types of conflicts because their edge sets intersect at most on the two directed edges between Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT; indeed, many more conflicts can otherwise be observed among A^(i)superscript^𝐴𝑖\widehat{A}^{(i)}over^ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT and A^(j)superscript^𝐴𝑗\widehat{A}^{(j)}over^ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT (in line 3 of Algorithm 1).

The ILP formulation.

Considering all conflicts among the output matrices of Phase-2, DCDILP delegates their resolution to an integer linear programming problem (Wolsey, 2020) by formulating all constraints on the sought solution as follows.

This problem involves: (i) binary variables noted Bijsubscript𝐵𝑖𝑗B_{ij}italic_B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and Sijsubscript𝑆𝑖𝑗S_{ij}italic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT for all pairs (i,j)[d]×[d]𝑖𝑗delimited-[]𝑑delimited-[]𝑑(i,j)\in[d]\times[d]( italic_i , italic_j ) ∈ [ italic_d ] × [ italic_d ] such that XiMB(Xj)subscript𝑋𝑖MBsubscript𝑋𝑗X_{i}\in\mbox{\bf MB}(X_{j})italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ MB ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ); (ii) binary variables noted Vijksubscript𝑉𝑖𝑗𝑘V_{ijk}italic_V start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT for all triples (i,j,k)[d]×[d]×[d]𝑖𝑗𝑘delimited-[]𝑑delimited-[]𝑑delimited-[]𝑑(i,j,k)\in[d]\times[d]\times[d]( italic_i , italic_j , italic_k ) ∈ [ italic_d ] × [ italic_d ] × [ italic_d ] such that Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (respectively, Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and Xksubscript𝑋𝑘X_{k}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT) belongs to both Markov blankets MB(Xj)MBsubscript𝑋𝑗\mbox{\bf MB}(X_{j})MB ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) and MB(Xk)MBsubscript𝑋𝑘\mbox{\bf MB}(X_{k})MB ( italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) (respectively, MB(Xi)MBsubscript𝑋𝑖\mbox{\bf MB}(X_{i})MB ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and MB(Xk)MBsubscript𝑋𝑘\mbox{\bf MB}(X_{k})MB ( italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ); or MB(Xi)MBsubscript𝑋𝑖\mbox{\bf MB}(X_{i})MB ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and MB(Xj)MBsubscript𝑋𝑗\mbox{\bf MB}(X_{j})MB ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )):

Bij=1subscript𝐵𝑖𝑗1\displaystyle B_{ij}=1italic_B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 if XiXjif subscript𝑋𝑖subscript𝑋𝑗\displaystyle\text{if~{}}X_{i}\to X_{j}if italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (2)
Vijk=Vjik=1subscript𝑉𝑖𝑗𝑘subscript𝑉𝑗𝑖𝑘1\displaystyle V_{ijk}=V_{jik}=1italic_V start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_j italic_i italic_k end_POSTSUBSCRIPT = 1 if (Xi,Xj,Xk) form a v-structure (XiXkXj).if subscript𝑋𝑖subscript𝑋𝑗subscript𝑋𝑘 form a v-structure (XiXkXj)\displaystyle\text{if }(X_{i},X_{j},X_{k})\text{ form a v-structure $(X_{i}\to X% _{k}\leftarrow X_{j})$}.if ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) form a v-structure ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ← italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) . (3)
Sij=Sji=1subscript𝑆𝑖𝑗subscript𝑆𝑗𝑖1\displaystyle S_{ij}=S_{ji}=1italic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT = 1 if Xi and Xj are spouses, i.e. k s.t. Vijk=1.if Xi and Xj are spouses, i.e. k s.t. Vijk=1\displaystyle\text{if $X_{i}$ and $X_{j}$ are spouses, i.e. $\exists k$ s.t. $% V_{ijk}=1$}.if italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are spouses, i.e. ∃ italic_k s.t. italic_V start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT = 1 . (4)

The constraints on the above variables express the fact that the sought solution denoted B𝐵Bitalic_B must be consistent with the given Markov blankets while the objective function of the problem is defined as the similarity of B𝐵Bitalic_B with the naive concatenation of matrices B^(i)superscript^𝐵𝑖\widehat{B}^{(i)}over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT, namely B^=i=1dB^(i)^𝐵superscriptsubscript𝑖1𝑑superscript^𝐵𝑖\widehat{B}=\sum_{i=1}^{d}\widehat{B}^{(i)}over^ start_ARG italic_B end_ARG = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT:

maximizeB𝔹d×d,S𝕊d×d,V𝔹d×d×dB^,Bsubject toformulae-sequence𝐵superscript𝔹𝑑𝑑formulae-sequence𝑆superscript𝕊𝑑𝑑𝑉superscript𝔹𝑑𝑑𝑑maximizeexpectation^𝐵𝐵subject to\displaystyle\underset{B\in\mathbb{B}^{d\times d},S\in\mathbb{S}^{d\times d},V% \in\mathbb{B}^{d\times d\times d}}{\text{maximize}}\braket{\widehat{B},B}\quad% \text{subject to }start_UNDERACCENT italic_B ∈ blackboard_B start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT , italic_S ∈ blackboard_S start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT , italic_V ∈ blackboard_B start_POSTSUPERSCRIPT italic_d × italic_d × italic_d end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG maximize end_ARG ⟨ start_ARG over^ start_ARG italic_B end_ARG , italic_B end_ARG ⟩ subject to (5)
Bij=0 if B^ij=B^ji=0formulae-sequencesubscript𝐵𝑖𝑗0 if subscript^𝐵𝑖𝑗subscript^𝐵𝑗𝑖0\displaystyle\hskip 25.60747ptB_{ij}=0\hskip 130.88268pt\text{~{} if ~{}}% \widehat{B}_{ij}=\widehat{B}_{ji}=0italic_B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 0 if over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT = 0 (6)
Sij=Sji=0 if XiMB(Xj)formulae-sequencesubscript𝑆𝑖𝑗subscript𝑆𝑗𝑖0 if subscript𝑋𝑖MBsubscript𝑋𝑗\displaystyle\hskip 25.60747ptS_{ij}=S_{ji}=0\hskip 108.12047pt\text{~{}if~{}}% X_{i}\notin\mbox{\bf MB}(X_{j})italic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT = 0 if italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∉ MB ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (7)
Vijk=0 if B^ik=0 or B^jk=0formulae-sequencesubscript𝑉𝑖𝑗𝑘0 if subscript^𝐵𝑖𝑘0 or subscript^𝐵𝑗𝑘0\displaystyle\hskip 25.60747ptV_{ijk}=0\hskip 130.88268pt\text{~{}if~{}}% \widehat{B}_{ik}=0\text{~{}or~{}}\widehat{B}_{jk}=0italic_V start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT = 0 if over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT = 0 or over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT = 0 (8)
Bij+Bji1,Bij+Bji+Sij1if XiMB(Xj)formulae-sequencesubscript𝐵𝑖𝑗subscript𝐵𝑗𝑖1subscript𝐵𝑖𝑗subscript𝐵𝑗𝑖subscript𝑆𝑖𝑗1if subscript𝑋𝑖MBsubscript𝑋𝑗\displaystyle\hskip 25.60747ptB_{ij}+B_{ji}\leq 1,\quad B_{ij}+B_{ji}+S_{ij}% \geq 1\quad\text{if~{}}X_{i}\in\mbox{\bf MB}(X_{j})italic_B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + italic_B start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT ≤ 1 , italic_B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + italic_B start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT + italic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ≥ 1 if italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ MB ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (9)
VijkBik,VijkBjk,VijkSijandformulae-sequencesubscript𝑉𝑖𝑗𝑘subscript𝐵𝑖𝑘formulae-sequencesubscript𝑉𝑖𝑗𝑘subscript𝐵𝑗𝑘subscript𝑉𝑖𝑗𝑘subscript𝑆𝑖𝑗and\displaystyle\hskip 25.60747ptV_{ijk}\leq B_{ik},~{}V_{ijk}\leq B_{jk},~{}V_{% ijk}\leq S_{ij}\quad\text{and}italic_V start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT ≤ italic_B start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT ≤ italic_B start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT ≤ italic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and (10)
Bik+Bjk1+Vijk,SijkVijkk,B^ik0,B^jk0.formulae-sequencesubscript𝐵𝑖𝑘subscript𝐵𝑗𝑘1subscript𝑉𝑖𝑗𝑘formulae-sequencesubscript𝑆𝑖𝑗subscript𝑘subscript𝑉𝑖𝑗𝑘for-all𝑘formulae-sequencesubscript^𝐵𝑖𝑘0subscript^𝐵𝑗𝑘0\displaystyle\hskip 25.60747ptB_{ik}+B_{jk}\leq 1+V_{ijk},~{}S_{ij}\leq\sum_{k% }V_{ijk}\quad\forall k,\widehat{B}_{ik}\neq 0,\widehat{B}_{jk}\neq 0.italic_B start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT + italic_B start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ≤ 1 + italic_V start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ≤ ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT ∀ italic_k , over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ≠ 0 , over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ≠ 0 . (11)

More precisely, the constraints of the above ILP are motivated by the following reasons:

  • Sparsity. The constraints (6)–(8) enable us to discard all pairs in B,S𝐵𝑆B,Sitalic_B , italic_S or triplets in V𝑉Vitalic_V that are not involved with the given Markov blanket information.

  • 2-cycle exclusiveness. For any pair (i,j),ij𝑖𝑗𝑖𝑗(i,j),i\neq j( italic_i , italic_j ) , italic_i ≠ italic_j and XjMB(Xi)subscript𝑋𝑗MBsubscript𝑋𝑖X_{j}\in\mbox{\bf MB}(X_{i})italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ MB ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (and vice-versa), the first constraint in (9), Bij+Bji1subscript𝐵𝑖𝑗subscript𝐵𝑗𝑖1B_{ij}+B_{ji}\leq 1italic_B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + italic_B start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT ≤ 1, excludes the 2-cycle between Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT given that the entries of B𝐵Bitalic_B are binary variables.

  • Markov blanket membership. The second constraint in (9), Bij+Bji+Sij1subscript𝐵𝑖𝑗subscript𝐵𝑗𝑖subscript𝑆𝑖𝑗1B_{ij}+B_{ji}+S_{ij}\geq 1italic_B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + italic_B start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT + italic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ≥ 1, dictates that, when XjMB(Xi)subscript𝑋𝑗MBsubscript𝑋𝑖X_{j}\in\mbox{\bf MB}(X_{i})italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ MB ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (and vice-versa), they must be either directly causally related or spouses.

  • V-structures. For any k[d]𝑘delimited-[]𝑑k\in[d]italic_k ∈ [ italic_d ], B^ik0subscript^𝐵𝑖𝑘0\widehat{B}_{ik}\neq 0over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT ≠ 0 and B^jk0subscript^𝐵𝑗𝑘0\widehat{B}_{jk}\neq 0over^ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ≠ 0, there is a chance that (Xi,Xj,Xk)subscript𝑋𝑖subscript𝑋𝑗subscript𝑋𝑘(X_{i},X_{j},X_{k})( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) form a v-structure. Therefore Vijksubscript𝑉𝑖𝑗𝑘V_{ijk}italic_V start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT is encoded (or created) as a binary variable. The constraints in (10) and (11) are necessary conditions for (Vijk,Bik,Bjk,Sij)subscript𝑉𝑖𝑗𝑘subscript𝐵𝑖𝑘subscript𝐵𝑗𝑘subscript𝑆𝑖𝑗(V_{ijk},B_{ik},B_{jk},S_{ij})( italic_V start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) to be consistent with the Markov blanket information.

Table 1: Example of Remark 5.
Concatenation B^^𝐵\widehat{B}over^ start_ARG italic_B end_ARG B𝐵Bitalic_B Bsuperscript𝐵B^{\prime}italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
Remark 5.

The constraints of DCDILP will exclude spurious solutions that are not conform with the Markov blanket information (obtained from Phase-1). In the first example of Table 1 (row (a)), if X𝑋Xitalic_X and Y𝑌Yitalic_Y do not belong to each other’s Markov blanket, then only the solution Bsuperscript𝐵B^{\prime}italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is feasible to the ILP; otherwise, only the solution B𝐵Bitalic_B is feasible to the ILP. \square

3.3 Discussion

Phase-1.

In a classical setting of Gaussian graphical models the sparsity pattern of the inverse covariance matrix of X encodes conditional independence relations between the variables. Consider the usual covariance matrix Σ=cov(X)ΣcovX\Sigma=\operatorname*{cov}(\mbox{\bf X})roman_Σ = roman_cov ( X ). It is a well-known consequence of the Hammersley–Clifford theorem that the entries of the precision matrix Θ=Σ1ΘsuperscriptΣ1\Theta=\Sigma^{-1}roman_Θ = roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT correspond to rescaled conditional correlations Loh and Wainwright (2012). In DCDILP, the empirical inverse covariance estimator is used to achieve Phase-1, that is, MB(Xi){i}MBsubscript𝑋𝑖𝑖\mbox{\bf MB}(X_{i})\cup\{i\}MB ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∪ { italic_i } is set as the support of Θi,:subscriptΘ𝑖:\Theta_{i,:}roman_Θ start_POSTSUBSCRIPT italic_i , : end_POSTSUBSCRIPT; details are given in Section B.1.

Time efficiency.

Apart from Phase-1, the computational cost of DCDILP is dominated by Phase-2, which is distributed on a number of parallel workers. Note that the wall time of DCDILP depends mostly on the maximal running time among the parallel workers. Therefore, the total running time of DCDILP is in principle dominated by the running time of the chosen causal discovery routine on the largest Markov blanket during Phase-2. The computational time of the three phases is shown empirically under different settings in Section 4.

4 Experiments

We conduct experiments on data generated on linear SEMs to assess the performance of DCDILP for causal causal discovery. The primary goal of the experiments is to examine the learning accuracy of the proposed method and its computational efficiency in different settings.

4.1 Experimental setting

Benchmark data.

The observational data are generated from linear SEMs following the usual settings of Zheng et al. (2018), where the causal structure B𝐵Bitalic_B is drawn from random DAGs with the Erdős–Rényi (ER) model. The coefficients (edge weights) of B𝐵Bitalic_B are drawn from the uniform distribution Unif([2,0.5][0.5,2])Unif20.50.52\text{Unif}([-2,-0.5]\cup[0.5,2])Unif ( [ - 2 , - 0.5 ] ∪ [ 0.5 , 2 ] ).

Baselines.

The baselines of the causal discovery benchmarks are GES Chickering (2002b) and DAGMA Bello et al. (2022). More precisely, the GES implementation used in the benchmark—labeled as GES (pcalg)—is from the R package pcalg (https://cran.r-project.org/web/packages/pcalg/index.html), which is a highly efficient implementation.

In these benchmarks, the proposed DCDILP method is represented by DCDILP-GES and DCDILP-DMA, which refers to the specific implementations of DCDILP using GES and DAGMA, respectively, during Phase 2 (line 3). The estimated graphs are evaluated by usual metrics (SHD, TPR, FDR and FPR) for DAG learning (details in Section C.1). The other algorithms are run on one CPU of Intel(R) Xeon(R) Gold 5120 14 cores @ 2.2GHz. The computations in Phase 2 of DCDILP are distributed on a maximal of 300 CPU cores.

The DCDILP method is tested in two scenarios. One scenario is to assess the proposed causal learning methodology irrespective of the statistical performance of Markov blanket discovery, in which case DCDILP (MB) will be used as the label, meaning that the Markov blankets are the ground truth ones. The other scenario is causal discovery, in which case DCDILP takes observational data as input and proceeds by following exactly the framework specified in Algorithm 1.

The implementation of DCDILP is made available at https://github.com/shuyu-d/dcdilp-exp.

4.2 Experimental results

DCDILP using GES in Phase 2.

In this experiment, DCDILP-GES represents the algorithm of DCDILP that uses GES during its Phase 2 (Algorithm 1, line 3), and is evaluated in causal discovery tasks in comparison with GES. The number d𝑑ditalic_d of nodes varies in {{\{{50, 100, 200, 400, 800, 1000, 1600}}\}} and the number of samples of X varies as n=50d𝑛50𝑑n=50ditalic_n = 50 italic_d. The choice of the sample sizes is discussed in Section 4.3. The computations in Phase 2 of DCDILP-GES is distributed on N=min(2d,300)𝑁2𝑑300N=\min(2d,300)italic_N = roman_min ( 2 italic_d , 300 ) CPU cores. The results are shown in Figure 3, Figure 4 and in Appendix C in the supplementary material.

Refer to caption
Refer to caption
Refer to caption
(a) Data with ER1
Refer to caption
Refer to caption
Refer to caption
(b) Data with ER2
Figure 3: Results of DCDILP-GES for learning the linear SEM on ER1 and ER2 graphs. Left: TPR (higher is better); Center: FDR (lower is better); Right: SHD (lower is better).
Refer to caption
Refer to caption
(a) Results on ER2 data
Figure 4: Wall time of DCDILP-GES and GES for learning linear SEMs on ER1 and ER2 graphs. Left: wall time of the three phases of DCDILP-GES in linear scale; Right: wall time of the two algorithms depending on d𝑑ditalic_d.

Figure 3 presents the learning scores of the two algorithms in (TPR, FDR, SHD), depending on the problem dimension d𝑑ditalic_d. The results in this figure shows that DCDILP-GES outperforms GES in all three learning scores for under all problem dimensions.

It is worth noting that DCDILP-GES uses partially the subgraphs obtained by GES during the local learning subproblems of Phase 2. It is therefore surprising to some extent that DCDILP-GES achieves significant gains over GES in the DAG learning accuracy. We believe that such gains are due to the following reasons: (i) the causal learning subproblems in Phase 2 of DCDILP enjoys a higher effective sample size, in the sense that the ratio n/|Si|𝑛subscriptS𝑖n/|\mbox{\bf S}_{i}|italic_n / | S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |—for each subproblem on SisubscriptS𝑖\mbox{\bf S}_{i}S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT—is effectively higher than n/d𝑛𝑑n/ditalic_n / italic_d (for the global problem); (ii) the reconciliation by the ILP method of DCDILP-GES is a selection process that removes spurious directions (in terms of DAG learning) from the subgraphs given by GES during Phase 2, while GES is a method for producing CPDAGs.

Figure 4 presents the running time (wall time) of the two algorithms depending on the problem dimension d𝑑ditalic_d. The results in this figure shows that (i) GES (pcalg) enjoys a highly competitive time efficiency on small- to mid-side problems (for d𝑑ditalic_d ranging from 50505050 to around 700700700700), which is superior or comparable to DCDILP-GES; (ii) however, the running time of DCDILP-GES grows more slowly than GES with increasing d𝑑ditalic_d. The trend in this figure shows that when d𝑑ditalic_d surpasses 800800800800 (on ER1 data) or 700700700700 (on ER2 data), DCDILP-GES starts to gain speedups over GES with an increasing rate. When d=1600𝑑1600d=1600italic_d = 1600, for example, DCDILP-GES achieves around 4×4\times4 × and 5×5\times5 × speedups over GES on ER1 data and ER2 data respectively.

DCDILP using DAGMA in Phase 2.

In this experiment, DCDILP-DMA represents the algorithm of DCDILP that uses DAGMA during Phase 2 (Algorithm 1, line 3), and is evaluated in comparison with DAGMA. The number d𝑑ditalic_d of nodes varies in {{\{{50, 100, 200, 400, 800, 1000, 1600}}\}} and the number of samples of X varies as n=50d𝑛50𝑑n=50ditalic_n = 50 italic_d.

The computation in Phase 2 of DCDILP-DAGMA is distributed on N=min(2d,300)𝑁2𝑑300N=\min(2d,300)italic_N = roman_min ( 2 italic_d , 300 ) CPU cores. The results are shown in Figure 5 and Figure 6.

Refer to caption
Refer to caption
Refer to caption
(a) Data with ER1
Refer to caption
Refer to caption
Refer to caption
(b) Data with ER2
Figure 5: Results of DCDILP-DMA (short for DCDILP-DAGMA) for learning linear SEMs on ER1 and ER2 graphs.
Refer to caption
Refer to caption
(a) Results on ER2 data
Figure 6: Wall time of DCDILP-DMA (short for DCDILP-DAGMA) and DAGMA for learning the linear SEM on ER1 and ER2 graphs. Left: wall time of the three phases of DCDILP-DMA in linear scale; Right: wall time (in log scale) of the two algorithms depending on d𝑑ditalic_d.

The results in Figure 5 and Figure 6 show that DCDILP-DMA achieves significant gains over DAGMA in running time—under almost all problem dimensions—within a reasonable compromise in learning accuracy. More precisely, on ER1 data, the median TPRs of both DCDILP-DMA and DAGMA close to 100% and the median FDRs of DCDILP-DMA are under 10% under all dimensions d𝑑ditalic_d, while DAGMA has FDRs that are close to zero (indicating almost exact recovery of the underlying DAGs). On ER2 data, DCDILP-DMA has median TPRs slightly above 90% while DAGMA stays close to 100%; and DCDILP-DMA has median FDRs between 10% and 20%, which is reasonably low despite being higher than DAGMA.

On the other hand, the time efficiency gains of DCDILP-DMA over DAGMA are even more significant than the case with GES (pcalg). Despite that DCDILP-DMA uses DAGMA for the local learning subproblems in Phase 2, its overall wall time becomes 100×100\times100 × lower than DAGMA when d𝑑ditalic_d surpasses 400400400400. As discussed in Section 3.3, such speedups are largely due to the parallelization of the local learning subproblems in Phase 2. Despite that the parallelization of subproblem tasks in the present experiment actually encounters congestion (when d𝑑ditalic_d grows) given the predefined limitation on the maximal number of CPU cores, the speedups obtained by DCDILP are already considerable for problems with more than 1000 variables.

4.3 Discussions

Effects of the ILP in Phase 3.

In the same settings as the above experiments, we showcase the effects of reconciliation process via the ILP-based method (formulated in Section 3.2, (5)–(11)) in comparison with the naive merge result B^^𝐵\widehat{B}over^ start_ARG italic_B end_ARG (1).

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: Results of DCDILP for learning the linear SEM on ER2 graphs with n=50d𝑛50𝑑n=50ditalic_n = 50 italic_d samples. Upper: DCDILP-GES; Lower: DCDILP-DMA.

Figure 7 shows their comparisons in the representative case with ER2 data. The scores in this figure shows that for both DCDILP-GES and DCDILP-DMA, the reconciliation process by the ILP method achieves significant gains in learning accuracy than the naive merge result B^^𝐵\widehat{B}over^ start_ARG italic_B end_ARG (1) under all problem dimensions. More precisely, the median TPRs of DCDILP are slightly below those of the naive merge while staying around 90% or above; more interestingly, the median FDRs of DCDILP are significantly reduced than those of the naive merge. These comparisons provide solid validation to the effects of the ILP method.

Evaluation of finite-sample cases.

In the same settings as the above experiments, we evaluate learning accuracy of DCDILP under different sample sizes. The sample size of observational data varies between 5d5𝑑5d5 italic_d and 50d50𝑑50d50 italic_d for causal discovery tasks of d=100𝑑100d=100italic_d = 100 variables.

Refer to caption
Refer to caption
Refer to caption
(a) ER2, Gaussian noise
Refer to caption
Refer to caption
Refer to caption
(b) ER2, Gumbel noise
Figure 8: Results of DCDILP-GES using oracle and estimated MBs respectively. The observational data are generated from linear SEMs on ER2 graphs with different types of noise.

The comparative results in Figure 8 give an empirical insight in the sample requirement of DCDILP-GES under the given choices for Phase 1 (MBs by inverse covariance estimation) and Phase 2 (using GES). The learning accuracy of DCDILP-GES improves with increasing sample size. For Gaussian SEMs, the accuracy attains a level comparable to the case with oracle MBs for n𝑛nitalic_n reaches 20d20𝑑20d20 italic_d; and for Gumbel-noised SEMs, the accuracy of DCDILP-GES becomes comparable to DCDILP-GES (MB*) when n𝑛nitalic_n reaches 40d40𝑑40d40 italic_d. Future work should include more sophisticated or more sample-efficient methods for Markov blanket discovery.

5 Conclusion and perspectives

The main contribution of this paper, DCDILP, is a divide-and-conquer method for learning causal structures from data. Firstly, it takes advantage of the natural decomposition through the Markov blankets associated to each variable. Secondly, a causal learner is deployed on MB(Xi)MBsubscript𝑋𝑖\mbox{\bf MB}(X_{i})MB ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) to learn causal relations involving Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Lastly, these causal relations are reconciled through ILP. The main novelty of the approach is the reconciliation process formulated as an ILP problem.

The computational efficiency of DCDILP, empirically demonstrated on problems involving more than a thousand variables, can be explained by (i) on one hand, the learning phase can be achieved in parallel, considering the different (overlapping) subsets of variables; (ii) on the other hand, the reconciliation of the causal relations found in Phase-2 can be delegated to an efficient general-purpose ILP solver.

DCDILP defines a general scheme as all its three phases can be implemented by different algorithmic components. In this work, a basic empirical inverse covariance estimation method is considered in Phase-1, GES and DAGMA are considered in Phase-2, and the Gurobi-based ILP solver is considered in Phase-3. The limitation of the approach is that its accuracy depends naturally on sufficiently accurate Markov blankets to be estimated in Phase-1 and on the causal relations discovered in Phase-2.

A first research perspective is to consider the interactions of the algorithmic components and act on their gearing. For instance, the over-estimation of Markov blankets (including spurious variables) in Phase-1, or a high false discovery ratio in Phase-2  can be handled through relaxed formulations of the ILP problem tackled in Phase-3. A longer term perspective is to take advantage of the multiple solutions discovered by the ILP solver in Phase-3: their intersection could be used to characterize the ‘backbone’ of the sought causal graph, and to focus the Phase-2 on refining this backbone.

Appendix A Proof

Proof of Proposition 4.

For B^(i)superscript^𝐵𝑖\widehat{B}^{(i)}over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT (i[d]𝑖delimited-[]𝑑i\in[d]italic_i ∈ [ italic_d ]) obtained in Phase-2 of Algorithm 1, the edge set of its graph is restricted within

Ci={(i,k):kMB(Xi)}{(k,i):kMB(Xi)}.subscript𝐶𝑖conditional-set𝑖𝑘𝑘MBsubscript𝑋𝑖conditional-set𝑘𝑖𝑘MBsubscript𝑋𝑖\displaystyle C_{i}=\{(i,k):k\in\mbox{\bf MB}(X_{i})\}\cup\{(k,i):k\in\mbox{% \bf MB}(X_{i})\}.italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { ( italic_i , italic_k ) : italic_k ∈ MB ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } ∪ { ( italic_k , italic_i ) : italic_k ∈ MB ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } .

Hence the intersection of the support graphs of B^(i)superscript^𝐵𝑖\widehat{B}^{(i)}over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT and B^(j)superscript^𝐵𝑗\widehat{B}^{(j)}over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT for any pair ij𝑖𝑗i\neq jitalic_i ≠ italic_j is included in CiCj={(i,j),(j,i)}subscript𝐶𝑖subscript𝐶𝑗𝑖𝑗𝑗𝑖C_{i}\cap C_{j}=\{(i,j),(j,i)\}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = { ( italic_i , italic_j ) , ( italic_j , italic_i ) }. Therefore, by Definition 3, B^(i)superscript^𝐵𝑖\widehat{B}^{(i)}over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT and B^(j)superscript^𝐵𝑗\widehat{B}^{(j)}over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT result in a merge conflict if and only if

B^ij(i)B^ij(j) or B^ji(i)B^ji(j).subscriptsuperscript^𝐵𝑖𝑖𝑗subscriptsuperscript^𝐵𝑗𝑖𝑗 or subscriptsuperscript^𝐵𝑖𝑗𝑖subscriptsuperscript^𝐵𝑗𝑗𝑖\displaystyle\widehat{B}^{(i)}_{ij}\neq\widehat{B}^{(j)}_{ij}\text{~{}or~{}}% \widehat{B}^{(i)}_{ji}\neq\widehat{B}^{(j)}_{ji}.over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ≠ over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT or over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT ≠ over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT .

As a consequence, all merge conflicts can be classified into the following three types, according to the number of nonzeros in the quadruplet Qij={B^ij(i),B^ij(j),B^ji(i),B^ji(j)}subscript𝑄𝑖𝑗subscriptsuperscript^𝐵𝑖𝑖𝑗subscriptsuperscript^𝐵𝑗𝑖𝑗subscriptsuperscript^𝐵𝑖𝑗𝑖subscriptsuperscript^𝐵𝑗𝑗𝑖Q_{ij}=\{\widehat{B}^{(i)}_{ij},\widehat{B}^{(j)}_{ij},\widehat{B}^{(i)}_{ji},% \widehat{B}^{(j)}_{ji}\}italic_Q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = { over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT }:

Table 2: Classification of all merge conflicts from a pair of local results. The symbol * indicates a nonzero number.
Type B^ij(i)subscriptsuperscript^𝐵𝑖𝑖𝑗\hat{B}^{(i)}_{ij}over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT B^ji(i)subscriptsuperscript^𝐵𝑖𝑗𝑖\hat{B}^{(i)}_{ji}over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT B^ij(j)subscriptsuperscript^𝐵𝑗𝑖𝑗\hat{B}^{(j)}_{ij}over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT B^ji(j)subscriptsuperscript^𝐵𝑗𝑗𝑖\hat{B}^{(j)}_{ji}over^ start_ARG italic_B end_ARG start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT Graphical model Characteristics
(3) Undirected * 0 * * (ij)𝑖𝑗(i\rightarrow j)( italic_i → italic_j ) vs (ij)𝑖𝑗(i\leftrightarrow j)( italic_i ↔ italic_j ) nnz(Qij)=3nnzsubscript𝑄𝑖𝑗3\operatorname{nnz}(Q_{ij})=3roman_nnz ( italic_Q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) = 3
0 * * * (ij)𝑖𝑗(i\leftarrow j)( italic_i ← italic_j ) vs (ij)𝑖𝑗(i\leftrightarrow j)( italic_i ↔ italic_j )
* * 0 * (ij)𝑖𝑗(i\leftrightarrow j)( italic_i ↔ italic_j ) vs (ij)𝑖𝑗(i\leftarrow j)( italic_i ← italic_j )
(2) Acute * 0 0 * (ij)𝑖𝑗(i\rightarrow j)( italic_i → italic_j ) vs (ij)𝑖𝑗(i\leftarrow j)( italic_i ← italic_j ) nnz(Qij)=2nnzsubscript𝑄𝑖𝑗2\operatorname{nnz}(Q_{ij})=2roman_nnz ( italic_Q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) = 2
0 * * 0 (ij)𝑖𝑗(i\leftarrow j)( italic_i ← italic_j ) vs (ij)𝑖𝑗(i\rightarrow j)( italic_i → italic_j )
* * 0 0 (ij)𝑖𝑗(i\leftrightarrow j)( italic_i ↔ italic_j ) vs (ij)(i\perp\!\!\!\!\perp j)( italic_i ⟂ ⟂ italic_j )
0 0 * * (ij)(i\perp\!\!\!\!\perp j)( italic_i ⟂ ⟂ italic_j ) vs (ij)𝑖𝑗(i\leftrightarrow j)( italic_i ↔ italic_j )
(1) Addition * 0 0 0 (ij)𝑖𝑗(i\rightarrow j)( italic_i → italic_j ) vs (ij)(i\perp\!\!\!\!\perp j)( italic_i ⟂ ⟂ italic_j ) nnz(Qij)=1nnzsubscript𝑄𝑖𝑗1\operatorname{nnz}(Q_{ij})=1roman_nnz ( italic_Q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) = 1
0 * 0 0 (ij)𝑖𝑗(i\leftarrow j)( italic_i ← italic_j ) vs (ij)(i\perp\!\!\!\!\perp j)( italic_i ⟂ ⟂ italic_j )
0 0 0 * (ij)(i\perp\!\!\!\!\perp j)( italic_i ⟂ ⟂ italic_j ) vs (ij)𝑖𝑗(i\leftarrow j)( italic_i ← italic_j )

\square

Appendix B Algorithms

B.1 An empirical inverse covariance estimator

In the implementation of DCDILP, we use a basic empirical inverse covariance estimator, detailed in Algorithm 2, for the inference of Markov blankets.

Algorithm 2 Empirical inverse covariance estimator
0:  Data matrix Xn×d𝑋superscript𝑛𝑑X\in\mathbb{R}^{n\times d}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT, parameter λ1(0,1)subscript𝜆101\lambda_{1}\in(0,1)italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ ( 0 , 1 )
0:  Θ^λ1d×dsubscript^Θsubscript𝜆1superscript𝑑𝑑\widehat{\Theta}_{\lambda_{1}}\in\mathbb{R}^{d\times d}over^ start_ARG roman_Θ end_ARG start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT
1:  Compute empirical covariance and its inverse:
C^=1n(XX¯)T(XX¯)andΘ^=C^,^𝐶1𝑛superscript𝑋¯𝑋T𝑋¯𝑋and^Θsuperscript^𝐶\displaystyle\widehat{C}=\frac{1}{n}{(X-\bar{X})}^{\mathrm{T}}(X-\bar{X})\quad% \text{and}\quad\widehat{\Theta}=\widehat{C}^{\dagger},over^ start_ARG italic_C end_ARG = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ( italic_X - over¯ start_ARG italic_X end_ARG ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ( italic_X - over¯ start_ARG italic_X end_ARG ) and over^ start_ARG roman_Θ end_ARG = over^ start_ARG italic_C end_ARG start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT , (12)
where C^superscript^𝐶\widehat{C}^{\dagger}over^ start_ARG italic_C end_ARG start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT denotes the pseudo-inverse of C^^𝐶\widehat{C}over^ start_ARG italic_C end_ARG.
2:  Element-wise thresholding on off-diagonal entries:
diag(Θ^λ1)diagsubscript^Θsubscript𝜆1\displaystyle\operatorname*{diag}(\widehat{\Theta}_{\lambda_{1}})roman_diag ( over^ start_ARG roman_Θ end_ARG start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) :=diag(Θ^),assignabsentdiag^Θ\displaystyle:=\operatorname*{diag}(\widehat{\Theta}),:= roman_diag ( over^ start_ARG roman_Θ end_ARG ) ,
(Θ^λ1)offsubscriptsubscript^Θsubscript𝜆1off\displaystyle(\widehat{\Theta}_{\lambda_{1}})_{\mathrm{off}}( over^ start_ARG roman_Θ end_ARG start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT :=(Θ^off,λ1Θ^offmax),assignabsentsubscript^Θoffsubscript𝜆1subscriptnormsubscript^Θoffmax\displaystyle:=\mathbb{H}(\widehat{\Theta}_{\mathrm{off}},\lambda_{1}\|% \widehat{\Theta}_{\mathrm{off}}\|_{\mathrm{max}}),:= blackboard_H ( over^ start_ARG roman_Θ end_ARG start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ over^ start_ARG roman_Θ end_ARG start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) , (13)
where \mathbb{H}blackboard_H is defined as
(y,τ)={yif |y|τ0 otherwise.𝑦𝜏cases𝑦if 𝑦𝜏0 otherwise.\mathbb{H}(y,\tau)=\left\{\begin{array}[]{ll}y&\mbox{if }|y|\geq\tau\\ 0&\mbox{ }\text{otherwise.}\end{array}\right.blackboard_H ( italic_y , italic_τ ) = { start_ARRAY start_ROW start_CELL italic_y end_CELL start_CELL if | italic_y | ≥ italic_τ end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_otherwise. end_CELL end_ROW end_ARRAY

In the computation of (12), the pseudo-inverse coincides with the inverse of C^^𝐶\widehat{C}over^ start_ARG italic_C end_ARG when C^^𝐶\widehat{C}over^ start_ARG italic_C end_ARG is positive definite (e.g., when the number n𝑛nitalic_n of samples is sufficiently large). In (13), the subscript ‘off’ indicates the following filtering operation

Θoff={Θij:ij}subscriptΘoffconditional-setsubscriptΘ𝑖𝑗𝑖𝑗\Theta_{\mathrm{off}}=\{\Theta_{ij}:i\neq j\}roman_Θ start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT = { roman_Θ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT : italic_i ≠ italic_j }

where the indices of the remaining (off-diagonal) entries are preserved.

Selection of λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for Algorithm 2.

In the implementation of Phase 1 of DCDILP, a simple grid search is included for selecting values of λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for the empirical inverse covariance estimator (Algorithm 2). The total time for this parameter selection corresponds to the computation time of Phase 1 in the benchmark of Section 4.

Given that the sought causal structures have an average degree 1deg41deg41\leq\operatorname{deg}\leq 41 ≤ roman_deg ≤ 4, the target sparsity of Θ^λ1subscript^Θsubscript𝜆1\widehat{\Theta}_{\lambda_{1}}over^ start_ARG roman_Θ end_ARG start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT by Algorithm 2 is bounded by ρ¯deg=max(degd)2.0%subscript¯𝜌degdeg𝑑percent2.0\bar{\rho}_{\operatorname{deg}}=\max(\frac{\operatorname{deg}}{d})\approx 2.0\%over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT roman_deg end_POSTSUBSCRIPT = roman_max ( divide start_ARG roman_deg end_ARG start_ARG italic_d end_ARG ) ≈ 2.0 % for graphs with d200𝑑200d\geq 200italic_d ≥ 200 nodes. This gives us an approximate target percentile of around 98%percent9898\%98 %, i.e., top 2%percent22\%2 % edges in terms of absolute weight of Θ^offsubscript^Θoff\widehat{\Theta}_{\mathrm{off}}over^ start_ARG roman_Θ end_ARG start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT. In other words, the maximal value λ1maxsuperscriptsubscript𝜆1\lambda_{1}^{\max}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_max end_POSTSUPERSCRIPT of the grid search area is set as λ1max:=|Θ^off(τ98)|Θ^offmaxassignsuperscriptsubscript𝜆1subscript^Θoffsubscript𝜏98subscriptnormsubscript^Θoffmax\lambda_{1}^{\max}:=\frac{|\widehat{\Theta}_{\mathrm{off}}(\tau_{\mathrm{98}})% |}{\|\widehat{\Theta}_{\mathrm{off}}\|_{\mathrm{max}}}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_max end_POSTSUPERSCRIPT := divide start_ARG | over^ start_ARG roman_Θ end_ARG start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT 98 end_POSTSUBSCRIPT ) | end_ARG start_ARG ∥ over^ start_ARG roman_Θ end_ARG start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG, where τ98subscript𝜏98\tau_{\mathrm{98}}italic_τ start_POSTSUBSCRIPT 98 end_POSTSUBSCRIPT refers to the index of the 98989898-th percentile in {|Θ^off|}subscript^Θoff\{|\widehat{\Theta}_{\mathrm{off}}|\}{ | over^ start_ARG roman_Θ end_ARG start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT | }. For the experiments with ER2 graphs in Section 4, the estimated λ1maxsuperscriptsubscript𝜆1\lambda_{1}^{\max}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_max end_POSTSUPERSCRIPT is 6.101superscript6.1016.10^{-1}6.10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. Hence, the search grid of λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is set up as nI1=20subscript𝑛subscript𝐼120n_{I_{1}}=20italic_n start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 20 equidistant values on I1=[102,6.101]subscript𝐼1superscript102superscript6.101I_{1}=[10^{-2},6.10^{-1}]italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = [ 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT , 6.10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ].

Refer to caption
(a) Distance(Θ^λ1,Θ)subscript^Θsubscript𝜆1superscriptΘ(\widehat{\Theta}_{\lambda_{1}},\Theta^{\star})( over^ start_ARG roman_Θ end_ARG start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , roman_Θ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT )
Refer to caption
(b) Criterion used
Figure 9: Grid search of λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with Algorithm 2 based on criterion C(λ1)𝐶subscript𝜆1C(\lambda_{1})italic_C ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) (14). Data X𝑋Xitalic_X is from linear SEM with Gaussian noise, on ER2 graph with d=200𝑑200d=200italic_d = 200 nodes.

The selection criterion, similar to GraphicalLasso, is defined as

C(λ1):=tr(C^Θ^λ1)logdet(Θ~λ1),assign𝐶subscript𝜆1tr^𝐶subscript^Θsubscript𝜆1subscript~Θsubscript𝜆1\displaystyle C(\lambda_{1}):=\operatorname*{tr}(\widehat{C}\widehat{\Theta}_{% \lambda_{1}})-\log\det(\tilde{\Theta}_{\lambda_{1}}),italic_C ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) := roman_tr ( over^ start_ARG italic_C end_ARG over^ start_ARG roman_Θ end_ARG start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - roman_log roman_det ( over~ start_ARG roman_Θ end_ARG start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , (14)

where Θ~λ1=Θ^λ1+910diag(Θ^λ1)subscript~Θsubscript𝜆1subscript^Θsubscript𝜆1910diagsubscript^Θsubscript𝜆1\tilde{\Theta}_{\lambda_{1}}=\widehat{\Theta}_{\lambda_{1}}+\frac{9}{10}% \operatorname*{diag}(\widehat{\Theta}_{\lambda_{1}})over~ start_ARG roman_Θ end_ARG start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = over^ start_ARG roman_Θ end_ARG start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + divide start_ARG 9 end_ARG start_ARG 10 end_ARG roman_diag ( over^ start_ARG roman_Θ end_ARG start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) is used in the logdet\log\detroman_log roman_det-evaluation for an enhanced positive definiteness in all cases.

Figure 9 shows the criterion values compared to the Hamming distances with the oracle precision matrix Θ:=ϕ(B)assignsuperscriptΘitalic-ϕsuperscript𝐵\Theta^{\star}:=\phi(B^{\star})roman_Θ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT := italic_ϕ ( italic_B start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ). We observe that the selection criterion with argminI1C(λ1)subscriptargminsubscript𝐼1𝐶subscript𝜆1\operatorname*{arg\,min}_{I_{1}}C(\lambda_{1})start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_C ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) gives an answer that is rather close to the optimal value in terms of distance of Θ^λ1subscript^Θsubscript𝜆1\widehat{\Theta}_{\lambda_{1}}over^ start_ARG roman_Θ end_ARG start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT to the oracle precision matrix ΘsuperscriptΘ\Theta^{\star}roman_Θ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT.

Appendix C Experiments

C.1 Evaluation metrics

The graph metrics for the comparison of graph edge sets are the commonly used (e.g., by the aforementioned baseline methods) ones as follows:
(1) TPR = TP/T  (higher is better),
(2) FDR = (R + FP)/P  (lower is better),
(3) FPR = (R + FP)/F  (lower is better),
(4) SHD = E +M + R  (lower is better).
More precisely, SHD is the (minimal) total number of edge additions (E), deletions (M), and reversals (R) needed to convert an estimated DAG into a true DAG. Since a pair of directed graphs are compared, a distinction between True Positives (TP) and Reversed edges (R) is needed: the former is estimated with correct direction whereas the latter is not. Likewise, a False Positive (FP) is an edge that is not in the undirected skeleton of the true graph. In addition, Positive (P) is the set of estimated edges, True (T) is the set of true edges, False (F) is the set of non-edges in the ground truth graph. Finally, let (E) be the extra edges from the skeleton, (M) be the missing edges from the skeleton.

Running time of DCDILP.

The running time benchmark of DCDILP-GES in the causal discovery tasks on ER1 and ER2 data is given in Figure 10.

The running time benchmark of DCDILP-DMA is given in Figure 11.

Refer to caption
Refer to caption
(a) Results on ER1 data
Refer to caption
Refer to caption
(b) Results on ER2 data
Figure 10: Wall time of DCDILP-GES and GES for learning linear SEMs on ER1 and ER2 graphs. Left: wall time of the three phases of DCDILP-GES in linear scale; Right: wall time of the two algorithms depending on d𝑑ditalic_d.
Refer to caption
Refer to caption
(a) Results on ER1 data
Refer to caption
Refer to caption
(b) Results on ER2 data
Figure 11: Wall time of DCDILP-DMA (short for DCDILP-DAGMA) and DAGMA for learning the linear SEM on ER1 and ER2 graphs. Left: wall time of the three phases of DCDILP-DMA in linear scale; Right: wall time (in log scale) of the two algorithms depending on d𝑑ditalic_d.

C.2 Effects of the reconciliation process by DCDILP (Phase-3)

The proposed ILP-based method is evaluated in comparison with the naive merge B^^𝐵\widehat{B}over^ start_ARG italic_B end_ARG (1) under the same experimental settings with different sample sizes. The supplementary results are shown in Figure 12Figure 16.

Refer to caption
Refer to caption
Refer to caption
Figure 12: Results of DCDILP-GES for learning the linear SEM on ER1 graphs, with n=10d𝑛10𝑑n=10ditalic_n = 10 italic_d samples. LP-based method versus the naive merge.
Refer to caption
Refer to caption
Refer to caption
Figure 13: Results of DCDILP-GES for learning the linear SEM on ER1 graphs, with n=50d𝑛50𝑑n=50ditalic_n = 50 italic_d samples. LP-based method versus the naive merge.
Refer to caption
Refer to caption
Refer to caption
Figure 14: Results of DCDILP-GES for learning the linear SEM on ER2 graphs, with n=10d𝑛10𝑑n=10ditalic_n = 10 italic_d samples. LP-based method versus the naive merge.
Refer to caption
Refer to caption
Refer to caption
Figure 15: Results of DCDILP-GES for learning the linear SEM on ER2 graphs, with n=50d𝑛50𝑑n=50ditalic_n = 50 italic_d samples. LP-based method versus the naive merge.
Refer to caption
Refer to caption
Refer to caption
Figure 16: Results of DCDILP-DAGMA for learning the linear SEM on ER3 graphs, with n=10d𝑛10𝑑n=10ditalic_n = 10 italic_d samples. LP-based method versus the naive merge.

References

  • Aragam et al. [2019] Bryon Aragam, Arash Amini, and Qing Zhou. Globally optimal score-based learning of directed acyclic graphs in high-dimensions. Advances in Neural Information Processing Systems, 32, 2019.
  • Arjovsky et al. [2019] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.
  • Bello et al. [2022] Kevin Bello, Bryon Aragam, and Pradeep Ravikumar. DAGMA: Learning dags via m-matrices and a log-determinant acyclicity characterization. Advances in Neural Information Processing Systems, 35:8226–8239, 2022.
  • Chickering [1996] David Maxwell Chickering. Learning bayesian networks is NP-complete. Learning from data: Artificial intelligence and statistics V, pages 121–130, 1996.
  • Chickering [2002a] David Maxwell Chickering. Learning equivalence classes of bayesian-network structures. The Journal of Machine Learning Research, 2:445–498, 2002a.
  • Chickering [2002b] David Maxwell Chickering. Optimal structure identification with greedy search. Journal of machine learning research, 3(Nov):507–554, 2002b.
  • Cussens [2023] James Cussens. Branch-price-and-cut for causal discovery. In 2nd Conference on Causal Learning and Reasoning, 2023.
  • Cussens et al. [2017] James Cussens, Matti Järvisalo, Janne H Korhonen, and Mark Bartlett. Bayesian network structure learning with integer programming: Polytopes, facets and complexity. Journal of Artificial Intelligence Research, 58:185–229, 2017.
  • Gao et al. [2017] Tian Gao, Kshitij Fadnis, and Murray Campbell. Local-to-global bayesian network structure learning. In International Conference on Machine Learning, pages 1193–1202. PMLR, 2017.
  • Gu and Zhou [2020] Jiaying Gu and Qing Zhou. Learning big gaussian bayesian networks: Partition, estimation and fusion. Journal of machine learning research, 21(158):1–31, 2020.
  • Gurobi Optimization [2023] Gurobi Optimization. Gurobi Optimizer Reference Manual, 2023. URL https://www.gurobi.com.
  • Jaakkola et al. [2010] Tommi Jaakkola, David Sontag, Amir Globerson, and Marina Meila. Learning bayesian network structure using lp relaxations. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 358–365. JMLR Workshop and Conference Proceedings, 2010.
  • Loh and Bühlmann [2014] Po-Ling Loh and Peter Bühlmann. High-dimensional learning of linear causal networks via inverse covariance estimation. The Journal of Machine Learning Research, 15(1):3065–3105, 2014.
  • Loh and Wainwright [2012] Po-Ling Loh and Martin J Wainwright. Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses. Advances in Neural Information Processing Systems, 25, 2012.
  • Lopez et al. [2022] Romain Lopez, Jan-Christian Hütter, Jonathan Pritchard, and Aviv Regev. Large-scale differentiable causal discovery of factor graphs. Advances in Neural Information Processing Systems, 35:19290–19303, 2022.
  • Margaritis and Thrun [1999] Dimitris Margaritis and Sebastian Thrun. Bayesian network induction via local neighborhoods. Advances in neural information processing systems, 12, 1999.
  • Meek [1995] Christopher Meek. Causal inference and causal explanation with background knowledge. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, UAI’95, pages 403–410. Morgan Kaufmann Publishers Inc., 1995. ISBN 1558603859.
  • Mokhtarian et al. [2021] Ehsan Mokhtarian, Sina Akbari, AmirEmad Ghassami, and Negar Kiyavash. A recursive Markov boundary-based approach to causal structure learning. In The KDD’21 Workshop on Causal Discovery, pages 26–54. PMLR, 2021.
  • Ng et al. [2020] Ignavier Ng, AmirEmad Ghassami, and Kun Zhang. On the role of sparsity and DAG constraints for learning linear DAGs. Advances in Neural Information Processing Systems, 33:17943–17954, 2020.
  • Ng et al. [2021] Ignavier Ng, Yujia Zheng, Jiji Zhang, and Kun Zhang. Reliable causal discovery with improved exact search and weaker assumptions. Advances in Neural Information Processing Systems, 34:20308–20320, 2021.
  • Pearl [2000] Judea Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2000.
  • Peters et al. [2016] Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society. Series B (Statistical Methodology), pages 947–1012, 2016.
  • Peters et al. [2017] Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017.
  • Rahman et al. [2021] Md Musfiqur Rahman, Ayman Rasheed, Md Mosaddek Khan, Mohammad Ali Javidian, Pooyan Jamshidi, and Md Mamun-Or-Rashid. Accelerating recursive partition-based causal structure learning. In Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, pages 1028–1036, 2021.
  • Ramsey et al. [2017] Joseph Ramsey, Madelyn Glymour, Ruben Sanchez-Romero, and Clark Glymour. A million variables and more: the fast greedy equivalence search algorithm for learning high-dimensional graphical causal models, with an application to functional magnetic resonance images. International journal of data science and analytics, 3:121–129, 2017.
  • Sauer and Geiger [2021] Axel Sauer and Andreas Geiger. Counterfactual generative networks. In International Conference on Learning Representations (ICLR), 2021.
  • Spirtes et al. [2000] Peter Spirtes, Clark N Glymour, Richard Scheines, and David Heckerman. Causation, prediction, and search. MIT press, 2000.
  • Tsamardinos et al. [2003] Ioannis Tsamardinos, Constantin F Aliferis, Alexander R Statnikov, and Er Statnikov. Algorithms for large scale Markov blanket discovery. In FLAIRS conference, volume 2, pages 376–380. St. Augustine, FL, 2003.
  • Tsirtsis et al. [2021] Stratis Tsirtsis, Amir-Hossein Karimi, Ana Lucic, Manuel Gomez-Rodriguez, Isabel Valera, and Hima Lakkaraju. ICML workshop on algorithmic recourse. 2021.
  • Wolsey [2020] Laurence A Wolsey. Integer programming. John Wiley & Sons, 2020.
  • Wu et al. [2020] Xingyu Wu, Bingbing Jiang, Kui Yu, chunyan Miao, and Huanhuan Chen. Accurate Markov boundary discovery for causal feature selection. IEEE Transactions on Cybernetics, 50(12):4983–4996, 2020. doi: 10.1109/TCYB.2019.2940509.
  • Wu et al. [2022] Xingyu Wu, Bingbing Jiang, Yan Zhong, and Huanhuan Chen. Multi-target Markov boundary discovery: Theory, algorithm, and application. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  • Wu et al. [2023] Xingyu Wu, Bingbing Jiang, Tianhao Wu, and Huanhuan Chen. Practical Markov boundary learning without strong assumptions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 10388–10398, 2023.
  • Zhang et al. [2020] Hao Zhang, Shuigeng Zhou, Chuanxu Yan, Jihong Guan, Xin Wang, Ji Zhang, and Jun Huan. Learning causal structures based on divide and conquer. IEEE Transactions on Cybernetics, 52(5):3232–3243, 2020.
  • Zheng et al. [2018] Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. DAGs with NO TEARS: Continuous optimization for structure learning. In Advances in Neural Information Processing Systems, volume 31, 2018. URL https://proceedings.neurips.cc/paper/2018/file/e347c51419ffb23ca3fd5050202f9c3d-Paper.pdf.