Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Regression Adjustments For Estimating The Global Treatment Effect in Experiments With Interference

Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

Regression Adjustments for Estimating the Global Treatment

Effect in Experiments with Interference∗


Alex Chin†
June 5, 2022
arXiv:1808.08683v1 [stat.ME] 27 Aug 2018

Abstract
Standard estimators of the global average treatment effect can be biased in the presence of inter-
ference. This paper proposes regression adjustment estimators for removing bias due to interference
in Bernoulli randomized experiments. We use a fitted model to predict the counterfactual outcomes
of global control and global treatment. Our work differs from standard regression adjustments in that
the adjustment variables are constructed from functions of the treatment assignment vector, and that
we allow the researcher to use a collection of any functions correlated with the response, turning the
problem of detecting interference into a feature engineering problem. We characterize the distribution
of the proposed estimator in a linear model setting and connect the results to the standard theory of
regression adjustments under SUTVA. We then propose an estimator that allows for flexible machine
learning estimators to be used for fitting a nonlinear interference functional form, borrowing ideas from
the double machine learning literature. We propose conducting statistical inference via bootstrap and
resampling methods, which allow us to sidestep the complicated dependences implied by interference and
instead rely on empirical covariance structures. Such variance estimation relies on an exogeneity assump-
tion akin to the standard unconfoundedness assumption invoked in observational studies. In simulation
experiments, our methods are better at debiasing estimates than existing inverse propensity weighted
estimators based on neighborhood exposure modeling. We use our method to reanalyze an experiment
concerning weather insurance adoption conducted on a collection of villages in rural China.

Keywords: causal inference, peer effects, SUTVA, A/B testing, exposure models, off-policy evaluation

1 Introduction
The goal in a randomized experiment is often to estimate the total or global average treatment effect (GATE)
of a binary causal intervention variable on a response variable. The GATE is the difference in average
outcomes when all units are exposed to treatment versus when all units are exposed to control. Under the
standard assumption that units do not interfere with each other (Cox 1958), which forms a key part of the
stable unit treatment value assumption (SUTVA) (Rubin 1974, 1980), the global average treatment effect
reduces to the standard average treatment effect.
However, in many social, medical, and online settings the no-interference assumption may fail to hold (Rosen-
baum 2007; Walker and Muchnik 2014; Aral 2016; Taylor and Eckles 2017). In such settings, peer and
spillover effects can bias estimates of the global average treatment effect. In the past decade, there has been
a flurry of literature proposing methods for handling interference, mostly focusing on cases in which struc-
tural assumptions about the nature of interference are known. For example, if there is a natural grouping
structure to the data, such as households or schools or classrooms, it may be reasonable to assume that
interference exists within but not across groups. Versions of this assumption are known as partial or strat-
ified interference (Hudgens and Halloran 2008). In this case two-stage randomized designs can be used to
decompose direct and indirect effects, which is an approach studied by VanderWeele and Tchetgen Tchetgen
∗ The author thanks Johan Ugander for helpful comments and suggestions. This work was supported in part by NSF grant

IIS-1657104.
† Department of Statistics, Stanford University, Stanford, CA, 94305 USA (ajchin@stanford.edu)

1
(2011); Tchetgen Tchetgen and VanderWeele (2012); Liu and Hudgens (2014); Baird et al. (2016); Basse
et al. (2017), among others. Baird et al. (2016) study how two-stage, random saturation designs can be
used to estimate dose response curves under the stratified interference assumption. Basse and Feller (2018)
study two-stage experiments in which households with multiple students are assigned to treatment or con-
trol. Other works that propose methods of handling interference include Ogburn and VanderWeele (2014),
which maps out causal diagrams for interference; van der Laan (2014), which studies a targeted maimum
likelihood estimator for time-varying and network-connected units; Choi (2017), which shows how confidence
intervals can be constructed in the presence of monotone treatment effects; and Jagadeesan et al. (2017),
which studies designs for estimating the direct effect that strive to balance the network degrees of treated
and control units.
The modus operandi for general or arbitrary interference is the method of exposure modeling, in which
the researcher defines equivalence classes of treatments that inform the interference pattern. Aronow and
Samii (2017) develop a general framework for analyzing inverse propensity weighted (Horvitz-Thompson-
and Hájek-style) estimators under correct specification of local exposure models. The exposure model often
used is some version of an assumption that the potential outcomes of unit i are constant conditional on all
treatments in a local neighborhood of i, or that the potential outcomes are a monotone function of such
treatments. This assumption, known as neighborhood treatment response (NTR), is a generalization of partial
and stratified interference to the general network setting (Manski 2013). Methods for handling interference
often rely on neighborhood treatment response as a core assumption. For example, Sussman and Airoldi
(2017) develop unbiased estimators for various parametric models of interference that are all restrictions on
the NTR condition, and Forastiere et al. (2016) propose propensity score estimators for observational studies
using the NTR assumption.
Aronow and Samii (2017) use their methods to analyze the results of a field experiment on an anti-
conflict program in middle schools in New Jersey. By defining appropriate exposure models, they are able to
estimate a direct effect (the effect of receiving the anti-conflict intervention), a spillover effect (the effect of
being friends with some students who received the anti-conflict intervention), and a school effect (the effect
of attending a school in which some students received the anti-conflict intervention). The network structure
consists of 56 disjoint social networks (schools), comprising 24,191 students in the original Paluck et al.
(2016) study and a subset of 2,050 students studied in the Aronow and Samii (2017) analysis. There are a
number of similar studies in which the target of scientific inquiry is the quantification of peer or spillover
effects and where the dataset permits doing so by being comprised of “many sparse networks.” Studies which
consist of randomized experiments on such social networks include Banerjee et al. (2013), which studies a
microfinance loan program in villages in India; Cai et al. (2015), which studies a weather insurance program
for farmers in rural China; Kim et al. (2015), which concerns public health interventions such as water
purification and microvitamin tablets in villages in Honduras; and Beaman et al. (2018), which explores
social diffusion of a new agricultural technology among farmers in Malawi. (Some studies thereof do not
explicitly aim to understand spillover effects—for example Kim et al. (2015) and Beaman et al. (2018) are
concerned primarily with strategies for targeting influential individuals—but the presence of such effects is
still crucial for their purposes.) In these settings, exposure modeling may be (and has been) a successful way
of decomposing direct and spillover effects.
How should one proceed if the goal is estimation of the global treatment effect rather than a decomposition
into direct and spillover effects? In this setting interference is a nuisance, not an object of intrinsic scientific
interest. Any exposure model, then, is only useful if estimates of a global treatment effect resulting from the
exposure model is actually “close” to the true global treatment effect. Eckles et al. (2017) discuss some of
the difficulties of working in this setting:

It is unclear how substantive judgment can directly inform the selection of an exposure model
for interference in networks—at least when the vast majority of vertices are in a single connected
component. Interference is often expected because of social interactions (i.e., peer effects) where
vertices respond to their neighbors’ behaviors: in discrete time, the behavior of a vertex at t is
affected by the behavior of its neighbors at t − 1; if this is the case, then the behavior of a vertex
at t would also be affected by the behavior of its neighbors’ neighbors at t − 2, and so forth.
Such a process will result in violations of the NTR assumption, and many other assumptions
that would make analysis tractable.

2
In this setting, the primary tool that has developed in the literature is the method of graph cluster random-
ization (Ugander et al. 2013), where researchers use a clustered design in which the clusters are selected
according to the structure of the graph in order to lower the variance of NTR-based inverse propensity es-
timators. Eckles et al. (2017) provide theoretical results and simulation experiments to show how clustered
designs can reduce bias due to interference. Clusters can be obtained using algorithms developed in the
graph partitioning and community detection literature (Fortunato 2010; Ugander and Backstrom 2013).
While the graph clustering approach can be effective at removing some bias, the structure of real-world
empirical networks may make it difficult to obtain satisfactory bias reduction via clustering, which relies
on having good quality graph cuts. The “six degrees of separation” phenomenon is well-documented in
large social networks (Ugander et al. 2011; Backstrom et al. 2012), and the average distance between two
Facebook users in February 2016 was just 3.5 (Bhagat et al. 2016). Furthermore, most users belong to
one large connected component and are unlikely to separate cleanly into evenly-sized clusters. In a graph
clustered experiment run at LinkedIn, the optimal clustering strategy used maintained only 35.59% of edges
between nodes of the same cluster (Saveski et al. 2017, Table 1), suggesting that bias remains even after
clustering. Figure 1 provides an example illustration of how the structure of the network can markedly affect
how much we might expect cluster randomization to help.

Figure 1: (left) A subset of 16 nearly-disjoint Chinese villages, comprising 822 nodes, from an experiment
regarding weather insurance adoption conducted by Cai et al. (2015). The setup of many, sparse networks
is similar to that in the anti-conflict school dataset from Paluck et al. (2016). (right) The largest connected
component of the Caltech Facebook network, with 762 nodes, from a single day snapshot in September 2005,
taken from the facebook100 dataset (Traud et al. 2011, 2012). Networks were plotted with the ggnet2
function (Tyner et al. 2017) in the GGally package, using the default Fruchterman-Reingold force-directed
layout (Fruchterman and Reingold 1991). We should not be surprised if methods for handling interference
that might work well in the collection of networks on the left, such as exposure modeling and graph clustering,
do not work so well in the network on the right.

Such experimental designs also face practical hurdles. Though cluster randomized controlled trials are
commonly used in science and medicine, existing experimentation platform infrastructure in some organiza-
tions may only exist for standard (i.i.d.) randomized experiments, in which case adapting the design and
analysis pipelines for graph cluster randomization would require significant ad hoc engineering effort. In
regimes of only mild interference, it may simply not be worth the trouble to run a clustered or two-stage
experiment, especially if there is no way to know a priori how much bias from interference will be present.
Instead, the practitioner would prefer to have a data-adaptive debiasing mechanism that can be applied to
an experiment that has already been run. Ideally, such estimators provide robustness to deviations from
SUTVA yet do not sacrifice too much in precision loss if it turns out interference was weak or non-existent.

3
This paper contains two main contributions: (a) an alternative regression adjustment strategy for de-
biasing global treatment effect estimators, and (b) a class of bootstrapping and resampling methods for
constructing variance estimates of such estimators. We explore how well the analysis side of an experiment
can be improved in independently-assigned (non-clustered) experiments. Our approach is loosely motivated
by, but not restricted to, the linear-in-means (LIM) family of generative models in the econometrics research
community. A vein of econometrics research stemming from Manski (1993), separate but related to the
interference literature, concerns the identification of endogenous social effects. In linear-in-means models,
the behaviors of individuals are correlated with the average endogenous and exogenous behaviors of their
peers. Manski (1993) distingishes between exogenous effects, in which the ego’s outcome is influenced by
peer characteristics, and endogenous effects, in which the ego’s outcome is influenced by peer outcomes. This
literature generally focuses on the identifiability of various peer effect parameters (Bramoullé et al. 2009).
For our purposes, the key feature of such a family of models with respect to estimating the global treatment
effect is that in the reduced form, the outcome can be seen to depend on only a low-dimensional number
of functions of peer exogenous characteristics, which can be controlled for in order to estimate the global
treatment effect.
Generally, our strategy is to learn a statistical model that captures the relationship between the out-
comes and a set of unit-level statistics constructed from the treatment vector and the observed network.
These statistics can be viewed as covariates, features, or control variables, and are to be constructed by the
practitioner using domain knowledge. The model is then used to predict the unobserved potential outcomes
of each unit under the counterfactual scenarios if the unit had been assigned to global treatment, and global
control. The approach is thus reminiscent of regression adjustment estimators and off-policy evaluation.
Figure 2 demonstrates how covariate distributions differ between the observed design distribution and the
unobserved global counterfactual distributions of interest.

Figure 2: (left) DistributionsPfor fraction of treated neighbors d−1


P
i j∈Ni Wj . (right) Distributions for
number of treated neighbors j∈Ni Wj . Feature distributions are under global exposure to control W = 0
(orange), global exposure to treatment W = 1 (green), and a single observed treatment instance from an iid
Bernoulli(0.5) distribution (blue). Network is the Caltech social graph from the facebook100 dataset (Traud
et al. 2011, 2012). If the response is correlated with one or both of these features, then ideas from off-policy
evaluation of the counterfactual outcomes can guide estimation of the global treatment effect. Even if the
distributions are quite different, as in the left hand picture, if the response can be modeled by low dimensional
model then extrapolation may not be too unreasonable.

As mentioned above, our approach is related to the rich literature on regression adjustment estimators.
In the randomized controlled trial setting, regression adjustments are used to control for imbalances due
to randomized assignment of the empirical covariate distributions of different treatment groups, and thus

4
improve precision of treatment effect estimators. In the observational studies setting, regression adjustments
are used to control for inherent differences between the covariate distributions of different treatment groups.
We heavily borrow tools from that literature, both in the classical regime of using low-dimensional, linear
regression estimators (Freedman 2008a,b; Lin 2013; Berk et al. 2013) and more recent advancements that
can utilize high-dimensional regression and machine learning techniques (Bloniarz et al. 2016; Wager et al.
2016; Wu and Gagnon-Bartsch 2017; Athey et al. 2017b; Chernozhukov et al. 2018).
A curious feature of randomized experiments under interference is that they display characteristics of both
SUTVA randomized experiments and observational studies. Methods for causal inference in observational
studies often require the estimation of both a propensity model and a response model ; doubly-robust estimators
are those that can handle misspecification of either the propensity model or the response model, but not
both. Because we work within the setting of a randomized controlled trial, the experimenter controls the
treatment assignment distribution and so no propensity model needs to be estimated. However, we do
need to estimate a response model, and so the presense of unobserved confounders becomes a concern. In
randomized experiments under interference, then, researchers must be wary of the same challenges that beset
drawing causal conclusions from observational datasets, even though the treatments were assigned randomly.
It is therefore necessary to make the same exogeneity assumptions that are used in the observational
studies setting, also known as unconfoundedness, ignorability, or selection on observables. Our version of
exogeneity can be stated roughly as, “Given covariate values, the treatment assignment is random and SUTVA
holds.” This assumption are not generally verifiable from the data, but such an assumption is necessary in
order to make any progress. Ideally, one has access to methods for conducting sensitivity analyses for
interference, but such methods are in their infancy and we refrain from addressing this issue here.
Our estimators have several advantages over existing exposure modeling estimators. The correct specifi-
cation of an exposure model is also a form of exogeneity assumption, yet our approach admits much more
flexible forms of interference. It can handle multiple types of graph features, which do not even have to
be constructed from the same network. Controlling for interference becomes a feature engineering problem
in which the practitioner is free to use his or her domain knowledge to construct appropriate features. If
a feature turns out to be noninformative for interference, no additional bias is incurred (though a penalty
in variance may be paid). Our adjustment framework also reduces to the standard, SUTVA regression
adjustment setup in the event that static, pretreatment covariates are used.
Finally, we propose methods for quantifying the variance of the proposed estimators. Variance estimation
in the presence of interference is generally difficult because of the complicated dependencies created by the
propagation of interference over the network structure. Confidence intervals based on asymptotic approx-
imations may not be reliable since the dependencies can drastically reduce the effective sample size. For
example, the variance of the sample mean of the covariates may not even scale at a n−1 rate, where n is the
sample size. In this paper we propose a novel way of taking advantage of the randomization distribution to
produce bootstrap standard errors, assuming unconfoundedness. Since the covariates are constructed by the
researcher from the vector of treatments, and the distribution of treatments is known completely in a ran-
domized experiment, we can calculate via Monte Carlo simulation the sampling distribution of any function
of the design matrix under the randomization distribution. This approach ensures that we properly represent
all of the dependencies exhibited empirically by the data, and can then be used to construct standard errors.
The remainder of this paper is structured as follows. In Section 2 we describe the problem and motivate
our approach with an informal discussion of a linear-in-means model. In Section 3 we develop the main
results for linear regression estimators and in Section 4 we show how to extend this to the non-linear setting.
In Section 5 we conduct simulation experiments, in Section 6 we consider an application to an existing field
experiment, and in Section 7 we conclude. All proofs are in the appendix.

2 Setup and estimation in LIM models


We work within the potential outcomes framework, or Rubin causal model (Neyman 1923; Rubin 1974).
Consider a population of n units indexed on the set [n] = {1, . . . , n} and let W = (W1 , . . . , Wn ) ∈ W =
{0, 1}n be a random vector of binary treatments. We will work only with treatments assigned according to
a Bernoulli randomized experimental design:

5
iid
Assumption 1. Wi ∼ Bernoulli(π) for every unit i ∈ [n], where π ∈ (0, 1) is the treatment assignment
probability.
The general spirit of our approach can likely be extended to more complicated designs, but our goal
in this paper is to show that substantial analysis-side improvements can be made even under the simplest
possible experimental design.
Suppose that each response lives in an outcome space Y, and is determined by a mean function µi : W →
Y:
Yi = Yi (W) = µi (W) + εi (1)
In this section we limit ourselves to an informal discussion of point estimation and defer the question of
variance estimation to a future section. The only assumption we require on the residuals, therefore, is an
assumption of strict exogeneity:
E[εi |W1 , . . . , Wn ] = 0.
In particular, no independence or other assumptions about the correlational structure of the residuals are
made in this section, though such assumptions will be necessary for variance estimation, which we address
in Section 3.
Because the units are assumed to belong to a network structure, distinguishing between finite population
and infinite superpopulation setups is not so straightforward. In the SUTVA setting, good estimators
for finite population estimands (or conditional average treatment effects) are usually good estimators for
superpopulation estimands, and vice versa (Imbens 2004). In order to simplify the analysis, we do not work
with a fixed potential outcomes Yi (w) for w ∈ W, and allow the residuals εi to be random variables. We
therefore consider additional variation of the potential outcomes coming from repetitions of the experiment,
but we do not consider the units to be sampled from a larger population. We do this because it is easier to
discuss the behavior of εi when they are random variables.
In this paper we focus on estimation of the total or global average treatment effect (GATE), defined by
n
1X
τ= [E[Yi (1)] − E[Yi (0)]]. (2)
n i=1

This parameter is called a global treatment effect because is a contrast of average outcomes between the
cases when the units are globally exposed to treatment (W = 1) and globally exposed to control (W = 0).
Under an assumption of strict exogeneity, in which E[εi |W] = 0, the treatment effect is the difference of
average global exposure means
n
1X
τ= [µi (1) − µi (0)] ,
n i=1
In order to proceed, we must make assumptions about the structure of the mean function µi .

2.1 A simple linear-in-means model


To illustrate our approach we start with a simple model. Let G be a network with adjacency matrix A. For
simplicity in this paper we will mostly assume that G is simple and undirected, but one can just as easily
use a weighted and directed graph. We emphasize that we assume G is completely known to the researcher.
Let Ni = {j ∈ [n] : Aij = 1} be the neighborhood of unit i and di = |Ni | be the network degree of unit i.
Define
1 X
Xi = Wj , (3)
di
j∈Ni

the fraction of neighbors of i that are in the treatment group. Then take the mean function µi in equation (1)
to be as follows.
Model 1 (Exogenous LIM model).
µi (W) = α + γWi + δXi .

6
This model is a simple version of a linear-in-means model (Manski 1993). The model contains an intercept
α as well as a direct effect γ, which captures the strength of individual i’s response to changes in its own
treatment assignment. Additionally, the response of unit i is correlated with mean treatment assignment of
its neighbors; Manski (1993) calls δ an exogenous social effect, because it captures the correlation of unit
i’s response with the exogenous characteristics of its neighbors. The interactions are also assumed to be
“anonymous” in that the unit i responds only to the mean neighborhood treatment assignment and not the
identities of those treated neighbors. In this model, unit i responds to its neighbors’ treatments but not to
its neighbors’ outcomes.
Now consider the estimand (2) under Model 1. If all units are globally exposed to treatment then it is
the case for all units i that Wi = 1 and Xi = 1. Therefore
n
1X
µi (1) = α + γ + δ.
n i=1

Similarly, if all units are globally exposed to control, then Wi = 0 and Xi = 0, and so
n
1X
µi (0) = α.
n i=1

Therefore, the treatment effect under Model 1 is simply

τ = (α + γ + δ) − α = γ + δ.

This parametrization suggests that if we have access to unbiased estimators γ̂ and δ̂ for γ and δ, then an
unbiased estimate for τ is given by
τ̂ = γ̂ + δ̂.
In particular, one is tempted to estimate γ and δ with an OLS regression of Yi on Wi and Xi . Of course,
using τ̂ as an estimator for τ only makes sense if Model 1 accurately represents the true data generating
process. We build up more flexible models in the following sections.
PnIn contrast, we P can easily see why the difference-in-means estimator, defined for sample sizes N1 =
n
i=1 W i and N0 = i=1 (1 − Wi ) as

n n
1 X 1 X
τ̂DM = Wi Yi − (1 − Wi )Yi , (4)
N1 i=1 N0 i=1

is biased under Model 1. The mean treated response is

E[Yi |Wi = 1] = α + γ + δE[Xi |Wi = 1] = α + γ + δE[Xi ],

where Xi is independent of Wi since the treatments are assigned independently and there are no self-loops
in G. Similarly,
E[Yi |Wi = 0] = α + δE[Xi |Wi = 0] = α + δE[Xi ].
Therefore, the difference-in-means estimator τ̂DM has expectation γ, which need not equal τ = γ + δ in
general. Only if δ = 0 do they coincide, in which case SUTVA holds and there is no interference. In other
words, the difference-in-means estimator marginalizes out the indirect effect rather than controlling for it; it
is an unbiased estimator not for the GATE but for the expected average treatment effect (EATE), defined as
n
1X
[E[Yi |Wi = 1] − E[Yi |Wi = 0]].
n i=1

The EATE was introduced in Sävje et al. (2017) as a natural object of study for estimators which are
designed for the SUTVA setting. Sävje et al. (2017); Chin (2018) study the limiting behavior of estimators
such as τ̂DM under mild regimes of misspecification of SUTVA due to interference.

7
2.2 Linear-in-means with endogenous effects
Now we move to the more interesting version of the linear-in-means model, which contains an endogenous
social effect in addition to an exogenous one. Let
1 X
Zi = Yj , (5)
di
j∈Ni

the average value of the neighboring responses. Now consider the following model:
Model 2(a) (LIM with endogenous social effect).

µi (W) = α + βZi + γWi + δXi .

In addition to direct and exogenous spillover effects, unit i now depends on the outcomes of its neighbors
through the spillover effect β. It is conventional and reasonable to assume that |β| < 1. Model 2(a) is often
more realistic than Model 1; as discussed in the introduction, we often believe that interference is caused by
individuals reacting to their peers’ behaviors rather than to their peers’ treatment assignments.
It is helpful to write Model 2(a) in vector-matrix form. Let G̃ be the weighted graph defined by degree-
normalizing the adjacency matrix of G; i.e., let G̃ be the graph corresponding to the adjacency matrix Ã
with entries Ãij = d−1
i Aij . Then the matrix representation of Model 2(a) is

Y = α + β ÃY + γW + δ ÃW + ε, (6)

where Y , W , and ε are the n-vectors


P∞of responses, treatment assignments, and residuals, respectively. Using
the matrix identity (I − β Ã)−1 = k=0 β k Ãk , as in equation (6) of Bramoullé et al. (2009), one obtains the
reduced form
∞ ∞
α X
k k+1
X
Y = + γW + (γβ + δ) β Ã W+ β k Ãk ε.
1−β
k=0 k=0

Unlike Manski (1993); Bramoullé et al. (2009) and other works in the “reflection problem” literature, we
are not concerned with the identification of the social effect parameters β and δ; these are only nuisance
parameters toward the end of estimating τ . We do note, however, that conditions for identifiability are
generally mild enough to be satisfied by real-world networks. For example, Bramoullé et al. (2009) show
that the parameters in Model 2(a) are identified whenever there exist a triple of individuals who are not all
pairwise friends with each other; such a triple nearly certainly exists in any networks that we consider.
Now, let Xi,k be the i-th coordinate of Ãk W . That is,

1 X
Xi,1 = Wj
di
j∈Ni
1 X 1 X
Xi,2 = Wk
di dj
j∈Ni k∈Nj
1 X 1 X 1 X
Xi,3 = W` ,
di dj dk
j∈Ni k∈Nj `∈Nk

and in general, for any k ≥ 1,


1 X 1 X 1 X
Xi,k = ... Wjk .
di dj1 djk
j1 ∈Ni j2 ∈Nj1 jk−1 ∈Njk−1

Then Model 2(a) is the same as



X
Yi = α̃ + γ̃Wi + β̃k Xi,k + ε̃i , (7)
k=0

8
where we have reparametrized the coefficients as
α
α̃ =
1−β
γ̃ = γ
β̃k = (γβ + δ)β k

X
ε̃ = β k Ak ε.
k=0

Notice that equation (7) respects exogeneity, as



X
E[ε̃|W] = β k Ak E[ε|W] = 0.
k=0

Each covariate Xi,k represents the effect of treatments from units of graph distance k on the response of
unit i. Since |β| < 1, the effects of the terms β̃k Xi,k do not contribute much to equation (7) when k is large.
Therefore, for any finite integer K, we may consider approximating Model 2(a) with a finite-dimensional
model.
Model 2(b) (Finite linear-in-means).
K
X
Yi = α̃ + γ̃Wi + β̃k Xi,k + ε̃i , (8)
k=0

The approximation error is of order β̃ k+1 = (γβ +δ)β K+1 (recall that |β| < 1). Therefore, good estimates
of the coefficients in equation (8) should be good estimates of the coefficients in equation (7) as well. Unless
spillover effects are extremely large, the approximation may be quite good for even small values of K. In
fact, it may be reasonable to take equation (8) rather than equation (7) as the truth where K is no larger
than the diameter of the network G, as spillovers for larger distances may not make sense.
As in Model 1, we can consider the counterfactuals of interest. If all units are globally exposed to
treatment, then Wi = 1 and Xi,k = 1 for all i and k. Similarly, if all units are globally exposed to control,
then Wi = 0 and Xi,k = 0 for all i and k. Therefore, by equation (7), the estimand τ under Model 2(a) is

X
τ = γ̃ + β̃k ,
k=0

and under Model 2(b) it is


K
X
τ = γ̃ + β̃k .
k=0

Now, since Model 2(b) has only K + 3 coefficients, given n > K + 3 individuals one can estimate the
coefficients using, say, ordinary least squares. The treatment effect estimator
K
X
τ̂ = γ̂ + β̂k
k=1

is then unbiased for τ under Model 2(b) and “approximately unbiased” for τ under Model 2(a). This
discussion is of course quite informal, and we make more formal arguments in Section 3.
One interpretation of the discussion in this section is that an endogeneous social effect in the linear-in-
means model manifests as a propogation of exogenous effects through the social network, with the strength of
the exogenous effect diminishing as the network distance increases. Therefore, controlling for the exogenous
features within the first few neighborhoods is nearly equivalent to controlling for the behavior implied by
the endogenous social effect.

9
3 Interference control variables and the general linear model
In Section 2, we showed that the mean function in the linear-in-means model is comprised of a linear com-
bination of statistics Xi,k which are constructed as functions of the treatment vector. This fact suggests
extending our approach to a linear model containing other functions of the treatment vector that are corre-
lated with Yi , not just the ones implied by the linear-in-means model. We now formulate the general linear
model. We suppose that each unit i is associated with a p-dimensional vector of interference control variables
or covariates Xi ∈ Rp that inform the pattern of interference for unit i. We assume that the covariates are
low-dimensional (p  n). Because Xi is to be used for adjustment, the main requirement is that it not be
a “post-treatment variable”; that is, that it not be correlated with the treatment Wi . Therefore, we require
the following assumption:
Assumption 2. Xi ⊥
⊥ Wi for all i ∈ [n].
Let W−i denote the vector of indirect treatments, which is the n − 1 vector of all treatments except for
Wi . The key feature of our approach is that even though Xi must be independent of Wi , it is not necessary
that Xi be independent of the vector of indirect treatments W−i . In fact, in order for Xi to be useful for
adjusting for interference, we expect that Xi will be correlated with some entries of W−i . In particular, Xi
may be a deterministic function xi (·) of the indirect treatments,

Xi = xi (W−i ). (9)

Adjusting for such a variable Xi will not cause post-treatment adjustment bias as long as the entries of W
are independent of each other. This holds automatically in a Bernoulli randomized design (Assumption 1).
The covariates Xi may depend on static structural information about the units such as network informa-
tion provided by G, though since G is static we supress this dependence in the notation. For example, Xi
defined as in equation (3), which represents the proportion of treated neighbors, captures a particular form
of exogenous social influence. Provided there are no self-loops in G so that Aii = 0, Wi does not appear on
the right-hand side of equation (3) and so Xi and Wi are independent.
We assume that we can easily sample from the distribution of Xi . In particular, if Xi = xi (W−i ), then
the distribution of Xi can be constructed by Monte Carlo sampling from the randomization distribution of
the treatment W. In this paper and in all the examples we use, we assume that Xi is a function of W−i as
in equation (9), so that conditioning on W−i removes all randomness in Xi . But the generalization is easily
handled.
In this section we assume that the response is linear in Xi ; we address nonparametric response surfaces
in Section 4.
Model 3 (Linear model). Given Xi , let the response Yi follow

Yi = Wi µ(1) (Xi ) + (1 − Wi )µ(0) (Xi ) + εi ,

where the conditional response surfaces


(0) (1)
µ(0) (x) = E[Yi |X = x], µ(1) (x) = E[Yi |X = x]

satisfy
µ(0) (x) = β0> x, µ(1) (x) = β1> x
for x ∈ Rp and β0 , β1 ∈ Rp . That is, they follow a “separate slopes” linear model in Xi . We assume p < n.
In the above parametrization, we assume that the first coordinate of each Xi is set to 1, so that the
vectors β0 and β1 contain coefficients corresponding to the intercept as in the classical OLS formulation.

3.1 Feature engineering


Before considering assumptions on the residuals εi , we pause here to emphasize the flexibility provided by
modeling the interference pattern as in Model 3. In this framework, the researcher can use domain knowledge

10
to construct graph features that are expected to contribute to interference. In essence, we have transformed
the problem of determining the structure of the interference pattern into a feature engineering problem,
which is perhaps a more intuitive and accessible task for the practitioner.
To elaborate, consider the problem of selecting an exposure model. Ugander et al. (2013) propose and
study a number of different exposure models for targeting the global treatment effect, including fractional
exposure (based on the fraction of treated neighbors), absolute exposure (based on the raw number of treated
neighbors), and extensions based on the k-core structure of the network. In reality, it may be the case that
fractional and absolute exposure both contribute partial effects of interference, so ideally one wishes to avoid
having to choose between one of the two exposure models. On the other hand, both features are easily
included in Model 3 by encoding both the fraction and raw number of treated neighbors in Xi .
In a similar manner, the researcher may wish to handle longer-range interference, such as that coming
from two-step or greater neighborhoods. It is possible to handle two-step information by working with the
graph corresponding to the adjacency matrix A2 , but this approach is unsatisfactory because presumably
one-step interference is stronger than two-step interference, and this distinction is lost by using A2 . On the
other hand, if one-step and two-step network information are encoded as separate features, both effects are
included and the magnitudes of their coefficients will reflect the strength of the corresponding interference
contributed by each feature.
Furthermore, nothing in our framework requires the variables to be constructed from a single network.
Often, the researcher has access to multiple networks defined on the same vertex set—i.e., a multilayer
network (Kivelä et al. 2014)—representing different types of interactions among the units. For example,
social networking sites such as Facebook and Twitter contain multiple friendship or follower networks based
on the strength and type of interpersonal relationship (e.g. family, colleagues, and acquaintances), as well
as activity-based networks constructed from event data such as posts, tweets, likes, or comments. Often
these networks are also dynamic in time. Given the sociological phenomenon that the strength of a tie is
an indicator of its capacity for social influence (Granovetter 1973) and that people use different mediums
differently when communicating online (Haythornthwaite and Wellman 1998), any or all of these network
layers can conceivably be a medium for interference in varying amounts depending on the causal intervention
and outcome metric in question. In our framework graph features from different network layers are easily
included in the model.

3.2 Exogeneity assumptions


Consider the following assumptions on the residuals.

Assumption 3. (a) The errors are strictly exogenous: E[εi |X1 , . . . , Xn ] = 0 for all i ∈ [n].
(b) The errors are independent.
(c) The errors are homoscedastic: Var(εi |X1 , . . . , Xn ) = σ 2 for all i ∈ [n].
Assumption 3(a) captures the requirement that the covariates contain all of the information needed to
control for the bias contributed by interference, and thus is similar to an unconfoundedness or ignorabil-
ity assumption often invoked in observational studies. Point estimates can be constructed based only on
Assumption 3(a), but variance estimation requires Assumption 3(b) so that each data point contributes
additional independent information.
Assumptions 3(a) and 3(b) therefore, collectively state that once we control for the effects of X1 , . . . , Xn ,
then SUTVA holds. This assumption cannot be verified from the data, and so this setup borrows all of the
problems that come with selecting an exposure model or being able to verify unconfoundedness. However,
our setup is slightly different because of the flexibility afforded by the covariates. Compared to what we
envision as the usual observational studies setting, our covariates are constructed from the treatment vector
and social network rather than being collected in the wild, and so they are quite cheap to construct via
feature engineering. That said, more work for conducting sensitivity analysis for interference or spillover
effects is certainly needed.
Assumption 3(c) is the easiest to deal with if violated. One may use a heteroscedisticity-consistent
estimate of the covariance matrix, also known as the sandwich estimator or the Eicker-Huber-White esti-

11
mator (Eicker 1967; Huber 1967; White 1980). In this paper we invoke Assumption 3(c) mainly to simplify
notation, but heteroscedasticity-robust extensions are straightforward.
For Xi following equation (9), denote
(0) (1)
Xi = xi (W−i = 0), Xi = xi (W−i = 1).
(0)
The variable Xi represents the context for unit i under the counterfactual scenario that i is exposed to
(1)
global control, and the variable Xi represents the context for unit i under the counterfactual scenario that
i exposed to global treatment. Both of these values are non-deterministic.1 For example, if Xi be the “mean
(0) (1)
treated” covariate as defined in equation (3), then Xi = 0 and Xi = 1 for every unit i ∈ [n].
We now consider the estimand under Model 3 and Assumption 3. The GATE for Model 3 is
n
1X
τ= [E[Yi |W = 1] − E[Yi |W = 0]]
n i=1
n
1 X h (1) (1) (0)
i
= µ (Xi ) − µ(0) (Xi )
n i=1
n
1 X h (1) > (0)
i
= (Xi ) β1 − (Xi )> β0 ,
n i=1

where the second equality is by Assumption 3(a). Now introduce the quantities
n n
1 X (0) 1 X (1)
ω0 = X , ω1 = X ,
n i=1 i n i=1 i

which are the mean counterfactual covariate values for global control and global treatment, averaged over
the population. We emphasize that ω0 and ω1 are non-deterministic and known, because the distribution of
Xi is assumed to be known. We then have

τ = ω1> β1 − ω0> β0 . (10)

Such an estimand, which focuses on the covariates of the finite population at hand, is natural in the network
setting where there is no clear superpopulation or larger network of interest.
We now construct an estimator by estimating the regression coefficients with ordinary least squares. For
w = 0, 1, let Xw be the Nw × p design matrix corresponding to covariates belonging to treatment group w,
where the first column of Xw is a column of ones. Let yw be the Nw -vector of observed responses Yi for
treatment group w. Then we use the standard OLS estimator

β̂w = (Xw> Xw )−1 Xw> yw . (11)

The estimate of the treatment effect is taken to be the difference in mean predicted outcomes under the
global treatment and control counterfactual distributions,

τ̂ = ω1> β̂1 − ω0> β̂0 . (12)

Assuming Model 3 holds, τ̂ is an unbiased estimate of τ , which follows from unbiasedness of the OLS
coefficients.
Proposition 1. Suppose Model 3 and Assumptions 1, 2, and 3(a) hold. Let τ and τ̂ be defined as in
equations (10) and (12), and let β̂w for w = 0, 1 be OLS estimators as defined in equation (11). Then
1 (0) (1)
In the event Xi are not defined through a function xi (·), one may work with Xi = Xi |(W−i = 0) and Xi = Xi |(W−i =
(0)
1), where this notation means that Xi follows the conditional distribution of Xi , conditionally on the event that W−i = 0,
(1) (0) (1) (0) (1)
and similarly for Xi . In this case Xi and Xi may be random, and estimands can be defined using E[Xi ] and E[Xi ]
instead.

12
conditionally on Xw being full (column) rank,2 β̂w is an unbiased estimator of βw and τ̂ is an unbiased
estimator of τ .
(Proofs for Proposition 1 and other results are deferred to the appendix.) Notice that the treatment
group predicted mean is
ω1> β̂1 = ω1> (X1> X1 )−1 X1> y1
and the control group predicted mean is

ω0> β̂0 = ω0> (X0> X0 )−1 X0> y0 .

Therefore τ̂ is linear in the observed response vector y. That is, τ = a> >
0 y0 + a1 y1 where the weight vectors
N0 N1
a0 ∈ R and a1 ∈ R are given by

a> > >


0 = ω0 (X0 X0 )
−1 >
X0 (13)
a>
1 = ω1> (X1> X1 )−1 X1> . (14)

These weights allow us to compare the reweighting strategy with that of other linear estimators, such as the
Hájek estimator, which is a particular weighted mean of y. More details are provided in Section 5.1, with
an example given in Section 5.3.

3.3 Inference
Now we provide variance expressions under the assumption that the errors are exogenous, independent, and
homoscedastic, as in Assumption 3.
Theorem 1. Suppose Model 3 and Assumptions 1, 2, and 3 hold. Then

Var(τ̂ ) = σ 2 (kω0 k2Γ0 + kω1 k2Γ1 ), (15)

where kvk2M = v > M v, and Γw = E[(Xw> Xw )−1 ], and ωw is the mean of the counterfactual covariate distri-
bution (including an intercept) for w = 0, 1.

3.3.1 Variance estimation


In order to estimate the variance (15), we must estimate the quantities Γ0 = E[(X0> X0 )−1 ] and Γ1 =
E[(X1> X1 )−1 ], which are the expected inverse sample covariance matrices. Of course, (X0> X0 )−1 and
(X1> X1 )−1 are observed and unbiased estimators. However, unlike standard covariates collected in the
wild, we envision that the Xi are constructed from the graph G and the treatment vector W, and so we
can take advantage of the fact that the distribution of Xi is completely known to the researcher. It is
thus possible to compute Γ0 and Γ1 up to arbitrary precision by repeated Monte Carlo sampling from the
randomization distribution of W. For clarity, this estimation procedure is illustrated in Algorithm 1.
Finally, we can estimate σ 2 in the usual way, with the residual mean squared error
n 2
1 X
2
σ̂ = Yi − Wi (β̂1> Xi ) − (1 − Wi )(β̂0> Xi ) .
n i=1

Equipped with σ̂ 2 and Monte Carlo estimates Γ̂w , we can use the variance estimate
 
d ) = σ̂ 2 kω0 k2 + kω1 k2 .
Var(τ̂ (16)
Γ̂ Γ̂ 0 1
2 Since Xw is random and depends on W, conditioning on Xw having full column rank is necessary, even though this
condition may not be fulfilled for all realizations of the treatment vector. For example, if Xw contains a column for the fraction
of neighbors treated, then it is possible though highly unlikely for all units to be assigned to treatment, in which case this
column is collinear with the intercept and Xw is not full rank. We shall, for the most part, ignore this technicality and assume
that the covariates are chosen so that the event that Xw >X
w is singular doesn’t happen very often, and is in fact negligible
asymptotically.

13
Algorithm 1 Estimating Γ0 and Γ1 by Monte Carlo
for b = 1:B do
Sample treatment Wb ∈ W and compute corresponding covariates Xb,i and sample sizes Nb,0 and Nb,1
Calculate sample covariances
n
1 X
(X̃0> X̃0 )b ← >
(1 − Wb,i )Xb,i Xb,i
Nb,0 i=1
n
1 X
(X̃1> X̃1 )b ← >
Wb,i Xb,i Xb,i
Nb,1 i=1

end for
return Moment estimates
B
1 X >
b X̃ > X̃w )−1 ] =
Γ̂w ← E[( w (X̃w X̃w )−1
b
B
b=1

for w = 0, 1.

3.4 Asymptotic results


Proposition 1 and Theorem 1 characterize the finite n expectation and variance of the treatment effect
estimator under Model 4. Establishing an asymptotic result is more nuanced, as because of the dependence
> −1 > −1 −1
among units implied by interference, the √ quantities E[(X0 X0 ) ] and E[(X1 X1 ) ] may not be O(n ) in
which case τ̂ would not converge at a n rate. This is a problem with dealing with interference in general,
making comparisons to the semiparametric efficiency bound√(Hahn 1998), a standard benchmark in the
SUTVA case, difficult in this setting. However we can state a n central limit theorem in the event that the
sample mean and covariance do scale and converge appropriately. To do so, we implicitly assume existence of
a sequence of populations indexed by their size n, and that the parameters associated with each population
setup, such as β0 , β1 , π, and σ 2 , converge to appropriate limits. Such an asymptotic regime is the standard
for results of this sort (cf. Freedman 2008a,b; Lin 2013; Abadie et al. 2017a,b; Sävje et al. 2017; Chin 2018).
We suppress the index on n to avoid notational clutter.
So that we can compare to previous works, it is helpful to reparametrize the linear regression setup so
that the intercept and slope coefficients are written separately. That is, let Xi and ωw be redefined to exclude
>
the intercept, and let βw = (αw , ηw ) so that the mean functions are written µ(w) (x) = αw + ηw x, where αw
is the intercept parameter and ηw is the vector of slope coefficients. Then the GATE is

τ = (α1 + ω1> η1 ) − (α0 + ω0> η0 ).

Denote the within-group sample averages by


n n
1 X 1 X
ȳ1 = Wi Yi , ȳ0 = (1 − Wi )Yi
N1 i=1 N0 i=1

and
n n
1 X 1 X
X̄0 = Wi Xi , X̄0 = (1 − Wi )Xi .
N1 i=1 N0 i=1

Since the intercept is determined by α̂w = ȳw − X̄w> η̂w , the estimator τ̂ , equation (12), is written as

τ̂ = (α̂1 + ω1> η̂1 ) − (α̂0 + ω0> η̂0 ).


= ȳ1 − ȳ0 + (ω1 − X̄1 )> η̂1 − (ω0 − X̄0 )> η̂0 . (17)

Now, τ̂ is seen to be an adjustment of the difference-in-means estimator ȳ1 − ȳ0 . The adjustment depends
on both the estimated strength of interference, η̂w , and the discrepancy between the means of the observed

14
distribution and the reference or target distribution, X̄w − ωw . This linear shift is a motif in the regression
adjustment literature, and is reminiscent of, e.g., equation (16) of Aronow and Middleton (2013).
We now state a central limit theorem for τ̂ .
Theorem 2. Assume the setup of Theorem 1. Assume further that the sample moments converge in proba-
bility:
n
1X p
X̄ = Xi → µX ,
n i=1
n
1X p
S= (Xi − X̄)> (Xi − X̄) → ΣX ,
n i=1

where ΣX is positive definite, and that all fourth moments are bounded. Then n(τ̂ − τ ) ⇒ N (0, V ), where

kω0 − µX k2Σ−1 kω1 − µX k2Σ−1


!
1
V = σ2 + X
+ X
. (18)
π(1 − π) 1−π π

The terms in expression (18) are unpacked versions of the terms in expression (15), and can be stated in
this way since the covariate moments converge at the appropriate rate.

3.5 Relationship with standard regression adjustments


The practitioner may also wish to perform standard regression adjustments to control for static, contextual
node-level variables such as age, gender, and other demographic variables. This fits easily into the framework
of Model 4, as any such static variable Xi can be viewed as simply a constant function of the indirect treatment
vector W−i . Then the adjustment is not used to remove bias but simply to reduce variance by balancing
the covariate distributions. In this case the counterfactual (global exposure) distribution is the same as the
observed distribution, and in particular, ω0 = µX and ω1 = µX . Hence we see that Theorem 2 reduces to
the standard asymptotic result for regression adjustments using OLS.

Corollary 1. Assume the setup of Theorem 2. Suppose Xi is independent of W−i . Then ω0 = µX and
ω1 = µX and
√ σ2
 
n(τ̂ − τ ) ⇒ N 0, ,
π(1 − π)
This variance in Corollary 1 is the same asymptotic variance as in the standard regression adjustment
setup (cf. Wager et al. 2016, Theorem 2). In practice, if some components of Xi are static variables and
some are interference variables, then the resulting variance will be decomposed into the components stated
in Theorem 2 and Corollary 1.

4 Nonparametric adjustments
In this section we relax the linear model, Model 3:
Model 4 (Non-linear response surface). Let Yi follow

Yi = Wi µ(1) (Xi ) + (1 − Wi )µ(0) (Xi ) + εi ,

with conditional mean response surfaces


(0) (1)
µ(0) (x) = E[Yi |X = x], µ(1) (x) = E[Yi |X = x].

We make no parametric assumptions on the form of µ(0) (x) and µ(1) (x).

15
We maintain Assumption 3, namely that SUTVA holds conditionally on X1 , . . . , Xn .
In the SUTVA setting, adjustment with OLS works best when the adjustment variables are highly
correlated with the potential outcomes; that is, the precision improvement largely depends on the prediction
accuracy. This fact suggests that predicted outcomes obtained from an arbitrary machine learning model
can be used for adjustment, an idea formalized by Wager et al. (2016); Wu and Gagnon-Bartsch (2017).
Based on ideas from Aronow and Middleton (2013), these papers propose using the estimator
n n n
1 X  (1) (0)
 1 X 
(1)
 1 X 
(0)

µ̂−i (Xi ) − µ̂−i (Xi ) + Wi Yi − µ̂−i (Xi ) − (1 − Wi ) Yi − µ̂−i (Xi ) , (19)
n i=1 N1 i=1 N0 i=1

(0) (1)
where µ̂−i and µ̂−i are predictions of the potential outcomes obtained without using the i-th observation.
This doubly-robust style approach is called cross-estimation by Wager et al. (2016) and the leave-one-out po-
tential outcomes (LOOP) estimator by Wu and Gagnon-Bartsch (2017) who focus on imputing the outcomes
using a version of leave-one-out cross validation. This estimator is also reminiscent of the double machine
learning (DML) cross-fitting estimators developed for the observational study setting (Chernozhukov et al.
2018), which consists of the following two-stage procedure: (a) train predictive machine learning models ê(·)
of Xi on Wi (the propensity model) and m̂(·) of Xi on Yi (the response model), and then (b) use the out-of-
sample residuals Wi − ê(Xi ) and Yi − m̂(Xi ) in a final stage regression. The difference in the experimental
setting is that the propensity scores are known and so no propensity model is needed. Wu and Gagnon-
Bartsch (2017) study the behavior of (19) in the finite population setting where the only randomization
comes from the treatment assignment, and Wager et al. (2016) provide asymptotic results for estimating the
(w)
population average treatment effect. As long as the predicted value µ̂−i does not use the i-th observation,
estimator (19) allows us to obtain asymptotically unbiased adjustments and valid inference using machine
learning algorithms such as random forests or neural networks. In practice, such predictions are obtained by
a cross validation-style procedure in which the data are split into K folds, and the predictions for each fold
k are obtained using a model fitted on data from the other K − 1 folds. (Cross validation on graphs is in
general difficult (Chen and Lei 2018; Li et al. 2018), but our procedure is unrelated to that problem because
the features are constructed from the entire graph and fixed beforehand.)
In this section we apply insights from the above works to the interference setting. Under Model 4, the
global average treatment effect has the form
n
1 X h (1) (1) (0)
i
τ= µ (Xi ) − µ(0) (Xi ) .
n i=1

To develop an estimator of τ , consider the form of the OLS estimator given by equation (17), which can be
rewritten as

τ̂ = ȳ1 − ȳ0 + (ω1 − X̄1 )> η̂1 − (ω0 − X̄0 )> η̂0
= ω1> η̂1 − ω0> η̂0 + (ȳ1 − X̄1> η̂1 ) − (ȳ0 − X̄0> η̂0 )
n n n
1 X  (1) > (0)
 1 X 1 X
(Xi ) η̂1 − (Xi )> η̂0 + Wi Yi − Xi> η̂1 − (1 − Wi ) Yi − Xi> η̂0 .
 
= (20)
n i=1 N1 i=1 N0 i=1

Now, by analog, we define the estimator for the nonparametric setting as


n n n
1 X  (1) (1) (0) (0)
 1 X 
(1)
 1 X 
(0)

τ̂ = µ̂−i (Xi ) − µ̂−i (Xi ) + Wi Yi − µ̂−i (Xi ) − (1 − Wi ) Yi − µ̂−i (Xi ) . (21)
n i=1 N1 i=1 N0 i=1

One sees that equations (20) and (21) agree whenever µ̂(w) (x) = α̂w + x> η̂w . Furthermore, equation (21) is
(0) (1)
equal to its SUTVA version, equation (19), whenever Xi = Xi = Xi .
Because the units can be arbitrarily connected, the cross-fitting component partitions are not immedi-
ately guaranteed to be exactly independent, and so any theoretical guarantees must assume some form of
approximate independence of the out-of-sample predictions. In this work we leave such theoretical results
open for future work; our primary contribution is the proposal of estimator (21) and a bootstrap variance
estimation method that respects the empirical structure of interference.

16
4.1 Bootstrap variance estimation
Here we discuss a method for placing error bars on the estimate τ̂ defined in equation (21). We propose using
a bootstrap estimator to estimate the sampling variance. Under exogeneity (Assumption 3), the covariates
and residuals contribute orthogonally to the total variance, and so the model and residuals can be resampled
(0) (1)
separately. For the residual portion, we take the initial fitting functions µ̂−i (·) and µ̂−i (·) and compute the
residuals
(1) (0)
ε̂i = Yi − Wi µ̂−i (Xi ) − (1 − Wi )µ̂−i (Xi ).
Instead of using the fixed, observed X1 , . . . Xn as in a standard residual bootstrap, we propose capturing
the entire variance induced by the covariate distribution by sampling a new Xi from its population distribu-
tion for each bootstrap replicate. That is, for each of B bootstrap repetitions, we sample a new treatment
vector Wb and compute bootstrap covariates Xib = xi (W−i b
). The means are then computed using the fitted
(0) b (1) b
function as µ̂−i (Xi ) and µ̂−i (Xi ). Provided that the adjustments are consistent in the sup norm sense, that
is, that
p p
sup |µ̂(0) (x) − µ(0) (x)| → 0, sup |µ̂(1) (x) − µ(1) (x)| → 0,
x x

then µ̂(0) (·), µ̂(1) (·) serve as appropriate stand-ins for µ(0) (·), µ(1) (·) in large samples.
We construct an artificial bootstrap response
(1) (0)
Yib = Wib µ̂−i (Xib ) + (1 − Wib )µ̂−i (Xib ) + εbi ,

where the residuals εb1 , . . . , εbn are sampled with replacement from the observed residuals ε̂1 , . . . , ε̂n . We then
compute τ̂ b using data (Yib , Xib , Wib ), and then take the bootstrap distribution {τ̂ b }B
b=1 as an approximation
to the true distribution of τ̂ . To construct a 1 − α confidence interval, one can calculate the endpoints using
approximate Gaussian quantiles, p
τ̂ ± zα/2 Var(τ̂b ).
Alternatively, one may use the α/2 and 1−α/2 quantiles of the empirical bootstrap distribution (a percentile
bootstrap), which is preferable if the distribution of τ̂ is skewed.
With this approach, the rich literature of bootstrap estimation stemming from Efron (1979) becomes
available to the researcher. For example, one can use more complicated bootstrap methods to be more
faithful to the empirical distribution, such as incorporating higher-order features of the distribution via
bias-corrected and accelerated (BCa) intervals (Efron 1987), or handling heteroscedasticity via the wild
bootstrap (Wu 1986).

5 Simulations
This section is devoted to running a number of simulation experiments. Our goals in these simulations are
to (a) verify that our adjustment estimators and variance estimates are behaving as intended, (b) compare
the performance of our proposed estimators to that of existing inverse propensity weighted estimators based
on exposure models, and (c) empirically explore the behavior of our estimators in regimes of mild model
misspecification.

5.1 Simulation setup and review of exposure modeling


For the network G we use a subset of empirical social networks from the facebook100 dataset, an assortment
of complete online friendship networks for one hundred colleges and universities collected from a single-day
snapshot of Facebook in September 2005. A detailed analysis of the social structure of these networks was
given in Traud et al. (2011, 2012). We use an empirical network rather than an instance of a random
graph model in order to replicate as closely as possible the structural characteristics observed in real-world
networks. We use the largest connected components of the Caltech and Stanford networks. Some summary
statistics for the networks are given in Table 1.

17
network Caltech Stanford
number of nodes 762 11586
number of edges 16651 568309
diameter 6 9
average pairwise distance 2.33 2.82

Table 1: Summary statistics for the facebook100 networks.

In all simulation regimes we compare our regression estimators to two other estimators, which we describe
now. As a baseline we use the SUTVA difference-in-means estimator
n n
1 X 1 X
τ̂DM = Wi Yi − (1 − Wi )Yi .
N1 i=1 N0 i=1

5.1.1 Exposure modeling IPW estimators


We also compare to an inverse propensity weighted estimator derived from a local neighborhood exposure
model. We now briefly describe the exposure model-based estimators framed in the language of constant
treatment response assumptions (Manski 2013). For Yi (w) = µi (w)+εi , this approach partitions the space of
treatments W into classes of treatments that map to the same mean response µi (·) for unit i. The partition
function is assumed known, and is called an exposure function. The no-interference portion of SUTVA can be
specified as an exposure model, since no-interference is equivalent to the requirement that µi (w1 ) = µi (w2 )
for any two treatment vectors w1 , w2 ∈ W in which the i-th components of w1 and w2 agree. Manski (2013)
refers to this formulation as individualistic treatment response (ITR).
The exposure model most commonly used for local interference is the neighborhood treatment response
(NTR) assumption, which given a graph G, posits that µi (w1 ) = µi (w2 ) whenever w1 and w2 agree in
all components j such that j ∈ Ni ∪ {i}. In other words, NTR assumes that Yi depends on unit i’s own
treatment and possibly any other unit in its neighborhood Ni , but that it does not respond to changes in the
treatments of any units outside of its immediate neighborhood. For the purposes of estimating the global
treatment effect, one may use fractional q-NTR, where given a threshold parameter q ∈ (0.5, 1], q-NTR
assumes that a unit is effectively in global treatment if at least a fraction q of its neighbors are assigned to
treatment, and similarly for global control. NTR is thus a graph analog of partial interference for groups and
q-NTR is a corresponding version of stratified interference. The threshold q is a tuning parameter; larger
values of q result in less bias due to interference, but greater variance because there are fewer units available
for estimation. Eckles et al. (2017) provide some theoretical results for characterizing the amount of bias
reduction. There is not much guidance for selecting q to manage this bias-variance tradeoff; Eckles et al.
(2017) uses q = 0.75.
Aronow and Samii (2017) study the behavior of inverse propensity weighted (IPW) estimators based on
a well-specified exposure model. Toward this end, let
 
1 X 
Ei = 1
(1)
Wj ≥ q
 di 
j∈Ni
 
1 X 
Ei = 1
(0)
Wj ≤ 1 − q
 di 
j∈Ni

be the events that unit i is q-NTR exposed to global treatment and q-NTR exposed to global control,
respectively. Let their expectations be denoted by
(1) (1) (0) (0)
πi = E(Ei ), πi = E(Ei ),

which represent the propensity scores for unit i being exposed to the global potential outcome conditions.

18
Then the inverse propensity weighted estimators under consideration are defined as
n
" #
(1) (0)
1 X Ei Yi Ei Yi
τ̂HT = − (0)
n i=1 π (1) πi
i
n (1)
! −1 n (1) n (0) −1 X
! n (0)
X Ei X Ei Yi X Ei Ei Yi
τ̂Hájek = (1) (1)
− (0) (0)
(22)
i=1 πi i=1 πi i=1 πi i=1 πi

The estimator τ̂HT is the Horvitz-Thompson estimator (Horvitz and Thompson 1952), and τ̂Hájek is the
Hájek estimator (Hájek 1971); these names stem from the survey sampling literature and are commonly used
in the interference literature. In the importance sampling and off-policy evaluation literatures, analogs of τ̂HT
and τ̂Hájek are known as unnormalized and self-normalized importance sampling estimators, respectively. In
the finite potential outcomes framework The Horvitz-Thompson estimator is unbiased under the experimental
design distribution, but suffers from excessive variance when the probabilities of global exposure are small, as
is usually the case. The Hájek estimator, which forces the weights to sum to one and is thus interpretable as a
difference of weighted within-group means, incurs a small amount of finite sample bias but is asymptotically
unbiased, and is nearly always preferable to τ̂HT . For our simulations we will therefore avoid using τ̂HT .
One of the main insights in the exposure modeling framework developed by Aronow and Samii (2017)
is that even if the initial treatment assignment probability π is constant across units, the global treatment
(1) (0)
propensity scores need not be; indeed, πi and πi depend on the network structure and choice of exposure
model. Therefore inverse propensity weighting is needed to produce unbiased (or consistent) estimators for
contrasts between exposures even in a Bernoulli randomized design.
Given a design and a (simple enough) exposure model, the propensities can be calculated exactly. If
the treatments are assigned according to independent Bernoulli coin flips, the exact exposure probabilities
are expressed straightforwardly using the binomial distribution function. That is, for treatment probability
π = P(Wi = 1) and degree di , the probability of unit i being q-NTR exposed to global treatment is
(1)
πi = π(1 − Fdi ,π (bdi qc)), (23)
where
k  
X n j
Fn,p (k) = p (1 − p)n−j
j=0
j
is the distribution function of a Binomial(n, p) random variable. Similarly, the probability that unit i is
q-NTR exposed to global control is
(0)
πi = (1 − π)Fdi ,π (bdi (1 − q)c). (24)
In a cluster randomized design, exposure probabilities for fractional neighborhood exposure can be computed
using a dynamic program (Ugander et al. 2013).
(1) (0)
A further comment on the propensity scores πi and πi is necessary. Importantly, these propensity
scores are exact only to the extent to which the exposure model is correct. Thus, when the exposure model
is unknown, these propensities scores should be viewed as estimated propensities, in which case even small
estimation errors in the propensities can lead to large estimation errors in their inverses. It is therefore the
case that τ̂HT and τ̂Hájek can suffer from the same high-variance problems as IPW estimators based on a
fitted propensity model used in observational studies, even if the exposure model is only mildly misspecified.
In our simulations we use the Hájek estimator, τ̂Hájek , defined by equation (22) and the q-NTR exposure
probabilities (23) and (24). We fix q = 0.75, which is the same threshold used in Eckles et al. (2017). For
the other values of q that we tried, performance was roughly on par with or worse than q = 0.75.

5.2 Variance estimates in a linear model


We first run a basic simulation in which we compute estimates, variances and variance estimates in an
ordinary linear model. We consider two covariates,
1 X
X1,i = Wj ,
di
j∈Ni

19
the proportion of treated neighbors, and X
X2,i = Wj ,
j∈Ni

the number of treated neighbors. It is conceivable that Yi may depend on both of these features. Let the
data-generating process for Yi be as in Model 3; that is, the mean function for Yi is linear in Xi = (X1,i , X2,i ),
given parametersPαw ∈ R and βw = (βw,1 , βw,2 ) ∈ R2 for w = 0, 1. We simulate εi ∼ N (0, σ 2 ).
n
Let d¯ = n−1 i=1 di be the average degree of G. Then the true global treatment effect is
¯ 1,2 .
α1 − α0 + β1,1 + dβ
We fix α1 = 1 and α0 = 0, so that the direct effect is 1. We fix the noise variance at σ 2 = 1. We vary
the “proportion” coordinate of β0 in {0, 0.1}, the “number” coordinate of β0 in {0, 0.01}, the “proportion”
coordinate of β0 in {0, 0.2}, and the “number” coordinate of β1 in {0, 0.05}, giving 16 total parameter
configurations. SUTVA holds when β0 = β1 = 0.
We use equation (16) to estimate the variance of the adjusted estimator, using 200 bootstrap samples
from the covariate distribution to calculate the inverse covariance matrices. We also compute the difference-
in-means (DM) estimator for comparison purposes, for which we use the standard Neyman conservative
variance estimate
S02 S2
+ 1,
N0 N1
2 2
where S0 and S1 are the within-group sample variances. We compute confidence intervals based on Gaussian
quantiles for a 90% nominal coverage rate.
We then run 1000 simulated experiments, sampling a new treatment vector W and computing the two
estimators each time. The results are shown in Table 2. The bias of the DM estimator increases with
greater departures from SUTVA, and confidence intervals for that estimator are only theoretically valid
under SUTVA (the first row in Table 2). Otherwise, the confidence intervals are anticonservative, both due
to bias of the DM estimator and due to invalidity of the Neyman variance estimate, which assumes fixed
potential outcomes. On the other hand, the adjustment estimator is unbiased and has valid coverage for all
parameter configurations.

Parameters Bias SE SE Ratio Coverage rate


β0 β1 τ DM adj DM adj DM adj DM adj
(0, 0) (0, 0) 1 0.007 -0.013 0.074 1.149 0.982 1.031 0.891 0.913
(0, 0.01) (0, 0) 1 -0.006 -0.028 0.072 1.189 1.004 0.996 0.899 0.901
(0.1, 0) (0, 0) 1 -0.053 0.027 0.072 1.151 1.004 1.029 0.808 0.919
(0.1, 0.01) (0, 0) 1 -0.052 0.017 0.074 1.155 0.973 1.025 0.801 0.906
(0, 0) (0, 0.05) 1.05 -0.025 0.005 0.073 1.211 0.990 0.976 0.866 0.882
(0, 0.01) (0, 0.05) 1.05 -0.026 0.005 0.070 1.173 1.036 1.010 0.894 0.909
(0.1, 0) (0, 0.05) 1.05 -0.075 0.058 0.075 1.232 0.960 0.961 0.707 0.884
(0.1, 0.01) (0, 0.05) 1.05 -0.078 -0.019 0.073 1.151 0.996 1.031 0.699 0.912
(0, 0) (0.2, 0) 1.2 -0.097 0.058 0.073 1.168 0.993 1.014 0.615 0.910
(0, 0.01) (0.2, 0) 1.2 -0.104 0.007 0.073 1.142 0.999 1.039 0.577 0.916
(0.1, 0) (0.2, 0) 1.2 -0.151 0.002 0.073 1.197 0.991 0.988 0.334 0.892
(0.1, 0.01) (0.2, 0) 1.2 -0.152 0.044 0.072 1.208 1.014 0.981 0.315 0.894
(0, 0) (0.2, 0.05) 1.25 -0.125 -0.054 0.072 1.154 1.003 1.025 0.476 0.908
(0, 0.01) (0.2, 0.05) 1.25 -0.130 -0.014 0.070 1.149 1.029 1.031 0.446 0.920
(0.1, 0) (0.2, 0.05) 1.25 -0.174 0.016 0.074 1.206 0.983 0.982 0.232 0.894
(0.1, 0.01) (0.2, 0.05) 1.25 -0.182 -0.014 0.074 1.172 0.985 1.012 0.194 0.903

Table 2: Results of the basic simulation setup from Section 5.2, showing bias, true standard error, ratio
of estimated standard error to true standard error, and coverage rate of 90% nominal Gaussian confidence
interval. Coverage rates which fall
p within a 99% one-sided interval of the nominal coverage rate (that is,
coverage rates above 0.9 − 2.326 0.9 × 0.1/1000 ≈ 0.878) are bolded.

20
5.3 Estimator weights
Both the OLS adjustment estimator and the Hájek estimator are linear reweighting estimators. The OLS
weights are given by equations (13) and (14), and the Hájek weights are implied by the definition of the
Hájek estimator in equation (22). Both depend on only the network structure, treatment assignment, and
exposure model or choice of covariates, but not on the realized outcome variable. The weights for a single
Bernoulli(0.5) draw of the treatment vector W for the Caltech graph are displayed in Figure 3, assuming
that the Hájek estimator is to be constructed under the q-NTR exposure condition for q = 0.75, and the
OLS estimator uses the fraction of treated neighbors as the only adjustment variable. We see that the
Hájek estimator trusts a few select observations to be representative of the global exposure conditions. A
graph cluster randomized design would increase the number of units used in the Hájek estimator. The OLS
estimator, on the other hand, gives all units non-zero weight. Some units that are in the treatment group but
are surrounded by control individuals are treated as diagnostic for the control mean and vice versa, which is
a reasonable thing to do if the linear model is true.

Figure 3: Estimator weights for the case where the only covariate is the proportion of treated neighbors. (left)
The Hájek estimator selects a few individuals from treatment and control and takes a weighted average of
those individuals with weights determined by exposure probabilities. Vertical dotted lines are the thresholds
used for selecting observations. (right) The regression estimator takes a more democratic approach, giving
all units non-zero weight.

5.4 Dynamic linear-in-means


Here we replicate portions of the simulation experiments conducted by Eckles et al. (2017). That paper uses
a discrete-time dynamic model, which can be viewed as a noisy best-response model (Blume 1995), in which
individuals observe and respond to the behaviors of their peers, using that information to guide their actions
in the following time period. Given responses Yi,t−1 for time period t − 1, let

1 X
Zi,t−1 = Yi,t−1 ,
di
j∈Ni

21
a time-varying version of Zi defined in equation (5), which represents the average behavior of unit i’s
neighbors at time t − 1. Then we model

Yi,t = α + βWi + γZi,t + εi,t . (25)

The noise is taken to be εi,t ∼ N (0, σ 2 ), which is independent and homoscedastic across time and individuals.
Eckles et al. (2017) add an additional thresholding step that transforms equation (25) into a probit model
and Y into a binary outcome variable, but here we study the non-thresholded case which is closer to the
original linear-in-means model specified by Manski (1993). Starting from initial values Yi,0 = 0, the process
is run up to a maximum time T and then the final outcomes are taken to be Yi = Yi,T . The choice of T ,
along with the strength of the spillover effect γ, governs the amount of interference. If T is larger than the
diameter of the graph, then the interference pattern is fully dense, and no exposure model holds.
We construct two different adjustment variables. First, let
1 X
X1,i = Wj ,
di
j∈Ni

the proportion of treated neighbors. Now let


(2)
Ni = {k ∈ [n] \ {i} : there exists j such that Aij Ajk = 1}

be the two-step neighborhood of unit i. Then define


1 X
X2,i = (2)
Aij Ajk Wk ,
|Ni | (2)
k∈Ni

(2)
the proportion of individuals belonging to Ni who are treated. (Note that unit i itself does not belong to
its own two-step neighborhood.)
We use a small-world network (Watts and Strogatz 1998), which is the random graph model used in
the simulations by Eckles et al. (2017), with n = 1000 vertices, initial neighborhood size 10, and rewiring
probability 0.1. We also run our simulation on the empirical Caltech network.
As in Eckles et al. (2017), we compute the “true” global treatment effects by Monte Carlo simulation.
For every parameter configuration we sample 5000 instances of the response vector under global exposure
to treatment W = 1, and 5000 instances of the response vector under global exposure to control W = 0,
and then average the resulting difference in response means. For the response model, we fix the intercept at
α = 0 and the direct effect at β = 1. We vary the spillover effect γ ∈ {0, 0.25, 0.5, 0.75, 1} and the maximum
number of time steps T ∈ {2, 4}. Larger values of γ and T indicate more interference. We also use two
different levels for the noise standard deviation, σ ∈ {1, 3}.
We consider two versions of the linear adjustment estimator defined in equation (12), one that adjusts
for X1,i only, and one that adjusts for both X1,i and X2,i . The first model controls for one-step neighbor-
hood information, whereas the second model controls for both one- and two-step neighborhood information.
We compare to the difference-in-means estimator and the Hájek estimator with q = 0.75 fractional NTR
exposure.
We emphasize that the all of the estimators that we consider are misspecified under the data generating
process that we use in this simulation. For T ≥ 2, local neighborhood exposure fails, so the propensity scores
used in the Hájek estimator do not align with the true propensity scores. Our adjustment estimators are also
misspecified for T ≥ 2; not only is the linear model misspecified, but the residuals are neither independent
nor exogenous, violating Assumption 3.
The results are displayed in Figure 4. We see that the two OLS adjustment estimators are uniformly
better at bias reduction than the Hájek estimator. The two-step adjustment is nearly unbiased even though
it is misspecified, even in the presence of strong spillover effects. This is because interference is dissipating
exponentially, so that units don’t really respond to the behavior of individuals that are distance 3 or 4 away.
The two-step adjustment has higher variance than the one-step adjustment because it involves fitting a
more complex model. Furthermore, estimators appear to have more trouble handling the real-world network
structure of the Caltech network, compared to the artificial small-world network.

22
Figure 4: Results for linear-in-means simulation. dm is the difference-in-means estimator, hajek is the Hájek
estimator, adj1 is adjustment based on a one-step neighborhood, and adj2 is adjustment based on a two-step
neighborhood.

The difference-in-means estimator outperforms the adjustment estimators in regimes of weak interference,
which is expected since difference-in-means is the best that can be done under correct specification of SUTVA.
In terms of RMSE the Hájek estimator sometimes outperforms the two-step adjustment estimator because
of large variance. However, if the main goal is robustness to interference, then unconfounded estimation
coupled with valid confidence intervals is likely the priority over optimizing an error metric such as RMSE.
In this case, since the Hájek estimator neither achieve sufficient bias reduction nor provide correct coverage,
it has no real advantage over the adjustment estimators.
Figure 5 displays the coverage rates obtain from variance estimates using equation (16) under the dynamic
treatment response setup. The coverage is not always correct due to misspecification, especially for adj1.

23
We see that coverage rates for adj2 are often conservative even though it too is misspecified. We note that
standard variance estimators for the difference-in-means estimator and those derived in Aronow and Samii
(2017) for the Hájek estimator also would fail here because they rely on correct specification of SUTVA and
an exposure model, respectively. In short, we are plagued with the same difficulties that beset attempting
to do valid inference in observational studies when we do not know whether unconfoundedness holds.

Figure 5: Coverage rates for 90% nominal interval.

5.5 Average + aggregate peer effects


In this example we consider a response model in which individuals respond partially to the average behavior
of their peers and partially to the aggregate behavior of their peers. Let
1 X
Xifrac = Wj
di
j∈Ni

be the fraction of treated neighbors and X


Xinum = Wj
j∈Ni

be the number of treated neighbors. Xifrac captures a notion of fractional neighborhood exposure and Xinum
captures a notion of absolute neighborhood exposure. It seems reasonable that both of these features may
contribute interference. In order to use an exposure model estimator one would need to focus on either
fractional exposure or absolute exposure, or otherwise define a more complicated exposure model, but our
adjustments easily handle both features.
We consider the following response function:
1 10
Yi = −5 + 2(2 + Ei )Wi + 0.03Xifrac + num + frac −0.4) + εi ,
1 + 0.001e−0.03(Xi −300) 3+e−8(Xi

where Ei ∼ N (0, 2) introduces heterogeneity into the direct effect and εi ∼ N (0, 1) is homoscedastic noise.
This function captures a possible way in which individuals could respond nonlinearly to their peer exposures
through Xifrac and Xinum . Figure 6 plots a single draw of this response on individuals from the Stanford
network. The continuous response exhibits a logistic dependence on both covariates. We see that individuals
with less than half of their neighbors exposed to the treatment condition experience a steadily increasing
peer effect as the proportion of treated neighbors increases. For individuals with more than half of their
neighbors exposed to the treatment condition, the effect is nearly constant across values of Xifrac , capturing
the idea that after a certain threshold observing additional peer exposures doesn’t add much. For Xinum , we

24
see that a small number of treated neighbors essentially contributes no interference, but once a large number
of neighbors are exposed to treatment this has a measurable impact on the response. We also see that there
is a noticeable bump around Xifrac = 0.5; this is because individuals with peers nearly equally assigned to
the two groups are more likely to have high degree. The model extends the idea of neighborhood exposure
to capture the intuition that having a high proportion of treated neighbors is evidence for being subject to
interference, but such evidence is stronger when the individual in question has many friends and not just
one or two friends. The true treatment effect is τ = 6.336, which was computed using 2000 Monte Carlo
draws each of global treatment and global control.

Figure 6: One draw of the covariates and response for the nonlinear setup. The left panel shows the
relationship between the two covariates, and the right two panels show the relationship of the response with
each covariate. The horizontal axis for “number of treated neighbors” (Xinum ) is on a logarithmic scale. A
local linear regression, for exploratory purposes, is plotted in blue.

In our experience larger populations seem to be needed for fitting the more complex, nonlinear functions,
so we work with the Stanford network which has 11586 nodes. We predict the response surfaces using a
generalized additive model (GAM) (Hastie and Tibshirani 1986), which is easy and fast to fit in R, but other
methods such as local regression or random forests could of course be used instead. We split the dataset into
K = 2 folds, and within each fold, train a GAM separately in the treatment and control groups for a total
of 4 fitted models. The models are then used to obtain predicted responses on the held-out fold. Standard
errors were computed via the bootstrap as described in Section 4.1, using 50 bootstrap replications.
We compare to the difference-in-means estimator, the Hájek estimator using a threshold of q = 0.75
on the Xifrac variable, and the OLS adjustment. The results are displayed in Table 3. The DM estimator
exhibits the most bias, as it does not control for any sort of interference. The Hájek estimator removes
some bias, but because it is based on a fractional exposure model it is unable to respond to the effect of
having a high treated degree. Both the OLS and GAM estimators remove about 95% of the bias. The GAM
adjustment does only slightly better than OLS; for this setup what matters most is controlling for both axes
of the interference covariate, and the added flexibility provided by the GAM does not seem to be crucial.
We note also that average bootstrapped standard error is 1.076 times greater than the true standard error,
suggesting that confidence intervals built on this standard error will have the approximately correct length.

6 Application to a farmer’s insurance experiment


In this section we apply our methods to a field experiment conducted on individuals in 185 villages in rural
China (Cai et al. 2015). The purpose of the study was to quantify the network (spillover) effects of certain
information sessions for a farmer’s weather insurance product on the eventual adoption of that product.
Though they do not frame their approach explicitly in the language of exposure models as in (Aronow and

25
estimator estimate absolute bias (%) SE (ratio)
DM -0.002 6.339 (100%) 0.077 (—)
Hájek 2.653 3.683 (58.1%) 1.601 (—)
OLS 6.683 0.347 (5.5%) 0.252 (0.942)
GAM 6.655 0.319 (5.0%) 0.246 (1.076)

Table 3: Nonlinear simulation results. The bias column displays the absolute and relative bias from the
truth τ = 6.336. The SE column displays the true standard error over 200 simulation replications, and for
the adjustment estimators we display in parentheses the ratio of the estimated standard error to the true
standard error.

Samii 2017), the estimands that are implied by the regression coefficients in the models that they use in that
paper can be thought of as contrasts between exposures in an appropriately-defined exposure model. The
authors did not consider estimating the global treatment effect; our proposed methods essentially allow us
to perform an off-policy analysis of that estimand.
In the original field experiment, the researchers consider four treatment groups obtained by assigning
villagers to either a simple or intensive information session in one of two rounds that were held three days
apart. Here, for simplicity, we ignore the temporal distinction between the two rounds and consider a villager
to be treated if they were exposed to either of the two intensive sessions.3 The outcome variable is a binary
indicator for whether the villager decided to purchase weather insurance.
We drop all villagers that were missing information about the treatment or the response, as well as
villages lacking network information. Though the study was conducted in separate villages (for the purpose
of administering the insurance information sessions), we combine all of the villagers into one large graph
G. The network has 4,382 nodes and 17,069 edges. Because some social connections exist across villages,
the villages do not partition exactly into separate connected components; our graph G has 36 connected
components. The summary statistics for the processed dataset are given in Table 4.

number of nodes 4832


number of edges 17069
number (%) treated 2406 (49.8%)
average takeup (mean response) 44.6%

Table 4: Summary statistics for the (Cai et al. 2015) dataset.

(2)
Now let Ni and Ni be the one- and two-step neighborhoods for unit i, as we have denoted previously.
We construct four variables from the graph: the fraction of units in Ni who are treated (frac1), the fraction
(2)
of units in Ni who are treated (frac2), the number of units in Ni who are treated (num1), and the number
(2)
of units in Ni who are treated (num2). Figure 7 displays the scatterplot matrix for these four variables as
well as the response. As might be expected, these four variables are positively correlated with each other,
and each is (weakly) positively correlated with the response variable. This correlation with the response
suggests that these variables may be useful for adjustment.
We compute the OLS adjusted estimator as well as an adjustment estimator that used predictions from
a logistic regression with K = 5 folds. We construct standard errors using the variance estimator given by
equation (16) in the OLS case, and the parametric bootstrap variance estimator described in Section 4.1 with
200 bootstrap replications for the logistic regression case. We compare to the difference-in-means estimator
and Hájek estimators based on thresholding on the frac1 and frac2 variables with q = 0.75.
The estimates are displayed in Table 5. Considering the strong positive spillover effects discovered by Cai
et al. (2015), the difference-in-means estimate of 0.0774 is likely to be an underestimate of the true global
treatment effect. The Hájek estimators produce estimates of 0.1630 (one-step fractional NTR) and 0.1672
3 According to Cai et al. (2015) the treatment groups in the study are stratified by household size and farm size, but it is not

clear from the data if and how exactly this was done, so for simplicity we analyze the experiment as if it were an unstratified,
Bernoulli randomized experiment.

26
Figure 7: Scatterplot matrix for the variables used in the Cai et al. (2015) analysis.

(two-step fractional NTR). Though we do not know the truth, it may make us nervous that these estimates
are more than twice the magnitude of the difference-in-means estimator, which if true would suggest that
magnitude of the spillover effect is larger than the magnitude of the direct effect. The true treatment effect
likely falls in between the estimates produced by difference-in-means and Hájek (though we have no way
of knowing for sure). The OLS (0.1218) and logistic regression estimates (0.1197) are similar to each other
and both within this range; an advantage they have over the Hájek estimators is that they incorporate
information about the raw number of treated neighbors. The standard error estimates of 0.0561 (linear
regression adjustment) and 0.0559 (logistic regression adjustment) are quite wide, suggesting some caution
when interpreting this result.
Note that we have omitted computation of standard error estimates for the difference-in-means and
Hájek estimators for several reasons. SUTVA and the neighborhood exposure conditions both likely fail
to hold, so it is unclear how we should interpret such standard errors. Secondly, the conservative variance
estimators proposed for the Hájek estimator (cf. Sections 5, 7.2, Aronow and Samii 2017) are themselves
inverse propensity estimators relying on small propensities, and consequently we found them to be quite
unstable. For example, the variance estimate was often much greater than 1, which is the maximum possible
variance of a [−1, 1]-valued random variable. Of course, we also do not know if the exogeneity assumptions
hold or if other variables should be included. In the regression analyses conducted by Cai et al. (2015),
they also consider some other social network measures including indicator variables for varying numbers of
friends and differentiation between strong and weak ties; a more sophisticated analysis here could include
these features as well.

7 Discussion
We propose regression adjustments for interference in randomized experiments, which opens the world of the
rich regression adjustment literature to the interference setting. We show in simulation experiments that the

27
estimator estimate standard error
DM 0.0774 —
Hájek 1 (q = 0.75) 0.1630 —
Hájek 2 (q = 0.75) 0.1672 —
Linear 0.1218 0.0561
Logistic (5-fold) 0.1197 0.0559

Table 5: Estimates and standard errors for estimating the global treatment effect of intensive session on
insurance adoption.

adjustments can do well, and we show how to do inference under exogeneity/unconfoundedness assumptions.
Our reanalysis of the Cai et al. (2015) study shows that our approach can produce sensible estimates of the
global treatment effect on real data.
There is much work to do to ensure that this approach can be reliably used in practical settings. First, we
would like to extend the methods to handle more complicated designs. In reality a combination of design-side
methods (graph clustering) and analysis-side methods (adjustment) could be the most effective approach.
Secondly, interference causes a randomized experiment to behave in some ways like an observational study,
so the assumptions required tend to look more like those required of regression adjustments in observational
studies rather than regression adjustments in randomized experiments. The most unsatisfying part of our
proposed approach is the strong reliance on exogeneity/unconfoundedness assumptions. However, such
requirements are not new, and also needed to employ both standard estimators for observational studies in
the SUTVA setting and exposure modeling estimators in the interference setting.
This issue simply highlights the need for better methods that can detect interference; there are several
budding possibilities here. First, several works have proposed ways of doing sensitivity analysis for inter-
ference. VanderWeele et al. (2014) extend Robins et al. (2000)-style sensitivity analysis to cover some of
the interference estimators studied in Hudgens and Halloran (2008), and Egami (2017) propose using an
auxiliary network to perform sensitivity analysis on estimates obtained using the primary network. But
clearly more work in this area is needed. Second, hypothesis tests for network or spillover effects of the type
developed in (Aronow 2012; Athey et al. 2017a; Basse et al. 2017), could be informative if applied to the
residuals of a fitted interference model. Finally, one can always use more robust standard error constructions
such as Eicker-Huber-White (Eicker 1967; Huber 1967; White 1980) standard errors for heteroscedasticity
or cluster bootstrap methods for dependence, though if the network structure is such that graph cluster
randomization is unlikely to work well, then clustered bootstrap probably won’t work well either. It is also
possible that work based on dependency central limit theorems like the ones considered in Chin (2018) could
be used to develop more robust variance calculations. Broadly, any of the above methods ideas can be
applied to the residuals of an interference model. If the bulk of interference can be captured in the mean
function, then it is perhaps easier to deal with the remaining interference in the residuals.

28
References
A. Abadie, S. Athey, G. W. Imbens, and J. Wooldridge. When should you adjust standard errors for
clustering? Technical report, National Bureau of Economic Research, 2017a.

A. Abadie, S. Athey, G. W. Imbens, and J. M. Wooldridge. Sampling-based vs. design-based uncertainty in


regression analysis. arXiv preprint arXiv:1706.01778, 2017b.
S. Aral. Networked experiments. Oxford, UK: Oxford University Press, 2016.
P. M. Aronow. A general method for detecting interference between units in randomized experiments.
Sociological Methods & Research, 41(1):3–16, 2012.
P. M. Aronow and J. A. Middleton. A class of unbiased estimators of the average treatment effect in
randomized experiments. Journal of Causal Inference, 1(1):135–154, 2013.
P. M. Aronow and C. Samii. Estimating average causal effects under general interference, with application
to a social network experiment. The Annals of Applied Statistics, 11(4):1912–1947, 2017.
S. Athey, D. Eckles, and G. W. Imbens. Exact p-values for network interference. Journal of the American
Statistical Association, pages 1–11, 2017a.
S. Athey, G. W. Imbens, and S. Wager. Approximate residual balancing: De-biased inference of average
treatment effects in high dimensions. 2017b. URL: https://arxiv.org/pdf/1604.07125.pdf.

L. Backstrom, P. Boldi, M. Rosa, J. Ugander, and S. Vigna. Four degrees of separation. In Proceedings of
the 4th Annual ACM Web Science Conference, pages 33–42. ACM, 2012.
S. Baird, J. A. Bohren, C. McIntosh, and B. Özler. Optimal design of experiments in the presence of
interference. Review of Economics and Statistics, (0), 2016.

A. Banerjee, A. G. Chandrasekhar, E. Duflo, and M. O. Jackson. The diffusion of microfinance. Science,


341(6144):1236498, 2013.
G. Basse and A. Feller. Analyzing two-stage experiments in the presence of interference. Journal of the
American Statistical Association, 113(521):41–55, 2018.

G. Basse, A. Feller, and P. Toulis. Exact tests for two-stage randomized designs in the presence of interference.
arXiv preprint arXiv:1709.08036, 2017.
L. Beaman, A. BenYishay, J. Magruder, and A. M. Mobarak. Can network theory-based targeting increase
technology adoption? Technical report, National Bureau of Economic Research, 2018.
R. Berk, E. Pitkin, L. Brown, A. Buja, E. George, and L. Zhao. Covariance adjustments for the analysis of
randomized field experiments. Evaluation Review, 37(3-4):170–196, 2013.
S. Bhagat, M. Burke, C. Diuk, I. O. Filiz, and S. Edunov. Three and a half degrees of separation. Facebook
research note, 2016. URL: https://research.fb.com/three-and-a-half-degrees-of-separation/.
A. Bloniarz, H. Liu, C.-H. Zhang, J. S. Sekhon, and B. Yu. Lasso adjustments of treatment effect estimates
in randomized experiments. Proceedings of the National Academy of Sciences, 113(27):7383–7390, 2016.
L. E. Blume. The statistical mechanics of best-response strategy revision. Games and Economic Behavior,
11(2):111–145, 1995.
Y. Bramoullé, H. Djebbari, and B. Fortin. Identification of peer effects through social networks. Journal of
Econometrics, 150(1):41–55, 2009.

J. Cai, A. De Janvry, and E. Sadoulet. Social networks and the decision to insure. American Economic
Journal: Applied Economics, 7(2):81–108, 2015.

29
K. Chen and J. Lei. Network cross-validation for determining the number of communities in network data.
Journal of the American Statistical Association, 113(521):241–251, 2018.
V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins. Dou-
ble/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21
(1):C1–C68, 2018.

A. Chin. Central limit theorems via Stein’s method for randomized experiments under interference. arXiv
preprint arXiv:1804.03105, 2018.
D. Choi. Estimation of monotone treatment effects in network experiments. Journal of the American
Statistical Association, pages 1–9, 2017.

D. R. Cox. Planning of experiments. 1958.


D. Eckles, B. Karrer, and J. Ugander. Design and analysis of experiments in networks: Reducing bias from
interference. Journal of Causal Inference, 5(1), 2017.
B. Efron. Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 7(1):1–26, 1979.

B. Efron. Better bootstrap confidence intervals. Journal of the American Statistical Association, 82(397):
171–185, 1987.
N. Egami. Unbiased estimation and sensitivity analysis for network-specific spillover effects: Application to
an online network experiment. arXiv preprint arXiv:1708.08171, 2017.

F. Eicker. Limit theorems for regressions with unequal and dependent errors. 1967.
L. Forastiere, E. M. Airoldi, and F. Mealli. Identification and estimation of treatment and interference effects
in observational studies on networks. arXiv preprint arXiv:1609.06245, 2016.
S. Fortunato. Community detection in graphs. Physics reports, 486(3-5):75–174, 2010.

D. A. Freedman. On regression adjustments to experimental data. Advances in Applied Mathematics, 40(2):


180–193, 2008a.
D. A. Freedman. On regression adjustments in experiments with several treatments. The annals of applied
statistics, 2(1):176–196, 2008b.
T. M. Fruchterman and E. M. Reingold. Graph drawing by force-directed placement. Software: Practice
and Experience, 21(11):1129–1164, 1991.
M. S. Granovetter. The strength of weak ties. American Journal of Sociology, 78(6):1360–1380, 1973.
J. Hahn. On the role of the propensity score in efficient semiparametric estimation of average treatment
effects. Econometrica, pages 315–331, 1998.

J. Hájek. Comment on ‘An essay on the logical foundations of survey sampling, part 1’ by D. Basu. In
V. Godambe and D. A. Sprott, editors, Foundations of Statistical Inference, page 236, Toronto, 1971.
Holt, Rinehart and Winston.
T. Hastie and R. Tibshirani. Generalized additive models. Statistical Science, 1(3):297–310, 1986.

C. Haythornthwaite and B. Wellman. Work, friendship, and media use for information exchange in a
networked organization. Journal of the American society for information science, 49(12):1101–1114, 1998.
D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a finite universe.
Journal of the American Statistical Association, 47(260):663–685, 1952.
P. J. Huber. The behavior of maximum likelihood estimates under nonstandard conditions. 1967.

30
M. G. Hudgens and M. E. Halloran. Toward causal inference with interference. Journal of the American
Statistical Association, 103(482):832–842, 2008.
G. W. Imbens. Nonparametric estimation of average treatment effects under exogeneity: A review. Review
of Economics and Statistics, 86(1):4–29, 2004.

R. Jagadeesan, N. Pillai, and A. Volfovsky. Designs for estimating the treatment effect in networks with
interference. arXiv preprint arXiv:1705.08524, 2017.
D. A. Kim, A. R. Hwong, D. Stafford, D. A. Hughes, A. J. O’Malley, J. H. Fowler, and N. A. Christakis.
Social network targeting to maximise population behaviour change: a cluster randomised controlled trial.
The Lancet, 386(9989):145–153, 2015.

M. Kivelä, A. Arenas, M. Barthelemy, J. P. Gleeson, Y. Moreno, and M. A. Porter. Multilayer networks.


Journal of Complex Networks, 2(3):203–271, 2014.
T. Li, E. Levina, and J. Zhu. Network cross-validation by edge sampling. arXiv preprint arXiv:1612.04717,
2018.

W. Lin. Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s critique.
The Annals of Applied Statistics, 7(1):295–318, 2013.
L. Liu and M. G. Hudgens. Large sample randomization inference of causal effects in the presence of
interference. Journal of the American Statistical Association, 109(505):288–301, 2014.
C. F. Manski. Identification of endogenous social effects: The reflection problem. The Review of Economic
Studies, 1993.
C. F. Manski. Identification of treatment response with social interactions. The Econometrics Journal, 16
(1), 2013.
J. Neyman. On the application of probability theory to agricultural experiments. Essay on Principles. section
9. (translated and edited by D. M. Dabrowska and T. P. Speed, Statistical Science (1990), 5, 465-480).
Annals of Agricultural Sciences, 10:1–51, 1923.
E. L. Ogburn and T. J. VanderWeele. Causal diagrams for interference. Statistical Science, 29(4):559–578,
2014.
E. L. Paluck, H. Shepherd, and P. M. Aronow. Changing climates of conflict: A social network experiment
in 56 schools. Proceedings of the National Academy of Sciences, 113(3):566–571, 2016.
J. M. Robins, A. Rotnitzky, and D. O. Scharfstein. Sensitivity analysis for selection bias and unmeasured
confounding in missing data and causal inference models. In Statistical Models in Epidemiology, the
Environment, and Clinical Trials, pages 1–94. Springer, 2000.

P. R. Rosenbaum. Interference between units in randomized experiments. Journal of the American Statistical
Association, 102(477):191–200, 2007.
D. B. Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of
Educational Psychology, 66(5):688, 1974.
D. B. Rubin. Randomization analysis of experimental data: The Fisher randomization test comment. Journal
of the American Statistical Association, 1980.
M. Saveski, J. Pouget-Abadie, G. Saint-Jacques, W. Duan, S. Ghosh, Y. Xu, and E. M. Airoldi. Detecting
network effects: Randomizing over randomized experiments. In Proceedings of the 23rd ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, pages 1027–1035. ACM, 2017.

F. Sävje, P. M. Aronow, and M. G. Hudgens. Average treatment effects in the presence of unknown inter-
ference. arXiv preprint arXiv:1711.06399, 2017.

31
D. L. Sussman and E. M. Airoldi. Elements of estimation theory for causal effects in the presence of network
interference. arXiv preprint arXiv:1702.03578, 2017.
S. J. Taylor and D. Eckles. Randomized experiments to detect and estimate social influence in networks.
arXiv preprint arXiv:1709.09636, 2017.

E. J. Tchetgen Tchetgen and T. J. VanderWeele. On causal inference in the presence of interference. Statistical
methods in medical research, 21(1):55–75, 2012.
A. L. Traud, E. D. Kelsic, P. J. Mucha, and M. A. Porter. Comparing community structure to characteristics
in online collegiate social networks. SIAM review, 53(3):526–543, 2011.
A. L. Traud, P. J. Mucha, and M. A. Porter. Social structure of Facebook networks. Physica A: Statistical
Mechanics and its Applications, 391(16):4165–4180, 2012.
S. Tyner, F. Briatte, and H. Hofmann. Network visualization with ggplot2. The R Journal, 2017.
J. Ugander and L. Backstrom. Balanced label propagation for partitioning massive graphs. In Proceedings
of the sixth ACM International Conference on Web Search and Data Mining, pages 507–516. ACM, 2013.

J. Ugander, B. Karrer, L. Backstrom, and C. Marlow. The anatomy of the facebook social graph. arXiv
preprint arXiv:1111.4503, 2011.
J. Ugander, B. Karrer, L. Backstrom, and J. Kleinberg. Graph cluster randomization: Network exposure
to multiple universes. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pages 329–337. ACM, 2013.

M. J. van der Laan. Causal inference for a population of causally connected units. Journal of Causal
Inference, 2(1):13–74, 2014.
T. J. VanderWeele and E. J. Tchetgen Tchetgen. Effect partitioning under interference in two-stage random-
ized vaccine trials. Statistics & Probability Letters, 81(7):861–869, 2011.

T. J. VanderWeele, E. J. Tchetgen Tchetgen, and M. E. Halloran. Interference and sensitivity analysis.


Statistical Science: A review journal of the Institute of Mathematical Statistics, 29(4):687, 2014.
S. Wager, W. Du, J. Taylor, and R. J. Tibshirani. High-dimensional regression adjustments in randomized
experiments. Proceedings of the National Academy of Sciences, 113(45):12673–12678, 2016.

D. Walker and L. Muchnik. Design of randomized experiments in networks. Proceedings of the IEEE, 102
(12):1940–1951, 2014.
D. J. Watts and S. H. Strogatz. Collective dynamics of ‘small-world’ networks. Nature, 393(6684):440, 1998.
H. White. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity.
Econometrica: Journal of the Econometric Society, pages 817–838, 1980.

C.-F. J. Wu. Jackknife, bootstrap and other resampling methods in regression analysis. The Annals of
Statistics, pages 1261–1295, 1986.
E. Wu and J. Gagnon-Bartsch. The LOOP estimator: Adjusting for covariates in randomized experiments.
arXiv preprint arXiv:1708.01229, 2017.

32
A Proofs for Section 3
A.1 Proof of Proposition 1
Proof. Let εw be the Nw vector of w-group residuals. As yw = Xw βw + εw , for w = 0, 1, conditionally on
Xw being full rank we have

E[β̂w ] = E[(Xw> Xw )−1 Xw> yw ] = E[(Xw> Xw )−1 Xw


>
(Xw βw + εw )] = βw + E[(Xw> Xw )−1 Xw> εw ].

Assumption 3(a) ensures that the second term is zero, and thus β̂w is unbiased for βw .
Unbiasedness of τ̂ then follows by linearity of expectation.

A.2 Proof of Theorem 1


Proof. We first calculate the variance of β̂w . By the law of total variance, we have

Var(β̂w ) = Var[(Xw> Xw )−1 Xw> yw ]


= Var[(Xw> Xw )−1 Xw> εw ]
= E[(Xw> Xw )−1 Xw> Var(εw |Xw )Xw (Xw> Xw )−1 ] + Var[(Xw> Xw )−1 Xw
>
E(εw |Xw )].

The second term is equal to zero by Assumption 3(a), and so by Assumption 3(b) and (c),

Var(β̂w ) = σ 2 E[(Xw> Xw )−1 ].

The coefficient estimates of the two groups are uncorrelated because the residuals are uncorrelated. That
is,

Cov(β̂0 , β̂1 ) = E[Cov(β̂0 , β̂1 |X)] + Cov(E[β̂0 |X], E[β̂1 |X])


= E[Cov((X0> X0 )−1 X0> ε0 , (X1> X1 )−1 X1> ε1 )] + 0
= 0.

Therefore,

Var(τ̂ ) = Var((ω1 )> β̂1 − (ω0 )> β̂0 )


= σ 2 (ω0 )> E[(X0> X0 )−1 ]ω0 + (ω1 )> E[(X1> X1 )−1 ]ω1 ,


which produces the variance expression in equation (15).

A.3 Proof of Theorem 2


This lemma establishes some basic convergence results.
Lemma 1. Let X̄w and Sw denote the within-group sample means and covariances. Under Assumptions 1,
2, and the assumptions in the statement of Theorem 2, the following statements hold for w = 0, 1.
p
(a) X̄w → µX .
p
(b) Sw → ΣX .
p
(c) η̂w → ηw .
√ p
(d) nπ(X̄1 − µX ) ⇒ N (0, ΣX ) and n(1 − π)(X̄0 − µX ) ⇒ N (0, ΣX ).

(e) nπ(η̂1 − η1 ) ⇒ N (0, σ 2 Σ−1 n(1 − π)(η̂0 − η0 ) ⇒ N (0, σ 2 Σ−1
p
X ) and X ).
√ 
σ2

(f ) n(ε̄1 − ε̄0 ) ⇒ N 0, π(1−π) .

33
Proof. (a) Because of Bernoulli random sampling it holds that
" n
#
1 X
lim E[X̄1 ] = lim E Wi Xi = µX .
n→∞ n→∞ N1 i=1

By conditioning on X we have

Var(X̄1 ) = E[Var(X̄1 |X)] + Var[E(X̄1 |X)].

For the first term, we have


" n
!#
1 X
E[Var(X̄1 |X)] = E Var Wi Xi + rn ,
nπ i=1

where !
n n
1 X 1−π X 2
Var Wi Xi = X = Op (n−1 )
nπ i=1 n2 π i=1 i
and  n
X
1 1
rn = − Wi Xi = Op (n−1 )
N1 np i=1

since N1 /n → π in probability. For the second term, we have

Var[E(X̄1 |X] = Var(X̄) → 0

since X̄ −µX = op (1). Therefore, we conclude Var(X̄1 ) → 0, and so consistency follows from Chebychev’s
inequality.
The result similarly holds for X̄0 .
(b) This result is established in a similar manner to part (a), using the fact that
n
1X p
(Xi − X̄)> (Xi − X̄) → ΣX ,
n i=1

and the fact that fourth moments are bounded.


(c) The convergence of η̂w to ηw follows conditionally on X from standard OLS theory. Then, letting
1
Sw = (Xw − X̄w )> (Xw − X̄w )
n
denote the sample covariance matrix, we find

Var(η̂) = Var[E[η̂w |X]] + E[Var[η̂w |X]]


σ2 −1
= Var[ηw ] + E[Sw ] → 0.
n
Convergence in probability follows from Chebychev’s inequality.
p
(d) This result follows from Bernoulli sampling and the convergence of the finite population means, X̄ → µX .
(e) As in the proof of part (c), we write
1 −1
η̂w = S (Xw − X̄w )> (yw − ȳw ).
n w

34
Since yw = Xw ηw + εw , we can write
√ √
 
1 −1
n(η̂w − ηw ) = n S (Xw − X̄w )> (yw − ȳw ) − ηw
n w
1 −1
= √ Sw (Xw − X̄w )> (εw − ε̄w )
n
1
= √ Σ−1 (Xw − X̄w )> (εw − ε̄w ) + R,
n X
where the remainder is
1
R = √ (Sw −1
− Σ−1 >
X )(Xw − X̄w ) (εw − ε̄w )
n
p √ √
−1
Since Sw −Σ−1
X = op (1) is implied by Sw → ΣX , and n(Xw − X̄w )> = Op (1) and n(εw − ε̄w ) = Op (1),
the remainder satisfies R = op (1).

Then n(η̂w − ηw ) is asymptotically Gaussian with mean zero and variance
 
1 −1
lim Var √ ΣX (Xw − X̄w ) (εw − ε̄w ) = σ 2 Σ−1
>
lim Var(Xw )Σ−1
X n→∞ X .
n→∞ n
σ 2 −1 −1 σ 2 −1 σ2 −1
Using the result of part (d), this variance equals π Σ X ΣX ΣX = π ΣX when w = 1 and 1−π ΣX when
w = 0.
with variances σ 2 /(nπ) and σ 2 /(n(1 − π)), respectively.
(f) From Assumption 3, ε̄1 is independent of ε̄0 √
A standard central limit theorem shows that n(ε̄1 − ε̄0 ) is asymptotically Gaussian with mean 0 and
variance
σ2 σ2 σ2
+ = .
π 1−π π(1 − π)

We now prove the main theorem.


Proof. We characterize the treatment effect estimator as
τ̂ − τ = ȳ1 − ȳ0 + (ω1 − X̄1 )> η̂1 − (ω0 − X̄0 )> η̂0 − (α1 − α0 ) − (ω1> η1 − ω0> η0 )
= ε̄1 − ε̄0 + (ω1 − X̄1 )> (η̂1 − η1 ) − (ω0 − X̄0 )> (η̂0 − η0 ),
which implies that
√ √ √ √
n(τ̂ − τ ) = n(ε̄1 − ε̄0 ) + n(ω1 − X̄1 )> (η̂1 − η1 ) − n(ω0 − X̄0 )> (η̂0 − η0 ).
Now, √ √ √
n(ωw − X̄w )> (η̂w − ηw ) = n(ωw − µw )> (η̂w − ηw ) + n(µw − X̄w )> (η̂w − ηw ),
p p
for w = 0, 1, where the second term is op (1) since X̄w → µX and η̂w → ηw following from parts (a) and (c)
of Lemma 1. Therefore,
√ √ √ √
n(τ̂ − τ ) = n(ε̄1 − ε̄0 ) + n(ω1 − µX )> (η̂1 − η1 ) − n(ω0 − µX )> (η̂0 − η0 ) + op (1).
The three terms are uncorrelated, with
√ σ2
 
n(ε̄1 − ε̄0 ) ⇒ N
0,
π(1 − π)
√ σ2
 
n(ω1 − µX )> (η̂1 − η1 ) ⇒ N 0, kω1 − µk2Σ−1
π X

√ 2
 
σ
n(ω0 − µX )> (η̂0 − η0 ) ⇒ N 0, kω0 − µk2Σ−1 ,
1−π X

established in parts (e) and (f) of Lemma 1. Combining the terms produces the variance expression in
equation (18), and completes the proof.

35
A.4 Proof of Corollary 1
Proof. If Xi is independent of W−i , then
n n
1X 1X
ω0 = E[Xi |W−i = 0] = E[Xi ],
n i=1 n i=1

and so is equal to µX in the limit (with the understanding that ω0 is actually a sequence associated with
each finite population). The same holds true for ω1 . Then the result follows immediately from equation (18),
as the second and third terms are equal to zero.

36

You might also like