Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Difference-in-Differences With Interference: Ruonan Xu

Download as pdf or txt
Download as pdf or txt
You are on page 1of 65

Difference-in-Differences with Interference

arXiv:2306.12003v5 [econ.EM] 30 May 2024

Ruonan Xu∗

Abstract
In many scenarios, such as the evaluation of place-based policies, poten-
tial outcomes are not only dependent upon the unit’s own treatment but also
its neighbors’ treatment. Despite this, “difference-in-differences” (DID) type
estimators typically ignore such interference among neighbors. I show in this
paper that the canonical DID estimators generally fail to identify interesting
causal effects in the presence of neighborhood interference. To incorporate in-
terference structure into DID estimation, I propose doubly robust estimators
for the direct average treatment effect on the treated as well as the average
spillover effects under a modified parallel trends assumption. The approach in
this paper relaxes common restrictions in the literature, such as partial inter-
ference and correctly specified spillover functions. Moreover, robust inference
is discussed based on the asymptotic distribution of the proposed estimators.

Key words: Difference-in-differences, interference, spillover, doubly-robust,


spatial correlation, finite population
JEL codes: C10, C21, C23

1 Introduction
According to the stable unit treatment value assumption (SUTVA), potential out-
comes only depend on one’s own treatment assignment. In many cases, SUTVA fails

ruonan.xu@rutgers.edu, Rutgers University

1
due to an unknown interference structure among neighbors. In the fields of environ-
mental economics, urban economics, criminal justice, and many other fields of social
sciences, place-based policies often generate spillover effects. One example is mini-
mum wage increase in Seattle studied by Jardim, Long, Plotnick, van Inwegen, Vigdor, and Wething
(2022). Through the channels of competition in the regional labor market for work-
ers and the possibility of relocation of businesses, they find that significant spillover
effects on wages and hours are seen up to a 40-minute drive from Seattle city limits.
When spillover effects are of interest, one often needs to observe the entire pop-
ulation. For example, we can typically collect information about all counties in
the United States. In the example above, Jardim et al. (2022) use administrative
employment records in the state of Washington. If we take sampling from the super-
population/infinite population approach literally, what we are estimating turns out
to be the spillover effect in a researcher’s sample instead of in the population from
which the sample is drawn unless interactions are restricted within clusters of friends
or household members.
In this paper, I study the “difference-in-differences” (DID) type estimators that
allow interference from a finite population perspective, where inference is condi-
tional on covariates and the whole population is observed. This approach is clos-
est to the conditional inference discussed by Abadie, Imbens, and Zheng (2014) and
Jin and Rothenhäusler (2024) without reference to a superpopulation of covariates.
Conditional treatment effect parameters have also been mentioned in Abadie and Imbens
(2002), Imbens (2004), and Balzer, Petersen, and van der Laan (2015). Recently,
Viviano (2024) adopts the same inference framework when studying optimal treat-
ment allocation under network interference. The conditional inference approach
adopted here allows arbitrary spatial correlation and nonstationarity of covariates.
Meanwhile, stochastic potential outcomes allow the possibility of modeling the con-
ditional means of the outcome variables. Consequently, the proposed estimators
below are more robust to model specification with a straightforward causal interpre-
tation. One could legitimately argue that researchers should not stick to a single
inference framework. That said, many attribute variables containing locational in-
formation and neighborhood characteristics, such as landlocked status, are deemed

2
non-stochastic for spatial data. I therefore consider the current approach a natural
starting point for studying population interference/spillover effects.
One challenge of incorporating interference is the modeling of spillover functions.
There is no guarantee that the spillover function will be specified correctly since
the true interference structure is rarely known. It is common practice to come up
with a functional form of the spillover pattern within a fixed neighborhood bound-
ary. One leading example is to specify exposure as the average value of treatment
statuses of neighbors within a distance d of a unit i. (Exposure mapping is a “func-
tion that maps an assignment vector and unit-specific traits to an exposure value”;
see Aronow and Samii (2017) for detailed definition.) This particular functional
specification might be chosen because of its straightforward presentation and policy
relevance. That said, potentially misspecified exposure mapping prevents the ap-
plication of standard causal inference techniques, especially in the context of DID,
where there is not much guidance on how to incorporate interference to begin with.
To overcome misspecification, I consider the expected direct treatment effect at
certain neighborhood exposure levels as the parameter of interest inspired by Sävje
(2024). The causal estimands I define coincide with the exact direct treatment effect
when the exposure mapping is correctly specified and remain well-defined even under
misspecification. The proposed estimators are consistent for the estimands regardless
of the correct specification of the exposure mapping.
Namely, practitioners can still use the spillover function they choose based on
their domain/institutional knowledge. Nevertheless, the estimands defined and the
estimators proposed in the current paper provide robustness with respect to misspec-
ification of the chosen spillover function. Given the cutoff of the spillover effect is
typically unknown, the chosen neighborhood boundary can also differ from the true
one. Applying the device of approximate neighborhood interference (ANI) in Leung
(2022) to spatial data, the only requirement of the data generating process is that
treatments assigned to units further from i have a smaller, but possibly nonzero, ef-
fect on i’s response. In addition, the assignment variables are allowed to be spatially
correlated as is often the case in practice with spatial data.
Putting all the pieces together, I propose doubly robust estimators for the di-

3
rect treatment effect and spillover effect. The proposed doubly robust estimator is a
modified version of the augmented inverse probability weighting (AIPW) estimator,
which only requires correct specification of either the propensity scores of treat-
ment/exposure or the conditional mean of the outcomes. The conditional inference
approach in the current paper leads to a different variance-covariance matrix which
may require a new variance estimator when necessary.
Besides the main contribution above, there are two other sets of results worth
mentioning. In Section 3.1 below and Section B in the online appendix, I study
the identification of canonical DID estimators available in the literature. I provide
conditions under which canonical estimators can still identify meaningful causal es-
timands. This discussion alone would be of interest to practitioners. In Section C
of the online appendix, I clarify what toolkit practitioners can use by comparing
various dimension reduction approaches in the interference literature. I also show
how these dimension reduction approaches relate to the ANI device. The proposed
doubly robust estimators are applied in Section 6 to study the policy effect of special
economic zones (SEZ) in China. Section D of the online appendix summarizes the
detailed steps for direct effect estimation.
Literature: Most of the methodological literature studies spillover effects in a
single cross section of experimental data and assumes partial interference or limits in-
terference to immediate neighbors. Additionally, they assume that the function of de-
pendence on neighbors’ treatments is known and correctly specified. See, for instance,
Hudgens and Halloran (2008) and Aronow and Samii (2017). Delgado and Florax
(2015), Clarke (2017), and Butts (2021) allow interference in DID estimation in a
two-way fixed effects (TWFE) estimating equation (often without covariates) from
a superpopulation perspective. Huber and Steinmayr (2021) also propose a DID ap-
proach to estimate spillover effect and total effect within a superpopulation frame-
work.1 All four papers mentioned above share some or all restrictions of the general
interference literature. Design-based DID estimation has been studied by Arkhangelsky, Imbens, Lei, an
1
Their potential outcomes are defined as functions of individual and regional treatments, where
individual treatment status is a function of the regional treatment. Therefore, Huber and Steinmayr
(2021) is more applicable to studies of local equilibrium effects.

4
(2021), Athey and Imbens (2022), and Rambachan and Roth (2022), but they keep
the SUTVA. Sant’Anna and Zhao (2020) have proposed AIPW estimators in the
DID context, maintaining SUTVA and the superpopulation perspective.

2 Setup
2.1 Environment
I start with the relatively simple setting of panel data with two time periods; t =
1, 2 stands for the time period before and after treatment respectively. Consider a
sequence of lattices of (possibly) unevenly placed locations in Rd , {DM ⊆ Rd , d ≥ 1},
where M indexes the sequence of finite populations. Because I consider the case
where the sample coincides with the population for spatial data, I let the population
size |DM | diverge to infinity in deriving the asymptotic properties, where |V | denotes
the cardinality of a finite subset V ⊆ DM .
I briefly summarize the notation used throughout the paper. I adopt the metric
ρ(i, j) = max1≤l≤d |jl − il | in space Rd , where il is the l-th component of i. The
distance between any subsets K, V ⊆ DM is defined as ρ(K, V ) = inf{ρ(i, j) : i ∈
1/p
K and j ∈ V }. For any random vector X, kXkp = E kXkp , p ≥ 1, denotes its
Lp -norm. Lastly, C denotes a generic positive constant that may vary under different
circumstances.
For each unit i in the population, there is a stochastic assignment variable Wi ∈
{0, 1}, a vector of fixed attributes zi = (ziind , zineigh ) that possibly includes attributes
of i’s neighborhood zineigh in addition to individual characteristics ziind , and a vector
of stochastic unobservables Uit . The potential outcome function for any i ∈ DM is
defined as hit (·) : {0, 1}|DM | × Rdim(zi ) × Rdim(Uit ) → R. I emphasize the treatment
vector of the entire population by denoting the potential outcomes as yit (wi , w−i ) =
hit (wi , w−i , zi , Uit ), where w−i = {wj , j ∈ DM , j 6= i}.2 The dependence of the
2
As in Manski (2013), the potential outcome function defined here can be considered as the
response function, namely the reduced form of structural equations where the structural potential
outcome may depend on other units’ treatments as well as outcomes.

5
potential outcomes on the fixed attributes and stochastic unobservables is indicated
by its i, t subscript. The realized potential outcomes are denoted by Yit = yit (W ).
Notice that (W , z, Y , U ) = {(Wi , zi , Yit (·), Uit ), i ∈ DM , M ≥ 1} are triangular
arrays of random fields defined on a probability space (Ω, F , P ). Exposure mapping
is defined by the function Gi = G(i, W−i ) ∈ G, where G is a discrete set. Therefore,
G(·) maps the treatment status of all units except i to an exposure value.
The construction of the G(·) function deserves some explanation. In an ideal
scenario, empirical researchers would like to come up with a functional form of
G(·) that captures actual interactions among neighbors as well as conveys clear
causal explanations. Because of the unknown interference structure and the high
dimensional treatment assignment vector of the entire population, choosing a G(·)
that achieves both goals is challenging.3 For instance, a leading choice is Gi =
P P
j∈DM ,j6=i Aij Wj / j∈DM ,j6=i Aij , where Aij = 1 if the distance between units i and
j is within a certain cutoff d. Besides the somewhat arbitrary cutoff d, the impact of
i’s neighbors may not be exchangeable in reality, e.g., unit l may have greater influ-
ence on i compared to unit m. That said, the specification above might still capture
meaningful policy effects. Consequently, using domain knowledge in the context of
each specific empirical question, coming up with a G(·) that summarizes interesting
and relevant policy implications for both direct treatment and spillover effects might
be a good starting point. In this paper, I show how to proceed with DID estimation
with a chosen G(·) function that is potentially misspecified. For more discussion on
some common choices of G(·) and how they can be accommodated in the current
framework, I refer readers to Section C in the online appendix.
In line with common practices, empirical researchers can construct the exposure
mapping G(·) in the following manner, which might not necessarily coincide with the
true interference structure. Given a fixed K, define the K-neighborhood of the unit
i as
N (i, K) = {j ∈ DM : ρ(i, j) ≤ K, j 6= i}
3
Manski (2013) and Basse and Airoldi (2018) formally point out that there exist no consistent
treatment effect estimators under arbitrary interference. It is therefore necessary to make dimension
reduction assumptions about the interference structure in order to identify meaningful treatment
effect parameters.

6
Let wN (i,K) = (wj : j ∈ N (i, K)) be the treatment vector of units within i’s

K-neighborhood. There exists K < ∞ such that for all w−i and w−i such that
′ ′
wN (i,K) = wN (i,K), G(i, w−i ) = G(i, w−i ). As a result, the specified exposure map-
ping function restricts spillover effects within the immediate K-neighborhood of each
unit. Having said that, the actual potential outcome function does not restrict the
interference structure. Treatments of units outside of i’s K-neighborhood can le-
gitimately influence i’s potential outcome as long as treatments assigned to units
further from i have a smaller, but possibly nonzero, effect on i’s response. Section 4
below describes the related assumptions. This way, the exposure mapping function
is allowed to be misspecified. The G(·) function is allowed to be multidimensional,
in which the K distance would be the largest distance that interference is allowed
under the specification across the fixed dimensions of G(·).

2.2 Estimands of Interest


This paper is interested in the expected finite population average, i.e., the average
of the expected potential outcome across all units in the finite population. In other
words, I focus on conditional inference given fixed attributes zi ; see Abadie et al.
(2014) and Jin and Rothenhäusler (2024) for detailed discussion of conditional pa-
rameters and conditional inference.
There are two types of estimands of interest. For the main part of the paper, I
will focus on the first type, the direct treatment effect. There is more than one way
to define the parameter of interest. As an analogy to the expected average treatment
effect in Sävje, Aronow, and Hudgens (2021), the overall direct effect can be defined
as
1 X  
τ= E yi2 (1, W−i ) − yi2 (0, W−i )|Wi = 1, zi ,
|DM | i∈D
M

which marginalizes over the treatment assignment vector. The overall direct effect is a
natural extension of the average treatment effect on the treated (ATT) as it coincides
with the ATT when units do not interfere (yi2 (Wi , W−i ) reduces to yi2 (Wi )).
Often the time, in addition to a summary of the direct effects, researchers can

7
also be interested in direct effect at different exposure levels. The overall direct
effect is highly related to the expected direct average treatment effect on the treated
(EDATT) at exposure levels g ∈ G defined in equation (1) below.4

1 X  
τ (g) = E yi2 (1, W−i ) − yi2 (0, W−i)|Wi = 1, Gi = g, zi (1)
|DM | i∈D
M

Without interference, τ (g) becomes a conditional ATT with Gi = g serving as an-


other characteristic of unit i. I focus on the identification and estimation of EDATT
as the estimation of the overall direct effect naturally follows using weighted av-
erages.5 Also, by contrasting different exposure levels, the definition of EDATT
facilitates the discussion of the second estimand, the spillover effect. In the interest
of space, this is delegated to Section A of the online appendix.
P  
The all-or-nothing effect, |D1M | i∈DM E yi2 (1) − yi2 (0)|zi , where 1 and 0 are
unit and zero vectors, cannot be consistently estimated (Basse and Airoldi (2018)).
Instead, direct effects and spillover effects summarize different aspects of policy ef-
fects. For the EDATT I consider in the main text, it captures the direct treatment
effect at different exposure levels. In a vaccination example, if the direct effect of
vaccinating an additional individual is almost zero given that a certain fraction of the
population are already vaccinated, this can serve as an indicator of herd immunity
being achieved.
The key ingredient of the definition is the expected potential outcome at exposure
level g,
 
E yi2 (1, W−i )|Wi = 1, Gi = g, zi
X  
= E yi2 (1, w−i)|Wi = 1, W−i = w−i , zi P (W−i = w−i |Gi = g, Wi = 1, zi ),
w−i ∈{0,1}|DM |−1

where the expectation is taken over all possible realizations of W−i given the specified
exposure mapping G(i, W−i ) = g. The definition of the expected potential outcome
4
Their relationship is explained in equation (G.1) in Section G of the online appendix.
5
As shown in Section 3.1, the canonical DID estimand, which ignores interference, fails to identify
τ for spatially correlated assignments.

8
is different from what is initially proposed in Sävje (2024), in which the potential
outcome is fixed in an experimental setting and the expectation is with respect to
the assignment variables only. Not only that I split the entire treatment vector into
wi and w−i , but also the stochastic nature of the potential outcomes needs to be
taken into account. The randomness of the potential outcome function brings up
challenge to causal interpretation of the spillover effect estimand. Section A in the
online appendix provides more detailed reasoning.
In terms of the interpretation of EDATT, if the spillover effect and the direct
effect are additively separable, we can identify the exact direct ATT even if the
spillover function is misspecified. Without additivity, we can still identify the direct
ATT that would realize in expectation at the specified exposure level.
Under correct specification of the exposure mapping,

1 X  
τ (g) = τ ∗ (g) = E ỹi2 (1, g) − ỹi2 (0, g)|Wi = 1, Gi = g, zi ,
|DM | i∈D
M

where ỹi2 (wi , g) = yi2 (wi , w−i ). What we identify in this case is exactly the direct
ATT at exposure level g.

2.3 A Motivating Example


I use the empirical application in Section 6 below to illustrate the relevant variables
and estimands. In addition, it serves as a numerical example of the difference between
DID estimates with or without taking spillover effects into account. The data I use
come from Lu, Wang, and Zhu (2019), who study how China’s SEZ policy impacts
various outcomes Yit , such as the logarithm of firm output at the village level. If
village i is located within the boundaries of a SEZ, direct treatment variable Wi is
equal to one and otherwise, it is equal to zero. Exposure mapping Gi is a binary
variable equal to one if the leave-one-out ratio of SEZ villages to the total number
of villages in the county c in which village i is located is greater than the mean ratio
among all counties. In this case, the EDATT captures the direct effect of establishing
a SEZ in village i given the fraction of SEZ villages within a county. It is possible that

9
the expected direct effect is lower when there is a higher proportion of neighboring
SEZ villages. This can help determine whether establishing an additional SEZ is
cost-effective.
On the basis of the SEZ dataset, I simulated how the usual DID estimates ig-
noring interference differ from the overall direct effect estimates. The logarithm of
output is first regressed on direct treatment assignment status Wi , exposure mapping
Gi defined above, as well as a list of covariates used in Lu et al. (2019) and their
interactions with Wi . The new output variables are generated as the sum of the
fitted values from the regression above and random draws from a standard normal
distribution. Nevertheless, I scale the coefficient for Gi from negative six folds to six
folds with one increment for each set of generated outcomes. In terms of magnitude,
the largest scaled coefficient on Gi is nearly half as large as the coefficient on Wi .

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1
-6 -4 -2 0 2 4 6
scalar of the coefficient on G

Figure 1: Canonic DID Estimates and Overall Direct Effect Estimates

In Figure 1, I plot the estimates based on the usual DID method, τ̂canonic , with-
out considering potential interference. In addition, I report the overall direct effect
estimates, τ̂ .6 Given the model specification, τ̂ remains the same under different
scalings. Based on 100 simulations for each scaling factor, the mean of the generated
6
τ̂ is calculated as a weighted average of the estimates of τ (1) and τ (0). The estimator for τ (g)
is presented in Section 4 below.

10
log output ranges between 9.46 and 9.96, which is pretty close to the mean of the
actual log output of 9.75.
In the actual dataset, τ̂canonic is 29.4% larger than τ̂ . A simple simulation study
shows that the average of τ̂canonic can be either 74.4% smaller or 116.0% larger than
the average of τ̂ in accordance with the magnitude of the spillover effect. By applying
larger scaling factors, it is possible to make the differences even more pronounced. As
shown in Remark 1 below, τcanonic is not the co-called total effect either. Therefore,
the usual DID estimates can be misleading for direct causal effects when there is
potential interference among units.

3 Identification
The first question when relaxing SUTVA is what the canonical DID estimator identi-
fies if spillover effects are incorrectly ignored. Namely, will the canonical DID estima-
tor still consistently estimate ATT in the presence of interference? Forastiere, Airoldi, and Mealli
(2021) discuss bias of the difference-in-means estimator when SUTVA is wrongly as-
sumed in observational studies on networks. To my knowledge, the literature has not
yet investigated DID type estimators. To facilitate the discussion of identification, I
impose the following assumptions.

Assumption 1. (Overlap) ∀ i ∈ DM , there exists ǫ > 0 such that ǫ < p(zi ) < 1 − ǫ,
π1g (zi ) > ǫ, and π0g (zi ) > ǫ, where

p(zi ) = P (Wi = 1|zi ), (2)

π1g (zi ) = P (Gi = g|Wi = 1, zi ), (3)

and
π0g (zi ) = P (Gi = g|Wi = 0, zi ). (4)

To simplify notation, I assume that the overlap assumption applies to every unit
in the population. With certain exposure mapping specifications, this might not be
plausible. An easy fix is to change the estimand by averaging over the subpopulation

11
where Gi can take on the value g. Failure to satisfy the overlap condition for p(zi ) is
trickier. If one is willing to move the goalpost by redefining the population, one can
drop units that always or never take treatment. The good news is that for the rede-
fined population, we can still observe the treatment assignment vector of the original
population since the treatment status of the dropped units is fixed and known. This
way, dropping the always or never takers will not affect the exposure mapping. On
the other hand, to deal with weak overlap conditions in practice without changing
the population or estimand, one can consider approaches proposed by Ma and Wang
(2020) and Man, Sant’Anna, Sasaki, and Ura (2023) to trim propensity scores and
correct the resulting bias simultaneously.

Assumption 2. (No Anticipation)

yi1 (wi , w−i ) = yi1 (0, 0)

Assumption 2 requires that the potential outcome in the first time period prior to
treatment is always equal to the potential outcome without treatment nor spillover.
The no-anticipation assumption is quite standard in the literature, sometimes im-
plicitly assumed.
If we focus on the case of correctly specified exposure mapping, we could impose
the following parallel trends assumption: For any g ∗ ∈ G ∗ and ∀ i,
   
E ỹi2 (0, g ∗)|Wi = 1, G∗i = g ∗, zi − E yi1 (0, 0)|Wi = 1, G∗i = g ∗ , zi
    (5)
=E ỹi2 (0, g ∗)|Wi = 0, G∗i = g ∗, zi − E yi1 (0, 0)|Wi = 0, G∗i = g ∗ , zi ,

where G∗i = G∗ (i, W−i ) stands for the unknown true exposure mapping and G ∗ is the
set of discrete values that G∗ can take. Notice that in the parallel trends assumption,
as no one is treated at t = 1 there is no spillover in the potential outcome function in
the first time period. Namely, moving from the first to the second time period, in the
absence of direct treatment the conditional mean of the potential outcomes for the
treated and the untreated with the same level of exposure in the second time period
follows the same trend. Equation (5) serves as our starting point for identification.

12
In order to provide a general framework for identifying EDATT, I impose the
following assumption instead.

Assumption 3. (Parallel Trends)

1 X h  i
E yi2 (0, W−i)|Wi = 1, Gi = g, zi − E yi1 (0, 0)|Wi = 1, Gi = g, zi
|DM | i∈D
M

1 X h  i
= E yi2 (0, W−i)|Wi = 0, Gi = g, zi − E yi1 (0, 0)|Wi = 0, Gi = g, zi
|DM | i∈D
M
(6)

If one removes the outer average, and assumes that equality holds for each unit
i ∈ DM , Assumption 3 becomes the conventional conditional parallel trends. Notice
that equation (5) and Assumption 3 can be linked by the law of iterated expectations
invariant of the specified exposure mapping function if equality in equation (7) holds.

P (G∗i = g ∗ |Wi = 1, Gi = g, zi ) = P (G∗i = g ∗|Wi = 0, Gi = g, zi ) (7)

In this sense, there are three subcases of Assumption 3.

Assumption 3-A. The exposure mapping is correctly specified such that G∗i = Gi .

When G∗i = Gi , equation (7) holds trivially regardless of the spatial correlation
among the assignment variables. This is the standard setting where Assumption 3
reduces to a population average of equation (5) under correct specification of the
exposure mapping. In this case, individual assignment variables can be spatially
correlated because of clustered assignments at a more aggregate geographical level
or simply because of geographical homophily such as the distribution of natural
resources. Spatial correlation between assignment variables can also be induced
by “peer effects.” For instance, individual units take up treatments based on their
neighbors’ adoption. Therefore, the asymptotic theory in Section 4 below allows for
spatial correlation as long as weak dependence holds. When G∗i is correctly specified,
we can identify the exact direct ATT.

13
Assumption 3-B. Individual assignments are independent such that Wi ⊥
⊥ Wj |z, ∀ i 6=
j.

This is the case where the exposure mapping G(·) is allowed to be arbitrarily
misspecified. It is empirically relevant because the interference structure is typi-
cally unknown. Nevertheless, the second case rules out clustered/spatially corre-
lated assignments as well as neighbors’ influence on treatment uptake after par-
tialing out individual and neighborhood characteristics. Notice that zi can contain
attributes of i’s neighborhood. Recall that G∗i = G∗ (i, W−i ). It is not hard to see
that equality (7) holds when individual assignment variables are independent con-
ditional on zi = z. If G∗ (·) is a function of the treatment vector of units within i’s
K-neighborhood, i.e., G∗i = G∗ (i, WN (i,K)), zi can be a sub-vector of z, for instance,
zi = zN (i,K) = (zjind : j ∈ N (i, K)).

Assumption 3-C. Exposure mapping is misspecified in the sense that G∗i 6= Gi and
individual assignments are correlated but equation (7) holds.

Assumption 3-C is the case in between with both misspecification of exposure


mapping and spatially correlated assignments, which merits further explanation. A
special case of Assumption 3-C is G∗i = f (Gi , ιi ), where ιi ⊥
⊥ Wi |Gi , zi and f (·) is
some arbitrary function. For example, a population is partitioned into clusters. Indi-
vidual treatment assignments are correlated (not perfectly correlated) within clusters
but independent across clusters. If partial interference is assumed, practitioners may
P
specify the exposure mapping Gi = Cj =Ci ,j6=i Wj to be the number of unit i’s treated
neighbors within the cluster, where Ci indicates the cluster that unit i belongs to.
P
Nevertheless, the actual exposure might be G∗i = j∈N (i,K) Wj = Gi + ιi , where
P
ιi = j∈N (i,K),Cj 6=Ci Wj is the number of treated units outside of unit i’s cluster but
within i’s K-neighborhood which is a larger group that contains i’s cluster.
Consider the empirical example in Section 6 below. Suppose the designation of
SEZ villages is clustered within counties but independent across counties; and the
exposure is specified as the number of other SEZ villages within the same county.
If the actual exposure is the total number of other SEZ villages in the same county

14
and bordering counties, Assumption 3-C holds.7
In another example, neighbors’ identities could matter. Suppose units are par-
titioned into clusters consisting of three units. The other two units in i’s clus-
ter could be assigned treatment status (0,1) or (1,0) with the same probability
under the cluster assignment mechanism.8 The exposure mapping is specified as
P
Gi = Cj =Ci ,j6=i Wj /2. However, the true exposure is the assignment vector of the
two neighbors within the same cluster. Still, P (G∗i = g ∗ |Wi = 1, Gi = 1/2) =
P (G∗i = g ∗ |Wi = 0, Gi = 1/2) = 1/2 for g ∗ = (1, 0) or g ∗ = (0, 1), which are
two different exposures. In an example of information diffusion among villagers, the
identity of the initial information receivers could matter.

Lemma 3.1. Under equation (5), if any of Assumption 3-A, 3-B, or 3-C is satisfied,
Assumption 3 holds.

Lemma 3.1 summarizes the three cases underlying the general formalization of
the parallel trends assumption in Assumption 3. While the first two cases are easier
to argue, certain data generating processes may lead to the third case. Regardless,
Assumption 3 identifies the EDATT.
There is a growing literature on justification and falsification of the parallel trends
assumption under SUTVA; see, for instance, Ghanem, Sant’Anna, and Wüthrich
(2022) and Roth and Sant’Anna (2023). When parallel trends might be violated,
Rambachan and Roth (2023) present confidence sets for the identified set of treat-
ment effects. The extension of these analyses to parallel trends with interference is
outside the scope of the current paper.
7
The empirical example is used to demonstrate the third case, but it does not necessarily fit into
this specific data generating process.
8
The individual assignment mechanism consists of two steps. In the first step, each cluster has
its own assignment probability qc ∈ (0, 1) drawn from a distribution with variance σ 2 > 0. In the
second step, units within cluster c are assigned to treatment independently according to cluster
specific probability qc .

15
3.1 Canonical DID
The usual ATT under the SUTVA is

1 X  
τ̃ = E yi2 (1) − yi2 (0)|Wi = 1, zi .
|DM | i∈D
M

Here, the potential outcomes are determined solely by unit i’s own treatment. Sup-
pose the canonical DID estimator consistently estimates

1 X h i
τcanonic = E(Yi2 − Yi1 |Wi = 1, zi ) − E(Yi2 − Yi1 |Wi = 0, zi ) .
|DM | i∈D
M

Examples include the TWFE linear estimating equation in Remark 1 in Sant’Anna and Zhao
(2020) under the additional restrictions of the data generating process therein, as well
as the inverse probability weighting (IPW) estimator in Abadie (2005). If the usual
(conditional) parallel trends assumption holds without interference, τcanonic would be
equivalent to τ̃ .
If SUTVA is violated, τ̃ is not well defined. Also, EDATT is generally determined
by the specified exposure level. As a result, I use the overall direct effect as a
benchmark for comparison.
Suppose that the parallel trends assumption (8) below holds for any g ∈ G, ∀ i.
 
E yi2 (0, W−i )|Wi = 1, Gi = g, zi − E yi1 (0, 0)|Wi = 1, Gi = g, zi
  (8)
=E yi2 (0, W−i )|Wi = 0, Gi = g, zi − E yi1 (0, 0)|Wi = 0, Gi = g, zi

Using the law of iterated expectations, τ and τcanonic can be decomposed in the
following way:

1 X Xh i
τ= E(Yi2 |Wi = 1, Gi = g, zi) − E(Yi1 |Wi = 1, Gi = g, zi) P (Gi = g|Wi = 1, zi )
|DM | i∈D g∈G
M
Xh i 
− E(Yi2 |Wi = 0, Gi = g, zi ) − E(Yi1 |Wi = 0, Gi = g, zi ) P (Gi = g|Wi = 1, zi )
g∈G

16

1 X Xh
τcanonic = E(Yi2 |Wi = 1, Gi = g, zi)
|DM | i∈D g∈G
M
i
− E(Yi1 |Wi = 1, Gi = g, zi ) P (Gi = g|Wi = 1, zi )
Xh i 
− E(Yi2 |Wi = 0, Gi = g, zi) − E(Yi1 |Wi = 0, Gi = g, zi) P (Gi = g|Wi = 0, zi )
g∈G

Proposition 1. Suppose equation (8) holds for any g ∈ G, ∀ i. Under Assumption


2, τcanonic 6= τ unless P (Gi = g|Wi = 0, zi ) = P (Gi = g|Wi = 1, zi ).

Corollary 1. Proposition 1 also holds for τ ∗ (g) under the correct specification of the
exposure mapping.

The equality of the conditional probabilities holds if Gi ⊥


⊥ Wi | zi . However,
conditional independence can be easily violated if either of the following is true: (i)
Gi and Wi are linked through covariates not included in zi ; (ii) neighbors’ behavior
affects unit i’s treatment uptake; (iii) similar neighborhood characteristics drive the
assignment mechanism; see Forastiere et al. (2021) for a parallel discussion allowing
interference on networks under unconfoundedness.
As a consequence, the overall direct effect can be either underestimated or over-
estimated by the canonical DID estimators.

Remark 1. When the exposure G takes two values zero and one, after a simple
calculation
1 X h 
τcanonic =τ + E(Yi2 |Wi = 0, Gi = 1, zi ) − E(Yi2 |Wi = 0, Gi = 0, zi )
|DM | i∈D
M
i
− E(Yi1|Wi = 0, Gi = 1, zi ) − E(Yi1 |Wi = 0, Gi = 0, zi )
 
· P (Gi = 1|Wi = 1, zi ) − P (Gi = 1|Wi = 0, zi )

τcanonic cannot be interpreted as a total effect. The terms E(Yi2 |Wi = 0, Gi = 1, zi ) −


E(Yi2 |Wi = 0, Gi = 0, zi ) and E(Yi1|Wi = 0, Gi = 1, zi ) − E(Yi1|Wi = 0, Gi =
0, zi ) represent the individual spillover effect defined in the online appendix and the
heterogeneity of the first period outcome associated with Gi , respectively. When the

17
spillover effect is moderate compared with the direct effect, the difference between
τcanonic and τ can be sizable, as shown in the motivating example in Section 2.3
above.

In the empirical literature, researchers often augment the TWFE DID estimating
equation with another binary indicator of spillover. In the interest of space, I examine
the identification of the modified TWFE estimating equations in Section B of the
online appendix.

3.2 Doubly Robust Estimand


Since ignoring the spillover effect is only harmless under special scenarios, we need to
propose new estimators for the EDATT. Under parallel trends and overlap assump-
tions, the EDATT can be identified by inverse weighting using propensity scores.
" #
1 X Wi − p(zi ) 1{Gi = g}
τ (g) = E (Yi2 − Yi1 ) zi
|DM | i∈D p(zi )(1 − p(zi )) Wi π1g (zi ) + (1 − Wi )π0g (zi )
M
" # (9)
Wi − p(zi ) 1{Gi = g}
=ED (Yi2 − Yi1 )
p(zi )(1 − p(zi )) Wi π1g (zi ) + (1 − Wi )π0g (zi )

To simplify notation, I use ED to denote the finite population average conditional on


the attributes z from now on. Without the G indicator and the additional propensity
scores for spillover, the IPW-DID estimand is the same as the estimand proposed in
Abadie (2005).
Alternatively, the EDATT can also be identified through regression adjustment.
Define the conditional means of the potential outcome as

µt,wg (zi ) = E(Yit |Wi = w, Gi = g, zi ). (10)

The regression adjustment estimand is

1 X   
τ (g) = µ2,1g (zi ) − µ2,0g (zi ) − µ1,1g (zi ) − µ1,0g (zi ) . (11)
|DM | i∈D
M

18
To allow for more robustness against misspecification of the propensity scores
or the conditional means of the outcomes, the IPW-DID estimand can be extended
to an AIPW estimand. Let mt,wg (zi ) denote the model for equation (10). Denote
∆mt,g (zi ) = mt,1g (zi ) − mt,0g (zi ). Furthermore, let η(zi ), η1g (zi ), and η0g (zi ) be the
models for the propensity scores in equations (2)-(4), respectively. The doubly robust
estimand is
"
Wi 1{Gi = g}   
τ (g) =ED Yi2 − m2,1g (zi ) − Yi1 − m1,1g (zi )
η(zi ) η1g (zi )
1 − Wi 1{Gi = g}   
− Yi2 − m2,0g (zi ) − Yi1 − m1,0g (zi ) (12)
1 − η(zi ) η0g (zi )
#
+ ∆m2,g (zi ) − ∆m1,g (zi ) .

Proposition 2. Under Assumptions 1-3, equation (12) recovers the EDATT, τ (g),
as long as either the models for the propensity scores or the models for the conditional
means of the outcome are correctly specified.

Corollary 2. Proposition 2 also holds for τ ∗ (g) under the correct specification of the
exposure mapping.

3.3 Pre-trends
In empirical research, tests for pre-trends remain common despite the caution de-
scribed in Roth (2022). A placebo DID is typically applied to multiple periods
observed before treatment by imposing a hypothetical period of adoption of treat-
ment. Something similar can be done in the context of interference. In the simplest
case, suppose there are two time periods t = 0, 1 prior to treatment, by imposing the

19
placebo treatment between time periods 0 and 1, we would like to test

1 X h  i
E yi1 (0, W−i)|Wi = 1, Gi = g, zi − E yi0 (0, 0)|Wi = 1, Gi = g, zi
|DM | i∈D
M

1 X h  i
= E yi1 (0, W−i)|Wi = 0, Gi = g, zi − E yi0 (0, 0)|Wi = 0, Gi = g, zi .
|DM | i∈D
M
(13)
The potential outcome yi1 (0, W−i ) is not observable because no unit is treated in
time period 1. Nevertheless, under the no anticipation assumption, equation (13)
reduces to
1 X h  i
E yi1 (0, 0)|Wi = 1, Gi = g, zi − E yi0 (0, 0)|Wi = 1, Gi = g, zi
|DM | i∈D
M

1 X h  i
= E yi1 (0, 0)|Wi = 0, Gi = g, zi − E yi0 (0, 0)|Wi = 0, Gi = g, zi .
|DM | i∈D
M
(14)
Namely, one can test whether subgroups defined by the combination of direct treat-
ment status and exposure level have different trends before actual treatment. In
reality, the testing equation (14) is a test of both no anticipation and parallel pre-
trends.

4 Asymptotic Properties of the Parametric Esti-


mator
I am primarily concerned with estimating the EDATT in this section. Spillover effects
are defined in Section A in the online appendix. Their estimation is similar to that
of the EDATT. I propose a GMM estimator combining equation (12) with moment
conditions for the propensity scores and conditional means of outcomes chosen by the
empirical researcher. Since the inference is only conditional on the covariates z, all
unobservables are identically distributed conditional on zi .9 The inference framework
9
It suffices that the first conditional moment is identical.

20
implies that the individual propensity score function and the individual conditional
mean function of the outcome remain the same across units.
To make the estimators more robust to misspecification of these functions, one can
use various moment conditions to identify the propensity scores. One option is the
covariate balancing propensity scores (CBPS) in Imai and Ratkovic (2014) or simi-
larly the inverse probability tilting estimator in Graham, de Xavier Pinto, and Egel
(2012), which can be locally more robust than the propensity scores based on maxi-
mum likelihood estimation (MLE). The alternative would be estimating all functions
semiparametrically or nonparametrically, which is left as future work.
I denote a generic moment condition for propensity scores as
 
ED q1 (Wi , zi , γ1∗ ) = 0 (15)

and
 
ED q2 (Wi , Gi, zi , γ2∗ ) = 0, (16)

where zi can contain neighbors’ attributes. For instance, the moment conditions for
CBPS are  
Wi (1 − Wi )
ED zi − zi = 0 (17)
P (Wi = 1|zi ) 1 − P (Wi = 1|zi )
and for g = 1, 2, . . . , G − 1,
 
1{Gi = g} 1{Gi = g − 1}
ED (Wi , zi ) − (Wi , zi ) = 0, (18)
P (Gi = g|Wi , zi ) P (Gi = g − 1|Wi , zi )

exp(z γ ∗ )
where P (Wi = 1|zi ) is some probability for a binary response, such as 1+exp(zi i1γ ∗ ) ,
1
and P (Gi = g|Wi , zi ) is some probability for discrete choices. Similarly, generic
conditional moment conditions are denoted by
 
ED q3 (Yi1 , Wi , Gi , zi , γ3∗ ) = 0 (19)

and
 
ED q4 (Yi2 , Wi , Gi, zi , γ4∗ ) = 0. (20)

21
Alternatively, one can model the conditional mean for ∆Yi = Yi2 − Yi1 and formulate
the moment condition as
 
ED q̃3 (∆Yi , Wi , Gi, zi , γ̃3∗ ) = 0. (21)

If there are only a few possible values that the exposure levels Gi can take, one can
alternatively model the conditional outcomes for the subpopulation with Wi = w
and Gi = g as a function of zi , separately. Leading cases for outcome regression are
moment conditions from (nonlinear) least squares. Lastly, the moment condition for

τ (g) is a restatement of equation (12)10 . Denote θM = (γ1∗ ′ , γ2∗ ′ , γ3∗ ′ , γ4∗ ′ , τ (g))′ .
 
ED q5 (Yit , Wi , Gi , zi , θM

)
"
Wi 1{Gi = g}   
=ED Yi2 − m2,1g (zi ) − Yi1 − m1,1g (zi )
η(zi ) η1g (zi )
1 − Wi 1{Gi = g}   
− Yi2 − m2,0g (zi ) − Yi1 − m1,0g (zi )
1 − η(zi ) η0g (zi )
#
+ ∆m2,g (zi ) − ∆m1,g (zi ) − τ (g) = 0 (22)

b
Let Xi = {Yit , Wi , Gi , zi }, q(Xi , θ) = (q1′ (γ1 ), q2′ (γ2 ), q3′ (γ3 ), q4′ (γ4 ), q5 (θ))′ , and Ψ
be the weighting matrix with dimensions larger or equal to that of θ.

1 X b 1
X
θ̂ = arg min q(Xi , θ)′ Ψ q(Xi , θ) (23)
θ∈Θ |DM | i∈D |DM | i∈D
M M

The GMM estimator is the solution to the finite population minimization problem
in equation (23). And the estimator of τ (g) is the last element of θ̂.
I impose the following assumptions to study the asymptotic distribution of the
GMM estimator.
10
In practice, it is recommended to normalize the weights for IPW type estimators. Changing the
moment condition with normalized propensity scores – where the weights sum to unity – does not
affect asymptotic normality of the GMM estimator. In fact, estimators with normalized weights
consistently show better finite sample performance in the simulations below.

22
Assumption 4. Suppose {DM } ⊆ Rd , d ≥ 1 is a sequence of finite sets such that
|DM | → ∞ as M → ∞. All elements in DM are located at distances of at least
ρ0 > 0 from each other, i.e., for all i, j ∈ DM : ρ(i, j) ≥ ρ0 ; w.l.o.g. I assume that
ρ0 > 1.

Consistent with the increasing domain asymptotics, the assumption of the mini-
mum distance ensures the expansion of the finite population region.

Assumption 5. (Approximate Neighborhood Interference) Let W (i,s) = WN (i,s) , WD′ M \N (i,s) ,

where W ′ is an independent copy of W , W (i,s,0) = WN (i,s), 0 , i.e., WD′ M \N (i,s) = 0,
and h i

κM (s) = max E yi2 (W ) − yi2 W (i,s,0)
z . (24)
i∈DM

Suppose that supM κM (s) → 0 as s → ∞.

Assumption 5 is a modified version of Assumption 4 in Leung (2022). Leung


(2022) varies WD′ M \N (i,s) in an arbitrary way but these treatments outside of the
s-neighborhood are fixed at zero here. Assumption 5 essentially implies that treat-
ments of units from s distance away from i should have a minimal impact as the
distance s gets larger. This way, we can allow interference from outside the immedi-
ate K-neighborhood while still being able to derive the asymptotic properties of the
proposed estimators. Leung (2022) has shown that several interference structures
satisfy the ANI assumption, including the linear-in-means model with endogenous
peer effects. Section C of the online appendix gives an overview of the different
approaches to modeling interference taken in the literature and compares them to
ANI.
I adopt ψ-dependence in Kojevnikov, Marmer, and Song (2021) as the notion of
weak dependence throughout the paper. Notice that α-mixing is a special case of
ψ-dependence. Let Lν,h denote the collection of bounded Lipschitz real functions f (·)
on Rν×h with the Lipschitz constant Lip(f ) < ∞ and kf k∞ < ∞, where kf k∞ =
supx |f (x)|. Denote the collection of subset pairs as

PM (h, h′ ; s) = {(H, H ′) : H, H ′ ⊆ DM , |H| = h, |H ′| = h′ , ρ(H, H ′ ) ≥ s}.

23
Definition 1. A triangular array {Vi , i ∈ DM , M ≥ 1}, Vi ∈ Rν , is called ψ-
dependent if there exist uniformly bounded constants {κ̃M,s }s≥0 with κ̃M,0 = 1, and
a collection of nonrandom functions {ψh,h′ }h,h′∈N with ψh,h′ : Lν,h × Lν,h′ → [0, ∞)
such that for all (H, H ′ ) ∈ PM (h, h′ ; s) with s > 0 and all f ∈ Lν,h and f ′ ∈ Lν,h′ ,

Cov f (VH ), f ′(VH ′ )|z ≤ ψh,h′ (f, f ′ )κ̃M,s ,

where VH = (Vi : i ∈ H).

I require κ̃M,s to approach zero as s grows. ψ-dependence bounds the covariances


of any two subsets of observations distant from each other.

Assumption 6. Let yit = φ(Wi , W−i , zi , Uit ), where φ(·) is some generic function
and Uit denotes the unobservables. Let ǫi = (Wi , Ui1 , Ui2 ). The random field ǫ =
{ǫi , i ∈ DM , M ≥ 1} is α-mixing under Definition 2 in Jenish and Prucha (2012).
The mixing coefficient is denoted by αǫ (u, v, r) ≤ (u + v)b
αǫ (r).

On top of possible interference, Assumption 6 allows assignment variables to be


spatially correlated as well.

Lemma 4.1. Under Assumptions 4, 5, 6, and Assumption A.1 in Appendix A, for


each θ ∈ Θ, each element of q(Xi , θ) and ∇θ q(Xi , θ) is ψ-dependent with κ̃M,s =

bǫ (s/3) 1(s > 3 max{K, ρ0 }) + 1(s ≤ 3 max{K, ρ0 }).
κM (s/3) + sd α

To adapt the limit theorems in Kojevnikov et al. (2021) to spatial data, I replace
the network denseness with the cardinality of the spatial sets implied by Lemma
A.1 in Jenish and Prucha (2009). As a result, Assumption 3.2 in Kojevnikov et al.
(2021) is modified as

Assumption 7.

X
sd−1 κ̃M,s < ∞
s=1

Assumption 7 is in line with Assumption 3(b) in Jenish and Prucha (2009) for
α-mixing random fields.

24
2
P ′ ∗

Let σM = V ar i∈DM λ q(Xi , θM )|z for a nonzero vector λ. Similarly, Assump-
tion 3.4 in Kojevnikov et al. (2021) is modified as

Assumption 8. There exists a positive sequence rM → ∞ such that for k = 1, 2



X X
1 k 1−
2+k

2+k
sd−1 max N (i; rM ) \ N (j; s − 1) κ̃M,s p → 0
σM i∈DM s=0
j∈DM ,s≤ρ(i,j)<s+1

and
1−(1/p)
|DM |2 κ̃M,rM
→0
σM
as M → ∞, where p > 4 is that appears in Assumption A.1 in Appendix A.

The rate of κ̃M,s is implicitly implied by Assumption 8. A sufficient condition for


the first part of the assumption is

|DM | kd X d−1 1− 2+k
r
2+k M
s κ̃M,s p → 0.
σM s=0

Analogous conditions can be found in Jenish and Prucha (2009) as equations (B.18)
and (B.19) therein.
The notation used in the asymptotic distribution of the GMM estimator is intro-
duced below. Define

ΩM = ∆ehw,M + ∆spatial,M − ∆E,M − ∆ES,M , (25)

where
1 X  ∗ ∗ ′

∆ehw,M = E q(Xi , θM )q(Xi , θM ) |z , (26)
|DM | i∈D
M

1 X    ′
∗ ∗
∆E,M = E q(Xi , θM )|z E q(Xi , θM )|z , (27)
|DM | i∈D
M

1 X X  ∗ ∗ ′

∆spatial,M = E q(Xi , θM )q(Xj , θM ) |z , (28)
|DM | i∈D
M j∈DM ,j6=i

25
and
1 X X    ′
∆ES,M = ∗
E q(Xi , θM )|z E q(Xj , θM

)|z . (29)
|DM | i∈D
M j∈DM ,j6=i

∆ehw,M and ∆spatial,M account for heteroskedasticity and spatial correlation respec-
tively, whereas ∆E,M and ∆ES,M are their finite population counterparts. Denote
 

RM = ED ∇θ q(Xi , θM

)

and
∗ ′ ∗
−1 ∗ ′ ∗ ∗ ′ ∗
−1
VM = RM ΨM RM RM ΨM ΩM ΨM RM RM ΨM RM , (30)
p
b − ΨM →
where Ψ 0.

Theorem 4.2. Under Assumptions 1-8, and Assumption A.1 in Appendix A, if


either equations (2)-(4) or equation (10) are correctly specified,

−1/2
p ∗ d
VM |DM |(θ̂ − θM ) → N (0, Ik ).

Corollary 3. Under correct specification of the exposure mapping, Theorem 4.2 holds

for θM = (γ1∗ ′ , γ2∗ ′ , γ3∗ ′ , γ4∗ ′ , τ ∗ (g))′.

With the consideration of interference and potentially spatially correlated as-


signments, we need to make inference robust to spatial correlation. As a common
approach to adjust the variance estimator for spatial correlation, the usual spatial
heteroskedasticity and autocorrelation consistent (SHAC) variance estimator is de-
fined as
 
b R̂ −1 R̂′ Ψ
V̂ = R̂′ Ψ b Ω̃(θ̂)Ψ b R̂ −1 ,
b R̂ R̂′ Ψ

where
1 X
R̂ = ∇θ q(Xi , θ̂)
|DM | i∈D
M

and
1 X  s  X
∞ X
Ω̃(θ) = ω q(Xi , θ)q(Xj , θ)′ .
|DM | s=0 bM i∈D
M j∈DM ,s≤ρ(i,j)<s+1

26
I impose the following assumption for the estimation of the variance-covariance ma-
trix.

Assumption 9. The weights satisfy:


s
 s

(i) ω(0) = 1, ω bM = 0 for any s > bM , ω bM < ∞, ∀ M;
(ii)
X∞  s 
1−2/p
ω − 1 sd−1 κ̃M,s → 0;
s=1
bM

(iii)

1 X d−1 2d 1−4/p
s bM κ̃M,s → 0
|DM | s=0

as M → ∞, where bM = o |DM |1/2d and p > 4 is that appears in Assumption A.1
in Appendix A.

Assumption 9(i) is satisfied by common choices of kernels, including the Bartlett


s

and Parzen kernels. Assumption 9(ii) requires that the kernel weights ω bM converge
to one sufficiently fast as M → ∞. Part (iii) of Assumption 9 regulates the growth
rate of the bandwidth {bM }.

Theorem 4.3. Under conditions in Theorem 4.2 and Assumption 9,

p
V̂ − (VM + VE ) → 0,

where
∗ ′ ∗
−1 ∗ ′ ∗ ∗ ′ ∗
−1
VE = RM ΨM RM RM ΨM ΩE ΨM RM RM ΨM RM

and

1 X  s  X
∞ X  ∗
  ∗
′
ΩE = ω E q(Xi , θM )|z E q(Xj , θM )|z .
|DM | s=0 bM i∈D
M j∈DM ,s≤ρ(i,j)<s+1

Corollary 4. Under correct specification of the exposure mapping, Theorem 4.3 holds

for θM = (γ1∗ ′ , γ2∗ ′ , γ3∗ ′ , γ4∗ ′ , τ ∗ (g))′.

27
Remark 2. When we choose kernel functions that produce positive semi-definite
(PSD) weighting matrix, the usual SHAC variance estimator is generally conserva-
tive for the finite population conditional spatial-correlation robust variance-covariance
matrix.

The conservativeness of the usual variance estimator for conditional variance has
also been investigated in Abadie et al. (2014) under the independence assumption
for the heteroskedasticity-robust variance matrix. I extend it to the case with spatial
correlation here when ΩE is PSD based on a PSD kernel weighting matrix. An
 
exception to Remark 2 is when E q(Xi , θM ∗
)|z = 0 for all i ∈ DM . In this case, the
usual variance-covariance matrix estimator is no longer conservative as VE = 0. With
heterogeneous direct treatment effect or misspecification of either the propensity
 
scores or conditional means, E q(Xi , θM

)|z 6= 0.
That said, I would like to highlight a few points. First, because Ω̃(θ̂) is a con-
servative estimator for ΩM , even if we choose ΨM as the optimal weighting matrix
Ω−1 b
M , using Ψ = Ω̃(θ̂) in estimation is not going to achieve the most efficient GMM
estimator. The usual variance estimator is therefore conservative not only because
of the neglect of the additional terms in the variance-covariance matrix but also be-
cause the optimal weighting matrix is not consistently estimated. Of course, when
the model is just identified, the weighting matrix choice is irrelevant.
Second, unlike the finite population variance-covariance matrix in Xu and Wooldridge
(2022), the conditional spatial-correlation robust variance matrix is consistently es-
timable because it is no longer conditional on the unobserved potential outcomes.
There are different approaches one can take. However, since the usual SHAC vari-
ance estimator is known to suffer from downward bias especially when the spatial
correlation is high, it is not always necessary to estimate the smaller conditional
variance matrix.

28
5 Simulations
In the simulation, I show the finite sample performance of the proposed estimators
for EDATT. I consider an irregularly spaced lattice with M = 400 units. The
locations (s1,iM ,s2,iM ) are drawn once and kept fixed across replications. Each of
s1,iM and s2,iM is independently drawn from U(0, 20). The distance between units
i and j is measured by ρ(i, j) = max{|s1,iM − s1,jM |, |s2,iM − s2,jM |}. Units are
considered neighbors if ρ(i, j) ≤ 0.3 with the neighborhood structure summarized by
the normalized contiguity matrix, A. After ruling out units without neighbors, the
effective size of the subpopulation eligible for spillover reduces to 350.
I consider two time period panel data. The potential outcome function in the
first time period remains the same across different designs.

y1 (0, 0) = 1 + z + e1 ,

where z is the individual covariate independently drawn from the standard normal
distribution and kept fixed, while e1 is the first time period unobservable. There is
a single binary treatment variable W = 1{p(z ∗ ) > u} with ui ∼ U(0, 1). I vary
i.i.d

the second time period potential outcome function and the assignment probability
p(z ∗ ) in different designs summarized in Table 1 below. z ∗ = (z, zu ), where the
vector of zu in the assignment probability is drawn from a multivariate normal dis-
tribution with mean zero and a variance-covariance matrix equal to 0.5 raised to the
power of the distance between units. Thus, zu is a spatially correlated locational
covariate that stands for neighborhood similarity, which might be neglected in naive
estimation assuming away spillover effect. Along with the individual second time pe-
riod unobservable e2 , ei1 |Wi , W−i , zi ∼ N (Wi ∗ zi , 1), ei2 |Wi , W−i , zi ∼ N (Wi ∗ zi , 1)
and ei1 ⊥
⊥ ei2 |Wi , W−i , zi , ∀ i. The specified exposure mapping is denoted by
G = 1{AW > 0}, which may or may not coincide with the true interference struc-
ture.
I compare the following estimators: the canonical TWFE, Abadie’s IPW esti-
mator, the augmented TWFE, regression adjustment, IPW estimator with either

29
Table 1: Simulation Designs
Design Assignment probability Second period potential outcomes
1 p1 Y2 = 2 + W + G + z + e2
2 p1 Y2 = 2 + W + G + 2z + e2
3 p2 Y2 = 2 + W + G + 2z + e2
4 p2 Y2 = 2 + W + 0.2A ∗ Y2 + 2z + e2
5 p2 Y2 = 2 + W + 0.2A ∗ Y2 + 2z 2 + e2
6 p2 Y2 = 2 + W ∗ G + 2z + e2
1 exp(0.3z) exp(0.3z+0.8zu )
p1 = p(z) = 1+exp(0.3z) ; p2 = p(z ∗ ) = 1+exp(0.3z+0.8z u)
.
2
Y2 , W , z, and e2 stand for the M × 1 vector of Y2 , W , z, and e2 .
3
In designs 4 and 5, the parallel trends assumption holds approximately with the
difference between the trends being less than 0.001.

MLE or CBPS moment condition for the propensity scores, and the proposed AIPW
estimator with either MLE or CBPS moment condition for the propensity scores.
Section F of the online appendix contains the standard deviation of the proposed
estimators and the coverage rate of the 95% confidence intervals based on the usual
SHAC standard errors for the doubly robust estimators.
For the canonical Abadie’s IPW estimator, I only include z in the logit model of
W as interference is assumed away when employing the canonical DID estimators.
As an illustration of Proposition 1, I also report Abadie’s IPW estimator with z,
Az, and zu included in the logit model, which leads to conditional independence of
W and G. The estimation of the augmented TWFE is as in equation (B.1) in the
online appendix with Si = Gi . For the estimation of the proposed IPW, regression,
and AIPW estimator accounting for spillover effect, the propensity scores for W and
G are estimated based on a logit model on z, Az, and zu and a logit model on W ,
z, Az, and zu , respectively. For the first time period data, I regress Y1 on W , z,
and W ∗ z. As for the second period data, I regress Y2 on W , z, W ∗ z, and G.
All estimators involving weighting are weighted by the normalized propensity scores.
The results are summarized across 10,000 replications.
According to the population generating process, the direct effects are τ (1) =
τ (0) = 1 in designs 1-5 and τ (1) = 1, τ (0) = 0 in design 6. In the last design, the

30
Table 2: Expected Direct ATT
1 2 3 4 5 6
twfe 0.998 1.259 1.355 1.333 0.884 0.909
abadie(z) 0.999 1.000 1.098 1.069 1.147 0.654
abadie(z ∗ ) 0.997 0.997 0.999 1.028 1.031 0.607
atwfe1 1.003 1.272 1.250 1.289 0.842 1.250
atwfe0 0.998 1.248 1.270 1.296 0.771 0.270
ra1 1.001 1.001 1.001 1.041 1.139 0.692
ra0 1.001 1.001 1.001 1.041 1.139 0.692
ipw mle1 0.999 0.998 1.001 1.028 1.034 1.001
ipw mle0 1.007 1.018 1.042 1.088 1.083 0.042
ipw cbps1 0.998 0.996 1.000 1.026 1.038 1.000
ipw cbps0 1.009 1.019 1.037 1.081 1.076 0.037
dr mle1 1.002 1.002 1.002 1.028 1.037 1.001
dr mle0 0.997 0.997 0.999 1.045 1.084 0.001
dr cbps1 1.002 1.002 1.002 1.028 1.040 1.001
dr cbps0 0.997 0.997 0.999 1.042 1.072 0.001
1
twfe stands for the canonical TWFE estimator; abadie(z) stands
for the canonical Abadie’s IPW estimator including only z as
the covariate; abadie(z ∗ ) stands for the canonical Abadie’s IPW
estimator using z, Az, and zu as the covariates; atwfe stands
for the augmented TWFE estimator; ra stands for the regression
adjustment estimator; ipw mle stands for the proposed IPW es-
timator with MLE moment condition for the propensity scores;
ipw cbps stands for the proposed IPW estimator with CBPS mo-
ment condition for the propensity scores; dr mle stands for the
proposed doubly robust estimator with MLE moment condition
for the propensity scores; dr cbps stands for the proposed doubly
robust estimator with CBPS moment condition for the propen-
sity scores;
2
All estimators ending with 1 or 0 correspond to the estimator
for the direct treatment effect at exposure levels one or zero,
respectively.
3
τ (1) = τ (0) = 1 in designs 1-5; τ (1) = 1, τ (0) = 0 in design 6
with the overall direct effect being approximately 0.607.

31
overall direct effect is approximately 0.607. The point estimates for the direct effect
are summarized in Table 2. In designs 1 and 2, neighborhood similarity does not
drive treatment assignment. As a result, the canonical Abadie’s IPW estimator with
covariate z closely estimates the overall direct effect. The canonical TWFE only
performs well in design 1 as the estimating equation of TWFE rules out z-specific
time trends, which is violated in all other designs. The augmented TWFE estimators
suffer from the same linearity restriction in their estimating equation as the regular
TWFE. With the inclusion of both z and zu , Abadie’s IPW estimates are very close
to the overall direct effect.
The proposed estimators accounting for the spillover effects all perform relatively
well. Due to the specific exposure mapping functional form, the overlap condition
holds better for exposure level one than zero. Consequently, the point estimates
for the direct effect estimator at exposure level one are slightly more accurate than
the results for the estimator at exposure level zero. It is worth mentioning that the
propensity score model for G is always misspecified. The outcome regressions are
also misspecified in designs 4-6. Nevertheless, the estimates from the proposed IPW
and doubly robust estimators are all quite close to the truth and much more accurate
than the TWFE type estimators.
The doubly robust estimators improve upon regression adjustment and IPW
alone, especially at exposure level zero. The only exception is design 5. Since the
outcome regression is more severely misspecified than in other designs, we do not
see improvement moving from IPW to AIPW. Nevertheless, the AIPW estimates
are still better than the regression adjustment estimates. Estimators with CBPS
moment condition slightly improve upon estimators with MLE moment condition.
When the overlap condition holds weaker in other population generating processes,
exp(z+2zu )
for instance, changing the assignment probability to p(z ∗ ) = 1+exp(z+2z u)
, we can
see more noticeable improvement from using the CBPS moment condition instead
of the MLE moment condition. Moreover, the doubly robust estimator can perform
substantially better than the proposed IPW estimator at exposure level zero.

32
6 Empirical Illustration
I evaluate the effects of China’s special economic zones (SEZ) policy using the pro-
posed doubly robust estimators. SEZs are a prominent development strategy that
aims to foster agglomeration economies. The benefits of SEZs include corporate tax
concessions, customs duty exemptions, discounts on land use fees, and special bank
loan programs. SEZs are likely to affect neighboring non-SEZ areas through, for
instance, firm relocation or knowledge spillover.
The data for the empirical illustration come from Lu et al. (2019). There are
five waves of SEZ establishment in China. Each wave is different in nature and
targets different regions with earlier waves creating more national-level economic
zones.11 Since detailed village level data is only available starting from 2004, Lu et al.
(2019) focus on the latest wave of SEZs established between 2005 and 2008. China
established 663 SEZs at the provincial level in 2006, accounting for 42 percent of
the country’s SEZs. These SEZs cover the coastal, central, and western regions and
are considered small-scale regional SEZs. As a result, the policy effect is interpreted
as the treatment effect on villages that had not yet been treated prior to this wave.
This means that areas covered by zones from earlier waves are not included in the
finite population.
Lu et al. (2019) collect comprehensive data on China’s economic zones based on
the economic censuses conducted by China’s National Bureau of Statistics in 2004
and 2008 covering all manufacturing firms. Consequently, the entire finite popula-
tion of village-level data in 2004 and 2008 is observed, where 2004 is the period prior
to the treatment and 2008 post the treatment. The units of observation are vil-
lages, which are the most disaggregated geographical units and smaller than an SEZ.
Treated villages are referred to as SEZ villages. Unfortunately, the publicly avail-
able data from Lu et al. (2019) do not contain an identifier of villages nor distances
among villages. Nevertheless, I can match counties in which villages are located from
separate datasets published by Lu et al. (2019). In the organized dataset, there are
3,963 SEZ villages and 99,259 non-SEZ villages. It would be ideal to set certain
11
As a result, SEZ establishment in China cannot simply be considered as staggered adoption.

33
neighborhood boundaries for each village based on the distance between villages.
The (potentially misspecified) exposure mapping is then a function of neighboring
villages’ SEZ assignment status. In the absence of detailed geographical data, I con-
sider each village’s neighborhood to be its corresponding county, with its neighbors
being the other villages within the same county. Without distance measures, stan-
dard errors are clustered at the county level, which can be considered a special case
of spatial-correlation robust inference.
According to equation (12), the outcome variables Yit include the logarithm of
capital, employment, and output of firms in a village. The direct treatment variable
Wi is equal to one if village i is located within the boundaries of SEZs and zero
otherwise. As an illustration, exposure mapping is defined in two ways. In the first
specification, Gi is equal to one if at least one of the rest villages in the corresponding
county is a SEZ village; Gi is equal to zero if there is at least another village in the
corresponding county but none of the other villages in the county are treated. As
one can see, the finite population only includes villages with eligible neighbors. In
the second specification, I define SEZ ratio as the fraction of other SEZ villages over
the total number of other villages in a county. Gi is also a binary variable equal
to one if this ratio in county c in which village i is located is above the mean of
the ratio among all counties.12 In the latter specification, villages are considered as
intensively exposed to neighbors’ economic zones if the corresponding county contains
a relatively high fraction of SEZ villages.13
There are four baseline village characteristics including logs of a village’s distance
from an airport and port, log of the capital-to-labor ratio, and log of the number
of firms in the village in 2004. These baseline characteristics and their interactions
with the direct treatment variable are included as regressors in moment conditions
(19) and (20). The four baseline characteristics and their leave-one-out means at the
county level are covariates in moment conditions (17) and (18) for the propensity
scores. The spillover effects are estimated analogously using equations (A.3) and
(A.4) in the online appendix.
12
Counties on average contain 148.3 villages. On average, there are 5.8 SEZ villages in a county.
13
These spillover functions are not necessarily supposed to be correctly specified.

34
Table 3: Direct and Spillover Effects of Special Economic Zones
log capital log employment log output
Canonical DID 0.689 0.395 0.629
(0.049) (0.038) (0.051)
Panel A: exposure G = 1 if at least one neighbor SEZ village in the same county
Direct effect (1) 0.594 0.345 0.492
(0.053) (0.039) (0.052)
Direct effect (0) 0.634 0.400 0.567
(0.058) (0.048) (0.062)
Spillover effect (treated) -0.004 -0.008 0.002
(0.056) (0.047) (0.062)
Spillover effect (untreated) 0.036 0.046 0.078
(0.033) (0.027) (0.041)
Panel B: exposure G = 1 if county SEZ ratio above national average
Direct effect (1) 0.438 0.219 0.302
(0.070) (0.057) (0.072)
Direct effect (0) 0.679 0.441 0.645
(0.060) (0.045) (0.059)
Spillover effect (treated) -0.138 -0.096 -0.174
(0.089) (0.073) (0.091)
Spillover effect (untreated) 0.103 0.129 0.169
(0.033) (0.030) (0.042)
1
The standard errors are clustered at the county level.
2
Canonical DID is estimated using inverse probability weighted DID in Abadie (2005).
3
Panel A reports expected direct and spillover effects when the exposure mapping is defined as
a binary variable equal to one if there is at least one neighbor SEZ village in the corresponding
county and zero otherwise.
4
Panel B reports expected direct and spillover effects when the exposure mapping is defined as
a binary variable equal to one if the ratio of SEZ villages in a county is above the mean ratio
across all counties and zero otherwise.
5
Direct effect (1) and direct effect (0) are τ (1) and τ (0) respectively; spillover effect (treated)
and spillover effect (untreated) are τ (1, 1, 0) and τ (0, 1, 0) as defined in the online appendix
respectively.

35
The first row of Table 3 below reports DID estimates using the IPW approach
in Abadie (2005) with the four baseline village characteristics as covariates. Be-
cause of potential spillover effects, these canonical estimates are difficult to interpret
causally. The direct effects reported in Panels A and B are mostly smaller than the
canonical DID estimates, especially for SEZ villages with exposure level one. SEZ
establishment has positive and statistically significant direct effects at the 1% level.
This implies that the SEZ villages benefit from the program by gaining investment,
employing more labor, and producing more output.
Moving from Panel A to B, by increasing the intensity of spillovers from having
more neighboring SEZ villages, the direct effects of SEZ establishment, τ (1), decrease
quite a bit for all three outcomes. On the other hand, with relatively low spillover
intensity, τ (0) increases moderately. In terms of spillover effects, having neighboring
SEZ villages does not significantly affect economic activities in SEZ villages. The only
exception is for log output, where the spillover effect under more intensive exposure is
quite negative and marginally statistically significant. By contrast, Panel B reports
that non-SEZ villages benefit from SEZ neighbors, especially if there is a sufficient
number of SEZ villages in the same county. These patterns of direct and spillover
effects are not found in Lu et al. (2019).
As suggested in Section 3.3, I also examined pre-trends under both specifications
of exposure mapping as well as classical pre-trends for canonical DID estimation.
Unfortunately, there is only one economic census period prior to treatment. As
a result, the pre-trends are tested with China’s Annual Surveys of Industrial Firms
(ASIF) used by Lu et al. (2019), which contain more data in the pre-treatment period
but only cover firms with relatively large sizes. I use ASIF data from 2004 and 2005.
Given this is not the same dataset used for the main analysis, the results presented in
Table 4 are only demonstrative. For canonical DID, only the differential pre-trend for
log of employment is statistically significant at the 10% level with small magnitude.
None of the doubly robust estimates for placebo direct effects are significant.

36
Table 4: Direct Effects of Special Economic Zones: Pre-trends
log capital log employment log output
Canonical DID 0.000 0.051 0.039
(0.039) (0.029) (0.045)
Panel A: exposure G = 1 if at least one neighbor SEZ village in the same county
Direct effect (1) -0.041 0.024 -0.002
(0.050) (0.046) (0.065)
Direct effect (0) -0.017 0.048 -0.009
(0.052) (0.043) (0.056)
Panel B: exposure G = 1 if county SEZ ratio above national average
Direct effect (1) -0.042 0.022 0.102
(0.080) (0.083) (0.105)
Direct effect (0) -0.010 0.055 -0.015
(0.059) (0.048) (0.070)
1
ASIF data from 2004 and 2005 are used in the pre-trends test.
2
The standard errors are clustered at the county level.
3
Canonical DID is estimated using inverse probability weighted DID in Abadie (2005).
4
Panel A reports expected direct and spillover effects when the exposure mapping is defined as
a binary variable equal to one if there is at least one neighbor SEZ village in the corresponding
county and zero otherwise.
5
Panel B reports expected direct and spillover effects when the exposure mapping is defined as
a binary variable equal to one if the ratio of SEZ villages in a county is above the mean ratio
across all counties and zero otherwise.
6
Direct effect (1) and direct effect (0) are τ (1) and τ (0) respectively.

7 Conclusion
I propose doubly robust estimators for the expected direct treatment effect and
spillover effect in a DID context. The approach in the current paper is general in the

37
sense that misspecification of exposure mapping is allowed and interference is not
restricted within a fixed boundary of neighborhoods. Given arbitrary spillover effect,
one needs to account for spatial correlation when conducting inference. With the
entire population observed, the usual spatial-correlation robust variance estimator
could be conservative. The immediate extension of the current framework to multiple
time periods with common treatment timing is summarized in Section E in the online
appendix.
I provide identification results of the direct and spillover effect for the IPW es-
timand, outcome regression estimand, and the doubly robust estimand. From here,
researchers can approach these estimands using various parametric, semiparametric,
or nonparametric estimation methods. In the current paper, I proved the asymp-
totic properties of GMM-type parametric estimators as an illustration of estimation.
Given the inclusion of neighbors’ treatments and attributes in the propensity score
and the conditional mean functions, nonparametric estimation is attractive to allow
for arbitrary functional forms. This is left as future work.

References
Abadie, A. (2005), Semiparametric difference-in-differences estimators. Review of
Economic Studies 72(1), 1–19.

Abadie, A. and Imbens, G. (2002), Simple and bias-corrected matching estimators


for average treatment effects. Tech. rep., National Bureau of Economic Research.

Abadie, A., Imbens, G.W., and Zheng, F. (2014), Inference for misspecified models
with fixed regressors. Journal of the American Statistical Association 109(508),
1601–1614.

Arkhangelsky, D., Imbens, G.W., Lei, L., and Luo, X. (2021), Double-robust
two-way-fixed-effects regression for panel data. Tech. rep., arXiv preprint
arXiv:2107.13737.

38
Aronow, P.M. and Samii, C. (2017), Estimating average causal effects under general
interference, with application to a social network experiment. Annals of Applied
Statistics 11(4), 1912–1947.

Athey, S. and Imbens, G.W. (2022), Design-based analysis in difference-in-differences


settings with staggered adoption. Journal of Econometrics 226(1), 62–79.

Balzer, L.B., Petersen, M.L., and van der Laan, M.J. (2015), Targeted estimation
and inference for the sample average treatment effect. Tech. rep., U.C. Berkeley
Division of Biostatistics Working Paper Series.

Basse, G.W. and Airoldi, E.M. (2018), Limitations of design-based causal inference
and a/b testing under arbitrary and network interference. Sociological Methodology
48(1), 136–151.

Butts, K. (2021), Difference-in-differences estimation with spatial spillovers. Tech.


rep., arXiv preprint arXiv:2105.03737.

Clarke, D. (2017), Estimating difference-in-differences in the presence of spillovers.


Tech. rep., MPRA Paper No. 81604.

Delgado, M.S. and Florax, R.J. (2015), Difference-in-differences techniques for spatial
data: Local autocorrelation and spatial interaction. Economics Letters 137, 123–
126.

Forastiere, L., Airoldi, E.M., and Mealli, F. (2021), Identification and estimation of
treatment and interference effects in observational studies on networks. Journal of
the American Statistical Association 116(534), 901–918.

Gallant, A.R. and White, H. (1988), A unified theory of estimation and inference for
nonlinear dynamic models. Blackwell.

Ghanem, D., Sant’Anna, P.H., and Wüthrich, K. (2022), Selection and parallel
trends. Tech. rep., arXiv preprint arXiv:2203.09001.

39
Graham, B.S., de Xavier Pinto, C.C., and Egel, D. (2012), Inverse probability tilt-
ing for moment condition models with missing data. Review of Economic Studies
79(3), 1053–1079.

Huber, M. and Steinmayr, A. (2021), A framework for separating individual-level


treatment effects from spillover effects. Journal of Business & Economic Statistics
39(2), 422–436.

Hudgens, M.G. and Halloran, M.E. (2008), Toward causal inference with interference.
Journal of the American Statistical Association 103(482), 832–842.

Imai, K. and Ratkovic, M. (2014), Covariate balancing propensity score. Journal of


the Royal Statistical Society: Series B: Statistical Methodology pp. 243–263.

Imbens, G.W. (2004), Nonparametric estimation of average treatment effects under


exogeneity: A review. Review of Economics and statistics 86(1), 4–29.

Jardim, E.S., Long, M.C., Plotnick, R., van Inwegen, E., Vigdor, J.L., and Wething,
H. (2022), Boundary discontinuity methods and policy spillovers. Tech. rep., Na-
tional Bureau of Economic Research.

Jenish, N. and Prucha, I.R. (2009), Central limit theorems and uniform laws of large
numbers for arrays of random fields. Journal of econometrics 150(1), 86–98.

Jenish, N. and Prucha, I.R. (2012), On spatial processes and asymptotic inference
under near-epoch dependence. Journal of Econometrics 170(1), 178–190.

Jin, Y. and Rothenhäusler, D. (2024), Tailored inference for finite populations: con-
ditional validity and transfer across distributions. Biometrika 111(1), 215–233.

Kojevnikov, D., Marmer, V., and Song, K. (2021), Limit theorems for network de-
pendent random variables. Journal of Econometrics 222(2), 882–908.

Leung, M.P. (2022), Causal inference under approximate neighborhood interference.


Econometrica 90(1), 267–293.

40
Lu, Y., Wang, J., and Zhu, L. (2019), Place-based policies, creation, and agglomer-
ation economies: Evidence from china’s economic zone program. American Eco-
nomic Journal: Economic Policy 11(3), 325–360.

Ma, X. and Wang, J. (2020), Robust inference using inverse probability weighting.
Journal of the American Statistical Association 115(532), 1851–1860.

Man, Y., Sant’Anna, P.H., Sasaki, Y., and Ura, T. (2023), Doubly robust estimators
with weak overlap. Tech. rep., arXiv preprint arXiv:2304.08974.

Manski, C.F. (2013), Identification of treatment response with social interactions.


The Econometrics Journal 16(1), S1–S23.

Newey, W.K. (1991), Uniform convergence in probability and stochastic equiconti-


nuity. Econometrica 59, 1161–1167.

Newey, W.K. and McFadden, D. (1994), Large sample estimation and hypothesis
testing. Handbook of Econometrics 4, 2111–2245.

Rambachan, A. and Roth, J. (2022), Design-based uncertainty for quasi-experiments.


Tech. rep., arXiv preprint arXiv:2008.00602.

Rambachan, A. and Roth, J. (2023), A more credible approach to parallel trends.


Review of Economic Studies 90(5), 2555–2591.

Roth, J. (2022), Pretest with caution: Event-study estimates after testing for parallel
trends. American Economic Review: Insights 4(3), 305–22.

Roth, J. and Sant’Anna, P.H. (2023), When is parallel trends sensitive to functional
form? Econometrica 91(2), 737–747.

Sant’Anna, P.H. and Zhao, J. (2020), Doubly robust difference-in-differences estima-


tors. Journal of Econometrics 219(1), 101–122.

Sävje, F. (2024), Causal inference with misspecified exposure mappings: separating


definitions and assumptions. Biometrika 111(1), 1–15.

41
Sävje, F., Aronow, P., and Hudgens, M. (2021), Average treatment effects in the
presence of unknown interference. Annals of statistics 49(2), 673.

Viviano, D. (2024), Policy targeting under network interference. Review of Economic


Studies .

Xu, R. and Wooldridge, J.M. (2022), A design-based approach to spatial correlation.


Tech. rep., arXiv preprint arXiv:2211.14354.

A Proofs
Definition 2. The random function g(Xi , θ) is said to be Lipschitz in parameter θ on
 
Θ if there is h(u) ↓ 0 as u ↓ 0 and b(·) : W → R such that supM,i∈DM E |b(Xi )| < ∞,
and for all θ̃, θ ∈ Θ, g(Xi , θ̃) − g(Xi , θ) ≤ b(Xi )h(kθ̃ − θk), i ∈ DM , M ≥ 1.
p
Assumption A.1. (i) Ψ b − ΨM → 0, where ΨM is positive semidefinite; (ii) Θ is
 ′  
compact; (iii) let QM (θ) = ED q(Xi , θ) ΨED q(Xi , θ) . {QM (θ)} has identifiably

unique minimizers {θM } on Θ as in Definition 3.2 in Gallant and White (1988);
(iv) q(Xi , θ) is continuously differentiable
h on int(Θ),i∀ i, M; (v) q(Xi , θ) is Lipschitz
in θ on Θ; (vi) supM,i∈DM E supθ∈Θ kq(Xi , θ)kp z < ∞ for some p > 4; (vii)
 

θM ∈ int(Θ) uniformly in M, and ED q(Xi , θM ∗
) = 0; (viii) inf M λmin (ΩM ) > 0,
where λmin (·) i (ix) ∇θ q(Xi , θ) is Lipschitz in θ on Θ; (x)
h is the smallest eigenvalue;
supM,i∈DM E supθ∈Θ k∇θ q(Xi , θ)k2 z < ∞; (xi) RM ∗ ′ ∗
ΨM RM is nonsingular; (xii)
let li = l(Xi , θ) be a generic function standing for each element of either q(Xi , θ)
or ∇θ q(Xi , θ). ∀ θ ∈ Θ, l(Xi , θ) is Lipschitz in Xi on the domain of Xi such that
supM,i∈DM Lip(li ) < ∞.

Notice that a necessary condition for Assumption A.1(xii) is supM,i∈DM |Yit | ≤


C < ∞ and supM,i∈DM kzi k ≤ C < ∞, which can often imply Assumption A.1(vi)
and (x).

42
Proof of Lemma 4.1:
(r) (r)   (i,r,0)  
Denote li = l Xi , θ = l yit W (i,r,0) , G i, W−i , Wi , zi , θ . Let f ∈ Lν,h
and f ′ ∈ Lν,h′ . Let s > 0 and (H, H ′) ∈ PM (h, h′ ; s). Define ξ = f (lH ), ζ = f ′ (lH ′ ),
(s) (s)
ξ (s) = f (li : i ∈ H), and ζ (s) = f ′ (li : i ∈ H ′).
First, for s ≤ 3 max{K, ρ0 }, we have

Cov(ξ, ζ|z) ≤ 2 kf k∞ kf ′ k∞ ≤ C1 < ∞ (A.1)

Next, consider s > 3 max{K, ρ0 }.

Cov(ξ, ζ|z) = Cov(ξ − ξ (s/3) + ξ (s/3) , ζ|z)


≤ Cov(ξ − ξ (s/3) , ζ|z) + Cov(ξ (s/3) , ζ − ζ (s/3) |z) + Cov(ξ (s/3) , ζ (s/3) |z)
h i h i 
≤2 kf ′ k∞ E ξ − ξ (s/3) z + 2 kf k∞ E ζ − ζ (s/3) z + Cov ξ (s/3) , ζ (s/3) |z
(A.2)
For the first two terms in equation (A.2),
h i h i
kf k∞ E ξ − ξ

z + kf k∞ E ζ − ζ
(s/3)
z (s/3)

h i h i
≤h kf ′ k∞ Lip(f ) sup E li − li z + h′ kf k∞ Lip(f ′ ) sup E li − li
(s/3) (s/3)
z
M,i∈DM M,i∈DM
  h i
≤ h kf ′ k∞ Lip(f ) + h′ kf k∞ Lip(f ′ ) sup Lip(li ) sup E Xi − Xi
(s/3)
z
M,i∈DM M,i∈DM

(A.3)

Since s/3 ≥ K,
 (i,s/3,0)     
Yi1 , yi2 W (i,s/3,0) , G i, W−i , Wi , zi = Yi1 , yi2 W (i,s/3,0) , G i, W−i , Wi , zi .

As a result,
h i h  i
E Xi − Xi(s/3) z = E yi2 (W ) − yi2 W (i,s/3,0) z ≤ κM (s/3). (A.4)

(s/3)
For any fixed s, li is α-mixing under Assumption 6. By Proposition 2.2 in

43
Kojevnikov et al. (2021), the last term in equation (A.2) is bounded by

(s/3)
 s d ′ s d s 
C2 α l (h, h′ , s) ≤ C2 αǫ hC3 , h C3 , . (A.5)
3 3 3

Putting these together, equation (A.2) is bounded by



C κM (s/3) + sd α
bǫ (s/3) . (A.6)

Proof of Theorem 4.2:


I prove the theorem by verifying Theorem 2.1 and Theorem 3.2 in Newey and McFadden
∗ p
(1994). I first show θ̂ − θM → 0.
Under Assumption A.1(vi) and Assumption 7

1 X   p
q(Xi , θ) − ED q(Xi , θ) → 0 (A.7)
|DM | i∈D
M

follows from Lemma 4.1 and Theorem 3.1 in Kojevnikov et al. (2021). Next,

1 X   p
sup q(Xi , θ) − ED q(Xi , θ) → 0 (A.8)
θ∈Θ |DM | i∈D
M

follows from Corollary 3.1 in Newey (1991) and equation (A.7) under condition (v).
 
Also, ED q(Xi , θ) is uniformly equicontinuous. Let

b 1 X b 1
X
Q(θ) = q(Xi , θ)′ Ψ q(Xi , θ).
|DM | i∈D |DM | i∈D
M M

Finally, I need to show


p
b
sup |Q(θ) − QM (θ)| → 0 (A.9)
θ∈Θ

and QM (θ) is uniformly equicontinuous. The proof of equation (A.9) and the equicon-
tinuity is standard. One can follow, for instance, the proof of Theorem 3 in Jenish and Prucha
(2012).

44
Next, I prove the asymptotic normality. The key steps are to prove

−1/2 1 X d

ΩM p q(Xi , θM ) → N (0, Ik ) (A.10)
|DM | i∈DM

and
1 X   p
sup ∇θ q(Xi , θ) − ED ∇θ q(Xi , θ) → 0. (A.11)
θ∈Θ |DM | i∈D
M

Equation (A.10) is implied by Theorem 3.2 in Kojevnikov et al. (2021), Lemma


4.1, and the Cramer-Wold device under Assumption A.1(vi) and (viii) and Assump-
tion 8. By analogous argumentation for the proof of consistency, equation (A.11)
holds under Assumption A.1(ix) and (x).

45
Online Appendix for “Difference-in-Differences with
arXiv:2306.12003v5 [econ.EM] 30 May 2024

Interference”
Ruonan Xu∗

A Spillover Effect
In addition to the expected direct average treatment effect on the treated (EDATT),
empirical researchers might also be interested in spillover effects defined in equations
(A.1) and (A.2).

1 X   
τ (1, g, g ′) = E yi2 (1, W−i )|Wi = 1, Gi = g, zi
|DM | i∈D
M (A.1)
 
− E yi2 (1, W−i )|Wi = 1, Gi = g , zi
′ ′

1 X   

τ (0, g, g ) = E yi2 (0, W−i )|Wi = 0, Gi = g, zi
|DM | i∈D
M (A.2)
 
− E yi2 (0, W−i

)|Wi = 0, Gi = g ′, zi

The spillover effect contrasts the expected potential outcomes between levels g and
g ′ and could differ with or without direct treatment. A leading case would be setting
g ′ to 0. The identification of the spillover effect is more straightforward because po-
tential outcomes under direct assignment and the specified exposures are observable.
Nevertheless, I impose a further condition to facilitate causal interpretation of the
spillover effects.

ruonan.xu@rutgers.edu, Rutgers University

1
Condition 1 ∀ i ∈ DM , yi2 (wi , w−i ) ⊥
⊥ W−i |Wi , zi .

Since I average over all population units in DM , there is no compositional change of


the subpopulation at exposure levels g and g ′. If it is not feasible for every population
unit to receive exposure g and g ′, one can average across the subpopulation composed
of units eligible for both exposure levels, instead. On top of that, given a single
unit i ∈ DM , Condition 1 rules out heterogeneity bias across different exposure
levels. Suppose the potential outcome is composed of a fixed outcome plus some
measurement error; namely, yi2 (wi , w−i ) = hi2 (wi , w−i ) + ei2 . A sufficient condition
for Condition 1 would be that given unit i’s own treatment status and neighborhood
attributes, its measurement error does not depend on neighbors’ treatment statuses.
Under Condition 1,

1 X     
τ (1, g, g ′) = E yi2 (1, W−i)|Wi = 1, Gi = g, zi − E yi2 (1, W−i′
)|Wi = 1, Gi = g ′ , zi
|DM | i∈D
M
X  X
1  
= E yi2 (1, w−i)|Wi = 1, W−i = w−i , zi P (W−i = w−i |Gi = g, Wi = 1, zi )
|DM | i∈D w−i ∈Ω
M
X  

− E yi2 (1, w−i)|Wi = 1, W−i = w−i , zi P (W−i = w−i |Gi = g , Wi = 1, zi )
′ ′ ′ ′ ′ ′

′ ∈Ω
w−i
 X
1 X  
= E yi2 (1, w−i)|Wi = 1, zi P (W−i = w−i |Gi = g, Wi = 1, zi )
|DM | i∈D w−i ∈Ω
M
X  

− E yi2 (1, w−i)|Wi = 1, zi P (W−i = w−i |Gi = g , Wi = 1, zi ) ,
′ ′ ′ ′

′ ∈Ω
w−i

where Ω = {0, 1}|DM |−1 . As a result, the spillover effect contrasts the expected poten-
tial outcome with direct treatment but weighted by different conditional probabilities
of neighbors’ treatment realization at either exposure g or g ′.

2
Analogously, the doubly robust estimands for the spillover effects are
"
Wi 1{Gi = g} 
τ (1, g, g ′) =ED Yi2 − m2,1g (zi ) + m2,1g (zi )
η(zi ) η1g (zi )
# (A.3)
Wi 1{Gi = g ′ } 
− Yi2 − m2,1g′ (zi ) − m2,1g′ (zi )
η(zi ) η1g′ (zi )

and
"
1 − Wi 1{Gi = g} 
τ (0, g, g ′) =ED Yi2 − m2,0g (zi ) + m2,0g (zi )
1 − η(zi ) η0g (zi )
# (A.4)
1 − Wi 1{Gi = g ′} 
− Yi2 − m2,0g′ (zi ) − m2,0g′ (zi ) .
1 − η(zi ) η0g′ (zi )

When the exposure mapping is correctly specified, the spillover effect reduces to

1 X     
τ ∗ (1, g, g ′) = E ỹi2 (1, g)|Wi = 1, Gi = g, zi − E ỹi2 (1, g ′)|Wi = 1, Gi = g ′, zi
|DM | i∈D
M

1 X     
= E ỹi2 (1, g)|Wi = 1, zi − E ỹi2 (1, g ′)|Wi = 1, zi .
|DM | i∈D
M

under Condition 1. τ ∗ (0, g, g ′) can be decomposed in a similar way.


The asymptotic distribution of the spillover effect estimators can be established
similarly by setting up a GMM problem. Notice that Condition 1 is not required for
estimation or inference but merely for causal interpretation.

B Modified Two-Way Fixed Effects


One way to estimate the spillover effect suggested in the existing literature is to
augment the TWFE DID estimating equation with another binary indicator Si
equal to one if a unit is close to a treated unit; see, for instance, attempts in
Di Tella and Schargrodsky (2004) and Butts (2021). Using the notation in the cur-

3
rent paper, I modify the estimating equation to be

Yit = β1 Wit + β2 (1 − Wi )Sit + β3 Wi Sit + αi + λt + ǫit , (B.1)

where Wit = Wi ∗ 1{t = 2} and Sit = Si ∗ 1{t = 2}. β̂1 and β̂1 + β̂3 − β̂2 estimated
from equation (B.1) would be consistent for the EDATT defined by

1 X h i
τ̄ (0) = E yi2 (1, 0) − yi2 (0, 0)|Wi = 1, Si = 0
|DM | i∈D
M

and
1 X h i
τ̄ (1) = E yi2 (1, W−i) − yi2 (0, W−i)|Wi = 1, Si = 1
|DM | i∈D
M

respectively, under some parallel trends assumptions.


We can see that given the estimating equation of the augmented TWFE, the
specified exposure mapping is fixed as 1{As W > 0}, where As is the adjacency
matrix with units being neighbors if their distance is less than or equal to ds . Only
when the interference structure coincides with the indicator function 1{As W > 0}
along with the correct distance cutoff, can the augmented TWFE identify the exact
direct ATT. In contrast, when the true interference structure is not 1{As W > 0},
the proposed estimands in the main text can still identify the exact direct ATT by
choosing correct specification of the exposure mapping. These proposed estimands
can also identify the EDATT, τ (g), with multiple levels of neighborhood exposure
g allowing for misspecification of the spillover structure. Meanwhile, covariates can
be flexibly accounted for in the proposed estimands by assuming conditional parallel
trends.
Furthermore, the basic augmented TWFE regression linear in covariates,

Yit = β0 + β1 Wit + β2 (1 − Wi )Sit + β3 Wi Sit + β4 Wi + zi γ + λt + ǫit ,

suffers from the same drawbacks of the usual canonical TWFE regression for DID
estimation as pointed out by Remark 1 in Sant’Anna and Zhao (2020). These limi-

4
tations include implicitly imposing homogeneous treatment effects and homogeneous
time trends.

Remark B.1 Inspired by the modified TWFE estimating equation above, for any
specification of the exposure mapping one can instead augment two-way fixed effects
in a saturated way.

Yit =β0 + β1 Wi + β2 Wit + η0 (1 − Wi )G2it + η1 Wi G2it


+ δ0 (1 − Wi )G3it + δ1 Wi G3it + · · · + ξ0 (1 − Wi )G|G|it + ξ1 Wi G|G|it + zi γ + λt + ǫit ,
(B.2)
where one creates |G| − 1 binary indicators for each exposure level, Ggit = 1{Gi =
g} ∗ 1{t = 2}. EDATT τ (g) can be consistently estimated by linear combinations of
the coefficient estimators if the linearity in equation (B.2) holds true. The saturated
TWFE, however, suffers from the same homogeneous restrictions and lacks flexibility
in controlling for covariates. Furthermore, it is possible that some Ggit may not be
well-defined for each unit because Gi cannot take value g for some unit i.

C Different Approaches to Dimension Reduction


Manski (2013) and Basse and Airoldi (2018) formally point out that there exist no
consistent treatment effect estimators under arbitrary interference. It is therefore
necessary to make dimension reduction assumptions about the interference structure
in order to identify meaningful treatment effect parameters. There are different ap-
proaches to dimension reduction in the literature; see, for instance, Auerbach and Tabord-Meehan
(2021), Agarwal, Cen, Shah, and Yu (2022), Emmenegger, Spohn, and Bühlmann (2022),
and Qu, Xiong, Liu, and Imbens (2022). In this section, I provide an overview of
some of the leading approaches in the literature. I show how recent literature devel-
opment relates to the general framework in the current paper. Each article referenced
proposes different estimation methods for various causal effect estimands. My focus
here is to compare the different approaches to modeling spillover effect.1
1
It is not supposed to be a comprehensive survey.

5
C.1 Partial Interference
The most popular approach to dimension reduction of the interference structure
is partial interference restricted within disjoint clusters. In Qu et al. (2022), their
potential outcome function is modeled as2

yc,i (wc,i , wc,(i),1 , · · · , wc,(i),m ) ≡ yc,i(wc,i , gc,1, · · · , gc,m), (C.1)

where c is the index of a cluster, yc,i and wc,i is the potential outcome and treatment
assignment of unit i in cluster c, and wc,(i),j is the treatment assignment of unit i’s
neighbors in the disjoint subset j of cluster c. Units within each of the m disjoint
subsets are exchangeable. As a result, the impact of wc,(i),j can be summarized by gc,j ,
which measures the number of treated neighbors in subset j of cluster c. Compared
with the assumption of fully exchangeable neighbors in cluster c, the partition of m
subsets allows for more heterogeneity of neighbors’ influence. This allows for a more
flexible interference structure.
If (C.1) is correctly specified, one can choose K to be maxc=1,...,C maxi,j∈c ρ(i, j).
Given bounded cluster sizes, K is finite. For all s > K and any i, yi (W ) −
yi (W (i,s) ) = 0. Therefore, potential outcomes in the form of (C.1) can be accommo-
dated in the approach I take. A trickier question is how to partition the m subsets
within each cluster c. On top of that, partial interference might be too strong an
assumption. If either the exchangeability or the partial interference assumption does
not hold, the approach in the current paper can still identify the expected exposure
effect as long as the interference from units further away is increasingly negligible.

C.2 Immediate Neighbors


A slightly different approach to dimension reduction is to restrict interference within
immediate neighbors. For instance, in Emmenegger et al. (2022), the spillover func-
2
The potential outcome is defined for a single cross section.

6
tion is specified as

f 1 ({Wj }j∈DM ,j6=i ), · · · , f r ({Wj }j∈DM ,j6=i ) (C.2)

of fixed dimensions r. Each such function is specified by empirical researchers and


describes a one-dimensional spillover effect that unit i receives from its neighbors.
In Example 2.1 in Emmenegger et al. (2022), the functions f l has been specified
as the average number of treated neighbors of unit i and the average number of
treated neighbors of neighbors of i, respectively, for r = 2. In this case, if one defines
neighbors of i as units within distance K̄ from i, then the approximate neighborhood
interference (ANI) assumption holds for any s > 2K̄.
Equations (C.1) and (C.2) have recently been proposed in the literature allowing
for a more flexible interference structure. The purpose of the discussion is to show
that if empirical researchers assume these specifications of the spillover function are
correct, they can be well accommodated in the framework of the current paper.
Even if some dimension reduction assumptions fail, applied researchers are still able
to identify causal estimands as long as ANI is true.

C.3 Local Configuration


A more interesting discussion is the comparison of the local configuration approach
proposed by Auerbach and Tabord-Meehan (2021) and ANI. In a spatial setting,
unit i’s local configuration of radius r, denoted by Gri , refers to the units within
distance r of i and their treatments and characteristics.3 Units within a local con-
figuration remain anonymous, similar to the exchangeability assumption. ANI and
the expected exposure mapping are initially proposed to allow for misspecification
of the spillover function. The local configuration approach instead maintains correct
specification of the spillover function. However, it uses local configurations of various
3
To maintain consistency, I follow the notation in Auerbach and Tabord-Meehan (2021). There-
fore, with the abuse of notation, Gi is different from the exposure mapping defined above.

7
radius r to approximate the effective treatment according to the spillover function.4
Below, I provide another interpretation of the ANI assumption. Under correct speci-
fication of the spillover function, the ANI approach is not too different from the local
configuration approach.
According to the metric definition in Auerbach and Tabord-Meehan (2021), for
1
effective treatment g and g̃, if the distance d(g, g̃) ≤ 1+r
then Gri = G̃ri . Under
Assumption 4.5 therein,

h(g0 ) − h(g̃) ≤ φ d(g0, g̃) , (C.3)

where φ(x) → 0 as x → 0, h(g) = E[h(g, Ui )], and Yi = h(Gi , Ui ). Therefore, we can


see that (C.3) goes to 0 as r → ∞, which is analogous to the ANI assumption in
Leung (2022).
h i
sup max E Yi (W ) − Yi W (i,r)
→ 0, as r → ∞ (C.4)
M i∈DM

Examples 2.1 and 2.2 in Auerbach and Tabord-Meehan (2021) are essentially
examples of Sections C.1 and C.2, and hence I focus on their Example 2.3 – the
linear-in-means peer effects model. Assuming correct specification,

1 X
Yi = α + δ Yj + Wi γ + ei ,
ni j∈P
i

where Pi is the peer group of unit i with size ni . As usual, |δ| < 1. The reduced
form of the potential outcome is solved to be

S
X
Yi = lim hs (Gsi , Ui ) = h(Gi , Ui )
S→∞
s=1

4
See Manski (2013) for the definition of “effective treatment.” The terms “exposure mapping”
from Aronow and Samii (2017) and “effective treatment” from Manski (2013) are used interchange-
ably throughout the text.

8
1
for some functions hs and h. Hence, for d(g, g̃) ≤ 1+r
,

h(g) − h(g̃) ≤ C|β|r for some |β| < 1,

which is exactly the ANI coefficient given in Proposition 1 in Leung (2022).5


Therefore, under a true interference structure, if one chooses a large enough r
neighborhood, the ANI approach can be thought of as using units with the effective
treatment closest to the actual effective treatment g to estimate the policy effect.

D Practical Guide
The following steps summarize the procedure for estimation of the EDATT. Spillover
effects can be estimated in a similar manner.

1. Collect the data {Yit , Wi , zi }i∈DM from the population.

2. Specify the exposure mapping function G(i, ·) as a function of W−i .

3. Set up models for the propensity scores P (Wi = 1|zi ), P (Gi = g|Wi = 1, zi ),
and P (Gi = g|Wi = 0, zi ). Also set up models for the conditional means of the
outcomes in both time periods, E(Yit |Wi = w, Gi = g, zi ).

4. Combine the moment conditions based on models in Step 3 (e.g., first-order


conditions from maximum likelihood estimation or linear regression) and the
identification equation (12) in the main text.

5. Estimate the GMM model given in Step 4 and conduct spatial-correlation ro-
bust inference. The EDATT estimate is the last element of the GMM estimates.
5
I refer readers to Auerbach and Tabord-Meehan (2021) for the introduction to notation and
more detailed derivation.

9
E Multiple Time Periods with Common Treat-
ment Timing
Extension to multiple time periods is straightforward. With common treatment
timing, the simplest approach is to aggregate the time periods prior to and post
treatment into a single time period, again denoted t = 1, 2. With the aggregated
data, we can directly apply the results in the main text. Alternatively, we might
be interested in the EDATT at different time periods. Denote the time periods by
{−T , . . . , −1, 0, 1, . . . , T }. Without loss of generality, suppose treatment starts at
t = 2. For any t ≥ 2, the EDATT for time period t at exposure level g is defined as

1 X  
τt (g) = E yit (1, W−i) − yit (0, W−i )|Wi = 1, Gi = g, zi (E.1)
|DM | i∈D
M

Spillover effects for time period t can be defined analogously.


It is worth discussing different ways to formalize the parallel trends assumption.
One can either pick one time period before treatment, say t = 1, as the comparison
time period. Or, one can use the average potential outcomes across the time periods
prior to treatment as a comparison. The latter can potentially improve efficiency
since data from more time periods are used in estimation. On the other hand, if the
parallel trends assumption only holds for the time periods closest to the treatment
period, the second approach is less robust. Hence, there is a typical tradeoff between
robustness and efficiency.
Other than the slight modification of the estimands of interest, the estimation
and asymptotic properties remain the same as long as one contrasts the appropriate
time periods, for instance, using data from any t ≥ 2 and t = 1. This way, one can
estimate the dynamic treatment effects as the treatment duration progresses.

10
F Additional Simulation Results
I examine the inference performance of doubly robust estimators with finite samples
in this section. In the main text, Section 5 describes how the population is generated.
The standard deviation of the τ (1) estimates is summarized in the top panel of Table
F.1 below. Regression adjustment comes with the smallest standard deviation. It is
more interesting to see that the standard deviation of the doubly robust estimates can
be one third smaller than that of the IPW estimates. With moderate misspecification,
we can still see efficiency gains from using the doubly robust estimator.
The bottom panel of Table F.1 summarizes the coverage rate of the 95% con-
fidence interval based on the usual standard error of the doubly robust estimator
with CBPS moment conditions. In this population generating process, the EHW
standard errors work well in designs 1-3 and 6. In designs 4-5, misspecification of
the linear-in-means outcome model induces more spatial correlation. As a result, the
confidence interval based on the SHAC standard errors provides better coverage than
that based on the EHW standard errors. With homogeneous direct treatment effect
and 350 effective population size, we do not see over coverage of the 95% confidence
interval based on the usual standard errors. Here, the conservativeness of the usual
standard errors is due to misspecification of the propensity scores and conditional
means. The typical downward bias of the SHAC standard errors in finite samples
also lowers the coverage rate.
I introduce an additional design with heterogeneous direct treatment effects.
There are now 900 units in the lattice. Among them, 612 units have neighbors
and are thus eligible for spillover. The individual treatment assignment probabil-
exp(0.3z+0.8zu )
ity remains as p(z ∗ ) = 1+exp(0.3z+0.8z u)
but the second period potential outcomes
are Y2 = 2 + 3z ⊙ W + 0.2A ∗ Y2 + 2z 2 + e2 . As a result, τ (1) = τ (0) = 0 for
the entire population. For the subpopulation composed of units with neighbors,
τ (1) = τ (0) = −0.016. Point estimates follow similar patterns as in Table 2 in the
main text. The biases of the doubly robust estimator with CBPS moment conditions
are -0.017 and -0.021 for τ (1) and τ (0), respectively. In this design, we do see (sub-
stantive) over coverage of the 95% confidence intervals for the average direct effect

11
Table F.1: Standard Deviation and Coverage of CI: τ (1)
1 2 3 4 5 6
standard deviation

ra 0.154 0.154 0.155 0.163 0.324 0.156


ipw mle 0.205 0.252 0.305 0.340 0.532 0.305
ipw cbps 0.206 0.252 0.306 0.341 0.530 0.306
dr mle 0.173 0.173 0.194 0.211 0.570 0.194
dr cbps 0.173 0.173 0.194 0.211 0.562 0.194
coverage rate

cov ehw 0.947 0.947 0.941 0.932 0.932 0.942


cov 0.6 0.945 0.946 0.939 0.935 0.944 0.940
cov 1 0.944 0.944 0.937 0.934 0.945 0.938
1
The coverage rate is based on the standard error of the doubly
robust estimator with CBPS moment conditions.
2
cov ehw stands for the coverage rate of the 95% confidence
interval based on the EHW standard error; cov 0.6 stands for
the coverage rate of the 95% confidence interval based on the
SHAC standard error with bandwidth 0.6; cov 1 stands for the
coverage rate of the 95% confidence interval based on the SHAC
standard error with bandwidth 1.
3
The confidence intervals are centered on the average of point
estimates. Thus, the coverage rate simply compares the mag-
nitudes of the standard errors without taking into account the
bias of the point estimates.

12
and parameters in the other moment conditions in the GMM estimation.
Table F.2 below summarizes results for a subset of the GMM parameters. The
first five columns are coverage rates for the parameters in q2 , the moment condition
for the propensity score for G. The next five columns are coverage rates for the
parameters in the outcome regression moment condition in the second time period,
q4 . The last two columns are coverage rates for the two direct effects at exposure
levels one and zero. Because of the spatial correlation induced by spillover, the SHAC
standard errors are the appropriate ones to be considered. As expected, the EHW
standard errors can be a bit too small when spatial correlation is nonnegligeble.
Because the usual standard errors tend to be conservative, the coverage rates of the
confidence interval constructed using the SHAC standard errors with appropriate
bandwidth can go above the nominal level of 0.95 with some coverage rates above
0.99.

Table F.2: Coverage of 95% Confidence Intervals


cov ehw 0.869 0.845 0.964 0.950 0.914 0.965 0.963 0.991 0.969 0.955 0.949 0.928
cov 0.6 0.942 0.928 0.963 0.955 0.956 0.966 0.967 0.990 0.968 0.973 0.955 0.943
cov 1 0.957 0.942 0.961 0.959 0.965 0.966 0.966 0.990 0.968 0.977 0.956 0.947
cov 1.4 0.960 0.946 0.960 0.959 0.967 0.965 0.967 0.990 0.968 0.977 0.956 0.948
1 The results are for the doubly robust estimator with CBPS moment conditions for the propensity
scores in the GMM estimation;
2 In columns 1-5, the coverage rate is with regard to the parameters in the propensity score moment
condition for G, q2 ; in columns 6-10, the coverage rate is with regard to the parameters in the
outcome regression moment condition, q4 ; in the last two columns, the coverage rate is with regard
to τ (1) and τ (0).

G Additional Proofs
Proof of Proposition 1
Compare the canonical DID estimand with EDATT:

1 X  
τ= E yi2 (1, W−i ) − yi2 (0, W−i )|Wi = 1, zi
|DM | i∈D
M

13
1 X X  
= E yi2 (1, W−i) − yi2 (0, W−i)|Wi = 1, Gi = g, zi P (Gi = g|Wi = 1, zi )
|DM | i∈D g∈G
M

1 X Xn  
= E yi2 (1, W−i )|Wi = 1, Gi = g, zi − E yi1 (0, 0)|Wi = 1, Gi = g, zi
|DM | i∈D g∈G
M
h  io
− E yi2 (0, W−i )|Wi = 0, Gi = g, zi − E yi1 (0, 0)|Wi = 0, Gi = g, zi P (Gi = g|Wi = 1, zi )

1 X Xh i
= E(Yi2 |Wi = 1, Gi = g, zi) − E(Yi1 |Wi = 1, Gi = g, zi) P (Gi = g|Wi = 1, zi )
|DM | i∈D g∈G
M
Xh i 
− E(Yi2 |Wi = 0, Gi = g, zi ) − E(Yi1 |Wi = 0, Gi = g, zi ) P (Gi = g|Wi = 1, zi )
g∈G

(G.1)

1 X h i
τcanonic = E(Yi2 − Yi1 |Wi = 1, zi ) − E(Yi2 − Yi1 |Wi = 0, zi )
|DM | i∈D
M

1 X Xh i
= E(Yi2 |Wi = 1, Gi = g, zi) − E(Yi1 |Wi = 1, Gi = g, zi) P (Gi = g|Wi = 1, zi )
|DM | i∈D g∈G
M
Xh i 
− E(Yi2 |Wi = 0, Gi = g, zi) − E(Yi1 |Wi = 0, Gi = g, zi) P (Gi = g|Wi = 0, zi )
g∈G

(G.2)

Proof of Proposition 2
Identification of the doubly robust estimand:
When the propensity scores are correctly specified, η(z) = p(z), η1g (z) = π1g (z),
and η0g (z) = π0g (z).
" #
Wi 1{Gi = g}   
E Yi2 − m2,1g (zi ) − Yi1 − m1,1g (zi ) zi
p(zi ) π1g (zi )
" #
Wi 1{Gi = g}   
=E Yi2 − m2,1g (zi ) − Yi1 − m1,1g (zi ) zi , Wi = 1 P (Wi = 1|zi )
p(zi ) π1g (zi )
 
1{Gi = g}   
=E Yi2 − m2,1g (zi ) − Yi1 − m1,1g (zi ) zi , Wi = 1, Gi = g P (Gi = g|Wi = 1, zi )
π1g (zi )

14
 
=E(Yi2 |zi , Wi = 1, Gi = g) − E(Yi1 |zi , Wi = 1, Gi = g) − m2,1g (zi ) − m1,1g (zi )
(G.3)

Similarly,
" #
1 − Wi 1{Gi = g}   
E Yi2 − m2,0g (zi ) − Yi1 − m1,0g (zi ) zi
1 − p(zi ) π0g (zi )
 
=E(Yi2|zi , Wi = 0, Gi = g) − E(Yi1|zi , Wi = 0, Gi = g) − m2,0g (zi ) − m1,0g (zi )
(G.4)
Hence,
"
Wi 1{Gi = g}   
ED Yi2 − m2,1g (zi ) − Yi1 − m1,1g (zi )
p(zi ) π1g (zi )
#
1 − Wi 1{Gi = g}   
− Yi2 − m2,0g (zi ) − Yi1 − m1,0g (zi ) + ∆m2,g (zi ) − ∆m1,g (zi )
1 − p(zi ) π0g (zi )

1 X
= E(Yi2 |zi , Wi = 1, Gi = g) − E(Yi1 |zi , Wi = 1, Gi = g)
|DM | i∈D
M
 
− E(Yi2 |zi , Wi = 0, Gi = g) − E(Yi1 |zi , Wi = 0, Gi = g)
 
  
− m2,1g (zi ) − m1,1g (zi ) − m2,0g (zi ) − m1,0g (zi ) + ∆m2,g (zi ) − ∆m1,g (zi )
1 X  
= E yi2 (1, W−i) − yi2 (0, W−i )|Wi = 1, Gi = g, zi (G.5)
|DM | i∈D
M

When the conditional means are correctly specified, mt,wg (z) = µt,wg (z).
" #
Wi 1{Gi = g}   
E Yi2 − µ2,1g (zi ) − Yi1 − µ1,1g (zi ) zi
η(zi ) η1g (zi )
" #
Wi 1{Gi = g}   
=E Yi2 − µ2,1g (zi ) − Yi1 − µ1,1g (zi ) zi , Wi = 1 P (Wi = 1|zi )
η(zi ) η1g (zi )
 
p(zi ) 1{Gi = g}   
= E Yi2 − µ2,1g (zi ) − Yi1 − µ1,1g (zi ) zi , Wi = 1, Gi = g P (Gi = g|zi , Wi = 1)
η(zi ) η1g (zi )

15
p(zi ) π1g (zi ) h i
= E(Yi2 |zi , Wi = 1, Gi = g) − µ2,1g (zi ) − E(Yi1 |zi , Wi = 1, Gi = g) − µ1,1g (zi ) = 0
η(zi ) η1g (zi )
(G.6)

Analogously,
" #
1 − Wi 1{Gi = g}   
E Yi2 − µ2,0g (zi ) − Yi1 − µ1,0g (zi ) zi = 0 (G.7)
1 − η(zi ) η0g (zi )

As a result,
"
Wi 1{Gi = g}   
ED Yi2 − µ2,1g (zi ) − Yi1 − µ1,1g (zi )
η(zi ) η1g (zi )
#
1 − Wi 1{Gi = g}   
− Yi2 − µ2,0g (zi ) − Yi1 − µ1,0g (zi ) + ∆µ2,g (zi ) − ∆µ1,g (zi )
1 − η(zi ) η0g (zi )
1 X  
= ∆µ2,g (zi ) − ∆µ1,g (zi )
|DM | i∈D
M

1 X  
= E yi2 (1, W−i) − yi2 (0, W−i )|Wi = 1, Gi = g, zi (G.8)
|DM | i∈D
M

Proof of Theorem 4.3:


∗ p
Using analogous arguments in the proof of Theorem 4.2, R̂ − RM → 0. The key
p
step is to show that Ω̃(θ̂) − ΩM − ΩE → 0.
Notice that
 
1 X X ∗
 ∗
  ∗
 ∗
′
ΩM = E q(Xi , θM ) − E q(Xi , θM )|z · q(Xj , θM ) − E q(Xj , θM )|z z
|DM | i∈D j∈D
M M

1 X X ∗ ∗ ′

= E q̃(Xi , θM )q̃(Xj , θM ) , (G.9)
|DM | i∈D j∈D
M M

where
∗ ∗
 ∗

q̃(Xi , θM ) = q(Xi , θM ) − E q(Xi , θM )|z (G.10)

16
 
with E q̃(Xi , θM

)|z = 0.
Since any sequence of symmetric matrices {AN } converges to a symmetric matrix
{A0 } if and only if c′ AN c → c′ A0 c for any vectors c, we can reach our conclusion by
taking an arbitrary linear combination of q(Xi , θ). From now on, I focus on the case
of a scalar q(Xi , θ).

∗ ∗
Ω̃(θ̂) − ΩM − ΩE ≤ Ω̃(θ̂) − Ω̃(θM ) + Ω̃(θM ) − ΩM − ΩE . (G.11)

For the first term on the right hand side of (G.11), take a mean value expansion

of Ω̃(θ̂) around θM . Let θ̌ denote the mean value from this expansion.


|Ω̃(θ̂) − Ω̃(θM )|
1 X  s  X
∞ X

 
= (θ̂ − θM ) ω ∇θ q(Xi , θ̌)q(Xj , θ̌) + q(Xj , θ̌)∇θ q(Xj , θ̌)
|DM | s=0 bM i∈D
M j∈DM ,s≤ρ(i,j)<s+1
bM
X X X
p ∗ 1
≤C1 |DM |(θ̂ − θM ) sup ∇θ q(Xi , θ)q(Xj , θ)
|DM |3/2 s=0 i∈DM j∈DM ,s≤ρ(i,j)<s+1 θ∈Θ
X
bM 
p ∗ 1 d−1 1 X
≤C |DM |(θ̂ − θM ) p s +1 sup ∇θ q(Xi , θ)q(Xj , θ)
|DM | s=1
|DM | i∈D θ∈Θ
M

(G.12)

Since
h 1 X i h i
E sup ∇θ q(Xi , θ)q(Xj , θ) z ≤ sup E sup ∇θ q(Xi , θ)q(Xj , θ) z
|DM | i∈D θ∈Θ M,i∈DM θ∈Θ
M
h i1/2 h i1/2
2 2
≤ sup E sup ∇θ q(Xi , θ) z · sup E sup q(Xi , θ) z < ∞, (G.13)
M,i∈DM θ∈Θ M,i∈DM θ∈Θ

1 X
sup ∇θ q(Xi , θ)q(Xj , θ) = Op (1) (G.14)
|DM | i∈D θ∈Θ
M
 PbM
by Markov’s inequality. Given bM = o |DM |1/2d , √ 1 s=1 sd−1 = o(1). Also,
|DM |

17
p ∗ ∗
|DM |(θ̂ − θM ) = Op (1) by Theorem 4.2. Hence, |Ω̃(θ̂) − Ω̃(θM )| = op (1).
Let

1 X  s  X
∞ X
∗ ∗
Ω̌M = ω q̃(Xi , θM )q̃(Xj , θM ). (G.15)
|DM | s=0 bM i∈D
M j∈DM ,s≤ρ(i,j)<s+1

Applying Proposition 4.1 in Kojevnikov, Marmer, and Song (2021), we have

Ω̌M − ΩM = op (1). (G.16)

What is left is to show


Ω̃(θM ) − ΩE − Ω̌M

1 X  s  X
∞ X  
≤2 ω E q(Xj , θM
∗ ∗
)|z q̃(Xi , θM ) (G.17)
|DM | s=0 bM i∈D
M j∈DM ,s≤ρ(i,j)<s+1

=op (1).
P∞ P  
j∈DM ,s≤ρ(i,j)<s+1 E q(Xj , θM )|z .
s ∗
Let Bi = s=0 ω bM

1 X  s  X
∞ X  
ω E q(Xj , θM
∗ ∗
)|z q̃(Xi , θM )
|DM | s=0 bM i∈D
M j∈DM ,s≤ρ(i,j)<s+1
1

1 X

≤ q̃(Xi , θM )Bi
|DM | i∈D
M 2
h 1 X  2 1 X X  i1/2
≤ E q̃(X i , θM
∗ 2
) |z Bi + E ∗
q̃(Xi , θM ∗
)q̃(Xj , θM )|z Bi Bj
|DM |2 i∈D |DM |2 i∈D j∈DM ,j6=i
M M
h C C2 X X
∞ X i1/2
1 2d
≤ b + κ̃M,s Bi Bj
|DM | M |DM |2 i∈D s=1
M j∈DM ,s≤ρ(i,j)<s+1
h ∞
C2 X d−1 2d i1/2
≤ o(1) + s bM κ̃M,s = o(1). (G.18)
|DM | s=1

18
Hence, equation (G.17) follows from Markov’s inequality. Theorem 4.3 follows by
continuity of matrix inversion and multiplication.

Proof of Corollaries 3 and 4:


When the exposure mapping is correctly specified, Assumption 5 holds by def-
inition for any s ≥ K. Hence, Lemma 4.1 holds with κ̃∗M,s = sd α bǫ (s/3)1(s >
3 max{K, ρ0 }) + 1(s ≤ 3 max{K, ρ0 }) ≤ κ̃M,s . Assumptions 7-9 are satisfied with
κ̃∗M,s .

References
Agarwal, A., Cen, S., Shah, D., and Yu, C.L. (2022), Network synthetic interventions:
A framework for panel data with network interference. Tech. rep., arXiv preprint
arXiv:2210.11355.

Aronow, P.M. and Samii, C. (2017), Estimating average causal effects under general
interference, with application to a social network experiment. Annals of Applied
Statistics 11(4), 1912–1947.

Auerbach, E. and Tabord-Meehan, M. (2021), The local approach to causal inference


under network interference. Tech. rep., arXiv preprint arXiv:2105.03810.

Basse, G.W. and Airoldi, E.M. (2018), Limitations of design-based causal inference
and a/b testing under arbitrary and network interference. Sociological Methodology
48(1), 136–151.

Butts, K. (2021), Difference-in-differences estimation with spatial spillovers. Tech.


rep., arXiv preprint arXiv:2105.03737.

Di Tella, R. and Schargrodsky, E. (2004), Do police reduce crime? estimates using


the allocation of police forces after a terrorist attack. American Economic Review
94(1), 115–133.

19
Emmenegger, C., Spohn, M.L., and Bühlmann, P. (2022), Treatment effect estima-
tion from observational network data using augmented inverse probability weight-
ing and machine learning. Tech. rep., arXiv preprint arXiv:2206.14591.

Kojevnikov, D., Marmer, V., and Song, K. (2021), Limit theorems for network de-
pendent random variables. Journal of Econometrics 222(2), 882–908.

Leung, M.P. (2022), Causal inference under approximate neighborhood interference.


Econometrica 90(1), 267–293.

Manski, C.F. (2013), Identification of treatment response with social interactions.


The Econometrics Journal 16(1), S1–S23.

Qu, Z., Xiong, R., Liu, J., and Imbens, G. (2022), Efficient treatment effect estima-
tion in observational studies under heterogeneous partial interference. Tech. rep.,
arXiv preprint arXiv:2107.12420.

Sant’Anna, P.H. and Zhao, J. (2020), Doubly robust difference-in-differences estima-


tors. Journal of Econometrics 219(1), 101–122.

20

You might also like