Lectures
Lectures
Figure: xkcd
Basic information
Textbooks
1 Angrist and Pischke (2009) Mostly Harmless Econometrics
(MHE)
2 Morgan and Winship (2014) Counterfactuals and Causal
Inference (MW)
Readings:
We will also discuss a number of papers in each lecture, each
of which you will need to learn inside and out.
Lecture slides and reading lists are available
The exam will cover material from all required readings.
Content of Part 1: The Core
y = β0 + β1 x + u (1)
addresses each of them.
The SLR model is a population model. When it comes to
estimating β1 (and β0 ) using a random sample of data, we
must restrict how u and x are related to each other.
What we must do is restrict the way u and x relate to each
other in the population.
First, we make a simplifying assumption (without loss of
generality): the average, or expected, value of u is zero in the
population:
E (u) = 0 (2)
where E (·) is the expected value operator.
The presence of β0 in
y = β0 + β1 x + u (3)
E (y |x) = β0 + β1 x (7)
which shows the population regression function is a linear
function of x.
The straight line in the graph on the next page is what
Wooldridge calls the population regression function, and
what Angrist and Pischke call the conditional expectation
function
E (y |x) = β0 + β1 x
The conditional distribution of y at three different values of x
are superimposed. for a given value of x, we see a range of y
values: remember, y = β0 + β1 x + u, and u has a distribution
in the population.
Deriving the Ordinary Least Squares Estimates
yi = β0 + β1 xi + ui (8)
E (u) = 0 (9)
Cov (x, u) = 0 (10)
E (u|x) = 0 (11)
With E (u) = 0, Cov (x, u) = 0 is the same as E (xu) = 0. Next we
plug in for u:
E (y − β0 − β1 x) = 0 (12)
E [x(y − β0 − β1 x)] = 0 (13)
n
X
−1
n (yi − β̂0 − β̂1 xi ) = 0 (14)
i=1
n
X
−1
n xi (yi − β̂0 − β̂1 xi ) = 0 (15)
i=1
where β̂0 and β̂1 are the estimates from the data.
These are two linear equations in the two unknowns β̂0 and β̂1 .
Pass the summation operator through the first equation:
n
X
−1
n (yi − β̂0 − β̂1 xi ) (16)
i=1
n
X n
X n
X
−1 −1 −1
=n yi − n β̂0 − n β̂1 xi (17)
i=1 i=1 i=1
n n
!
X X
= n−1 yi − β̂0 − β̂1 n−1 xi (18)
i=1 i=1
implies
Pn
(x − x)(yi − y ) Sample Covariance(xi , yi )
Pn i
β̂1 = i=1 2
= (27)
i=1 (xi − x) Sample Variance(xi )
The previous formula for β̂1 is important. It shows us how to
take the data we have and compute the slope estimate.
β̂1 is called the ordinary least squares (OLS) slope estimate.
It can be computed whenever the sample variance of the xi is
not zero, which only rules out the case where each xi has the
same value.
The intuition is that the variation in x is what permits us to
identify its impact on y .
Once we have β̂1 , we compute β̂0 = y − β̂1 x. This is the OLS
intercept estimate.
These days, we let the computer do the calculations, which
are tedious even if n is small.
For any candidates β̂0 and β̂1 , define a fitted value for each i
as
and so y = ŷ .
Similarly the way we obtained our estimates,
n
X
n−1 xi (yi − β̂0 − β̂1 xi ) = 0 (34)
i=1
Because the ŷi are linear functions of the xi , the fitted values and
residuals are uncorrelated, too:
n
X
n−1 ŷi ûi = 0 (36)
i=1
Both properties hold by construction. β̂0 and β̂1 were chosen to
make them true.
A third property is that the point (x, y ) is always on the OLS
regression line. That is, if we plug in the average for x, we predict
the sample average for y :
E (β̂) = β (38)
Remember, our objective is to estimate β1 , the slope
population parameter that describes the relationship between
y and x.
β̂1 is an estimator of that parameter obtained for a specific
sample.
Different samples will generate different estimates (β̂1 ) for the
“true” β1 , i.e. (β̂1 ) is a random variable.
Unbiasedness is the idea that if we could take as many
random samples on Y as we want from the population, and
compute an estimate each time, the average of these
estimates would be equal to β1 .
Assumption SLR.1 (Linear in Parameters)
The population model can be written as
y = β0 + β1 x + u (39)
where β0 and β1 are the (unknown) population parameters.
We view x and u as outcomes of random variables; thus, y is
random.
Stating this assumption formally shows that our goal is to
estimate β0 and β1 .
Assumption SLR.2 (Random Sampling)
We have a random sample of size n, {(xi , yi ) : i = 1, ..., n},
following the population model.
We know how to use this data to estimate β0 and β1 by OLS.
Because each i is a draw from the population, we can write,
for each i,
yi = β0 + β1 xi + ui (40)
Notice that ui here is the unobserved error for observation i.
It is not the residual that we compute from the data!
Assumption SLR.3 (Sample Variation in the Explanatory
Variable)
The sample outcomes on xi are not all the same value.
This is the same as saying the sample variance of
{xi : i = 1, ..., n} is not zero.
In practice, this is no assumption at all. If the xi are all the
same value, we cannot learn how x affects y in the population.
Assumption SLR.4 (Zero Conditional Mean)
In the population, the error term has zero mean given any
value of the explanatory variable:
E (β̂1 ) = β1 (42)
where the expected value means averaging across random samples.
Step 1: Write down a formula for β̂1 . It is convenient to use
Pn
(xi − x)yi
β̂1 = Pi=1
n 2
(43)
i=1 (xi − x)
which is one of several equivalent forms.
It is convenient to define SSTx = ni=1 (xi − x)2 , to total variation
P
in the xi , and write
Pn
(xi − x)yi
β̂1 = i=1 (44)
SSTx
Remember, SSTx is just some positive number. The existence of
β̂1 is guaranteed by SLR.3.
Pn Pn
β1 SSTx + i=1 (xi − x)ui i=1 (xi− x)ui
β̂1 = = β1 + (49)
SSTx SSTx
Note how the last piece is the slope coefficient from the OLS
regression of ui on xi , i = 1, ..., n. We cannot do this regression
because the ui are not observed.
Now define
(xi − x)
wi = (50)
SSTx
so we have
n
X
β̂1 = β1 + w i ui (51)
i=1
= β1 (54)
E (u 2 |x) = σ 2 = E (u 2 ) (57)
Under the population Assumptions SLR.1 (y = β0 + β1 x + u),
SRL.4 (E (u|x) = 0) and SLR.5 (Var (u|x) = σ 2 ),
E (y |x) = β0 + β1 x (58)
2
Var (y |x) = σ (59)
σ2 σ2
Var (β̂1 |x) = Pn 2
= (60)
i=1 (xi − x) SSTx
2 −1
Pn 2
σ n i=1 xi
Var (β̂0 |x) = (61)
SSTx
(conditional on the outcomes {x1 , x2 , ..., xn }).
To show this, write, as before,
n
X
β̂1 = β1 + w i ui (62)
i=1
n
!
X
Var (β̂1 |x) = Var wi ui |x (63)
i=1
n
X n
X
= Var (wi ui |x) = wi2 Var (ui |x) (64)
i=1 i=1
n
X n
X
= wi2 σ 2 =σ 2
wi2 (65)
i=1 i=1
n n Pn 2
X X (xi − x)2 i=1 (xi − x)
wi2 = = (66)
(SSTx )2 (SSTx )2
i=1 i=1
SSTx 1
= 2
= (67)
(SSTx ) SSTx
We have shown
σ2
Var (β̂1 ) = (68)
SSTx
Usually we are interested in β1 . We can easily study the two
factors that affect its variance.
σ2
Var (β̂1 ) = (69)
SSTx
σ2
Var (β̂1 ) = (73)
SSTx
we can compute SSTx from {xi : i = 1, ..., n}. But we need to
estimate σ 2 .
Recall that
σ 2 = E (u 2 ). (74)
Therefore, if we could observe a sample on the errors,
{ui : i = 1, 2, ..., n}, an unbiased estimator of σ 2 would be the
sample average
n
X
n−1 ui2 (75)
i=1
ui = yi − β0 − β1 xi (76)
ûi = yi − β̂0 − β̂1 xi (77)
ûi can be computed from the data because it depends on the
estimators β̂0 and β̂1 . Except by fluke,
ûi 6= ui (78)
for any i.
It is a true estimator and easily computed from the data after OLS.
As it turns out, this estimator is slightly biased: its expected value
is a little less than σ 2 .
The estimator does not account for the two restrictions on the
residuals, used to obtain β̂0 and β̂1 :
n
X
ûi = 0 (82)
i=1
n
X
xi ûi = 0 (83)
i=1
SSR
σ̂ 2 = (84)
(n − 2)
THEOREM: Unbiased Estimator of σ 2
Under Assumptions SLR.1 to SLR.5,
E (σ̂ 2 ) = σ 2 (85)
In regression output, it is
s
√ SSR
σ̂ = σ̂ 2 = (86)
(n − 2)
that is usually reported. This is an estimator of sd(u), the
standard deviation of the population error. And SSR = ni=1 ub2 .
P
σ̂ is called the standard error of the regression, which
means it is an estimate of the standard deviation of the error
in the regression. Stata calls it the root mean squared error.
Given σ̂, we can now estimate sd(β̂1 ) and sd(β̂0 ). The
estimates of these are called the standard errors of the β̂j .
We just plug σ̂ in for σ:
σ̂
se(β̂1 ) = √ (87)
SSTx
where both the numerator and denominator are computed
from the data.
For reasons we will see, it is useful to report the standard
errors below the corresponding coefficient, usually in
parentheses.
OLS inference is generally faulty in the presence of
heteroskedasticity
Fortunately, OLS is still useful
Assume SLR.1-4 hold, but not SLR.5. Therefore
− x)2 ubi 2
Pn
i=1 (xi
Var (βb1 ) =
SSTx2
yg = xg β + ug
βb = [X 0 X ]−1 X 0 y
E (yi |xi = x)
Helpful result: Law of Iterated Expectations
E (Y ) = E (E [Y |X ])
.
Say that the population is divided by gender. We could take
conditional expectations by gender and combine them
(properly weighted) to get the unconditional expectation
E [IQ] = E (E [IQ|Sex])
X
= Pr (Sexi ) · E [IQ|Sexi ]
Sexi
= Pr (Male) · E [IQ|Male]
+Pr (Female) · E [IQ|Female]
yi = E (yi |xi ) + ui
where
1 ui is mean independent of xi ; that is
E (ui |xi ) = 0
where V (·) is the variance and V (yi |xi ) is the conditional variance
of yi given xi .
workforpayi = β0 + β1 numkidsi + ui
. regress workforpay numkids
where the first line is the causal / econometric model, and the
second line is the regression command in STATA
If family size is random, then number of kids is uncorrelated
with the unobserved error term, which means we can interpret
βb1 as the causal effect.
Example: if Melissa has no children in reality (i.e.,
numkids= 0) and we wanted to know what the effect on labor
supply will be if we surgically manipulated her family size (i.e.,
numkids = 1) then βb1 would be our answer
Visual: Even better, we could just plot the regression
coefficient in a scatter plot showing all i (workforpay,
numkids) pairs and the slope coefficient would be the best fit
of the data through these points, as well as tell us the average
causal effect of family size on labor supply
But how do we interpret βb1 if numkids is non-random?
Assume that family size is random once we condition on race,
age, marital status and employment. Then the model is:
and x̃1i = x1i − xb1i being the residual from the auxiliary regression.
The parameter β1 can be rewritten as:
β1 E [fi x1i ] = · · · = βk−1 E [fi xk−1i ] = βk+1 E [fi xk+1i ] = · · · = βK E [fi xKI ] = 0
Regression Anatomy Proof (cont.)
3 Consider now the term E [ei fi ]. This can be written as:
E [ei fi ] = E [ei fi ]
= E [ei x̃ki ]
= E [ei (xki − xbki )]
= E [ei xki ] − E [ei x̃ki ]
which follows directly from the orthogonoality between E [xki |X−k ] and x̃ki . From
previous derivations we finally get
Yi = β0 + β1 Si + β2 Ai + ui
Yi = log of earnings
Si = schooling measured in years
Ai = individual ability
When Cov (A, S) > 0 then ability and schooling are correlated.
When ability is unobserved, then not even multiple regression
will identify the causal effect of schooling on wages.
Here we see one of the main justifications for this class – what
will we do when the treatment variable is endogenous?
Because endogeneity means the causal effect has not been
identified.
Overview
Part 1: The Core
Part 2: Selection on observables Regression fundamentals
Part 3: Selection on unobservables Randomized Experiments
Part 4: Advanced material Directed Acyclical Graphs
Conclusion
References
Observed variables:
Treatment, Di , is observed as either 0 or 1 for each i unit.
Actual outcomes, Y , are observed for each unit i
Unobserved counterfactual variables:
Each individual i has one counterfactual (potential) outcome
that in principle exists but is not observed:
(
Yi1 if Di = 1
Potential outcome =
Yi0 if Di = 0
E [δ|D = 1] = E [Y 1 − Y 0 |D = 1]
= E [Y 1 |D = 1] − E [Y 0 |D = 1]
E [δ|D = 0] = E [Y 1 − Y 0 |D = 0]
= E [Y 1 |D = 0] − E [Y 0 |D = 0]
Simple difference in means
Notice that . . . :
All individuals in the population contribute twice to ATE,
whereas a sampled individual is used only once to estimate
SDO by contributing to either EN [yi |di = 1] or EN [yi |di = 0].
Statistical models, such as SDO, are valuable insofar as they
can provide unbiased and/or consistent estimates of the
parameter of interest (i.e., ATE). But notice the subtle
difference between the LHS and RHS:
<
SDO > ATE
E [y |d = 1] − E [y |d = 0] <
> E [Y 1 ] − E [Y 0 ]
E [Y 1 |D = 1] − E [Y 0 |D = 0] = ATE
+E [Y 0 |D = 1] − E [Y 0 |D = 0]
+(1 − π)(ATT − ATU) (89)
ATE = E [Y 1 ] − E [Y 0 ]
= {πE [Y 1 |D = 1] + (1 − π)E [Y 1 |D = 0]}
−{πE [Y 0 |D = 1] + (1 − π)E [Y 0 |D = 0]}
E [Y 1 |D = 1] = a
E [Y 1 |D = 0] = b
E [Y 0 |D = 1] = c
E [Y 0 |D = 0] = d
ATE = e
Rewrite ATE
E [Y 1 |D = 1] − E [Y 0 |D = 0] = ATE
+(E [Y 0 |D = 1] − E [Y 0 |D = 0])
+(1 − π)({E [Y 1 |D = 1] − E [Y 0 |D = 1]}
−(1 − π){E [Y 1 |D = 0] − E [Y 0 |D = 0]})
1 0
E [Y |D = 1] − E [Y |D = 0] = ATE
+(E [Y 0 |D = 1] − E [Y 0 |D = 0])
+(1 − π)(ATT − ATU)
Decomposition of difference in means
+ E [Y |D = 1] − E [Y 0 |D = 0]
0
| {z }
Selection bias
+ (1 − π)(ATT − ATU)
| {z }
Heterogenous treatment effect bias
Independence assumption
Treatment is independent of potential outcomes
(Y 0 , Y 1 ) ⊥
⊥D
E [Y 1 |D = 1] = E [Y 1 |D = 0]
E [Y 0 |D = 1] = E [Y 0 |D = 0]
Random Assignment Solves the Selection Problem
+ E [Y |D = 1] − E [Y 0 |D = 0]
0
| {z }
Selection bias
+ (1 − π)(ATT − ATU)
| {z }
Heterogenous treatment effect bias
Notice that the selection bias from the second line of the
decomposition of SDO was:
E [Y 0 |D = 1] − E [Y 0 |D = 0]
If treatment is independent of potential outcomes, then swap
out equations and selection bias zeroes out:
E [Y 0 |D = 1] − E [Y 0 |D = 0] = E [Y 0 |D = 0] − E [Y 0 |D = 0]
= 0
Random Assignment Solves the Heterogenous Treatment Effects
How does randomization affect heterogeneity treatment effects bias from the
third line? Rewrite definitions for ATT and ATU:
ATT = E [Y 1 |D = 1] − E [Y 0 |D = 1]
ATU = E [Y 1 |D = 0] − E [Y 0 |D = 0]
E [Y 1 |D = 1] = E [Y 1 |D = 0]
With randomization one could simply calculate SDO (simple difference in mean
outcomes) for the treatment and control group and know that SDO=ATE
because of independence
Nonetheless, it is often useful to analyze experimental data with regression
analysis (see MW section 3.2.2; MHE ch. 2)
Assume that treatment effects are constant – i.e., Yi1 − Yi0 = δ ∀i
Substitute into a rearranged switching equation (Definition 2):
Yi = Di Yi1 + (1 − Di )Yi0
Yi = Yi0 + (Yi1 − Yi0 )Di
Yi = Yi0 + δDi
Yi = E [Yi0 ] + δDi + Yi0 − E [Yi0 ]
Yi = α + δDi + ηi
Yi = α + δDi + Xi0 γ + ηi
A. Kindergarten
Small class 4.82 5.37 5.36 5.37 4.82 5.37 5.36
(2.19) (1.26) (1.21) (1.19) (2.19) (1.25) (1.21)
Regular/aide class .12 .29 .53 .31 .12 .29 .53
(2.23) (1.13) (1.09) (1.07) (2.23) (1.13) (1.09)
White/Asian (1 5 — — 8.35 8.44 — — 8.35
yes (1.35) (1.36) (1.35)
Girl (1 5 yes) — — 4.48 4.39 — — 4.48
(.63) (.63) (.63)
Free lunch (1 5 — — 213.15 213.07 — — 213.15
yes) (.77) (.77) (.77)
White teacher — — — 2.57 — — —
(2.10)
Teacher experience — — — .26 — — —
(.10)
Master’s degree — — — 2.51 — — —
(1.06)
School éxed effects No Yes Yes Yes No Yes Yes
R2 .01 .25 .31 .31 .01 .25 .31
B. First grade
Regression results 1st grade
Problem 1: Attrition
TABLE VI
EXPLORATION OF EFFECT OF ATTRITION DEPENDENT VARIABLE: AVERAGE
PERCENTILE SCORE ON SAT
Coefficient Coefficient
on small Sample on small Sample
Grade class dum. size class dum. size
Estimates of reduced-form models are presented. Each regression includes the following explanatory
variables: a dummy variable indicatin g initial assignment to a small class; a dummy variable indicating initial
assignment to a regular/aide class, unrestricted school effects; a dummy variable for student gender; and a
dummy variable for student race. The reported coefficien t on small class dummy is relative to regular classes.
Standard errors are in parentheses.
“It is virtuallyKindergarten
impossible to Small
Small
prevent
1292
some
Regular
60
students
Reg/aide
48
from switching
All
1400
Regular 126 737 663 1526
between class Aide
types over time.”
122 (Krueger
761 1999,706 p. 506)
1589
All 1540 1558 1417 4515
B. First grade to second grade
Second grade
B. Data
If students withand stronger
Standardized Tests
expected academic potential were
Students were tested at the end of March or beginning of
more likely
Aprilto move
of each year. into theconsisted
The tests smallof the
classes,
Stanford then
Achieve-these
ment Test (SAT), which measured achievement in reading, word
transitions wouldandbias
recognition, mathainsimple comparison
grades K–3, ofBasic
and the Tennessee outcomes across
Skills First (BSF) test, which measured achievement in reading
class types.
and math in grades 1–3. The tests were tailored to each grade
Problem 2: Switch Classrooms after Random Assignment
Hawthorne effects
People behave differently if they are being observed in an
experiment. Similar to “placebo effects” in that this is a false
positive result.
If they operate differently on treatment and control groups,
then they may introduce biases
If people from the control group behave differently, these
effects are sometimes called “John Henry” effects
Substitution bias
Control group members may seek substitutes for treatment
This would bias the estimated treatment effects downward.
Can you see why?
Can also occur if the experiment frees up resources that can
now be concentrated on the control group.
Overview
Part 1: The Core
Part 2: Selection on observables Regression fundamentals
Part 3: Selection on unobservables Randomized Experiments
Part 4: Advanced material Directed Acyclical Graphs
Conclusion
References
Z X Y
Z X Y
U is a parent of X and Y
X and Y are descendants of Z
There is a directed path from Z to Y
There are two paths from Z to U (but no directed path)
X is a collider of the path Z → X ← U
X is a noncollider of the path Z → X → Y
Confounding
D Y
X
Blocked backdoor paths
Examples:
1 Conditioning on a noncollider blocks a path:
X Z Y
2 Conditioning on a collider opens a path:
Z X Y
3 Not conditioning on a collider blocks a path:
Z X Y
Backdoor criterion
Backdoor criterion
Conditioning on X satisfies the backdoor criterion with respect to
(D, Y ) directed path if:
1 All backdoor paths are blocked by X
2 No element of X is a collider
In words: If X satisfies the backdoor criterion with respect to
(D, Y ), then matching on X identifies the causal effect of D on Y
Matching on all common causes is sufficient
There are two backdoor paths from D to Y
X1 D Y
X2
U X2 D Y
X3 D Y
X2
X1
X3 D Y
X2
X1 D Y
X2
X1 D Y
X2
X D Y
U2
No confounding – D is identified
But what if we condition on X ? Now a backdoor path opens.
U1
X D Y
U2
Ultimately, we can’t know if we have a collider bias problem,
or whether we’ve satisfied the backdoor criterion, without a
model
There’s no getting around it – all empirical work requires
theory to guide the work. Otherwise how do you know if
you’ve conditioning on a noncollider or a collider?
Put differently, you cannot identify treatment effects without
making assumptions about the process that generated your
data
Simple example of collider bias
Movie Star
Talent Beauty
STATA code
clear all
set seed 3444
4
2
0
-2
-4
Beauty
-4 -2 0 2 4
Total
4
2
0
-2
-4
-4 -2 0 2 4
Talent
Graphs by Movie star
Figure: Top left figure: Non-star sample scatter plot of beauty (vertical axis) and talent
(horizontal axis). Top right right figure: Star sample scatter plot of beauty and talent.
Bottom left figure: Entire (stars and non-stars combined) sample scatter plot of beauty and
talent.
Final Remarks
Textbook readings
Read MW chapter 3-4
Read MHE chapter 3
Article readings
See website under Matching estimation (Job Trainings Papers)
GERRY B. HILL, WAYNE MILLAR and JAMES CONNELLY
Figure 1
Lung Cancer at Autopsy: Combined Results from 18 Studies
1860 1870 1880 1890 1900 l910 1920 1930 1940 1950
Year
Observed +fitted
Mortality Statistics
"The
TheGreat Debate"
Registrar General of England and Wales began publishing the num- 371
bers of deaths for specific cancer sites in 1911.W The death rates for can-
cer of the lung from 1911 to 1955 were
Figure 2(a)published by Percy Stocks.26The
rates increased exponentially
Mortality overof
from Cancer thetheperiod:
Lung in10% per year in males
Males
and 6% per year in females. Canadian rates for the period 1931-52 were
Rate per 100,000
published
120 by A. J. Phillips.27 The rates were consistently lower in Canada
than in- England and Wales, but also increased exponentially at 8% per
l00
year in males and 4% per year in females.
The
80 -British and Canadian rates are shown in Figure 2. The rates (a) for
males, and (b) for females have been age-standardized,28 and the trends
6 0 - to 1990, using data published by Richard Peto and colleagues, 29
extended
and40by- Statistics Canada.30In British males the rates reached a maxi-
I mum in the mid-1970's and then declined. In Canadian males the initial
rise 20
was- more prolonged, reaching a maximum in 1990.Among females
the age-standardized rates continue to climb in both countries, the rise
0
beingl910
steeper in Canada
1920 1930 1940than in Britain.
1960 1960 1970 l980 l990 2000
The fact that mortality was lower Yearat first in Canada than in Britain
may be explained by the difference in smoking in the two countries.
-England
Percy Stocks31 cited data on the+
& Wales annual
Canadaconsumption
+ United per adult of ciga-
Kingdom
rettes in various countries between 1939 and 1957. In 1939 the con-
increases with the amount smoked.
Figure 4
Smoking and Lung Cancer Case-control Studies
376 Odds Ratio* GERRY B. HILL, WAYNE MILLAR and JAMES CONNELLY
'0 1
Cohort Studies 60
Cohort studies, though less prone to bias, are much more difficult to
perform than case-control studies, since it is necessary to assemble many
thousands of individuals, determine their smoking status, and follow
them up for several years to determine how many develop lung cancer.
Four such studies were mounted in the 1950s. The subjects used were
British doctors,61United States veterans,62 Canadian veterans,63 and vol-
unteers assembled by the American Cancer Society.@All four used mor-
tality as"the end-point.
Lees than 20 20 or more All
Figure 5 shows the combined mortality ratios for cancer of the lung in
males by level of cigarette smoking. Two of the studies involved females,
deaths m
but the numbers of lung cancerMale8 wereFemales
too small to provide precise
estimates. Since all causes of death were recorded in the cohort studies it
was Weighted
possiblemean
to determine the relationship between smoking and dis-
'
eases other than lung cancer. Sigruficant associations were found in rela-
tion to several types of cancer (e.g. mouth, pharynx, larynx, esophagus,
bladder) and with chronic respiratory disease and cardiovascular disease.
Figure 5
Smoking and Lung cancer Cohort Studies in Males
Mortality Ratio*
25
-1
S C
Criticisms from Joseph Berkson, Jerzy Neyman and Ronald Fisher: (Hill, Millar
and Connelly 2003)
1 Correlation b/w smoking and lung cancer is spurious due to biased selection of
subjects (e.g., conditioning on collider problem)
2 Functional form complaints about using “risk ratios” and “odds ratios”
3 Confounder, Z , creates backdoor path between smoking and cancer
4 Implausible magnitudes
5 No experimental evidence to incriminate smoking as a cause of lung cancer
Fisher’s confounding theory
Fisher, equally famous as a geneticist, argued from logic, statistics and genetic
evidence for a hypothetical confounding genome, Z , and therefore smokers and
non-smokers were not exchangeable (violation of independence assumption)
Other studies showed that cigarette smokers and non-smokers were different on
observables – more extraverted than non-smokers and pipe smokers, differed in
age, differed in income, differed in education, etc.
Uh, I thought you told me Fisher was smart?
Older people die at a higher rate, and for reasons other than
just smoking cigars
Maybe cigar smokers higher observed death rates is because
they’re older on average
Subclassification
One way to think about the problem is that the covariates are
not balanced – their mean values differ for treatment and
control group. So let’s try to balance them.
Subclassification (also called stratification): Compare
mortality rates across the different smoking groups within age
groups so as to neutralize covariate imbalances in the
observed sample
Subclassification
Question: What would the average mortality rate be for pipe smokers
if they had the same age distribution as the non-smokers?
Subclassification: example
Question: What would the average mortality rate be for pipe smokers
if they had the same age distribution as the non-smokers?
29 9 2
15 · + 35 · + 50 · = 21.2
40 40 40
Table: Adjusted death rates using 3 age groups (Cochran 1968)
Definition: Outcomes
Those variables, Y , that are (possibly) not predetermined are
called outcomes (for some individual i, Yi0 6= Yi1 )
Subclassification
Matching
Propensity score methods
Regression
Identification under independence
(Y 0 , Y 1 ) ⊥
⊥D
and therefore:
E [Y |D = 1] − E [Y |D = 0] = E [Y 1 |D = 1] − E [Y 0 |D = 0]
| {z }
by the switching equation
1 0
= E [Y ] − E [Y ]
| {z }
by independence
1 0
= E [Y − Y ]
| {z }
ATE
E [Y 1 − Y 0 ] = E [Y 1 − Y 0 |D = 1]
Identification under conditional independence
Identification assumptions:
1 (Y 1 , Y 0 ) ⊥
⊥ D|X (conditional independence)
2 0 < Pr (D = 1|X ) < 1 with probability one (common support)
Identification result:
Given assumption 1:
E [Y 1 − Y 0 |X ] = E [Y 1 − Y 0 |X , D = 1]
= E [Y |X , D = 1] − E [Y |X , D = 0]
Given assumption 2:
δATE = E [Y 1 − Y 0 ]
Z
= E [Y 1 − Y 0 |X , D = 1]dPr (X )
Z
= (E [Y |X , D = 1] − E [Y |X , D = 0]) dPr (X )
Identification under conditional independence
Identification assumptions:
1 (Y 1 , Y 0 ) ⊥
⊥ D|X (conditional independence)
2 0 < Pr (D = 1|X ) < 1 with probability one (common support)
Identification result:
Similarly
δATT = E [Y 1 − Y 0 |D = 1]
Z
= (E [Y |X , D = 1] − E [Y |X , D = 0]) dPr (X |D = 1)
K k
1,k 0,k N
Question: What is δ[
X
ATE = (Y − Y ) · ?
N
k=1
Subclassification by Age (K = 2)
K k
1,k 0,k N
Question: What is δ[
X
ATE = (Y − Y ) · ?
N
k=1
13 17
4· +6· = 5.13
30 30
Subclassification by Age (K = 2)
K
NTk
1,k 0,k
Question: What is δ[
X
ATT = (Y −Y )· ?
NT
k=1
Subclassification by Age (K = 2)
K
NTk
1,k 0,k
Question: What is δ[
X
ATT = (Y −Y )· ?
NT
k=1
3 7
4· +6· = 5.4
10 10
Subclassification by Age and Gender (K = 4)
K k
1,k 0,k N
Problem: What is δ[
X
ATE = (Y − Y ) · ?
N
k=1
Subclassification by Age and Gender (K = 4)
K k
1,k 0,k N
Problem: What is δ[
X
ATE = (Y − Y ) · ?
N
k=1
Not identified!
Subclassification by Age and Gender (K = 4)
K k
1,k 0,k NT
Question: What is δ[
X
ATT = (Y − Y ) · ?
NT
k=1
Subclassification by Age and Gender (K = 4)
K k
1,k 0,k NT
Question: What is δ[
X
(Y − Y ) ·
ATT = ?
NT
k=1
3 3 4
4· +5· +6· = 5.1
10 10 10
Curse of Dimensionality
Potential Outcome
unit under Treatment under Control
i Yi1 Yi0 DI Xi
1 6 ? 1 3
2 1 ? 1 1
3 0 ? 1 10
4 0 0 2
5 9 0 3
6 1 0 -2
7 1 0 -4
1 X
Question: What is δ[
ATT = (Yi − Yj(i) )?
NT
Di =1
Matching example with single covariate
Potential Outcome
unit under Treatment under Control
i Yi1 Yi0 DI Xi
1 6 ? 1 3
2 1 ? 1 1
3 0 ? 1 10
4 0 0 2
5 9 0 3
6 1 0 -2
7 1 0 -4
1 X
Question: What is δ[
ATT = (Yi − Yj(i) )?
NT
Di =1
Match and plug in!
Matching example with single covariate
Potential Outcome
unit under Treatment under Control
i Yi1 Yi0 DI Xi
1 6 9 1 3
2 1 0 1 1
3 0 9 1 10
4 0 0 2
5 9 0 3
6 1 0 -2
7 1 0 -4
1 X
Question: What is δ[
ATT = (Yi − Yj(i) )?
NT
Di =1
1 1 1
δbATT = · (6 − 9) + · (1 − 0) + · (0 − 9) = −3.7
3 3 3
A Training Example
Trainees Non-Trainees
unit age earnings unit age earnings
1 28 17700 1 43 20900
2 34 10200 2 50 31000
3 29 14400 3 30 21000
4 25 20800 4 27 9300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900
2 34 10200 2 50 31000
3 29 14400 3 30 21000
4 25 20800 4 27 9300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900
2 34 10200 2 50 31000
3 29 14400 3 30 21000
4 25 20800 4 27 9300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900
2 34 10200 2 50 31000
3 29 14400 3 30 21000
4 25 20800 4 27 9300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000
3 29 14400 3 30 21000
4 25 20800 4 27 9300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000
4 25 20800 4 27 9300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200 9,16 24 17700
15 25 10100 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200 9,16 24 17700
15 25 10100 15 25 23300 15 25 23300
16 43 10700 16 24 9700
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200 9,16 24 17700
15 25 10100 15 25 23300 15 25 23300
16 43 10700 16 24 9700 1 43 20900
17 28 11500 17 29 6200
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200 9,16 24 17700
15 25 10100 15 25 23300 15 25 23300
16 43 10700 16 24 9700 1 43 20900
17 28 11500 17 29 6200 8 28 8800
18 27 10700 18 35 30200
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200 9,16 24 17700
15 25 10100 15 25 23300 15 25 23300
16 43 10700 16 24 9700 1 43 20900
17 28 11500 17 29 6200 8 28 8800
18 27 10700 18 35 30200 4 27 9300
19 28 16300 19 32 17800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200 9,16 24 17700
15 25 10100 15 25 23300 15 25 23300
16 43 10700 16 24 9700 1 43 20900
17 28 11500 17 29 6200 8 28 8800
18 27 10700 18 35 30200 4 27 9300
19 28 16300 19 32 17800 8 28 8800
Average: 28.5 16426 20 23 9500 Average:
21 32 25900
Average: 33 20724
A Training Example
Trainees Non-Trainees Matched Sample
unit age earnings unit age earnings unit age earnings
1 28 17700 1 43 20900 8 28 8800
2 34 10200 2 50 31000 14 34 24200
3 29 14400 3 30 21000 17 29 6200
4 25 20800 4 27 9300 15 25 23300
5 29 6100 5 54 41100 17 29 6200
6 23 28600 6 48 29800 20 23 9500
7 33 21900 7 39 42000 10 33 15500
8 27 28800 8 28 8800 4 27 9300
9 31 20300 9 24 25500 12 31 26600
10 26 28100 10 33 15500 11,13 26 8450
11 25 9400 11 26 400 15 25 23300
12 27 14300 12 31 26600 4 27 9300
13 29 12500 13 26 16500 17 29 6200
14 24 19700 14 34 24200 9,16 24 17700
15 25 10100 15 25 23300 15 25 23300
16 43 10700 16 24 9700 1 43 20900
17 28 11500 17 29 6200 8 28 8800
18 27 10700 18 35 30200 4 27 9300
19 28 16300 19 32 17800 8 28 8800
Average: 28.5 16426 20 23 9500 Average: 28.5 13982
21 32 25900
Average: 33 20724
Age Distribution: Before Matching
A: Trainees
3
2
1
frequency
0
B: Non−Trainees
3
2
1
0
20 30 40 50 60
age
Graphs by group
Age Distribution: After Matching
A: Trainees
3
2
1
frequency
0
B: Non−Trainees
3
2
1
0
20 30 40 50 60
age
Graphs by group
Training E↵ect Estimates
After matching:
where
b12 0 . . .
σ 0
0 σ b22 . . . 0
Vb −1 = . .. . . ..
. . . . .
0 0 ... bk2
σ
bn2 , and
Thus, if there are changes in the scale of Xni , these changes also affect σ
the normalized Euclidean distance does not change
Mahalanobis distance
where Σ
b X is the sample variance-covariance matrix of X .
where each i and j(i) units are matched, Xi ≈ Xj(i) and Dj(i) = 0.
Define potential outcomes and switching eq.
µ0 (x) = E [Y |X = x, D = 0] = E [Y 0 |X = x],
µ1 (x) = E [Y |X = x, D = 1] = E [Y 1 |X = x],
Yi = µDi (Xi ) + εi
Substitute and distribute terms
1 X 1
(µ (Xi ) + εi ) − (µ0 (Xj(i) ) + εj(i) )
δbATT =
NT
Di =1
1 X 1 1 X
= (µ (Xi ) − µ0 (Xj(i) )) + (εi − εj(i) )
NT NT
Di =1 Di =1
Deriving the matching bias
µ0 (Xi ) − µ0 (Xj(i) )
to the bias.
Bias-corrected (BC) matching:
BC 1 Xh c0 (X ))
c0 (Xi ) − µ
i
δbATT = (Yi − Yj(i) ) − (µ j(i)
NT
Di =1
Potential Outcome
unit under Treatment under Control
i Yi1 Yi0 Di Xi
1 10 8 1 3
2 4 1 1 1
3 10 9 1 10
4 8 0 4
5 1 0 0
6 9 0 8
10 − 8 4 − 1 10 − 9
δbATT = + + =2
3 3 3
Bias adjustment in matched data
Potential Outcome
unit under Treatment under Control
i Yi1 Yi0 Di Xi
1 10 8 1 3
2 4 1 1 1
3 10 9 1 10
4 8 0 4
5 1 0 0
6 9 0 8
10 − 8 4 − 1 10 − 9
δbATT = + + =2
3 3 3
c0 (X ) = βb0 + βb1 X = 2 + X
For the bias correction, estimate µ
Bias adjustment in matched data
Potential Outcome
unit under Treatment under Control
i Yi1 Yi0 Di Xi
1 10 8 1 3
2 4 1 1 1
3 10 9 1 10
4 8 0 4
5 1 0 0
6 9 0 8
10 − 8 4 − 1 10 − 9
δbATT = + + =2
3 3 3
For the bias correction, estimate µ c0 (X ) = βb0 + βb1 X = 2 + X
is valid.
Large sample distribution for matching estimators
Overview
What do we use a propensity score for?
How do we construct the propensity score?
How do we implement propensity score estimation in STATA?
Discuss several articles using propensity score matching
Joke (sort of. . . )
1 X 1 X
(Y |D = 1) − (Y |D = 0)
NT NC
This is called “ignobility” when we have randomized
treatment assignment
But if what if ignobility is violated?
Propensity score matching has a weaker assumption called
conditional independence which we will discuss later
OLS
Y = δD + βX + ε
Identification Assumptions:
1 (Y 0 , Y 1 ) ⊥
⊥ D|X (conditional independence)
2 0 < Pr (D = 1|X ) < 1 (common support)
Identifying assumption I: Conditional independence (Rosenbaum and Rubin 1983)
(Yi0 , Yi1 ) ⊥
⊥ D|Xi . There exists a set X of observable covariates such that after
controlling for these covariates, treatment assignment is independent of potential
outcomes.
Yi0 = α + β 0 Xi + ε, and
Yi1 = Yi0 + δ
Pr (D = 1|Y 0 , Y 1 , p(X )) 6= f (Y 0 , Y 1 )
because
Three-step procedure:
1 Estimate the conditional probability of treatment, or
propensity score, p(X ) = Pr (D = 1|X ), using any standard
probability model (logit or probit)
2 Do matching, sub classification (stratification), inverse
probability weighting or some other algorithmic procedures to
estimate the average causal effect conditional on the
estimated propensity score
3 Compute standard errors
Balancing property
D Y
p(X )
X
Inverse Probability Weighting
Proposition
If Y 1 , Y 0 ⊥
⊥ D|X , then
D − p(X )
δATE = E [Y 1 − Y 0 ] = E Y ·
p(X ) · (1 − p(X ))
D − p(X )
1
δATT = E [Y 1 − Y 0 |D = 1] = ·E Y ·
Pr (D = 1) 1 − p(X )
Proof.
D − p(X )
Y
E Y = E X , D = 1 p(X )
p(X )(1 − p(X )) p(X )
X
−Y
+E X , D = 0 (1 − p(X ))
1 − p(X )
= E [Y |X , D = 1] − E [Y |X , D = 0]
and the results follow from integrating over P(X ) and P(X |D = 1).
Weighting on the propensity score
D − p(X )
δATE = E [Y 1 − Y 0 ] = E Y ·
p(X ) · (1 − p(X ))
D − p(X )
1
δATT = E [Y 1 − Y 0 |D = 1] = ·E Y ·
Pr (D = 1) 1 − p(X )
Standard errors:
We need to adjust the standard errors for first-step estimation
of p(X )
Parameteric first step: Newey and McFadden (1994)
Non-parametric first step: Newey (1994)
Or bootstrap the entire two-step procedure.
Other algorithmic methods
where
C (i) is the set of neighbors with W = 0 of the treatment unit i
and
P wij is the weight of control group units j with
j∈C (i) wij = 1
Estimation (cont.)
We estimate it as follows
1 X
[=
ATT = Yi − Yi(j)
NT
i:Wi =1
Each treatment unit i is matched only with the control group unit
j whose propensity score falls into a predefined neighborhood of
the propensity score of the treatment unit.
All the control units with ρbj falling within a radius r from ρbi
are matched to the treatment unit i
The smaller the radius, the better the quality of the matches,
but the higher the possibility some treatment units are not
matched because the neighborhood does not contain control
group units j
Checking the common support assumption
CPS NSW
All Controls Trainees
Nc = 15, 992 Nt = 297
covariate mean (s.d.) mean mean t-stat diff
Black 0.09 0.28 0.07 0.80 47.04 -0.73
Hispanic 0.07 0.26 0.07 0.94 1.47 -0.02
Age 33.07 11.04 33.2 24.63 13.37 8.6
Married 0.70 0.46 0.71 0.17 20.54 0.54
No degree 0.30 0.46 0.30 0.73 16.27 -0.43
Education 12.0 2.86 12.03 10.38 9.85 1.65
1975 Earnings 13.51 9.31 13.65 3.1 19.63 10.6
1975 Unemp 0.11 0.32 0.11 0.37 14.29 -0.26
Dehija and Wahba (1999)
X ⊥
⊥ D|p(X )
. qui probit treated age black hispanic married educ nodegree re75
. margins, dydx(_all)
------------------------------------------------------------------------------
| Delta-method
| dy/dx Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | -.0035844 .000462 -7.76 0.000 -.0044899 -.002679
black | .0766501 .0088228 8.69 0.000 .0593577 .0939426
hispanic | .0831734 .0157648 5.28 0.000 .0522751 .1140718
married | -.0850743 .0070274 -12.11 0.000 -.0988478 -.0713009
education | .0003458 .0023048 0.15 0.881 -.0041716 .0048633
nodegree | .0418875 .0108642 3.86 0.000 .0205942 .0631809
re75 | -6.89e-06 5.89e-07 -11.71 0.000 -8.04e-06 -5.74e-06
------------------------------------------------------------------------------
0 .2 .4 .6 .8 1
Pr(treated)
treated=1 treated=0
0 .2 .4 .6 .8 1
Propensity Score
Untreated Treated
. // compute nearest neighbor matching with caliper and replacement
. psmatch2 treated, pscore(score) outcome(re78) caliper(0.01)
There are observations with identical propensity score values.
The sort order of the data could affect your results.
Make sure that the sort order is random before calling psmatch2.
----------------------------------------------------------------------------------------
Variable Sample | Treated Controls Difference S.E. T-stat
----------------------------+-----------------------------------------------------------
re78 Unmatched | 5976.35202 21553.9209 -15577.5689 913.328457 -17.06
ATT | 6067.8117 5758.47686 309.334834 1080.935 0.29
----------------------------+-----------------------------------------------------------
Note: S.E. does not take into account that the propensity score is estimated.
----------------------------------------------------------------------------
| Mean %reduct | t-test
Variable Sample | Treated Control %bias |bias| | t p>|t|
------------------------+----------------------------------+----------------
age Unmatched | 24.626 34.851 -116.6 | -16.48 0.000
Matched | 25.052 25.443 -4.5 96.2 | -0.61 0.545
| |
black Unmatched | .80135 .2506 132.1 | 20.86 0.000
Matched | .78967 .78967 0.0 100.0 | -0.00 1.000
| |
hispanic Unmatched | .09428 .03253 25.5 | 5.21 0.000
Matched | .09594 .08856 3.0 88.0 | 0.30 0.767
| |
married Unmatched | .16835 .86627 -194.9 | -33.02 0.000
Matched | .1845 .14022 12.4 93.7 | 1.40 0.163
| |
education Unmatched | 10.38 12.117 -68.6 | -9.51 0.000
Matched | 10.465 10.166 11.8 82.8 | 1.54 0.125
| |
nodegree Unmatched | .73064 .30522 94.0 | 15.10 0.000
Matched | .71587 .69373 4.9 94.8 | 0.56 0.573
| |
re75 Unmatched | 3066.1 19063 -156.6 | -20.12 0.000
Matched | 3197.4 3307.8 -1.1 99.3 | -0.28 0.778
| |
----------------------------------------------------------------------------
black
nodegree
hispanic
education
age
re75
married
Unmatched Matched
Kernel matching
| psmatch2:
psmatch2: | Common
Treatment | support
assignment | On support | Total
-----------+-----------+----------
Untreated | 2,490 | 2,490
Treated | 297 | 297
-----------+-----------+----------
Total | 2,787 | 2,787
. //evaluate quality of matching
. pstest2 age black hispanic married educ nodegree re75, sum graph
----------------------------------------------------------------------------
| Mean %reduct | t-test
Variable Sample | Treated Control %bias |bias| | t p>|t|
------------------------+----------------------------------+----------------
age Unmatched | 24.626 34.851 -116.6 | -16.48 0.000
Matched | 24.626 24.572 0.6 99.5 | 0.09 0.926
| |
black Unmatched | .80135 .2506 132.1 | 20.86 0.000
Matched | .80135 .81763 -3.9 97.0 | -0.50 0.614
| |
hispanic Unmatched | .09428 .03253 25.5 | 5.21 0.000
Matched | .09428 .08306 4.6 81.8 | 0.48 0.631
| |
married Unmatched | .16835 .86627 -194.9 | -33.02 0.000
Matched | .16835 .1439 6.8 96.5 | 0.82 0.413
| |
education Unmatched | 10.38 12.117 -68.6 | -9.51 0.000
Matched | 10.38 10.238 5.6 91.8 | 0.81 0.415
| |
nodegree Unmatched | .73064 .30522 94.0 | 15.10 0.000
Matched | .73064 .72101 2.1 97.7 | 0.26 0.793
| |
re75 Unmatched | 3066.1 19063 -156.6 | -20.12 0.000
Matched | 3066.1 3905.8 -8.2 94.8 | -1.99 0.047
| |
----------------------------------------------------------------------------
black
nodegree
hispanic
education
age
re75
married
Unmatched Matched
0 1 on the estimated propensity score. The covariate matching es-
The normalized difference provides a scale-free measure of the timators use the matrix Ane (the diagional matrix with inverse
difference in the location of the two distributions, and is useful sample variances on the diagonal) as the distance measure. Be-
for assessing the degree of difficulty in adjusting for differences cause we are focused on the average effect for the treated, the
in covariates. bias correction only requires an estimate of µ0 (Xi ). We esti-
Matchings vs. Propensity score
Panel A contains the results for pretreatment variables and mate this regression function using linear regression on all nine
Panel B for outcomes. Notice the large differences in back- pretreatment covariates in Table 1, panel A, but do not include
ground characteristics between the program participants and the any higher order terms or interactions, with only the control
PSID sample. This is what makes drawing causal inferences units that are used as a match [the units j such that Wj = 0 and
Recently there has been an increase in a particular type of research design known as
regression discontinuity design (RDD). Cook (2008) has a fascinating history of
thought on how and why.
Donald Campbell is the originator of regression discontinuity design. First
study is Thistlethwaite and Campbell (1960). Merit awards were given to
students whose test scores were over some cutoff point. They compared award
winners to non-winners just around the threshold to identify the causal effect of
merit awards on future academic outcomes.
Pictures are helpful for understanding the RDD research design. Tell me what
you think these are saying.
.1 .2 .3 .4 .5 .6 .7 .8 .9 1
0
-300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350
Local Average
V. Results discontinuity in earnings at the admission cutoff. This
A. Earnings Discontinuities at the Admission Cutoff
RDD Visual isExample
shown for white men in figure 2, which shows a
regression of residual earnings on a cubic polynomial of
To the extent that there are economic returns to attend- adjusted SAT score. Table 1 shows the discontinuity
ing the flagship state university, one should observe a estimates that result from varying functional form
FIGURE 2.—NATURAL LOG OF ANNUAL EARNINGS FORWHITE MEN TEN TO FIFTEEN YEARS AFTER HIGH SCHOOL GRADUATION (FIT WITH A CUBIC
POLYNOMIAL OF ADJUSTED SAT SCORE)
.2
(Residual) Natural Log of Earnings
-.3 -.2 -.1-.4 0 .1
-300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350
SAT Points Above the Admission Cutoff
TABLE I
U NWE IGH TE D D E SCRIP TIVE S TATISTICS
Qu a n t iles
A. F u ll sa m ple
5t h gr a de (2019 cla sses, 1002 sch ools, t est ed in 1991)
Cla ss size 29.9 6.5 21 26 31 35 38
E n r ollm en t 77.7 38.8 31 50 72 100 128
P er cen t disa dva n t a ged 14.1 13.5 2 4 10 20 35
Rea din g size 27.3 6.6 19 23 28 32 36
Ma t h size 27.7 6.6 19 23 28 33 36
Aver a ge ver ba l 74.4 7.7 64.2 69.9 75.4 79.8 83.3
Aver a ge m a t h 67.3 9.6 54.8 61.1 67.8 74.1 79.4
4t h gr a de (2049 cla sses, 1013 sch ools, t est ed in 1991)
Cla ss size 30.3 6.3 22 26 31 35 38
E n r ollm en t 78.3 37.7 30 51 74 101 127
P er cen t disa dva n t a ged 13.8 13.4 2 4 9 19 35
Rea din g size 27.7 6.5 19 24 28 32 36
Ma t h size 28.1 6.5 19 24 29 33 36
Aver a ge ver ba l 72.5 8.0 62.1 67.7 73.3 78.2 82.0
Aver a ge m a t h 68.9 8.8 57.5 63.6 69.3 75.0 79.4
Discontinuity sample
5t h gr a de 4t h gr a de 3r d gr a de
Va r ia ble defin it ion s a r e a s follows: Cla ss size # n u m ber of st u den t s in cla ss in t h e spr in g, E n r ollm en t #
Sept em ber gr a de en r ollm en t , P er cen t disa dva n t a ged # per cen t of st u den t s in t h e sch ool fr om ‘‘disa dva n t a ged
ba ckgr ou n ds,’’Rea din g size # n u m ber of st u den t s wh o t ook t h e r ea din g t est , Ma t h size # n u m ber of st u den t s
wh o t ook t h e m a t h t est , Aver a ge ver ba l # a ver a ge com posit e r ea din g scor e in t h e cla ss, Aver a ge m a t h #
a ver a ge com posit e m a t h scor e in t h e cla ss.
Class size function and enrollment size
es
fsc = es −1
(90)
int 40 +1
where es is the beginning-of-year enrollment in school s in a given grade
(e.g., 5th grade); fsc is class size assigned to class c in school s for that
grade; int(n) is the largest integer less than or equal to n
This equation captures the fact that Maimonides’ rule allows
enrollment cohorts of 1-40 to be grouped in a single class, but
enrollment cohorts of 41-80 are split into two classes of average size
20.5-40, enrollment cohorts of 81-120 are split into three classes of
average size 27-40, and so on.
Class size function and enrollment size
The overall positive correlation between test scores and enrollment is partly
attributed to larger schools in Israel being geographically concentrated in
larger, more affluent cities (smaller schools in poorer “developmental towns”
outside the major urban centers)
They note that the enrollment size and the PD index measuring proportion of
students from disadvantaged background is negatively correlated
They control for the “trend” association between test scores and enrollment
size and plot the residuals from regressions of average scores and the average
of fsc on average enrollment and PD index for each interval
The estimates for fifth graders imply a reduction in predicted class size of ten
students is associated with a 2.2 point increase in average reading scores – a
little more than one-quarter of a standard deviation in the distribution of class
averages
US IN G M AIM ON IDE S ’ R UL E 545
F IGURE III
Aver a ge Test (Rea din g/Ma t h ) Scor es a n d P r edict ed Cla ss Size by E n r ollm en t ,
Residu a ls fr om Regr ession s on P er cen t Disa dva n t a ged a n d E n r ollm en t
5t h Gr a de
5t h Gr a der s 4t h
Rea din g R
Cla ss size com pr eh en sion Ma t h Cla ss size com
A. F u ll sa m ple
B. Discon t in u it y sa m ple
Th e fu n ct ion f sc is equ a l t o en r ollm en t /[in t ((en r ollm en t ! 1)/40) " 1]. St a n da r d er r or s a r e r epor t ed in pa r en t h eses. St a n da r d er r or s wer
bet ween cla sses. Th e u n it of obser va t ion is t h e a ver a ge scor e in t h e cla ss.
Second step: use fitted values in grouped regression
ȳsc = Xs β + nc
sc δ + ηs + [µc + ε̄sc ] (94)
where nc
sc is the predicted class size from the previous regression
TABLE IV
2SLS E STIMATE S F OR 1991 (F IF TH G RADE RS )
!/" 5
Discon t in u it y
F u ll sa m ple sa m ple F u ll sa m ple
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10)
Th e u n it of obser va t ion is t h e a ver a ge scor e in t h e cla ss. St a n da r d er r or s a r e r epor t ed in pa r en t h eses. St a n da r d er r or s wer e cor r ect ed for wit h in -sch ool cor r
All est im a t es u se f sc a s a n in st r u m en t for cla ss size.
How large is their effects?
Question: So, where can I find these “jumps”? Answer: Humans are
embedding “jumps” in their rules all the time. Dumb rules – while usually bad
policy – are great for research.
Validity doesn’t require the assignment rule be arbitrary, only that it is known,
precise and free of manipulation. The most effective RDD studies involve
programs where X has a “hair trigger” that is not tightly related to the
outcome being studied. Examples:
Probability of being tried as an adult (higher penalties for a given crime) “jumps” at age 18
Probability of being arrested for DWI “jumps” at blood alcohol content >0.08
Probability of receiving universal healthcare insurance “jumps” at age 65
Probability of receiving medical attention “jumps” when birthweight falls below 1,500 grams
Probability of having to attend summer school “jumps” when grade falls below 60
Data requirements can be substantial. Large sample sizes are characteristic
features of the RDD
If there are strong trends, one typically needs a lot of data
Researchers are typically using administrative data or settings such as birth
records where there are many observations
More recent examples
i-
II
H
Q)
o
I,
v)
0)
o
Selection variable S
2
FIGURE
ASSIGNMENT IN THE SHARP (DASHED) AND FUZZY (SOLID) RD DESIGN
Sharp vs. Fuzzy RDD
Yi0 = α + βXi
Yi1 = Yi0 + δ
Continuity of conditional regression functions (Hahn, Todd and Van der Klaauw
2001; Lee 2008)
E [Yi0 |X = X0 ] and E [Yi1 |X = X0 ] are continuous (smooth) in X at X0 .
Yi = α + β(Xi − X0 ) + δDi + εi
This doesn’t change the interpretation of the treatment effect – only the
interpretation of the intercept.
Example: Medicare and age 65. Center the running variable (age) by
subtracting 65:
Smoothness in E [Yi0 |Xi ] and linearity are different things. What if the trend relation
E [Yi0 |Xi ] does not jump at X0 but rather is simply nonlinear?
gen x2 = x*x
gen x3 = x*x*x
gen y = 10000 + 0*D - 100*x +x2 + rnormal(0, 1000)
Yi = f (Xi ) + δDi + ηi
To derive a regression model, first note that the observed values must be used
in place of the potential outcomes:
E [Y |X ] = E [Y 0 |X ] + E [Y 1 |X ] − E [Y 0 |X ] D
where β1∗ = β11 − β01 , β2∗ = β21 − β21 and βp∗ = β1p − β0p
The equation we looked at earlier a few slides back was just a special case of
the above equation with β1∗ = β2∗ = βp∗ = 0
The treatment effect at x0 is δ
The treatment effect at Xi − X0 = c > 0 is: δ + β1∗ c + β2∗ c 2 + · · · + βp∗ c p
Polynomial simulation example
capture drop y x2 x3
gen x2 = x*x
gen x3 = x*x*x
gen y = 10000 + 0*D - 100*x +x2 + rnormal(0, 1000)
reg y D x x2 x3
predict yhat
ThereThere is therefore
is therefore systematic
systematic biasbias
withwith
thethe kernel
kernel method
method if ifthe f (X
thef (X ) )is
is upwards or downwards sloping.
upwards or downwards sloping
Waldinger (Warwick) 21 / 48
Kernel Method - Local linear regression
'SBDUJPOXJUIDPWFSBHF
"OZDPWFSBHFIJHIFEXIJUFT
"OZDPWFSBHFBMMHSPVQT
"OZDPWFSBHFMPXFENJOPSJUJFT
5XPQPMJDJFTIJHIFEXIJUFT
5XPQPMJDJFTBMMHSPVQT
5XPQPMJDJFTMPXFENJOPSJUJFT
"HFJORVBSUFST
Figure 1. Coverage by Any Insurance and by Two or More Policies, by Age and Demographic Group
where hj 1a 2 5 fj 1a 2 1 d1g1j 1a 2 1 d2g2j 1a 2 represents the reduced-form age profile for group j, p yj
5 p1j d1 1 p2j d2, and vijay
5 uija 1 v1ija d1 1 v2ija d2 is an error term. Assuming that the profiles
Estimation
1
Cija = Xija βj1 + gj1 (a) + Da πj1 + vija
1
2
Cija = Xija βj2 + gj2 (a) + Da πj2 + vija
2
where βj1 and βj2 are group-specific coefficients, gj1 (a) and gj2 (a) are smooth
age profiles for group j, and Da is a dummy for being age ≥65.
Reduced form from couple slides back
X
k k
yija = Xija α + fj (α; β) + Cija δ + uija
k
yija = Xija (αj + βj1 δj1 + βj2 δj2 ) + hj (a) + Da πjy + vija
y
where h(a) = fj (a) + δ 1 gj1 (a) + δ 2 gj2 (a) is reduced form age profile for group j,
πjy = πj1 δ 1 + πj2 δ 2 and vija
y 1 δ 1 + v 2 δ 2 is the error term.
= uija + vija ija
Assuming that the profiles fj (a), gj (a) and gj2 (a) are continuous at age 65
1
For some basic health care services (e.g., routine doctor visits), it may be the
only thing that matters is insurance.
But, in those situations, the implied discontinuity in y at 65 for group j will be
proportional to the change in insurance status experienced by that group
For more expensive (or elective) services, the generosity of the coverage may
matter (for instance, if patients are unwilling to cover the required copay or if
the managed care program won’t cover the service).
This creates a potential identification problem in interpreting the discontinuity
in y for any one group
Since πjy is a linear combination of the discontinuities in coverage and
generosity, δ 1 and δ 2 can be estimated by a regression across groups:
Table 1—Insurance Characteristics Just before Age 65 and Estimated Discontinuities at Age 65
On Medicare Any insurance Private coverage 21 Forms coverage Managed care
Age RD Age RD Age RD Age RD Age RD
63–4 at 65 63–4 at 65 63–4 at 65 63–4 at 65 63–4 at 65
112 122 132 142 152 162 172 182 192 1102
Overall sample 12.3 59.7 87.9 9.5 71.8 22.9 10.8 44.1 59.4 228.4
14.12 10.62 11.12 12.82 12.12
Classified by ethnicity and education:
White non-Hispanic:
High school dropout 21.1 58.5 84.1 13.0 63.5 26.2 15.0 44.5 48.1 225.0
14.62 12.72 13.32 14.02 14.52
High school graduate 11.4 64.7 92.0 7.6 80.5 21.9 10.1 51.8 58.9 230.3
15.02 10.72 11.62 13.82 12.62
At least some college 6.1 68.4 94.6 4.4 85.6 22.3 8.8 55.1 69.1 240.1
14.72 10.52 11.82 14.02 12.62
Minority:
High school dropout 19.5 44.5 66.8 21.5 33.2 21.2 11.4 19.4 39.1 28.3
13.12 12.12 12.52 11.92 13.12
High school graduate 16.7 44.6 85.2 8.9 60.9 25.8 13.6 23.4 54.2 215.4
14.72 12.82 15.12 14.82 13.52
At least some college 10.3 52.1 89.1 5.8 73.3 25.4 11.1 38.4 66.2 222.3
14.92 12.02 14.32 13.82 17.22
Classified by ethnicity only:
White non-Hispanic 10.8 65.2 91.8 7.3 79.7 22.8 10.4 51.9 61.9 233.6
1all2 14.62 10.52 11.42 13.52 12.32
Black non-Hispanic 17.9 48.5 84.6 11.9 57.1 24.2 13.4 27.8 48.2 213.5
1all2 13.62 12.02 12.82 13.72 13.72
Hispanic 1all2 16.0 44.4 70.0 17.3 42.5 22.0 10.8 21.7 52.9 212.1
13.72 13.02 11.72 12.12 13.72
Note: Entries in odd-numbered columns are percentages of age 63- 64-year-olds in group with insurance characteristic
shown in column heading. Entries in even-numbered columns are estimated regression discontinuties at age 65, from
models that include quadratic control for age, fully interacted with dummy for age 65 or older. Other controls include
indicators for gender, race/ethnicity, education, region, and sample year. Estimates are based on linear probability
models fit to pooled samples of 1999–2003 NHIS.
Medicare coverage rises by 60 percentage points at age 65, from a base level of 12 percent
among 63- 64-year-olds. Consistent with DI enrollment patterns (David H. Autor and Mark G.
Duggan 2003), Medicare enrollment prior to 65 is higher for minorities and people with below-
average schooling, and these groups experience relatively smaller gains at age 65 (see rows 2–7).
Other changes at age 65
VOL. 98 NO. 5 CARD ET AL: THE EFFECT OF MEDICARE ON HEALTH CARE UTILIZATION 2249
)JHIFEXIJUFTBDUVBM
)JHIFEXIJUFTQSFEJDUFE
0WFSBMMTBNQMFBDUVBM
0WFSBMMTBNQMFQSFEJDUFE
'SBDUJPOFNQMPZFE
-PXFENJOPSJUJFTBDUVBM
-PXFENJOPSJUJFTQSFEJDUFE
"HF
gender, and show no large discontinuities for either men or women.6 As an additional check,
Effect of cutoff on access to care and utilization
Since 1997, NHIS has asked two questions: (1) “During the past 12 months has medical care been delayed
for this person because of worry about the cost?” and (2) “‘During the past 12 months was there any time
when this person needed medical care but did not get it because (this person) could not afford it?”
2250 THE AMERICAN ECONOMIC REVIEW DECEMBER 2008
Delayed care last year Did not get care last year Saw doctor last year Hospital stay last year
Age 63–64 RD at 65 Age 63–64 RD at 65 Age 63264 RD at 65 Age 63264 RD at 65
112 122 132 142 152 162 172 182
Overall sample 7.2 21.8 4.9 21.3 84.8 1.3 11.8 1.2
10.42 10.32 10.72 10.42
Note: Entries in odd numbered columns are mean of variable in column heading among people ages 63–64. Entries in
even numbered columns are estimated regression discontinuties at age 65, from models that include linear control for
age interacted with dummy for age 65 or older (columns 2 and 4) or quadratic control for age, interacted with dummy
for age 65 and older (columns 6 and 8). Other controls in models include indicators for gender, race/ethnicity, educa-
tion, region, and sample year. Sample in columns 1–4 is pooled 1997–2003 NHIS. Sample in columns 5–8 is pooled
1992–2003 NHIS. Samples for regression models include people ages 55–75 only. Standard errors (in parentheses) are
clustered by quarter of age.
Effect of cutoff on access to care and utilization
The right-hand columns of Table 3 present results for two key measures of healthcare utilization: (1) “Did
the individual have at least one doctor visit in the past year?” and (2) “Did the individual have one or more
overnight hospital stays in the past year?”
2250 THE AMERICAN ECONOMIC REVIEW DECEMBER 2008
Delayed care last year Did not get care last year Saw doctor last year Hospital stay last year
Age 63–64 RD at 65 Age 63–64 RD at 65 Age 63264 RD at 65 Age 63264 RD at 65
112 122 132 142 152 162 172 182
Overall sample 7.2 21.8 4.9 21.3 84.8 1.3 11.8 1.2
10.42 10.32 10.72 10.42
Note: Entries in odd numbered columns are mean of variable in column heading among people ages 63–64. Entries in
even numbered columns are estimated regression discontinuties at age 65, from models that include linear control for
age interacted with dummy for age 65 or older (columns 2 and 4) or quadratic control for age, interacted with dummy
for age 65 and older (columns 6 and 8). Other controls in models include indicators for gender, race/ethnicity, educa-
tion, region, and sample year. Sample in columns 1–4 is pooled 1997–2003 NHIS. Sample in columns 5–8 is pooled
1992–2003 NHIS. Samples for regression models include people ages 55–75 only. Standard errors (in parentheses) are
clustered by quarter of age.
Changes in hospitalizations
2252 THE AMERICAN ECONOMIC REVIEW DECEMBER 2008
)PTQJUBMBENJTTJPOTGPSIJQBOELOFFSFQMBDFNFOUQFS
"MMBENJTTJPOT MFGUTDBMF
"MMBENJTTJPOTQSFEJDUFE
5PUBMIPTQJUBMBENJTTJPOTQFS
QFSTPOZFBST
8IJUFIJQBOELOFFSFQMBDFNFOU 8IJUFIJQBOELOFFSFQMBDFNFOUGJUUFE
)JTQBOJDIJQBOELOFFSFQMBDFNFOU )JTQBOJDIJQBOELOFFSFQMBDFNFOUGJUUFE
#MBDLIJQBOELOFFSFQMBDFNFOU #MBDLIJQBOELOFFSFQMBDFNFOUGJUUFE
QFSTPOZFBST
"HF
Table 4—Hospital Admissions and Insurance Coverage at Age 65: California, Florida, and New York
All Whites Hispanics Blacks
Rate age RD at 65 Rate age RD at 65 Rate age RD at 65 Rate age RD at 65
60–64 60–64 60–64 60–64
112 122 132 142 152 162 172 182
Hospital admissions
All admissions 1,443 7.57 1,407 7.74 1,262 9.47 2,008 4.39
10.292 10.332 10.552 10.712
By route into hospital
ER admission 761 3.30 688 3.70 774 2.63 1,313 1.93
10.392 10.402 10.922 10.952
Non–ER admission 682 12.16 718 11.51 488 19.89 695 8.92
10.462 10.492 11.052 11.042
By admission diagnosis
Chronic ischemic heart disease 83 11.58 90 11.05 59 18.45 66 8.29
10.962 11.162 12.452 12.782
AMI 48 4.41 50 5.31 38 3.90 45 23.43
11.432 11.652 13.332 14.782
Heart failure 56 0.44 45 2.33 62 24.85 130 21.47
11.112 11.242 12.632 12.432
Chronic bronchitis 34 7.50 36 6.50 19 9.76 38 13.05
11.512 11.522 15.582 14.432
Osteoarthrosis 34 26.97 38 27.16 18 29.27 27 22.08
11.392 11.642 15.052 14.012
Pneumonia 34 2.44 32 2.05 30 3.39 51 3.81
11.422 11.742 14.342 13.212
By primary procedure
None 419 5.70 400 5.73 388 7.23 614 3.86
10.332 10.402 11.232 11.252
Diagnostic procedures on heart 51 9.18 53 8.17 40 16.78 58 8.76
11.032 11.202 13.212 13.322
Removal of coronary artery obstruction 38 10.67 43 10.49 23 18.77 22 0.49
11.462 11.602 13.942 15.332
Bypass anastomosis of heart 26 15.91 29 16.17 17 18.97 13 5.15
11.392 11.442 15.622 16.092
Joint replacement lower extremity 41 22.69 46 23.16 22 26.40 33 12.14
11.472 11.602 14.692 14.202
Diagnostic procedure on small intestine 35 7.35 31 6.60 37 13.07 58 4.09
11.272 11.472 13.272 13.132
Cholecystectomy 1gall bladder removal2 26 17.93 26 16.00 29 29.25 18 12.27
12.102 11.842 15.112 17.502
Insurance coverage
Probability of coverage 82.7 15.0 86.7 12.7 69.1 20.3 79.0 17.6
1March CPS data2 10.82 10.82 12.72 12.72
Notes: Insurance estimates are based on pooled March CPS 1996–2004 data for California, Florida, and New York.
Entries in top row columns 1, 3, 5, 7 are fractions of 60- and 64-year-olds with insurance coverage. Entries in top row
columns 2, 4, 6, 8 represent regression discontinuity estimates 131002 of the increase in coverage at age 65 from a
model with a quadratic in age, fully interacted with a post-65 dummy. Entries in lower rows columns 1, 3, 5, 7 are hospi-
tal admission rates 1 per 10,000 person years for 60- to 64-year-olds2 for California, Florida, and New York 1992–2002.
Entries in lower rows columns 2, 4, 6, 8 are regression discontinuity estimates 131002 of the increase in the log of the
number of admissions at age 65, from models with a quadratic in age, fully interacted with a post-65 dummy. Standard
errors are in parentheses.
Changes in ownership
VOL. 98 NO. 5 CARD ET AL: THE EFFECT OF MEDICARE ON HEALTH CARE UTILIZATION 2255
1SJWBUFOPOQSPGJU
$IVSDI3VO
1SJWBUFGPSQSPGJU
BENJTTJPOT
,BJTFS
)PTQJUBMEJTUSJDU
$PVOUZ
"HF
Notes: Each entry in panel A, column 1, is estimated coefficient from regression of RDs in listed health outcome on
RDs in insurance coverage over six ethnicity/education groups (rows 1–4) or nine state-ethnicity groups (rows 5–8).
All regressions weighted by the inverse sampling variance of the estimated discontinuity in each outcome, and regres-
sions in rows 5–8 include state dummies. Entries in column 2 are corresponding R-squared coefficients from each
regression. Entries in columns 3, 4, and 5 are the observed disparities in each health outcome at ages 63–64, and entries
in columns 6, 7, and 8 are the percent change in the disparity attributable to the change in insurance coverage based on
the coefficient in column 1. Health disparities measured in the NHIS are characterized in terms of low-ed minorities
versus hi-ed whites, whereas health disparities measured in the hospital discharge data are characterized in terms of
black-white or hispanic-white differences. Panel B is similar to panel A except that the RDs in each health outcome are
regressed on the RDs in the incidence of multiple coverage at 65. Panel B regressions are based on data for New York
and Florida only (i.e., six state-ethnicity groups).
Summary of Card, et al. (2009)
6.40
6.35
Unplanned ED admission Unplanned ED admissions fitted
6.25
6.15
6.10
6.05
6.00
5.95
5.90
60 61 62 63 64 65 66 67 68 69 70
Age in months
FIGURE II
Number of Admissions by Route into Hospital, California, 1992–2002
The lines are fitted values from regressions that include a second-order polyno-
mial in age fully interacted with a dummy for age !65 and a dummy variable for
the month before people turn 65. The dependent variable is the log of the number
of admissions by patient’s age (in days) at admission, for patients between 60 and
70 years of age. The count of admissions is based on hospital discharge records for
California and includes admissions from January 1, 1992, to November 30, 2002.
The points represent means of the dependent variable for 30-day cells. The age
profile for unplanned ED admissions includes admissions that occurred through
the emergency department and were unplanned. The category “Other Admissions”
includes all other admissions.
Proportion with coverage
DOES MEDICARE SAVE LIVES? 615
0.5
0.4
0.3
0.2
0.1
0
60 61 62 63 64 65 66 67 68 69 70
Age at admission
FIGURE IV
Primary Insurance Coverage of Admitted Patients
See notes for Figure II. In this figure the y-axis represents the fraction of
patients with different classes of primary insurance coverage. Sample includes
425,315 patients with nondeferrable primary diagnoses, defined as unplanned
admissions through the emergency department for diagnoses with a t-statistic for
the test of equal weekday and weekend admission rates of 0.965 or less. Medicare
eligibility status of patients within one month of their 65th birthdays is uncertain
and we have excluded these observations.
Impact of Medicare on type of coverage
616
TABLE III
Age over 65 (×100) 43.9 47.5 −24.8 −26.8 −10.1 −10.8 −7.4 −8.0
(0.4) (0.4) (0.4) (0.4) (0.3) (0.3) (0.2) (0.2)
Additional controls No Yes No Yes No Yes No Yes
Mean of dependent variable 24.0 43.3 43.3 9.7
for patients aged 64–65 (×100)
Notes. Standard errors in parentheses. Dependent variable is indicator for type of insurance listed as “primary insurer” on discharge record. Sample includes 425,315 observations
on patients between the ages of 60 and 70 admitted to California hospitals between January 1, 1992, and November 30, 2002 for an unplanned admission through the emergency
department, with a diagnosis (ICD-9) for which the t-test for equality of weekend and weekday admission rates is less than 0.96 in absolute value. All models include second-order
polynomial in age (in days) fully interacted with dummy for age over 65 and are fit by OLS. Models in even-numbered columns include the following additional controls: a dummy for
people who are within one month of their 65th birthday; dummies for month, year, sex, race/ethnicity, and admission on Saturday or Sunday; and a complete set of unrestricted fixed
effects for each ICD-9 admission diagnosis. In columns (1)–(8) the coefficient on “age over 65” and its standard error have been multiplied by 100.
10.20 10
10.15
9
10.10
8
10.05
9.95 6
9.90
5
9.85
4
Log list charges Log charges fitted
9.80 Length of stay fitted Length of stay (days)
Number of procedures Procedures fitted
3
9.75
9.70 2
60 61 62 63 64 65 66 67 68 69
Age at admission
FIGURE V
Three Measures of Inpatient Treatment Intensity
See notes to Figure IV. Sample includes unplanned admissions through the
emergency department for diagnoses with a t-statistic for the test of equal weekday
and weekend admission rates of 0.965 or less. In this figure the sample is further
restricted to patients with valid SSNs (407,386 observations). Sample for log list
charges excludes patients admitted to Kaiser hospitals. Length of stay, number of
procedures, and list charges are cumulated over all consecutive hospitalizations.
List charges are measured in 2002 dollars.
Treatment intensity
TABLE IV
REGRESSION DISCONTINUITY MODELS FOR CHANGES IN TREATMENT INTENSITY
Notes. Standard errors in parentheses. Dependent variable is length of stay in days (columns (1) and (2)), number of procedures performed (columns (3) and (4)), and log of total
list charges (columns (5) and (6)). Sample includes 407,386 (352,652 in columns (5) and (6)) observations on patients with valid SSNs between the ages of 60 and 70 admitted to
California hospitals between January 1, 1992, and November 30, 2002 for an unplanned admission through the ED. Data on list charges are missing for Kaiser hospitals. See note to
Table III for additional details on sample, and list of additional covariates included in even-numbered columns. In columns (5) and (6) the coefficient on “age over 65” and its standard
error have been multiplied by 100.
619
Downloaded from http://qje.oxfordjournals.org/ at University of Kentucky Libraries on November 29, 2011
Mortality
0.30
Death 365 days Death 365 days fitted Death 28 days fitted
Death 180 days Death 180 days fitted Death 14 days fitted
Death 90 days Death 90 days fitted Death 7 days fitted
Death 28 days
0.25 Death 14 days
Death 7 days
0.20
0.15
0.10
0.05
0.00
60 61 62 63 64 65 66 67 68 69 70
Age at admission
FIGURE VI
Patient Mortality Rates over Different Follow-Up Intervals
See notes to Figure IV. Sample includes unplanned admissions through the
emergency department for diagnoses with a t-statistic for the test of equal week-
day and weekend admission rates of 0.965 or less. In this figure the sample is
further restricted to patients with valid SSNs (407,386 observations). Deaths in-
clude include in-hospital and out-of-hospital deaths.
Mortality
622
TABLE V
REGRESSION DISCONTINUITY ESTIMATES OF CHANGES IN MORTALITY RATES
Death rate in
Notes. Standard errors in parentheses. Dependent variable is indicator for death within interval indicated by column heading. Entries in rows (1)–(3) are estimated coefficients of
dummy for age over 65 from models that include a quadratic polynomial in age (rows (1) and (2)) or a cubic polynomial in age (row (3)) fully interacted with a dummy for age over 65.
Models in rows (2) and (3) include the following additional controls: a dummy for people who are within 1 month of their 65 birthdays, dummies for year, month, sex, race/ethnicity,
and Saturday or Sunday admissions, and unrestricted fixed effects for each ICD-9 admission diagnosis. Entries in row (4) are estimated discontinuities from a local linear regression
procedure, fit separately to the left and right, with independently selected bandwidths from a rule-of-thumb procedure suggested by Fan and Gijbels (1996). Sample includes 407,386
observations on patients between the ages of 60 and 70 admitted to California hospitals between January 1, 1992, and November 30, 2002, for unplanned admission through the ED
who have nonmissing Social Security numbers. All coefficients and their SEs have been multiplied by 100.
Figure: Imbens and Lemieux (2007), figure 3. Horizontal axis is the running variable. Vertical axis is the
conditional probability of treatment at each value of the running variable.
Visualization of identification strategy (i.e. smoothness)
Figure: Potential and observed outcome regressions (Imbens and Lemieux 2007)
Use the discontinuity as IV
One can use both Ti as well as the interaction terms as instruments for Di . If
one uses only Ti as IV, then the it is a “just identified” model which usually
has good finite sample properties.
In the just identified case, the first stage would be:
As in the sharp RDD case one can allow the smooth function
to be different on both sides of the discontinuity.
The second stage model with interaction terms would be the
same as before:
As Hahn, Todd and van der Klaauw (2001) point out, one
needs the same assumptions as in the standard IV framework
As with other binary IVs, the fuzzy RDD is estimating LATE:
the average treatment effect for the compliers
In RDD, the compliers are those whose treatment status
changed as we moved the value of xi from just to the left of
x0 to just to the right of x0
Overview
Part 1: The Core
Stratification
Part 2: Selection on observables
Matching
Part 3: Selection on unobservables
Propensity score matching
Part 4: Advanced material
Regression discontinuity designs
Conclusion
References
Challenges to RDD
We would expect waiting room A to become crowded. In the RDD context, sorting
on the running variable implies heaping on the “good side” of X0
McCrary (2008) suggests a formal test. Under the null the density should be
continuous at the cutoff point. Under the alternative hypothesis, the density
should increase at the kink (where D is viewed as good)
1 Partition the assignment variable into bins and calculate frequencies (i.e., number
of observations) in each bin
2 Treat those frequency counts as dependent variable in a local linear regression
The McCrary Density Test has become mandatory for every analysis using
RDD.
If you can estimate the conditional expectations, you evidently have data on the
running variable. So in principle you can always do a density test
You can download the (no longer supported) STATA ado package, DCdensity, to
implement McCrary’s density test
(http://eml.berkeley.edu/~jmccrary/DCdensity/)
You can install rdd for R too
(http://cran.r-project.org/web/packages/rdd/rdd.pdf)
Caveats about McCrary Density Test
For RDD to be useful, you already need to know something about the
mechanism generating the assignment variable and how susceptible it could be
to manipulation. Note the rationality of economic
Fig. 1. The agent’s problem. actors that this test is built
on.
A discontinuity
0.50 in the density is “suspicious” 0.50 – it suggests manipulation of X
Conditional Expectation
Conditional Expectation
around the0.30
cutoff is probably going on. 0.30In principle one doesn’t need continuity.
Estimate
Estimate
0.10 0.10
This is a-0.10high-powered test. You need a-0.10lot of observations ` at X0 to distinguish
a discontinuity
-0.30
in the density from noise. -0.30
-0.50 -0.50
5 10 15 20 5 10 15 20
Income Income
0.16 0.16
0.14 0.14
Density Estimate
Density Estimate
0.12 0.12
0.10 0.10
0.08 0.08
0.06 0.06
0.04 0.04
0.02 0.02
0.00 0.00
5 10 15 20 5 10 15 20
Income Income
Fig. 2. Hypothetical example: gaming the system with an income-tested job training program: (A) conditional expectation of returns to
Figure:
treatmentPanel
with noCpre-announcement
is density of income when there
and no manipulation; is no pre-announcement
(B) conditional expectation of returns and no manipulation.
to treatment Panel D is
with pre-announcement
theand
density of income
manipulation; when
(C) density therewith
of income is no
pre-announcement
pre-announcement and and manipulation.
no manipulation; From
(D) density McCrary
of income (2008).
with pre-announcement
and manipulation.
also necessary, and we may characterize those who reduce their labor supply as those with coai pc=f i and
bi 4ai ð1 " f i Þ=d.
Fig. 2 shows the implications of these behavioral effects using a simulated data set on 50,000 agents with
Visualizing manipulation
150
100
McCrary z
50
0
-50
2:30 3:00 3:30 4:00 4:30 5:00 5:30 6:00 6:30 7:00
threshold
NOTE: The McCrary test is run at each minute threshold from 2:40 to 7:00 to test whether there is a significant discontinuity
NOTE: The dark bars highlight the density in the minute bin just prior to each 30 minute threshold. in the density function at that threshold.
Figure: Figures 2 and 3 from Eric Allen, Patricia Dechow, Devin Pope and George Wu’s (2013)
“Reference-Dependent Preferences: Evidence from Marathon Runners”.
http://faculty.chicagobooth.edu/devin.pope/research/pdf/Website_Marathons.pdf
12 14
More examples of testing for manipulation:
Lee (2008) Incumbency Effect
%'!
%)#!
%&! %)+!
%)&!
!%&E$&<-F'6,$</
*&<="/F';=/"+./&
$!
%)!!
!)*!
#!
!)#!
!)+!
"!
!)&!
! !)!!
(% (!)* (!)# (!)+ (!)& ! !)& !)+ !)# !)* %
*&+,-%./"-'D.%#"<
,
Figure: Democratic vote share relative to cutoff: popular elections to the
!"#$%&'G)'9&%-&</'0,/"<#'H&.F8
3,44'6.44'0,/&=?'I)1)'>,$=&',7'3&:%&=&</./"5&=?'@JGKCLBB(
House of Representatives, 1900-1990 (McCrary 2008).
%)&!
!%&E$&<-F'6,$</
*&<="/F';=/"+./&
$! More evidence of manipulation %)!!
!)*!
Contrast this with
#! roll call voting in the US House of Representatives
!)#!
Coordination is expected because these are repeated games, votes !)+! are public
"!
records, and side payments are possible in the form of future votes !)&!
Bills around! the cutoff are more likely to be passed than not. Seems !)!!
like a good
candidate for(% RDD (!)* (!)# (!)+ (!)& ! !)& !)+ !)# !)* %
*&+,-%./"-'D.%#"<
Fails McCrary Density Test; cannot use RDD because policy decisions , are not
quasi-randomly
Roll Call assigned
Votes in thearound cutoff
House, 1857-2008:
!"#$%&'G)'9&%-&</'0,/"<#'H&.F8
A Not Very Smooth Density
3,44'6.44'0,/&=?'I)1)'>,$=&',7'3&:%&=&</./"5&=?'@JGKCLBB(
"!!
&)'!
&'!
&)!!
&!!
!%&E$&<-F'6,$</
*&<="/F';=/"+./&
%)'!
%'!
%!! %)!!
'! !)'!
! !)!!
! !)% !)& !)" !)+ !)' !)# !)- !)* !)$ %
9&%-&</'0,/"<#'"<'!.5,%',7'9%,:,=&M'N"44
,
Figure: Percent voting Yeay: Roll Call Votes, US House of Representatives, 1857-2004
(McCrary 2008)
Test 2: Balance test on covariates
RDD graphs
Waldinger (Warwick) 26 / 48
Example: Outcomes by Forcing Variable - Smaller Bins
Example:
From Lee Outcomes
and Lemieux by Running
(2010) based Variables with smaller bins
on Lee (2008)
Waldinger (Warwick) 27 / 48
More RDD Graphs
Waldinger (Warwick) 29 / 48
More RDD graphs!
Waldinger (Warwick) 31 / 48
“Do Voters Affect or Elect Policies?”
by Lee, Moretti and Butler (2004)
How do voters affect policy? There are two fundamentally different views of the role
of elections in a representative democracy.
1 Convergence: Heterogenous voter ideology forces each candidates to
moderate their positions (e.g., median voter theorem).
“Competition for votes can force even the most partisan Republicans
and Democrats to moderate their policy choices. In the extreme case,
competition may be so strong that it leads to ‘full policy
convergence’: opposing parties are forced to adopt identical policies”
(Lee, Moretti, and Butler 2004; Downs 1957).
2 Divergence: When partisan politicians cannot credibly commit to certain
policies, then convergence is undermined. The result can be full policy
divergence. Divergence is when the winning candidate, after taking office,
simply pursues his most-preferred policy. In this case, voters fail to compel
candidates to reach any kind of policy compromise.
Simplified model
1 1
maxl u(l) = − (l − c)2 maxl v (l) = − l2
2 2
∂u(l) 1 ∂v (l) 1
= 2 × − (l − c) = 0 = − × 2l = 0
∂l 2 ∂l 2
l∗ = c(> 0) l∗ = 0
Two main datasets: liberal voting score from the Americans for Democratic
Action (ADA) linked with House of Representatives election results for
1946-1995
Authors use the ADA score for all US House Representatives from 1946 to 1995
as their voting record index
For each Congress, ADA chooses about twenty high-profile roll-call votes and
creates an index varying 0 and 100 for each Representative of the House. Higher
scores correspond to a more “liberal” voting record.
RDD Jargon
The running variable is voteshare which is the share of all votes that went to a
Democrat. ADA scores are linked to election returns data during that period.
They use exogenous variation in Democratic wins to check whether convergence
or divergence is correct.
Discontinuity in the running variable occurs at voteshare= 0.5. When
voteshare> 0.5, the Democratic candidate wins.
Download STATA do file from my website: http://business.baylor.edu/
scott_cunningham/teaching/causalinf/lmb2004.txt, save it on your
computer, and rename the extension to say lmb2004.do. Open it in your
STATA do-editor.
Statistical results
832 QUARTERLY JOURNAL OF ECONOMICS
TABLE I
RESULTS BASED ON ADA SCORES—CLOSE ELECTIONS SAMPLE
Variable ADA t"1 ADAt DEMt"1 (col. (2)*(col. (3)) (col. (1)) $ (col. (4))
(1) (2) (3) (4) (5)
Standard errors are in parentheses. The unit of observation is a district-congressional session. The
sample includes only observations where the Democrat vote share at time t is strictly between 48 percent and
52 percent. The estimated gap is the difference in the average of the relevant variable for observations for
which the Democrat vote share at time t is strictly between 50 percent and 52 percent and observations for
which the Democrat vote share at time t is strictly between 48 percent and 50 percent. Time t and t " 1 refer
to congressional sessions. ADA t is the adjusted ADA voting score. Higher ADA scores correspond to more
liberal roll-call voting records. Sample size is 915.
. * Next, control for the (centered) running variable as a linear control variable.
. * This is the simplest RDD.
. reg score democrat demvoteshare_c, cluster(id2)
. * Modeling the linearity such that slopes can differ above vs. below discontinuity
. xi: reg score i.democrat*demvoteshare_c, cluster(id2)
i.democrat _Idemocrat_0-1 (naturally coded; _Idemocrat_0 omitted)
i.demo~t*demv~c _IdemXdemvo_# (coded as above)
. gen demvoteshare3=demvoteshare^3
(11 missing values generated)
. gen demvoteshare4=demvoteshare^4
(11 missing values generated)
. gen demvoteshare5=demvoteshare^5
(11 missing values generated)
. reg score demvoteshare demvoteshare2 demvoteshare3 demvoteshare4 demvoteshare5 demvoteshare5 democrat, cluster(id2)
note: demvoteshare5 omitted because of collinearity
. * Center the running variable and use polynomials and interactions to model
* the nonlinearities below and above discontinuity
. gen x c=demvoteshare-0.5
(11 missing values generated)
.
. reg score i.democrat##(c.x c c.x2 c c.x3 c c.x4 c c.x5 c)
---------------------------------------------------------------------------------
score | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------+----------------------------------------------------------------
1.democrat | 47.73325 2.042906 23.37 0.000 43.72887 51.73763
x c | -28.79765 74.09167 -0.39 0.698 -174.0276 116.4323
x2 c | -1138.232 1144.497 -0.99 0.320 -3381.605 1105.141
x3 c | -10681.29 7137.315 -1.50 0.135 -24671.42 3308.839
x4 c | -33490.23 18844.92 -1.78 0.076 -70428.88 3448.424
x5 c | -32873.77 17212.08 -1.91 0.056 -66611.83 864.302
|
democrat#c.x c |
1 | -5.793828 97.79088 -0.06 0.953 -197.4775 185.8899
|
democrat#c.x2 c |
1 | 1768.225 1433.252 1.23 0.217 -1041.149 4577.599
|
democrat#c.x3 c |
1 | 6279.346 8553.95 0.73 0.463 -10487.58 23046.28
|
democrat#c.x4 c |
1 | 47111.44 21834.51 2.16 0.031 4312.767 89910.11
|
democrat#c.x5 c |
1 | 17786.85 19488.24 0.91 0.361 -20412.81 55986.51
|
cons | 17.05847 1.469785 11.61 0.000 14.17749 19.93946
---------------------------------------------------------------------------------
More polynomial and bandwidth regressions
---------------------------------------------------------------------------------
score | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------+----------------------------------------------------------------
1.democrat | 45.9283 1.892566 24.27 0.000 42.21797 49.63863
xc | 38.63988 60.77525 0.64 0.525 -80.5086 157.7884
x2 c | 295.1723 594.3159 0.50 0.619 -869.9704 1460.315
|
democrat#c.x c |
1 | 6.507415 88.51418 0.07 0.941 -167.0226 180.0374
|
democrat#c.x2 c |
1 | -744.0247 862.0435 -0.86 0.388 -2434.041 945.9916
|
cons | 17.71198 1.310861 13.51 0.000 15.14207 20.28189
---------------------------------------------------------------------------------
Nonparametric estimation
Hahn, Todd and Van der Klaauw (2001) clarified assumptions about RDD (i.e.,
continuity in conditional expectation regression functions)
Also framed estimation as a non-parametric problem and emphasized using
local polynomial regressions
Nonparametric methods mean a lot of different things to different people in
statistics.
In RDD context, the idea is to estimate a model that doesn’t assume a
functional form for the relationship between Y (outcome variable) and X
(running variable)
That model would be something general like
Y = f (X ) + ε
Figure:
be a Lee, Moretti,
continuous andfunction
and smooth Butler 2004,
of vote sharesFigure I. γ ≈ 20
everywhere,
except at the threshold that determines party membership. There
is a large discontinuous jump in ADA scores at the 50 percent
Nonparametric estimation with cmogram
n = 13577
FIGURE IIb
Effect of Initial Win on Winning Next Election: (P D R
t!1 " P t!1 )
Top panel plots ADA scores after the election at time t against the Democrat
vote share, time t. Bottom panel plots probability of Democrat victory at t ! 1
against Democrat vote share, time t. See caption of Figure III for more
D details.R
Figure: Lee, Moretti, and Butler 2004, Figure IIb. (Pt+1 − Pt+1 ) ≈ 0.50
Kernel weighted local polynomial regression
Hahn, Todd and Van der Klaauw (2001) showed that the one-sided kernel
estimation (such as lowess) may have poor properties because the point of
interest is at a boundary (i.e., the discontinuity), called the “boundary problem”
They proposed to use “local linear nonparametric regressions” instead
STATA’s poly estimates kernel-weighted local polynomial regressions. Think of
it as a weighted regression restricted to a window like we’ve been doing (hence
“local”) where the kernel provides the weights
A rectangular kernel would give the same result as E [Y ] at a given bin on X .
The triangular kernel gives more importance to observations close to the center.
This method will be sensitive to how large the bandwidth (window) you choose
. * Note kernel-weighted local polynomial regression is a smoothing method.
. lpoly score demvoteshare if democrat == 0, nograph kernel(triangle) gen(x0 sdem0) ///
> bwidth(0.1)
. scatter sdem1 x1, color(red) msize(small) || scatter sdem0 x0, msize(small) color(red) ///
> xline(0.5,lstyle(dot)) legend(off) xtitle("Democratic vote share") ytitle("ADA score")
. * Next, let’s get the treatment effect at the cutoff where demvoteshare=0.5
. capture drop sdem0 sdem1
. gen forat=0.5 in 1
. lpoly score demvoteshare if democrat==0, nograph kernel(triangle) gen(sdem0) at(forat) bwidth(0.1)
. lpoly score demvoteshare if democrat==1, nograph kernel(triangle) gen(sdem1) at(forat) bwidth(0.1)
. gen late=sdem1 - sdem0
. list sdem1 sdem0 late in 1/1
+----------------------------------+
| sdem1 sdem0 late |
|----------------------------------|
1. | 64.395204 16.908821 47.48639 |
+----------------------------------+
. * What happens when we change the bandwidth? Use 0.01, 0.05, 0.2, 0.3, 0.4
. capture drop smoothdem0* smoothdem1* x0* x1*
. local co 0
. foreach i in 0.01 0.05 0.1 0.20 0.30 0.40
2. local co = ‘co’ +1
3. lpoly score demvoteshare if democrat == 0, nograph kernel(triangle) gen(x0‘co’ smoothdem0‘co’) ///
> bwidth(‘i’)
4. lpoly score demvoteshare if democrat == 1, nograph kernel(triangle) gen(x1‘co’ smoothdem1‘co’) ///
> bwidth(‘i’)
5.
. line smoothdem01 x01, msize(small) color(gray) sort || line smoothdem11 x11, sort color(gray) || ///
> line smoothdem02 x02, color(black) sort || line smoothdem12 x12, sort color(black) || ///
> line smoothdem03 x03, color(red) sort || line smoothdem13 x13, sort color(red) || ///
> line smoothdem04 x04, color(blue) sort || line smoothdem14 x14, sort color(blue) || ///
> line smoothdem05 x05, color(green)sort || line smoothdem15 x15, sort color(green)|| ///
> line smoothdem06 x06, color(orange) sort || line smoothdem16 x16, sort color(orange) ///
> xline(0.5,lstyle(dot)) legend(off) xtitle("Democratic vote share") ytitle("ADA score") ///
> title("Bandwidths: 0.01, 0.05, 0.1, 0.2, 0.3, 0.4")
Several methods for choosing the optimal bandwidth (window), but it’s always
a trade off between bias and variance
In practical applications, you want to check for balance around that window
Standard error of the treatment effects can be bootstrapped but there are also
other alternatives
You could add other variables to nonparametric methods.
Calonico, Cattaneao and Titiunik (2013b) propose local-polynomial regression
discontinuity estimators with robust confidence intervals
STATA ado package and R package are both called rdrobust
. rdrobust score demvoteshare, c(0.5) all bwselect(IK)
Preparing data.
Computing Bandwidth Selectors.
Computing Variance-Covariance Matrix.
Computing RD Estimates.
Estimation Completed.
All Estimates.
--------------------------------------------------------------------------------------
Method | Coef. Std. Err. z P>|z| [95% Conf. Interval]
----------------------+---------------------------------------------------------------
Conventional | 47.171 .98058 48.1046 0.000 45.2488 49.0926
Bias-Corrected | 46.6 .98058 47.5226 0.000 44.678 48.5219
Robust | 46.6 1.2773 36.4839 0.000 44.0965 49.1034
--------------------------------------------------------------------------------------
McCrary Density Test
. * McCrary (2008) density test to check for manipulation of the running variable (DCdensity)
. DCdensity demvoteshare_c if (demvoteshare_c>-0.5 & demvoteshare_c<0.5), breakpoint(0) generate(Xj Yj r0 fhat se_fhat)
Using default bin size calculation, bin size = .003047982
Using default bandwidth calculation, bandwidth = .104944836
People with higher base earnings have less trouble finding a job
(negative slope).
People with higher base earnings have less trouble finding
There is a kink: the relationship becomes shallower once benefits
a
job (negative
increase more.slope)
Waldinger (Warwick) 48 / 48
There is a kink: the relationship becomes shallower once
benefits increase more.
Selection on unobservables
Independence: (Y 0 , Y 1 ) ⊥
⊥D
Pearl argues that there are three ways to estimate causal effects: the backdoor
criterion (“selection on observables”), instrumental variables, and the front
door criterion.
Now we move into the instrumental variables research design
Illustrious history – discovered by Philip Wright and published as an appendix
in his 1928 book
One of the most powerful methods for identifying causal effects in the social
sciences
Sometimes, the researcher can find the flag. That is, the researcher knows of a
variable (Z ) that actually is randomly assigned and that affects fertility
decisions. Such a variable is called an “instrument”.
Example: Angrist and Evans (1998), “Children and their parents’ labor supply”
American Economic Review,
Z is a dummy variable indicating whether the first two children born were of the
same gender
Many parents have a preference for having at least one child of each gender
Consider a couple whose first two kids were both boys; they will often have a
third, hoping to have a girl
Consider a couple whose first two kids were girls; they will often have a third,
hoping for a boy
Consider a couple with one boy and one girls; they will often not have a third kid
The gender if your kids is arguably randomly assigned (maybe not exactly, but
close enough)
“No causation without manipulation” (Holland, 1985). If you want to use IV,
then ask:
What moves around the covariate of interest that might be plausibly viewed as random?
In a pinch, you can even get by with two different data sets,
one of which has information on the outcome and the
instrument, and the other of which has information on the
covariate of interest and the instrument.
This is known as “Two sample IV” because there are two
samples involved, rather than the traditional one sample.
Once we define what IV is measuring carefully, you will see
why this works.
Labor economists have been studying the returns to schooling a long time –
typically some version of a “Mincer regression”:
Yi = α + ρSi + γAi + νi
Yi = log of earnings
Si = schooling measured in years
Ai = individual ability
Typically the econometrician cannot observe Ai ; for instance, the CPS tells us
nothing about adult respondents’ family background, intelligence, or
motivation.
What are the consequences of leaving ability out of the regression? Suppose
you estimated this short regression instead:
Yi = α + ρSi + ηi
Cov (Y , Z ) = Cov (α + ρS + γA + ν, Z )
= E [(α + ρS + γA + ν), Z ] − E [α + ρS + γA + ν]E [Z ]
= {αE (Z ) − αE (Z )} + ρ{E (SZ ) − E (S)E (Z )}
+γ{E (AZ ) − E (A)E (Z )} + E (νZ ) − E (ν)E (Z )
Cov (Y , Z ) = ρCov (S, Z ) + γCov (A, Z ) + Cov (ν, Z )
Cov (Y , Z )
ρIV = =ρ
Cov (S, Z )
Z e
D Y
(a)
Z e
D Y
(b)
Yi = α + ρSi + ηi
Si = α + ρZi + ζi
Yi = α + ρSbi + νi
Yi = α + πZi + εi
Yi = α + βXi + εi
Xi = γ + δZi + νi
Note βIV is ratio of “reduced form” (π) to “first stage” coefficient (δ):
Cov (Z ,Y )
Cov (Z , Y ) Var (Z ) π
βbIV = = =
b
Cov (Z , X ) Cov (Z ,X )
Var (Z )
δb
Rewrite δb as
Cov (Z , X )
δb = ⇔ Cov (Z , X ) = δVar
b (Z ) (100)
Var (Z )
Then rewrite βIV
Cov (Z , Y ) δCov
b (Z , Y ) δCov
b (Z , Y )
βIV = = =
d
Cov (Z , X ) δCov (Z , X )
b 2
δ Var (Z )
b
Cov (δZ
b ,Y)
= (101)
Var (δZ
b )
Cov (δZ
b ,Y )
Recall X = γ + δZ + ν; β
dIV = and let Xb = γ
b + δZ
b .
Var (δZ
b )
Then the two-stage least squares (2SLS) estimator is
Cov (δZ
b ,Y) Cov (Xb , Y )
βIV = =
d
Var (δZ
b ) Var (Xb )
Proof.
We will show that δCov
b (Y , Z ) = Cov (Xb , Y ). I will leave it to you to show that
Var (δZ
b ) = Var (Xb )
The 2SLS estimator replaces X with the fitted values of X (i.e., Xb ) from the
first stage regression of X onto Z and all other covariates.
In a sample of data, you could get the reduced form and first stage coefficients
manually by the following two regression commands in STATA:
. reg Y Z
. reg X Z
While it is always a good idea to run these two regressions, don’t compute your
IV estimate this way
Example: It is often the case that a pattern of missing data will differ between Y
and X ; in such a case, the usual procedure of “casewise deletion” is to keep the
subsample with non-missing data on Y , X , and Z .
But the reduced form and first stage regressions would be estimated off of
different sub-samples if you used the two step method above
The standard errors from the second stage regression are also wrong
Best practice is to use your built-in procedure (which also gives standard
errors):
. ivregress Y (X=Z)
You can also estimate 2SLS using the auxiliary regression approach we just
covered:
. reg X Z
. predict Xhat
. reg Y Xhat
For the same reasons that you shouldn’t actually implement 2SLS manually
using the ratio of the reduced form and first stage coefficients, you shouldn’t
manually use the auxiliary regression approach because, again, the standard
errors are incorrect, and any complex missing patterns may leave you with
different samples
This “two stage least squares” interpretation of IV – called an interpretation,
because it is not the actual suggested procedure – is useful for understanding
what IV does, but stick with ivregress
Waldinger (Warwick) 9 / 45
First Stage
First Stages
Men born earlier in the year have lower schooling. This indicates that there is a
first
Menstage.
born earlier in the year have lower schooling. This indicates that
there is a first stage.
Waldinger (Warwick) 10 / 45
Cunningham Causal Inference
Overview
Part 1: The Core Natural Experiments
Part 2: Selection on observables Instrumental variables
Part 3: Selection on unobservables Panel data
Part 4: Advanced material Differences-in-differences
Conclusion Synthetic Control
References
Reduced Form
Reduced Form
Do differences in schooling due to different quarter of birth translate into
different earnings?in schooling due to di§erent quarter of birth translate
Do di§erences
into di§erent earnings?
Waldinger (Warwick)
Cunningham Causal Inference 11 / 45
Overview
Part 1: The Core Natural Experiments
Part 2: Selection on observables Instrumental variables
Part 3: Selection on unobservables Panel data
Part 4: Advanced material Differences-in-differences
Conclusion Synthetic Control
References
Sbi = X π
b10 + π
b11 Zi
where π
b1j for j = 1, 2 are OLS estimates of the first stage regression
2 Plug the first stage fitted values into the “second-stage equation” to then
estimate
Yi = αX + Sρb + error
But as we note, they don’t actually manually do this – I remind you of this
because the 2SLS intuition is very useful. 2SLS only retains the variation in S
generated by the quasi-experimental variation, which we hope is exogenous
Angrist and Krueger use more than one instrumental variable to instrument for
schooling: they include a dummy for each quarter of birth. Their estimated
first-stage regression is therefore:
The second stage is the same as before, but the fitted values are from the new
first stage
First Stage
Quarter Regressions
of birth in Angrist
is a strong predictor of total&years
Krueger (1991)
of education
IV Results
IV Estimates
IV Estimates Birth Birth
Cohorts 20-29, 1980 Cohorts
Census 20-29, 1980 Census
Waldinger (Warwick) 17 / 45
Cunningham Causal Inference
Overview
Part 1: The Core Natural Experiments
Part 2: Selection on observables Instrumental variables
Part 3: Selection on unobservables Panel data
Part 4: Advanced material Differences-in-differences
Conclusion Synthetic Control
References
IV Results - IV
Including Some
Estimates Covariates
- including some covariates
IV Estimates Birth Cohorts 20-29, 1980 Census
Waldinger (Warwick) 18 / 45
Cunningham Causal Inference
Wald estimator
They
They alsoalso include
include specifications
specifications where where they
they use use 30 (quarter
30 (quarter of birth ×ofyear)
birth x
dummy
year)variables
dummies and 150150
and (quarter of birth
(quarter × state)
of birth dummies
x state) as instrumental
dummies as IVs
variables
(intuition: the e§ect of quarter of birth may vary by birth year or
What’s the intuition here? The effect of quarter of birth may vary by birth year or
state).
by state
This reduces standard errors.
It reduced the standard errors, but that comes at a cost of potentially having a
weakButinstruments
also comes at the cost of potentially having a weak instruments
problem
problem (see below).
Waldinger (Warwick) 19 / 45
plain
Mechanism
Weak Instruments
y = βx + ν
x = Z’π + η
If νi and ηi are correlated, estimating the first equation by OLS would lead to
biased results, wherein the OLS bias is:
Cov (ν, x)
E [βOLS − β] =
Var (x)
σνη
If νi and ηi are correlated the OLS bias is therefore: σx2
\ σνη 1
E [β2SLS − β] ≈
ση2 F + 1
where F is the population analogue of the F -statistic for the joint significance
of the instruments in the first stage regression. See Angrist and Pischke pp.
206-208 for a derivation.
σ
If the first stage is weak (i.e, F → 0), then the bias of 2SLS approaches σνη 2
η
This is the same as the OLS bias as for π = 0 in the second equation on the
earlier slide (i.e., there is no first stage relationship between Z and D) σx2 = ση2
σ σ
and therefore the OLS bias σνη2 becomes σνη2 .
x x
But if the first stage is very strong (F → ∞) then the IV bias goes to 0.
Adding more weak instruments reduced the first stage F-statistic and
Adding more weak instruments reduced the first stage F -statistic and moves
moves the coe¢cient towards the OLS coe¢cient.
the coefficient towards the OLS coefficient
Waldinger (Warwick) 41 / 45
Cunningham Causal Inference
Overview
Part 1: The Core Natural Experiments
Part 2: Selection on observables Instrumental variables
Part 3: Selection on unobservables Panel data
Part 4: Advanced material Differences-in-differences
Conclusion Synthetic Control
References
Adding more weak instruments reduced the first stage F-statistic and
Adding more weak instruments reduced the first stage F -statistic and moves
moves the coe¢cient towards the OLS coe¢cient.
the coefficient towards the OLS coefficient
Waldinger (Warwick) 42 / 45
3 If you have many IVs, pick your best instrument and report the just identified
model (weak instrument problem is much less problematic)
4 Check over identified 2SLS models with LIML
5 Look at the reduced form
The reduced form is estimated with OLS and is therefore unbiased
If you can’t see the causal relationship of interest in the reduced form, it is
probably not there
Angrist (1990) uses the Vietnam draft lottery as an instrumental variable for
military service
In the 1960s and 1970s, young American men were drafted for military service
to serve in Vietnam
Concerns about the fairness of the conscription policy led to the introduction of
a draft lottery in 1970
From 1970 to 1972, random sequence numbers were randomly assigned to each
birth date in cohorts of 19-year-olds
Men with lottery numbers below a cutoff were drafted; in other words
Higher numbers were less likely to be drafted;
Lower numbers were more likely to be drafted.
The draft did not perfectly determine military service:
Many draft-eligible men were exempt for health and other reasons
Exempt men would sometimes volunteer
Up to this point, we only considered models where the causal effect was the
same for all individuals (i.e., homogenous treatment effects where Yi1 − Yi0 = δ
for all i units)
Let’s now try to understand what instrumental variables estimation is
measuring if treatment effects are heterogenous (i.e, Yi1 − Yi0 = δi which varies
across the population)
Why do we care?
1 We care about internal validity: Does the design successfully uncover causal
effects for the population that we are studying?
2 We care about external validity: Does the study’s results inform us about
different populations?
π0i = E [Di0 ]
π1i = (Di1 − Di0 ) is the heterogenous causal effect of the IV on Di .
E [π1i ] = The average causal effect of Zi on Di
Interpretation Potential outcomes for each person i are unrelated to the treatment
status of other individuals.
Example Veteran status of person at risk of being drafted is not affected by the
draft status of others at risk of being drafted.
Implication Rewrite Yi (D,Z) as Yi (Di , Zi ) and Di (Z) as Di (Zi ).
Independence assumption
E [Yi |Zi = 1] − E [Yi |Zi = 0] = E [Yi (Di1 , 1)|Zi = 1] − E [Yi (Di0 , 0)|Zi = 0]
= E [Yi (Di1 , 1)] − E [Yi (Di0 , 0)]
Independence means that the first stage measures the causal effect of Zi on Di :
Example Vietnam conscription for military service was based on randomly generated
draft lottery numbers. The assignment of draft lottery number was independent of
potential earnings or potential military service – as good as random.
Cunningham Causal Inference
Overview
Part 1: The Core Natural Experiments
Part 2: Selection on observables Instrumental variables
Part 3: Selection on unobservables Panel data
Part 4: Advanced material Differences-in-differences
Conclusion Synthetic Control
References
Exclusion Restriction
Exclusion Restriction
Y(D,Z) = Y(D,Z’) for all Z, Z’, and for all D
Exclusion restriction
Use the exclusion restriction to define potential outcomes indexed solely against
treatment status:
Yi = α0 + ρi Di
with α0 = E [Yi0 ] and ρi = Yi1 − Yi0
First stage
Monotonicity
Monotonicity
Either π1i ≥ 0 for all i or π1i ≤ 0 for all i = 1, . . . , N
If all 1-5 assumptions are satisfied, then IV estimates the local average treatment
effect (LATE) of D on Y :
Effect of Z on Y
δIV ,LATE =
Effect of Z on D
The LATE parameters is the average causal effect of D on Y for those whose
treatment status was changed by the instrument, Z
Vietnam draft lottery example: IV estimates the average effect of military
service on earnings for the subpopulation who enrolled in military service
because of the draft but would not have served otherwise.
In other words, LATE would not tell us what the causal effect of military service
was for volunteers or those who were exempted from military service for medical
reasons
We have reviewed the properties of IV with heterogenous treatment effects
using a very simple dummy endogenous variable, dummy IV, and no additional
controls example.
The intuition of LATE generalizes to most cases where we have continuous
endogenous variables and instruments, and additional control variables.
More IV Jargon!
Never-Takers Complier
Di1 − Di0 = 0 Di1 − Di0 = 1
Yi (0, 1) − Yi (0, 0) = 0 Yi (1, 1) − Yi (0, 0) = Yi (1) − Yi (0)
By Exclusion Restriction, causal effect of Z Average Treatment Effect among
on Y is zero. Compliers
Defier Always-taker
Di1 − Di0 = −1 Di1 − Di0 = 0
Yi (0, 1) − Yi (1, 0) = Yi (0) − Yi (1) Yi (1, 1) − Yi (1, 0) = 0
By Monotonicity, no one in this group By Exclusion Restriction, causal effect of Z
on Y is zero.
Example Someone at risk of draft (low lottery number) changes education plans to
retain draft deferments and avoid conscription.
Implication Increased bias to IV estimand through two channels:
Average direct effect of Z on Y for compliers
Average direct effect of Z on Y for noncompliers multiplied by odds of being a
non-complier
Severity Depends on:
Odds of noncompliance (smaller → less bias)
“Strength” of instrument (stronger → less bias)
Effect of the alternative channel on Y
Example Someone who would have volunteered for Army when not at risk of draft
(high lottery number) chooses to avoid military service when at risk of being drafted
(low lottery number)
Implication Bias to IV estimand (multiplication of 2 terms):
Proportion defiers relative to compliers
Difference in average causal effects of D on Y for compliers and defiers
Severity Depends on:
Proportion of defiers (small → less bias)
“Strength” of instrument (stronger → less bias)
Variation in effect of D on Y (less → less bias)
Summarizing
The average treatment effect on the treated, ATT, is a weighted average of the
effects on always-takers and compliers.
If there are no always takers we can, however, estimate ATT which is equal to
LATE in that case.
IV in Randomized Trials
1 X
δ(z1 , z0 ) = E [Yi (Z = z1 ) − Yi (Z = z0 )]
N
i
Yi (Zi = 1) − Yi (Zi = 0)
Yi (Di = 1) − Yi (Di = 0)
Z e
D Y
IV analysis
D=0 D=1
Z=0 Compliers and Never-takers Defiers and Always-takers
Z=1 Defiers and Never-takers Compliers and Always-takers
Heterogenous treatment effects
0
Monotonicity: With probability 1, Diz ≥ Diz for all z ≥ z 0 and
all i
Then local average treatment effect (LATE) is identified
In binary Z , D case, LATE is the average treatment effect for
the population of compliers
We always have to ask ourselves: are we interested in the
LATE?
Peer encouragement designs with a single behavior of
interest
Zj Wj Yi Zj Wj Yi
Zj Wj Yi Zj Wj Yi
always-taker peer:
All variables represented by circles may have other common causes not shown. Variables represented by squares are root nodes."
Peer encouragements for dyads
Peer encouragements for dyads
Non-
Non-
Z W Z-i W-i
Peer encouragements for dyads
Yi Yi
All variables represented by circles may have other common causes not shown. Variables represented by squares are root nodes."
Peer encouragement designs
Noncompliance with multiple peers
Noncompliance with multiple peers
▪ An ego’s peer may be a mix of compliance types"
Never taker
Yi Never taker
Complier
Complier
Non-compliance
Lottery
Waiver to operate lottery
5-week sign-up period, heavy advertising (January to February
2008)
Low barriers to sign up, no eligibility pre-screening
Limited information on list
Randomly drew 30,000 out of 85,000 on list (March-October
2008)
Those selected given chance to apply
Treatment at household level
Had to return application within 45 days
60% applied; 50% of those deemed eligible → 10,000 enrollees
Oregon Health Insurance Experiment
10
Empirical Framework
12
Amy Finkelstein, et al. (2012). “The Oregon Health
Insurance Experiment: Evidence from the First Year”,
Quarterly Journal of Economics, vol. 127, issue 3, August.
Effects of Medicaid
Fielding protocol
∼70,000 people, surveyed at baseline and 12 months later
Basic protocol: three-stage male survey protocol,
English/Spanish
Intensive protocol on a 30% subsample included additional
tracking, mailings, phone attempts (done to adjust for
non-response bias)
Response rate
Effective response rate = 50%
Non-response bias aways possible, but response rate and
pre-randomization measures in administrative data were
balanced between treatment and control
Administrative data
Medicaid records
Pre-randomization demographics from list
Enrollment records to assess “first stage” (how many of the
selected got insurance coverage)
Hospital discharge data
Probabilistically matched to list, de-identified at Oregon
Health Plan
Includes dates and source of admissions, diagnoses,
procedures, length of stay, hospital identifier
Includes years before and after randomization
Other data
Mortality data from Oregon death records
Credit report data, probabilistically matched, de-identified
Sample
19
Outcomes
21
22
Effect of lottery on coverage (first stage)
Results: Access & Use of Care
Gaining insurance resulted in increased probability of hospital
Gaining insurance
admissions, resulted
primarily in increased
driven probability of department
by non-emergency hospital
admissions,
admissions primarily driven by non-ED admissions.
24
Summary: Access and use of care
26
27
Summary: Financial Strain
29
Summary: Self-reported health
Measured:
Blood pressure
Cholesterol levels
Glycated hemoglobin
Depression
Reasons for selecting these:
Reasonably prevalent conditions
Clinically effective medications exist
Markers of longer term risk of cardiovascular disease
Can be measured by trained interviewers and lab tests
A limited window into health status
44
45
46
47
Results on specific conditions
Topics covered
Panel Methods
E [y |x1 , x2 , . . . , xk , c]
Single unit:
yi1 Xi,1,1 Xi,1,2 Xi,1,j ... Xi,1,K
.. .. .. .. ..
. . . . .
yi = yit
Xi = Xi,t,1 Xi,t,2 Xi,t,j
... Xi,t,K
.. .. .. .. ..
. . . . .
yiT T ×1 Xi,T ,1 Xi,T ,2 Xi,T ,j ... Xi,T ,K T ×K
N X
X T
(β,
b cb1 , . . . , cbN ) = argmin (yit − xit b − mi )2
b,m1 ,...,mN i=1 t=1
and
T
X
(yit − xit βb − cbi ) = 0
t=1
for i = 1, . . . , N.
Derivation: fixed effects regression
Therefore, for i = 1, . . . , N,
T
1 X
cbi = (yit − xit β)
b = ȳi − x̄i β,
b
T
t=1
where
T T
1 X 1 X
x̄i ≡ xit ; ȳi ≡ yit
T T
t=1 t=1
N X
X T N X
−1 X T
βb = ẍit0 ẍit ẍit0 ẍit
i=1 t=1 i=1 t=1
Identification assumptions:
1 E [εit |xi1 , x + i2, . . . , xiT , ci ] = 0; t = 1, 2, . . . , T
regressors are strictly exogenous conditional on the unobserved
effect
allows xit to be arbitrarily related to ci
PT 0
2 rank t=1 E [ẍ ẍ
it it ] =K
regressors vary over time for at least some i and not collinear
Fixed effects estimator
1 Demean and regress ÿit on ẍit (need to correct degrees of
freedom)
2 Regress yit on xit and unit dummies (dummy variable
regression)
3 Regress yit on xit with canned fixed effects routine
STATA: xtreg y x, fe i(PanelID)
Properties (under assumptions 1-2):
βbFE is consistent: plim βbFE ,N = β
N→∞
βbFE is unbiased conditional on X
Fixed effects regression: main issues
Inference:
Standard errors have to be “clustered” by panel unit (e.g.,
farm) to allow correlation in the εit ’s for the same i.
STATA: xtreg , fe i(PanelID) cluster( PanelID )
Yields valid inference as long as number of clusters is
reasonably large
Typically we care about β, but unit fixed effects ci could be of
interest
cbi from dummy variable regression is unbiased but not
consistent for ci (based on fixed T and N → ∞)
xtreg , fe routine demeans the data before running the
regression and therefore does not estimate cbi
intercept shows average cbi across units
we can recover cbi using cbi = ȳi − x̄i βb
predict c_i, u
Example: Direct Democracy and Naturalizations
J. Hainmueller (MIT) 20 / 64
Fixed E↵ects Regression
J. Hainmueller (MIT) 21 / 64
Fixed E↵ects Regression
. ** pooled ols
. reg nat_rate repdem , cl(muniID)
Robust
nat_rate Coef. Std. Err. t P>|t| [95% Conf. Interval]
J. Hainmueller (MIT) 22 / 64
Fixed E↵ects Regression
. xtsum nat_rate
J. Hainmueller (MIT) 23 / 64
Fixed E↵ects Regression
J. Hainmueller (MIT) 24 / 64
Fixed effects regression with demeaned data
Fixed E↵ects Regression
.
. * regression with demeaned data
. reg dm_nat_rate dm_repdem , cl(muniID)
Robust
dm_nat_rate Coef. Std. Err. t P>|t| [95% Conf. Interval]
J. Hainmueller (MIT) 25 / 64
Fixed effects regression
Fixed E↵ects Regression with canned routine
Fixed E↵ects Regression with Canned Routine
. xtreg nat_rate repdem , fe cl(muniID) i(muniID)
F(1,244) = 265.18
corr(u_i, Xb) = -0.1373 Prob > F = 0.0000
Robust
nat_rate Coef. Std. Err. t P>|t| [95% Conf. Interval]
sigma_u 1.7129711
sigma_e 3.69998
rho .17650677 (fraction of variance due to u_i)
J. Hainmueller (MIT) 26 / 64
Fixed E↵ects Regression
Fixed effects regression with dummies
Fixed E↵ects Regression with Dummies
Robust
nat_rate Coef. Std. Err. t P>|t| [95% Conf. Interval]
muniID
2 1.367365 5.17e-14 2.6e+13 0.000 1.367365 1.367365
3 1.292252 5.17e-14 2.5e+13 0.000 1.292252 1.292252
9 1.284652 5.17e-14 2.5e+13 0.000 1.284652 1.284652
10 1.271783 5.17e-14 2.5e+13 0.000 1.271783 1.271783
13 .3265469 5.17e-14 6.3e+12 0.000 .3265469 .3265469
J. Hainmueller (MIT) 27 / 64
Applying fixed effects
Where yit is murder rate and xit is police spending per capita
What happens when we regress y on x and city fixed effects?
βbFE inconsistent unless strict exogeneity conditional on ci holds
E [εit |xi1 , xi2 , . . . , xiT , ci ] = 0; t = 1, 2, . . . , T
implies εit uncorrelated with past, current and future regressors
Most common violations
1 Time-varying omitted variables
Economic boom leads to more police spending and less murders
Can include time-varying controls, but avoid post-treatment bias (i.e., collider)
2 Simultaneity
if city adjusts police based on past murder rate, then spendingt+1 is correlated
with εt (since higher εt leads to higher murder rate at t)
strictly exogenous x cannot react to what happens to y in the past or the future!
Fixed effects do not obviate need for good research design!
Random Effects
Given assumptions 1-3, pooled OLS is consistent, since composite error vit is
uncorrelated with xit for all t
However, pooled OLS ignores the serial correlation in vit
Random effects assumptions
As λ → 1, βbRE → βbFE
As λ → 0, βbRE → βbPooled OLS
λ → 1 as T → ∞ or if variance of ci is large relative to
variance of εit
1
λ can be estimated from data λ σε2 /(b
b = 1 − (b σε2 + T εb2c )) 2
Usually wise to cluster the standard errors since assumption 4
is strong
Random effects regression
Random E↵ects
Robust
nat_rate Coef. Std. Err. z P>|z| [95% Conf. Interval]
sigma_u 1.3866768
sigma_e 3.69998
rho .1231606 (fraction of variance due to u_i)
J. Hainmueller (MIT) 34 / 64
Summary: Fixed effects, random effects, Pooled OLS
Main assumptions
1 Regressors are strictly exogenous conditional on the
time-invariant unobserved effects
2 Regressors are uncorrelated with the time-invariant unobserved
effects
Results
Fixed effects estimator is consistent given assumption 1, but
rules out time-invariant regressors
Random effects estimators and pooled OLS are consistent
under assumptions 1-2, and allow for time-invariant regressors
Given homoskedasticity assumptions (random effects
assumption 4), the random effects estimator is asymptotically
efficient
Assumption 2 is strong so fixed effects are typically more
credible
Often the main reason for using panel data is to rule out all
time-invariant unobserved confounders!
Hausman test
βbRE βbFE
H0 : Cov [xit , ci ] = 0 Consistent and efficient Consistent
H1 : Cov [xit , ci ] 6= 0 Inconsistent Consistent
Then,
Under H0 , βbRE − βbFE should be close to zero
Under H1 , βbRE − βbFE should be different from zero
It can be shown that in large samples, under H0 , the test
statistic
d
(βbFE − βbRE )0 (Var d [βRE ])−1 (βbFE − βbRE ) →
d [βFE ] − Var χ2k
Hausman Test
. ** hausman test
. quietly: xtreg nat_rate repdem , fe i(muniID)
. estimates store FE
.
. quietly: xtreg nat_rate repdem , re i(muniID)
. estimates store RE
. hausman FE RE
Coefficients
(b) (B) (b-B) sqrt(diag(V_b-V_B))
FE RE Difference S.E.
chi2(1) = (b-B)'[(V_b-V_B)^(-1)](b-B)
= 28.79
Prob>chi2 = 0.0000
J. Hainmueller (MIT) 37 / 64
Hausman test
Random E↵ects
Hausman
HausmanTest
test does not test if the fixed effect model is correct;
the test assumes that the fixed effects estimator is consistent!
Conventional Hausman test assumes homoskedastic model
Hausman test does not test if the fixed e↵ects model is correct, the test
and doesthat
assumes nottheallow
fixed for clustering
e↵ects estimator is consistent!
There are Haumsman
Conventional Hausman testlike testshomoscedastic
assumes that allow model
for clustered
and does not
allow for clustering
standard errors
There are Hausman like tests that allow for clustered standard errors
. xtoverid
6
naturalization rate (percent)
2 0 4
1
representative democracy (1/0)
.2 .4 0 .6 .8
xtgraph repdem
xtgraph repdem
J. Hainmueller (MIT) 42 / 64
Fixed effects: adding time effects
J. Hainmueller (MIT) 44 / 64
Fixed effects: adding
Modeling Time time effects
Fixed E↵ects Regression: Adding Time E↵ects
. egen time = group(year)
J. Hainmueller (MIT) 45 / 64
Fixed effects: linear time trend
Modeling Time
F(2,244) = 247.57
corr(u_i, Xb) = -0.0079 Prob > F = 0.0000
Robust
nat_rate Coef. Std. Err. t P>|t| [95% Conf. Interval]
sigma_u 1.6271657
sigma_e 3.584409
rho .17086519 (fraction of variance due to u_i)
J. Hainmueller (MIT) 46 / 64
Fixed effects: year fixed effects
Modeling Time
F(19,244) = 31.48
corr(u_i, Xb) = -0.0168 Prob > F = 0.0000
Robust
nat_rate Coef. Std. Err. t P>|t| [95% Conf. Interval]
time
2 .3829173 .1723225 2.22 0.027 .0434879 .7223468
3 .2789777 .1514124 1.84 0.067 -.0192644 .5772198
4 .7034078 .167466 4.20 0.000 .3735443 1.033271
J. Hainmueller (MIT) 47 / 64
Fixed effects: unit
Modeling Time specific time trends
F(18,244) = .
corr(u_i, Xb) = -0.3963 Prob > F = .
Robust
nat_rate Coef. Std. Err. t P>|t| [95% Conf. Interval]
muniID#c.time
1 .333343 .024298 13.72 0.000 .2854823 .3812036
2 .2914274 .024298 11.99 0.000 .2435667 .339288
3 .248985 .024298 10.25 0.000 .2011244 .2968457
J. Hainmueller (MIT) 48 / 64
Unit specific time trends often eliminate “results”
Modeling Time
" AOt"
Unit Specific Time Trends Often Eliminate “Results” 240 Chapter 5
'I( ,
Fixed Effects, DD, and Panel Da
5.2.3
TABLE
the estimates in columns 2 and 3 ad
Estimated effects of labor regulation on the performance of firms
specific covariates, such as government
in Indian states
and state population. This is in the s
(1) (2) (3) (4) addition of state-level adult employme
Labor regulation (lagged) -.186 -.185 -.104 .0002 the minimum wage study. The addition
(.064) (.051) (.039) (.020) Besley and Burgess estimates little. Bu
specific trends kills the labor regulatio
Log development .240 .184 .241 in column 4. Apparently, labor regula
expenditure per capita (.128) (.119) (.106)
in states where output was declining an
Log installed electricity .089 .082 .023 trend therefore drives the estimated reg
capacity per capita (.061) (.054) (.033)
I
Log state population .720 0.310 -1.419 Picking Controls
(.96) (1.192) (2.326)
We've la beled the two dimensions in t
Congress majority -.0009 .020 time because this is the archetypical D
(.01) (.010)
econometrics. But the DD idea is much
Hard left majority -.050 -.007 of states, the subscript s might denot
(.017) (.009) some of which are affected by a poli
Janata majority .008 -.020 For example, Kugler, Jimeno, and Her
(.026) (.033) effects of age-specific employment prot
Likewise, instead of time, we might g
Regional majority .006 .026
other types of characteristics. An exam
(.009) (.023)
(1999), who studied the effect of ch
State-specific trends No No No Yes laws on teen pregnancy using variatio
Adjusted R2 .93 .93 .94 .95 birth. Regardless of the group labels
Notes: Adapted from Besley and Burgess (2004), table IV. The table reports always set up an implicit treatment-co
regression DD estimates of the effects of labor regulation on productivity. The question of whether this comparison
dependent variable is log manufacturing output per capita. All models include careful consideration.
state and year effects. Robust standard errors clustered at the state level are One potential pitfall in this contex
reported in parentheses. State amendments to the Industrial Disputes Act are
“labor regulation increased in states where output was declining
coded 1 = pro-worker, 0 = neutral, -1 = pro-employer and then cumulated anyway”
position of the treatment and contro
over the period to generate the labor regulation measure. Log of installed result of treatment. Going back to a
electrical capacity is measured in kilowatts, and log development expenditure and time comparisons, suppose we're
is real per capita state spending on social and economic services. Congress, of the generosity of public assistance O
J. Hainmueller (MIT) hard left, Janata, and regional majority are counts of the number of years 49 / 64
cally, U.S. states have offered widely V
Distributed Lag model
Interpretation of coefficients:
Consider permanent increase in xit from level m to m + 1 at
t, i.e., (xs = m, s < t and xs = m + 1, s ≥ t)
yt−1 = mβ0 + mβ1 + mβ2 + ci
yt = (m + 1)β0 + mβ1 + mβ2 + ci
yt+1 = (m + 1)β0 + (m + 1)β1 + mβ2 + ci
yt+2 = (m + 1)β0 + (m + 1)β1 + (m + 1)β2 + ci
yt+3 = (m + 1)β0 + (m + 1)β1 + (m + 1)β2 + ci
After one period y has increased by β0 + β1 , after two periods
y has increased by β0 + β1 + β2 and there are no further
increases after two periods
Long-run increase in y : β0 + β1 + β2 (long-run propensity)
Modeling Dynamic E↵ects
Lagged effects of direct democracy
Lagged E↵ects of Direct Democracy
F(19,244) = 21.63
corr(u_i, Xb) = -0.0206 Prob > F = 0.0000
Robust
nat_rate Coef. Std. Err. t P>|t| [95% Conf. Interval]
repdem
--. .6364802 .3593924 1.77 0.078 -.0714272 1.344388
L1. 1.201065 .4233731 2.84 0.005 .367133 2.034998
L2. -.1648692 .4697434 -0.35 0.726 -1.090139 .7604003
L3. -.5245206 .4109918 -1.28 0.203 -1.334065 .2850239
J. Hainmueller (MIT) 53 / 64
Modeling Dynamic E↵ects
J. Hainmueller (MIT) 54 / 64
Lags and Leads model
. xtreg nat_rate F1.repdem repdem L1.repdem L2.repdem L3.repdem i.year, fe cl(muniID) i(muniID)
F(19,244) = 20.34
corr(u_i, Xb) = -0.0353 Prob > F = 0.0000
Robust
nat_rate Coef. Std. Err. t P>|t| [95% Conf. Interval]
repdem
F1. .1707685 .3212906 0.53 0.596 -.4620886 .8036255
--. .6975731 .4397095 1.59 0.114 -.1685376 1.563684
L1. .8723962 .4619322 1.89 0.060 -.0374873 1.78228
L2. .014941 .4583628 0.03 0.974 -.8879119 .9177939
L3. -.2904252 .4108244 -0.71 0.480 -1.09964 .5187895
J. Hainmueller (MIT) 56 / 64
The Autor Test
yit = Di,t+2 β−2 +Di,t+1 β−1 +Dit β0 +Di,t−1 β1 +Di,t−2 β2 +ci +εit
Interpretation of coefficients:
Leads β−1 , β−2 , etc. test for anticipation effects
Switch β0 tests for immediate effect
Lags β1 , β2 , etc. test for long-run effects
highest lag dummy can be coded 1 for all post-switch years
Modeling Dynamic E↵ects
muni_n~e year repdem switch_t sw_lag1 sw_lag2 sw_lag3 sw_lead1 sw_lead2 sw_lead3
J. Hainmueller (MIT) 58 / 64
Dynamic Effect of Switching to Representative Democracy
Modeling Dynamic E↵ects
F(27,244) = 23.76
corr(u_i, Xb) = -0.0162 Prob > F = 0.0000
Robust
nat_rate Coef. Std. Err. t P>|t| [95% Conf. Interval]
J. Hainmueller (MIT) 59 / 64
Modeling Dynamic E↵ects
DynamicEffect
Dynamic E↵ectofofSwitching
SwitchingtotoRepresentative
RepresentativeDemocracy
Democracy
J. Hainmueller (MIT) 60 / 64
Lagged Dependent Variable
With T = 3 we have
Since εi2 affects both ∆yi2 = yi2 − yi1 and ∆εi3 = εi3 − εi2 we get
Duty to retreat
“For purposes of subsection (a), in determining whether an actor described by
subsection (e) reasonably believed that the use of force was necessary, a finder of
fact may not consider whether the actor failed to retreat.”
Also: Language stating a person is justified using deadly force against another “if
a reasonable person in the actor’s situation would not have retreated” is removed
from the statute
Presumption of reasonableness
“Except as provided in subsection (b), a person is justified in using force against
another when and to the degree the actor [he] reasonably believes the force is
immediately necessary to protect the actor [himself] against the other’s use or
attempted use of unlawful force. The actor’s belief that the force was
immediately necessary as described by this subsection is presumed to be
reasonable if the actor . . . ”
Civil Liability
“A defendant who uses force or deadly force that is justified under Chapter 9
Penal code is immune from civil liability for personal injury or death that results
from the defendant’s use of force or deadly force, as applicable.”
Economic theory
Summary:
21 states passed laws removing “duty to retreat” in places
outside the home
17 states removed “duty to retreat” in any place one had a
legal right to be
13 states include a presumption of reasonable fear
18 states remove civil liability when force was justified under
law
Cheng and Hoekstra’s identification strategy
Research design:
Estimate the state-level causal effect: what would’ve happened
to these same states had they not passed the law?
Compare the changes in outcomes after castle doctrine law
adoption to changes in the outcomes in other states in the
same region of the country
Estimation: Panel fixed effects estimation
Statistical inference:
Cluster the standard errors at the state level
Are disturbances random draws from individually identical
distribution?
It’s likely that within a state, unobserved determinants of
crime are serially correlated
Bertand, Duflo and Mullainathan (2004) recommend adjusting
for serial correlation in unobserved disturbances within states
at the level of the treatment
“Randomization inference”
“how likely is it that we estimate effects of this magnitude
when using randomly chosen pre-treatment time periods and
randomly assigning placebo treatments?”
Becoming increasingly commonly done
Region-by-year fixed effects
Cheng and Hoekstra (2013) present falsification first to show the reader that
they find no association within region over time in the passage of these laws
and either larceny rate or motor vehicle theft rate
The idea here is to immediately address concerns that what they show you later is
due to generic crime trends in those states that pass the laws
It’s a useful way to assuage doubt people may have, as remember, policy-makers
are not just randomly flipping coins when passing laws, but presumably do so
because of things they observe on the ground
Results will be presented separately under six different specifications
Each new specification adds more controls
What should you expect to find on key variables of interest?
No statistically significant association between the CDL passage and the
placebos; small magnitudes preferably too
No association on the one-year lead either
How do you interpret coefficients?
His model is “log outcomes” regressed onto a dummy variable (level), so these
are semi-elasticities and approximate percentage changes – but you should
transform them by taking the exponential of each coefficient and then
differencing it from one to find the actual percentage change
Ex: CDL = -0.0137 (column 12, Table 3, “Log (larceny rate)” outcome.)
Exp(-0.0137) = 0.986, and so 1-0.986 = 1.4. Thus, CDL reduced larceny rates
by 1.4 percent, which is not statistically significant.
Results – Falsification Exercise
The key finding in this study is the very large effect that CDL
had on homicides and non-negligent manslaughter
As the effects are quite large, their strategy is first to present
pictures
The pictures are a bit tricky, though, since they’re going to
also present pictures for the control and treatment group units
This is going to be useful for eye-balling the parallel trends
pre-treatment.
Remember, though – he needs parallel trends within-region –
these figures don’t show that
But you should start with pictures; don’t fetishize regression
Log Homicide Rates – 2005 Adopter = Florida
2
1.9
1.8
Log Homicide Rate
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year
2
1.9
1.8
Log Homicide Rate
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year
2
1.9
1.8
Log Homicide Rate
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year
2
1.9
1.8
Log Homicide Rate
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year
1.5
1.4
1.3
Log Homicide Rate
1.2
1.1
1
.9
.8
.7
.6
.5
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year
.1
.08
Residual Log Homicide Rate
.06
.04
.02
0
-.02
-.04
-.06
-.08
-.1
-6 -5 -4 -3 -2 -1 0 1 2 3 4
Years after passing castle doctrine/stand-your-ground
Before going into the estimation results, here?s what you are
looking for
This second hypothesis wherein reductions in the expected
penalties and costs associated with self-defense due to CDL
causes lethal violence to increase (non-deterrence escalation of
violence) should exhibit a positive association on the DD
variable
It should be different from zero statistically and economically
meaningful
He will estimate the model using panel fixed effects estimation
and “negative binomial count models”
Because of the smaller number of annual homicides each year
in a state, he moves away from homicide rates in some
specifications and looks at “count” outcomes
He uses a class of estimators more appropriate for “counts”
called “count models”, like the negative binomial estimated
with maximum likelihood
Results are robust to least squares and count models
Homicide – OLS
1 2 3 4 5 6
Panel A: Log Homicide Rate (OLS - Weighted)
State and Year Fixed Effects Yes Yes Yes Yes Yes Yes
Region-by-Year Fixed Effects Yes Yes Yes Yes Yes
Time-Varying Controls Yes Yes Yes Yes
Contemporaneous Crime Rates Yes
State-Specific Linear Time Trends Yes
Homicide – Negative Binomial; Murder – OLS
1 2 3 4 5 6
Panel C: Homicide (Negative Binomial - Unweighted)
State and Year Fixed Effects Yes Yes Yes Yes Yes Yes
Region-by-Year Fixed Effects Yes Yes Yes Yes Yes
Time-Varying Controls Yes Yes Yes Yes
Contemporaneous Crime Rates Yes
State-Specific Linear Time Trends Yes
Homicide – identification test
J. Hainmueller (MIT) 5 / 50
They surveyed about 400 fast food stores both in New Jersey
and Pennsylvania before and after the minimum wage increase
in New Jersey
DD Strategy
J. Hainmueller (MIT) 7 / 50
DD Strategy II
In New Jersey
Employment in February is
E (Yist |s = NJ, t = Feb) = γNJ + λFeb
Employment in November is:
E (Yist |s = NJ, t = Nov ) = γNJ + λNov + δ
Difference between November and February
E (Yist |s = NJ, t = Nov )−E (Yist |s = NJ, t = Feb) = λN −λF +δ
In Pennsylvania
Employment in February is
E (Yist |s = PA, t = Feb) = γPA + λFeb
Employment in November is:
E (Yist |s = PA, t = Nov ) = γPA + λNov
Difference between November and February
E (Yist |s = PA, t = Nov ) − E (Yist |s = PA, t = Feb) = λN − λF
DD Strategy III
Stores by state
Difference,
PA NJ NJ-PA
Variable (i) (ii) (iii)
1. FTE employment before,
all available observations
2. FTE employment after,
all available observations
3. Change in mean FTE
employment
4. Change in mean FTE
employment, balanced
sample of storesC
J. Hainmueller (MIT) 17
Surprisingly, employment rose in NJ relative to PA after the
5. Change in mean FTE
minimum wage changesetting
employment,
Regression DD
Waldinger (Warwick) 19 / 55
Graph - DD
YistY= =+
ist α γNJ
a+ s +
gNJ s +λd
ldt t+ d(NJs× d)
+δ(NJ dt )st++#ist
εist
Waldinger (Warwick) 20 / 55
Graph - DD
YistYist
== α+
a+γNJ
gNJs s++λd
ldtt + d(NJs ×
+ δ(NJ dd) ++
t )st #istεist
Waldinger (Warwick) 21 / 55
Graph - DD
Waldinger (Warwick) 22 / 55
Graph - DD
Yist
Yist==αa+ gNJss + ld
+γNJ λdtt + (NJs ×dtd)
+dδ(NJ )+st #+
ist εist
Waldinger (Warwick) 23 / 55
Key assumption of any DD strategy: Parallel trends
Even if pre-trends are the same one still has to worry about
absence of the treatment.
This does not mean that they have to have the same mean of the
other policiesoutcome!
changing at the same time (omitted variable
Common trend assumption is di¢cult to verify but one often uses
bias) pre-treatment data to show that the trends are the same.
Even if pre-trends are the same one still has to worry about other
policies changing at the same time.
Waldinger (Warwick) 24 / 55
Regression DD Including Leads and Lags
The lags show the effect increases during the first years of the
treatment and then remains relatively constant.
Waldinger (Warwick) 27 / 55
Standard errors in DD strategies
1 Non-parallel trends
2 Compositional differences
3 Long-term effects vs. reliability
4 Functional form dependence
Non-parallel trends
Figure 1: Internet Di↵usion and Average Quarterly Music Expenditure in the CEX
40 100
30
70
(in 1998 dollars)
25
60
20 50
40
15
30
10
20
5
10
0 0
1996 1997 1998 1999 2000 2001
Year
J. Hainmueller (MIT) 41 / 50
Table 1: Descriptive Statistics for Internet User and Non-user Groupsa
How do inflows
Motivating of immigrants
Example: The Marielaffect the wages and
Boatlift
employment of natives in local labor markets?
How do inflows of immigrants a↵ect the wages and
Card (1990) usesofthe
employment Mariel
natives boatlift
in local of 1980 as a natural
labor markets?
experiment to measure
Card (1990) the effect
uses the Mariel Boatliftofof a1980
sudden influx of
as a natural
immigrants on unemployment
experiment to measure the e↵ectamong less-skilled
of a sudden influx of natives
immigrants on unemployment among less-skilled natives
Motivating example: The Mariel Boatlift
"'"'" "'"'"
- I / C'i~~
-0- C1C1o.
-N-
~
O\C;:) I I I I
,
vi
u
~
u
.r:
c
~
00
~ - r;::; ~ ~ ~
O\MO\ 000\00.
ft ddd "';:dN c
'-' '-' '-" ,-
O\MV \0\00 C
~
M~d o\NM ~
Motivating Example: The Mariel Boatlift
E -
~ OO~ - 0
-00. --
O\N I I.r: ~
u
a ~
d
U, ~
C
L..
='
0
t:
C
0
',----M- --~
r-..000\"d
0u
.-0
c:
- 0 ... - - 0 ... - ~C
K ~ . -vr-..
-""""" MMO
\~ '/ .-00 ~ 0\ . V .. 0
1J') 00'" 0 N g
au t--
~c.s
- C/)
.-
e >c I """
\0
10..,
"d
~ ~
~uU M
rl
C c: :0
U
0 0 ~
.I::
...
10.., .~.~~
0.
~
0..
~
L.
0
0
a E 0\
--
'a
0 0 0\
~
~ U
I Y
~
';j
II)
u
'-.-
u 6
.~
U
(I) .
,-
~
u
~
.~
u'-
~
~
~
'-'
a
r \
'-'
~ 0.2-: c.2-:
OGJ
a
0
g °u
~
U '~ U
C
~
.~ U
C ~
.L..
~F;0..L.. U
I£: -
eU~o..~
L..
u,.. II) .- U ~ ,;... U
"0
~ R
=' ,~ ~ ,- a ~ u ;a a ~
0 ~ ~ .- 0 ~ 0..
t!: (,5 ~~UQ ~~UQ ~
<
';,
,.
: GJ
II)
II)
; U
C U
0
' U -0
Z
'I!~
'
,
.-
~
GJ
, ---N ---" ,
.
or
l~ Q ~_c ~~~
,
\
:)
Comparative case studies
Advantages:
Policy interventions often take place at an aggregate level
Aggregate/macro data are often available
Problems:
Selection of control group is often ambiguous
Standard errors do not reflect uncertainty about the ability of
the control group to reproduce the counterfactual of interest
Synthetic Control method
Precludes extrapolation
Does not require access to post-treatment outcomes in the
“design” phase of the study, when synthetic controls are
calculated
Makes explicit the contribution of each comparison unit to the
counterfactual of interest
Allows researchers to use quantitative and qualitative
techniques to analyze the similarities and differences between
the units representing the case of interest and the synthetic
control
Formalizing the way comparison units are chosen not only
represents a way of systemizing comparative case studies, it
also has direct implications for inference
Synthetic control method: estimation
for t > T0 and Y1t is the outcome for unit one at time t
Synthetic control method: implementation
k
X J+1
X 2
vm X1m − wj Xjm
m=1 j=2
Cross-validation
Divide the pre-treatment period into an initial training period
and a subsequent validation period
For any given V , calculate W ∗ (V ) in the training period.
Minimize the MSPE of W ∗ (V ) in the validation period
What about unobserved factors?
140
California
rest of the U.S.
120
per−capita cigarette sales (in packs)
100
80
60
40
Passage of Proposition 99
20
0
year
Cigarette Consumption: CA and synthetic CA
Cigarette Consumption: CA and synthetic CA
140
California
synthetic California
120
per−capita cigarette sales (in packs)
100
80
60
40
Passage of Proposition 99
20
0
year
Predictor Means:
Predictor ActualActual
Means: vs. Synthetic CaliforniaCalifornia
vs. Synthetic
California Average of
Variables Real Synthetic 38 control states
Ln(GDP per capita) 10.08 9.86 9.86
Percent aged 15-24 17.40 17.40 17.29
Retail price 89.42 89.41 87.27
Beer consumption per capita 24.28 24.20 23.75
Cigarette sales per capita 1988 90.10 91.62 114.20
Cigarette sales per capita 1980 120.20 120.43 136.58
Cigarette sales per capita 1975 127.10 126.99 132.81
Note: All variables except lagged cigarette sales are averaged for the 1980-
1988 period (beer consumption is averaged 1984-1988).
Smoking Gap between CA and synthetic CA
Smoking Gap Between CA and synthetic CA
30
gap in per−capita cigarette sales (in packs)
20
10
0
−10
−20
Passage of Proposition 99
−30
year
Inference
30
control states
gap in per−capita cigarette sales (in packs)
20
10
0
−10
−20
Passage of Proposition 99
−30
year
Smoking Gap for CA and 34 control states
30
control states
gap in per−capita cigarette sales (in packs)
20
10
0
−10
−20
Passage of Proposition 99
−30
year
Smoking Gap for CA and 29 control states
30
control states
gap in per−capita cigarette sales (in packs)
20
10
0
−10
−20
Passage of Proposition 99
−30
year
Smoking Gap for CA and 19 control states
30
control states
gap in per−capita cigarette sales (in packs)
20
10
0
−10
−20
Passage of Proposition 99
−30
year
Ratio Post-Prop. 99 RMSPE to Pre-Prop. 99 RMSPE
5
4
3
frequency
California
2
1
0
0 20 40 60 80 100 120
reunification
5000
West Germany
rest of the OECD sample
0
year
Covariate Averages before
Covariate 1990
averages before 1990
reunification
5000
West Germany
synthetic West Germany
0
year
Country Weights in the Synthetic West Germany
Country Weights in the Synthetic West Germany
Italy ●
Australia ●
Norway ●
Greece ●
Netherlands ●
USA ●
New Zealand ●
Spain ●
Belgium ●
UK ●
Switzerland ●
Japan ●
France ●
Denmark ●
Austria ●
Portugal ●
5 10 15
placebo reunification
5000
West Germany
synthetic West Germany
0
year
Comparison to Regression
Let
Bb = (X0 X00 )−1 X0 Y00
be the (k × T1 ) matrix of regression coefficients of Y0 on X0
That is, each column of Bb contains the regression coefficients of Y0 on X0 for a
post-intervention period.
A regression-based counterfactual of the outcome for the treated unit in
absence of the treatment is given by the (T1 × 1) vector Bb0 X1
Comparison to regression
Notice that
Bb0 X1 = Y0 W reg
where
W reg = X00 (X0 X00 )−1 X1
As a result, the regression-based estimate of the
counterfactual of interest is a linear combination of
post-treatment outcomes for the untreated units, with weights
W reg
Let ι be a (J × 1)vector of ones. The sum of the regression
weights is ι0 W reg . It can be proven that
ι0 W reg = 1
Synthetic vs. Regression Weights
Final thoughts
A good research design is one you are excited to tell people about – that’s
basically what characterizes all research designs, whether propensity score
matching or regression discontinuity designs, in some respects
Most important thing is to be honest with yourself and the reader
Always check for covariate balance in everything that you do
Causality is easy and hard. Don’t get confused which is the hard part and
which is the easy part. Don’t get enamored by statistical modeling that
obscures the identification problem from plain sight. Always understand what
assumptions you must make, be clear which parameters you are and are not
identifying, and don’t be afraid of your answers.