Dehejia&Wahba Causal Effects in Nonexperimental Studies Reevaluating The Evaluation of Training Programs
Dehejia&Wahba Causal Effects in Nonexperimental Studies Reevaluating The Evaluation of Training Programs
Dehejia&Wahba Causal Effects in Nonexperimental Studies Reevaluating The Evaluation of Training Programs
Programs
Author(s): Rajeev H. Dehejia and Sadek Wahba
Source: Journal of the American Statistical Association , Dec., 1999, Vol. 94, No. 448
(Dec., 1999), pp. 1053-1062
Published by: Taylor & Francis, Ltd. on behalf of the American Statistical Association
Stable URL: https://www.jstor.org/stable/2669919
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms
Taylor & Francis, Ltd. and American Statistical Association are collaborating with JSTOR to
digitize, preserve and extend access to Journal of the American Statistical Association
This article uses propensity score methods to estimate the treatment impact of the National Supported Work (NSW) Demonstration,
a labor training program, on postintervention earnings. We use data from Lalonde's evaluation of nonexperimental methods that
combine the treated units from a randomized evaluation of the NSW with nonexperimental comparison units drawn from survey
datasets. We apply propensity score methods to this composite dataset and demonstrate that, relative to the estimators that Lalonde
evaluates, propensity score estimates of the treatment impact are much closer to the experimental benchmark estimate. Propensity
score methods assume that the variables associated with assignment to treatment are observed (referred to as ignorable treatment
assignment, or selection on observables). Even under this assumption, it is difficult to control for differences between the treatment
and comparison groups when they are dissimilar and when there are many preintervention variables. The estimated propensity score
(the probability of assignment to treatment, conditional on preintervention variables) summarizes the preintervention variables.
This offers a diagnostic on the comparability of the treatment and comparison groups, because one has only to compare the
estimated propensity score across the two groups. We discuss several methods (such as stratification and matching) that use the
propensity score to estimate the treatment impact. When the range of estimated propensity scores of the treatment and comparison
groups overlap, these methods can estimate the treatment impact for the treatment group. A sensitivity analysis shows that our
estimates are not sensitive to the specification of the estimated propensity score, but are sensitive to the assumption of selection on
observables. We conclude that when the treatment and comparison groups overlap, and when the variables determining assignment
to treatment are observed, these methods provide a means to estimate the treatment impact. Even though propensity score methods
are not always applicable, they offer a diagnostic on the quality of nonexperimental comparison groups in terms of observable
preintervention variables.
1053
The article is organized as follows. Section 2 reviews employed prior to randomization. Selection of this subset is
Lalonde's data and reproduces his results. Section 3 identi- based only on preintervention variables (month of assign-
fies the treatment effect under the potential outcomes causal ment and employment history). Assuming that the initial
model and discusses estimation strategies for the treatment randomization was independent of preintervention covari-
effect. Section 4 applies our methods to Lalonde's dataset, ates, the subset retains a key property of the full experimen-
and Section 5 discusses the sensitivity of the results to the tal data: The treatment and control groups have the same
methodology. Section 6 concludes the article. distribution of preintervention variables, although this dis-
tribution could differ from the distribution of covariates for
2. LALONDE'S RESULTS the larger sample. A difference in means remains an un-
biased estimator of the average treatment impact for the
2.1 The Data
reduced sample. The subset includes 185 treated and 260
The NSW Demonstration [Manpower Demonstration Re- control observations.
search Corporation (MDRC) 1983] was a federally and pri- We present the preintervention characteristics of the orig-
vately funded program implemented in the mid-1970s to inal sample and of our subset in the first four rows of Table
provide work experience for a period of 6-18 months to 1. Our subset differs from Lalonde's original sample, espe-
individuals who had faced economic and social problems cially in terms of 1975 earnings; this is a consequence both
prior to enrollment in the program. Those randomly se- of the cohort phenomenon and of the fact that our sub-
lected to join the program participated in various types of sample contains more individuals who were unemployed
work, such as restaurant and construction work. Informa- prior to program participation. The distribution of preinter-
tion on preintervention variables (preintervention earnings vention variables is very similar across the treatment and
as well as education, age, ethnicity, and marital status) was control groups for each sample; none of the differences is
obtained from initial surveys and Social Security Admin- significantly different from 0 at a 5% level of significance,
istration records. Both the treatment and control groups with the exception of the indicator for "no degree".
participated in follow-up interviews at specific intervals. Lalonde's nonexperimental estimates of the treatment ef-
Lalonde (1986) offered a separate analysis of the male and fect are based on two distinct comparison groups: the Panel
female participants. In this article we focus on the male par- Study of Income Dynamics (PSID-1) and Westat's Matched
ticipants, as estimates for this group were the most sensitive Current Population Survey-Social Security Administration
to functional-form specification, as indicated by Lalonde. File (CPS-1). Table 1 presents the preintervention charac-
Candidates eligible for the NSW were randomized into teristics of the comparison groups. It is evident that both
the program between March 1975 and July 1977. One con- PSID-1 and CPS-1 differ dramatically from the treatment
sequence of randomization over a 2-year period was that group in terms of age, marital status, ethnicity, and prein-
individuals who joined early in the program had different tervention earnings; all of the mean differences are signif-
characteristics than those who entered later; this is referred icantly different from 0 well beyond a 1% level of signif-
to as the "cohort phenomenon" (MDRC 1983, p. 48). An- icance, except the indicator for "Hispanic". To bridge the
other consequence is that data from the NSW are delin- gap between the treatment and comparison groups in terms
eated in terms of experimental time. Lalonde annualized of preintervention characteristics, Lalonde extracted subsets
earnings data from the experiment because the nonexperi- from PSID-1 and CPS-1 (denoted PSID-2 and -3 and CPS-2
mental comparison groups that he used (discussed later) are
and -3) that resemble the treatment group in terms of single
delineated in calendar time. By limiting himself to those preintervention characteristics (such as age or employment
assigned to treatment after December 1975, Lalonde en- status; see Table 1). Table 1 reveals that the subsets re-
sured that retrospective earnings information from the ex- main statistically substantially different from the treatment
periment included calendar 1975 earnings, which he then group; the mean differences in age, ethnicity, marital status,
used as preintervention earnings. By likewise limiting him- and earnings are smaller but remain statistically significant
self to those who were no longer participating in the pro- at a 1% level.
gram by January 1978, he ensured that the postintervention
data included calendar 1978 earnings, which he took to be
2.2 Lalonde's Results
the outcome of interest. Earnings data for both these years
are available for both nonexperimental comparison groups. Because our analysis in Section 4 uses a subset of
This reduces the NSW sample to 297 treated observations Lalonde's original data and an additional variable (1974
and 425 control observations for male participants. earnings), in Table 2 we reproduce Lalonde's results us-
However, it is important to look at several years of prein- ing his original data and variables (Table 2, panel A), and
tervention earnings in determining the effect of job training then apply the same estimators to our subset of his data
programs (Angrist 1990, 1998; Ashenfelter 1978; Ashen- both without and with the additional variable (Table 2, pan-
felter and Card 1985; Card and Sullivan 1988). Thus we els B and C). We show that when his analysis is applied
further limit ourselves to the subset of Lalonde's NSW data to the data and variables that we use, his basic conclusions
for which 1974 earnings can be obtained: those individuals remain unchanged. In Section 5 we discuss the sensitivity
who joined the program early enough for the retrospective of our propensity score results to dropping the additional
earnings information to include 1974, as well as those indi- earnings data. In his article, Lalonde considered linear re-
viduals who joined later but were known to have been un- gression, fixed-effects, and latent variable selection models
No. of observations Age Education Black Hispanic No degree Married RE74 (U.S. $) RE75 (U.S. $)
NSW/Lalonde:a
Treated 297 24.63 10.38 .80 .09 .73 .17 3,066
(.32) (.09) (.02) (.01) (.02) (.02) (236)
Control 425 24.45 10.19 .80 .11 .81 .16 3,026
(.32) (.08) (.02) (.02) (.02) (.02) (252)
RE74 subset:b
Treated 185 25.81 10.35 .84 .059 .71 .19 2,096 1,532
(.35) (.10) (.02) (.01) (.02) (.02) (237) (156)
Control 260 25.05 10.09 .83 .1 .83 .15 2,107 1,267
(.34) (.08) (.02) (.02) (.02) (.02) (276) (151)
Comparison groups:c
PSID-1 2,490 34.85 12.11 .25 .032 .31 .87 19,429 19,063
[.78] [.23] [.03] [.01] [.04] [.03] [991] [1,002]
PSID-2 253 36.10 10.77 .39 .067 .49 .74 11,027 7,569
[1.00] [.27] [.04] [.02] [.05] [.04] [853] [695]
PSID-3 128 38.25 10.30 .45 .18 .51 .70 5,566 2,611
[1.17] [.29] [.05] [.03] [.05] [.05] (686) [499]
CPS-1 15,992 33.22 12.02 .07 .07 .29 .71 14,016 13,650
[.81] [.21] [.02] [.02] [.03] [.03] [705] [682]
CPS-2 2,369 28.25 11.24 .11 .08 .45 .46 8,728 7,397
[.87] [.19] [.02] [.02] [.04] [.04] [667] [600]
CPS-3 429 28.03 10.23 .21 .14 .60 .51 5,619 2,467
[.87] [.23] [.03] [.03] [.04] [.04] [552] [288]
NOTE: Standard errors are in parentheses. Standard error on difference in means with RE74 subset/treated is given in brackets. Age = age in years; Education = number of years of schooling;
Black = 1 if black, 0 otherwise; Hispanic = 1 if Hispanic, 0 otherwise; No degree = 1 if no high school degree, 0 otherwise; Married = 1 if married, 0 otherwise; REx = earnings in calendar year
19x.
a NSW sample as constructed by Lalonde (1986).
bThe subset of the Lalonde sample for which RE74 is available.
c Definition of comparison groups (Lalonde 1986):
PSID-1: All male household heads under age 55 who did not classify themselves as retired in 1975.
PSID-2: Selects from PSID-1 all men who were not working when surveyed in the spring of 1976.
PSID-3: Selects from PSID-2 all men who were not working in 1975.
CPS-1: All CPS males under age 55.
CPS-2: Selects from CPS-1 all males who were not working when surveyed in March 1976.
CPS-3: Selects from CPS-2 all the unemployed males in 1976 whose income in 1975 was below the poverty level.
PSID1-3 and CPS-1 are identical to those used by Lalonde. CPS2-3 are similar to those used by Lalonde, but Lalonde's original subset could not be recreated.
of the treatment impact. Because our analysis focuses on panel B, is that the regression specifications and compari-
the importance of preintervention variables, we focus on son groups fail to replicate the treatment impact.
the first of these. Including 1974 earnings as an additional variable in the
regressions in Table 2, panel C does not alter Lalonde's
Table 2, panel A, reproduces the results of Lalonde (1986,
basic message, although the estimates improve compared
Table 5). Comparing Panels A and B, we note that the treat-
to those in panel B. In columns (1) and (3), many estimates
ment effect, as estimated from the randomized experiment,
remain negative, but less so than in panel B. In column (2)
is higher in the latter ($1,794 compared to $886). This re-
the estimates for PSID-1 and CPS-1 are negative, but the
flects differences in the composition of the two samples, as estimates for the subsets improve. In columns (4) and (5)
discussed in the previous section: A higher treatment ef- the estimates are closer to the experimental benchmark than
fect is obtained for those who joined the program earlier or in panel B, off by about $1,000 for PSID1-3 and CPS1-2
who were unemployed prior to program participation. The and by $400 for CPS-3. Overall, the results closest to the
results in terms of the success of nonexperimental estimates experimental benchmark in Table 2 are for CPS-3, panel C.
are qualitatively similar across the two samples. The sim- This raises a number of issues. The strategy of considering
subsets of the comparison group improves estimates of the
ple difference in means, reported in column (1), yields neg-
treatment effect relative to the benchmark. However, Table
ative treatment effects for the CPS and PSID comparison
1 reveals that significant differences remain between the
groups in both samples (except PSID-3). The fixed-effects-
comparison groups and the treatment group. These subsets
type differencing estimator in the third column fares some-
are created based on one or two preintervention variables.
what better, although many estimates are still negative or In Sections 3 and 4 we show that propensity score methods
deteriorate when we control for covariates in both panels. provide a systematic means of creating such subsets.
The estimates in the fifth column are closest to the exper-
imental estimate, consistently closer than those in the sec- 3. IDENTIFYING AND ESTIMATING THE AVERAGE
TREATMENT EFFECT
ond column, which do not control for earnings in 1975. The
treatment effect is underestimated by about $1,000 for the 3.1 Identification
CPS comparison groups and by $1,500 for the PSID groups. Let Yi, represent the value of the outcome when unit i
Lalonde's conclusion from panel A, which also holds in exposed to regime 1 (called treatment), and let Yio repres
Q03 Lo
-1 LO co 00 Lo 'I cm
q q 'It CI) CC) CC) N N r, L0 O-) Lo C') 0.)
co
f:
0.) U)
CC) r' N n C, =3
U) C') 0 N
ui w m w 11- 0 w (D C'i
E2
U) (3)
:3
U)
U)
C) CM E
U) Lo C) 00
(DI-co
C00
CL
U)
-0
=3
U)
U)
CZ
CO 0') CM C) C)
CL LL] co 10 "- ; "t
(O CI)"t 0C)W
co 0') Z N 0 CO
0') CC) C"L 0') b
6 2LI
i5
co
LO co CM Lo
CL 0 N LO O' C-0
L6 L:;ZI
- rl- C) =3
:3 C) '2 (D C) C) C) (D
(o Lo o
U)
=3
co
2w oo 0
C') 0.) C) (DLo
C) '('O F,a
N r' LO CO co co
72
LL] 'r, (D ; W LO 11 -
CI) C\j 0') N 0 rl- (D 0
co CM CM LO LO CI) (D CO 00
co
Q4) C?
U)
, zi- z zti O-
CQL 70 70
co
C) C\j :;-
ia C) i:::, _-
CC) i:::, (Co') Ca-
't
to r M Lo I'* :3
C, 00)
LOC) co
CMFi rco rl-
0') "t . U)
rl- 0)
(o 00 rl- 00
JR).
4).
ui w
CL
0 coE
co U)
C) rl- 00 0 CE2
CM j:Z: 'It LO - CM - - :;' C) Lo - w
CI)r-
(D - N
r''
C) N
CI)
r"
r"C'q
00
C'j .'-6
co 00
00 LO 0 't
CZ
Ir
LU U
=3
CZ U)
co
LU C3) m 0, - co 0
6 ja a
LO O- co 2
- r- co CM LO 0 CO
gt ON LO
0 "t a 0
"taC)
-mRNCI)
:;:
:3 r' Co (D (D r-U-)(D C: 0
0) 0') 00 I'- C') (D (D co a)6-
cli E U)
U)
0-
0
It (O 0') CO
't 1, rlo "t
I0'-3 CI) L)
C1 (o cq (o CIZ 00 LO to 0) LO LO C\J r- U)
W "t 11- W 0) LO C\j
Q R -6
Oa" 4 E2
COO co
U) 70
C: 15 2
0') CM 0') E U) a) co
CM cl) 23 E =3
CM 0') LO i:::, (O 0 U)
C) (D C\i "t r- LO (D LO LO CM 0 E
:3 co r- 00 LO
CZ 16
a) co
CL Q) ? o
ztj E p ui w
14
a) U) co
U)
the value of the outcome when unit i is exposed to regime 0 the expectation is over the distribution of Xi for the NSW
(called control). Only one of Yio or Yi1 can be observed for population.
any unit, because one cannot observe the same unit under One method for estimating the treatment effect that stems
both treatment and control. Let Ti be a treatment indicator from (1) is estimating E(Yi IXi, Ti = 1) and E(Yi IXi, Ti =
(1 if exposed to treatment, 0 otherwise). Then the observed 0) as two nonparametric equations. This estimation strat-
outcome for unit i is Yi = TiTYi + (1 -Tj)Yio. The treatmentegy becomes difficult, however, if the covariates, Xi, are
effect for unit i is Ti = Yi0-Yio. high dimensional. The propensity score theorem provides
In an experimental setting where assignment to treatment an intermediate step.
is randomized, the treatment and control groups are drawn
from the same population. The average treatment effect for Proposition ] (Rosenbaum and Rubin 1983). Let p(Xi)
this population is T= E(Yi) - E(Yio). But randomization be the probability of unit i having been assigned to
treatment, defined as p(Xi) _ Pr(Ti = 1JXi) =
implies that {Yi, I Yio IL Ti} [using Dawid's (1979) notation,
H represents independence], so that for j 0,1, E(TilXi). Assume that 0 < p(Xi) < 1, for all Xi, and
Pr(Tl, T2, ... TNIX1,X2, ... XN) = f .i=1 N P(Xi)Ti (1
E(YijlTi = 1) = E(YijlTi = 0) = E(YijTi = j) p(Xi))(1-Ti) for the N units in the sample. Then
T E(Yi1 ITi 1) - E(Yio ITi = 0) Corollary. If {(Yil, Yio) H Ti} IXi and the assumptions
of Proposition 1 hold, then
= E(YjjTi =1)-E(YijTi = ?),
T|T=l = EfE(YilTi = l,p(Xi))
which is readily estimated.
In an observational study, the treatment and comparison
-E(YilTi = 0,p(Xi)) Ti = 1}, (2)
groups are often drawn from different populations. In our assuming that the expectations are defined. The outer ex-
application the treatment group is drawn from the popu- pectation is over the distribution of p(Xi)ITi = 1.
lation of interest: welfare recipients eligible for the pro- One intuition for the propensity score is that whereas in
gram. The (nonexperimental) comparison group is drawn (1) we are trying to condition on Xi (intuitively, to find ob-
from a different population. (In our application both the servations with similar covariates), in (2) we are trying to
CPS and PSID are more representative of the general U.S. condition just on the propensity score, because the propo-
population.) Thus the treatment effect that we are trying sition implies that observations with the same propensity
to identify is the average treatment effect for the treated score have the same distribution of the full vector of co-
population, variates, Xi.
whether we succeed in balancing the covariates within each ing nonparametric techniques. Hence we use a parametric
stratum. We use tests for the statistical significance of dif- model for the propensity score. This is preferable to ap-
ferences in the distribution of covariates, focusing on first plying a parametric model directly to (1) because, as we
and second moments (see Rosenbaum and Rubin 1984). If will see, the results are less sensitive to the logit specifi-
there are no significant differences between the two groups cation than regression models, such as those in Table 2.
within each stratum, then we accept the specification. If Finally, depending on the estimator that one adopts (e.g.,
there are significant differences, then we add higher-order stratification), a precise estimate of the propensity score is
terms and interactions of the covariates until this condition not required. The process of validating the propensity score
is satisfied. In Section 5 we demonstrate that the results are estimate produces at least one partition structure that bal-
not sensitive to the selection of higher-order and interaction ances preintervention covariates across the treatment and
variables. comparison groups within each stratum, which, by (1), is
In the second step, given the estimated propensity score, all that is needed for an unbiased estimate of the treatment
we need to estimate a univariate nonparametric regres- impact.
150 i 1 i I l I l l l
o i j
1 100
50 -
Ca 50Xm0 ,lF,ln
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Estimated p(Xi), 1333 comparison units discarded, first bin contains 928 comparison units
Figure 1. Histogram of the Estimated Propensity Score for NSW Treated Units and PSID Comparison Units. The 1,333 PSID units whose
estimated propensity score is less than the minimum estimated propensity score for the treatment group are discarded. The first bin contains 928
PSID units. There is minimal overlap between the two groups. Three bins (.8-.85, .85-.9, and .9-.95) contain no comparison units. There are 97
treated units with an estimated propensity score greater than .8 and only 7 comparison units.
200 i i I l l l I l
150: 1{
o100
Ca
50
co X
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Estimated p(Xi), 12611 comparison units discarded, first bin contains 2969 comparison units
Figure 2. Histogram of the Estimated Propensity Score for NSW Treated Units and CPS Comparison Units. The 12,611 CPS units whose
estimated propensity score is less than the minimum estimated propensity score for the treatment group are discarded. The first bin contains
2,969 CPS units. There is minimal overlap between the two groups, but the overlap is greater than in Figure 1; only one bin (.45-.5) contains no
comparison units, and there are 35 treated and 7 comparison units with an estimated propensity score greater than .8.
treatment group (although the treatment impact still could number of treated observations within each stratum [Table
be estimated in the range of overlap). With limited overlap, 3, column (4)]. An alternative is a within-block regression,
we can proceed cautiously with estimation. Because in our again taking a weighted sum over the strata [Table 3, col-
application we have the benchmark experimental estimate, umn (5)]. When the covariates are well balanced, such a
we are able to evaluate the accuracy of the estimates. Even regression should have little effect, but it can help elimi-
nate the remaining within-block differences. Likewise for
in the absence of an experimental estimate, we show in Sec-
matching, we can estimate a difference in means between
tion 5 that the use of multiple comparison groups provides
the treatment and matched comparison groups for earnings
another means of evaluating the estimates.
in 1978 [column (7)], and also perform a regression of 1978
We use stratification and matching on the propensity
earnings on covariates [column (8)].
score to group the treatment units with the small number
Table 3 presents the results. For the PSID sample, the
of comparison units whose estimated propensity scores are
stratification estimate is $1,608 and the matching esti-
greater than the minimum-or less than the maximum- mate is $1,691, compared to the benchmark randomized-
propensity score for treatment units. We estimate the treat- experiment estimate of $1,794. The estimates from a dif-
ment effect by summing the within-stratum difference in ference in means and regression on the full sample are
means between the treatment and comparison observations -$15,205 and $731. In columns (5) and (8), controlling
(of earnings in 1978), where the sum is weighted by the for covariates has little impact on the stratification and
Table 3. Estimated Training Effects for the NSW Male Participants Using Comparison Groups From PSID and CPS
matching estimates. Likewise for the CPS, the propensity- Column (3) in Table 3 illustrates the value of allowing
score-based estimates from the CPS-$1,713 and $1,582- both for a heterogeneous treatment effect and for a non-
are much closer to the experimental benchmark than linear functional form in the propensity score. The estima-
estimates from the full comparison sample, -$8,498 tors in columns (4)-(8) have both of these characteristics,
and $972. whereas column (3) regresses 1978 earnings on a less non-
We also consider estimates from the subsets of the PSID linear function [quadratic, as opposed to the step function
and CPS. In Table 2 the estimates tend to improve when in columns (4) and (5)] of the estimated propensity score
applied to narrower subsets. However, the estimates still and a treatment indicator. The estimates are comparable
range from -$3,822 to $1,326. In Table 3 the estimates do to those in column (2), where we regress the outcome on
not improve for the subsets, although the range of fluctua- all preintervention characteristics, and are farther from the
tion is narrower, from $587 to $2,321. Tables 1 and 4 shed experimental benchmark than the estimates in columns (4)-
light on this. (8). This demonstrates the ability of the propensity score to
Table 1 presents the preintervention characteristics of summarize all preintervention variables, but underlines the
the various comparison groups. We note that the subsets- importance of using the propensity score in a sufficiently
nonlinear functional form.
PSID-2 and -3 and CPS-2 and -3-although more closely
Finally, it must be noted that even though the estimates
resembling the treatment group are still considerably differ-
presented in Table 3 are closer to the experimental bench-
ent in a number of important dimensions, including ethnic-
mark than those presented in Table 2, with the exception
ity, marital status, and especially earnings. Table 4 presents
of the adjusted matching estimator, their standard errors
the characteristics of the matched subsamples from the
are higher. In Table 3, column (5), the standard errors are
comparison groups. The characteristics of the matched sub-
1,152 and 1,581 for the CPS and PSID, compared to 550
sets of CPS-1 and PSID-1 correspond closely to the treat-
and 886 in Table 2, Panel C, column (5). This is because the
ment group; none of the differences is statistically signif-
propensity score estimators use fewer observations. When
icant. But as we create subsets of the comparison groups,
stratifying on the propensity score, we discard irrelevant
the quality of the matches declines, most dramatically for
controls, so that the strata may contain as few as seven
the PSID. PSID-2 and -3 earnings now increase from 1974
treated observations. However, the standard errors for the
to 1975, whereas they decline for the treatment group. The
adjusted matching estimator (751 and 809) are similar to
training literature has identified the "dip" in earnings as an
those in Table 2.
important characteristic of participants in training programs
By summarizing all of the covariates in a single number,
(see Ashenfelter 1974, 1978). The CPS subsamples retain
the propensity score method allows us to focus on the com-
the dip, but 1974 earnings are substantially higher for the
parability of the comparison group to the treatment group.
matched subset of CPS-3 than for the treatment group.
Hence it allows us to address the issues of functional form
This illustrates one of the important features of propen-
and treatment effect heterogeneity much more easily.
sity score methods, namely that creation of ad hoc subsam-
ples from the nonexperimental comparison group is nei- 5. SENSITIVITY ANALYSIS
ther necessary nor desirable; subsamples based on single
preintervention characteristics may dispose of comparison 5.1 Sensitivity to the Specification of the
units that still provide good overall comparisons with treat- Propensity Score
ment units. The propensity score sorts out which compari- The upper half of Table 5 demonstrates that the estimates
son units are most relevant, considering all preintervention of the treatment impact are not particularly sensitive to the
characteristics simultaneously, not just one characteristic at specification used for the propensity score. Specifications 1
n time and 4 are the same as those in Table 3 (and hence they bal-
Matched No. of
samples observations Age Education Black Hispanic No degree Married RE74 (U.S. $) RE75 (U.S.$)
NSW 185 25.81 10.35 .84 .06 .71 .19 2,096 1,532
MPSID-1 56 26.39 10.62 .86 .02 .55 .15 1,794 1,126
[2.56] [.63] [.13] [.06] [.13] [.12] [1,406] [1,146]
MPSID-2 49 25.32 11.10 .89 .02 .57 .19 1,599 2,225
[2.63] [.83] [.14] [.08] [.16] [.16] [1,905] [1,228]
MPSID-3 30 26.86 10.96 .91 .01 .52 .25 1,386 1,863
[2.97] [.84] [.13] [.08] [.16] [.16] [1,680] [1,494]
MCPS-1 119 26.91 10.52 .86 .04 .64 .19 2,110 1,396
[1.25] [.32] [.06] [.04] [.07] [.06] [841] [563]
MCPS-2 87 26.21 10.21 .85 .04 .68 .20 1,758 1,204
[1.43] [.37] [.08] [.05] [.09] .08 [896] [661]
MCPS-3 63 25.94 10.69 .87 .06 .53 .13 2,709 1,587
[1.68] [.48] [.09] [.06] [.10] [.09] [1,285] [760]
NOTE: Standard error on the difference in means with NSW sample is given in brackets.
MPSID1-3 and MCPS1-3 are the subsamples of PSID1-3 and CPS1-3 that are matched to the treatment group.
Comparison group Unadjusted Adjusteda (3) Unadjusted Adjusted Observationsd Unadjusted Adjustedb
NSW 1,794 1,672
(633) (638)
Dropping higher-order terms
PSID-1: -15,205 218 294 1,608 1,254 1,255 1,691 1,054
Specification 1 (1,154) (866) (1,389) (1,571) (1,616) (2,209) (831)
PSID-1: -15,205 105 539 1,524 1,775 1,533 2,281 2,291
Specification 2 (1,154) (863) (1,344) (1,527) (1,538) (1,732) (796)
PSID-1: -15,205 105 1,185 1,237 1,155 1,373 1,140 855
Specification 3 (1,154) (863) (1,233) (1,144) (1,280) (1,720) (906)
CPS-1: -8,498 738 1,117 1,713 1,774 4,117 1,582 1,616
Specification 4 (712) (547) (747) (1,115) (1,152) (1,069) (751)
CPS-1: -8,498 684 1,248 1,452 1,454 6,365 835 904
Specification 5 (712) (546) (731) (632) (2,713) (1,007) (769)
CPS-1: -8,498 684 1,241 1,299 1,095 6,017 1,103 1,471
Specification 6 (712) (546) (671) (547) (925) (877) (787)
Dropping RE74
PSID-1: -15,205 -265 -697 -869 -1,023 1,284 1,727 1,340
Specification 7 (1,154) (880) (1,279) (1,410) (1,493) (1,447) (845)
PSID-2: -3,647 297 521 405 304 356 530 276
Specification 8 (959) (1,004) (1,154) (1,472) (1,495) (1,848) (902)
PSID-3: 1,069 243 1,195 482 -53 248 87 11
Specification 8 (899) (1,100) (1,261) (1,449) (1,493) (1,508) (938)
CPS-1: -8,498 525 1,181 1,234 1,347 4,558 1,402 861
Specification 9 (712) (557) (698) (695) (683) (1,067) (786)
CPS-2: -3,822 371 482 1,473 1,588 1,222 1,941 1,668
Specification 9 (670) (662) (731) (1,313) (1,309) (1,500) (755)
CPS-3: -635 844 722 1,348 1,262 504 1,097 1,120
Specification 9 (657) (807) (942) (1,601) (1,600) (1,366) (783)
NOTE: Specification 1: Same as Table 3, note (c). Specification 2: Specification 1 without higher powers. Specification 3: Specification 2 without higher-order terms. Specification 4: Same as
Table 3, note (e). Specification 5: Specification 4 without higher powers. Specification 6: Specification 5 without higher-order terms. Specification 7: Same as Table 3, note (c), with RE74 removed.
Specification 8: Same as Table 3, note (d), with RE74 removed. Specification 9: Same as Table 3, note (e), with RE74 removed.
a Least squares regression: RE78 on a constant, a treatment indicator, age, education, no degree, black, Hispanic, RE74, RE75.
bWeighted least squares: treatment observations weighted as 1, and control observations weighted by the number of times they are matched to a treatment observation [sa
c Least squares regression of RE78 on a quadratic on the estimated propensity score and a treatment indicator, for observations used under stratification; see note (d).
d Number of observations refers to the actual number of comparison and treatment units used for (3)-(5); namely, all treatment units and those comparison units whose estimated propensity
score is greater than the minimum, and less than the maximum, estimated propensity score for the treatment group.
ance the preintervention characteristics). In specifications outcomes, Yi1 and Yio, are observed. This assumption led
2-3 and 5-6, we drop the squares and cubes of the co- us to restrict Lalonde's data to the subset for which 2 years
variates, and then the interactions and dummy variables. In (rather than 1 year) of preintervention earnings data is avail-
specifications 3 and 6, the logits simply use the covariates able. In this section we consider how our estimators would
linearly. These estimates are farther from the experimen- fare in the absence of 2 years of preintervention earnings
tal benchmark than those in Table 3, ranging from $835 to data. In the bottom part of Table 5, we reestimate the treat-
$2,291, but they remain concentrated compared to the range ment impact without using 1974 earnings. For PSID1-3,
of estimates from Table 2. Furthermore, for the alternative the stratification estimates (ranging from -$1,023 to $482)
specifications, we are unable to find a partition structure are more variable than the regression estimates in column
such that the preintervention characteristics are balanced (2) (ranging from -$265 to $297) and the estimates in Ta-
within each stratum, which then constitutes a well-defined ble 3, which use 1974 earnings (ranging from $1,494 to
criterion for rejecting these alternative specifications. In- $2,321). The estimates from matching vary less than those
deed, the specification search begins with a linear specifi- from stratification. Compared to the PSID estimates, the es-
cation, then adds higher-order and interaction terms until timates from the CPS are closer to the experimental bench-
within-stratum balance is achieved. mark (ranging from $1,234 to $1,588 for stratification and
from $861 to $1,941 for matching). They are also closer
than the regression estimates in column (2).
5.2 Sensitivity to Selection on Observables The results clearly are sensitive to the set of preinter-
One important assumption underlying propensity score vention variables used, but the degree of sensitivity varies
methods is that all of the variables that influence assign- with the comparison group. This illustrates the importance
ment to treatment and that are correlated with the potential of a sufficiently lengthy preintervention earnings history
for training programs. Table 5 also demonstrates the value rica, 66, 249-288.
Ashenfelter, 0. (1974), "The Effect of Manpower Training on Earnings:
of using multiple comparison groups. Even if we did not
Preliminary Results," in Proceedings of the Twenty-Seventh Annual Win-
know the experimental estimate, the variation in estimates
ter Meetings of the Industrial Relations Research Association, eds. J.
between the CPS and PSID would raise the concern that the Stern and B. Dennis, Madison, WI: Industrial Relations Research Asso-
variables that we observe (assuming that earnings in 1974 ciation.
(1978), "Estimating the Effects of Training Programs on Earnings,"
are not observed) do not control fully for the differences
Review of Economics and Statistics, 60, 47-57.
between the treatment and comparison groups. If all rele- Ashenfelter, O., and Card, D. (1985), "Using the Longitudinal Structure
vant variables are observed, then the estimates from both of Earnings to Estimate the Effect of Training Programs," Review of
groups should be similar (as they are in Table 3). When Economics and Statistics, 67, 648-660.
Card, D., and Sullivan, D. (1988), "Measuring the Effect of Subsidized
an experimental benchmark is not available, multiple com-
Training Programs on Movements In and Out of Employment," Econo-
parison groups are valuable, because they can suggest the metrica, 56, 497-530.
existence of important unobservables. Rosenbaum (1987) Dawid, A. P. (1979), "Conditional Independence in Statistical Theory,"
has developed this idea in more detail. Journal of the Royal Statistical Society, Ser. B, 41, 1-31.
Dehejia, R. H., and Wahba, S. (1998), "Matching Methods for Estimat-
ing Causal Effects in Non-Experimental Studies," Working Paper 6829,
6. CONCLUSIONS
National Bureau of Economic Research.
In this article we have demonstrated how to estimate the Hardle, W., and Linton, 0. (1994), "Applied Nonparametric Regression," in
Handbook of Econometrics, Vol. 4, eds. R. Engle and D. L. McFadden,
treatment impact in an observational study using propen-
Amsterdam: Elsevier, pp. 2295-2339.
sity score methods. Our contribution is to demonstrate the Heckman, J., and Hotz, J. (1989), "Choosing Among Alternative Non-
use of propensity score methods and to apply them in a experimental Methods for Estimating the Impact of Social Programs:
context that allows us to assess their efficacy. Our results The Case of Manpower Training," Journal of the American Statistical
Association, 84, 862-874.
show that the estimates of the training effect for Lalonde's
Heckman, J., Ichimura, H., Smith, J., and Todd, P. (1998), "Characteriz-
hybrid of an experimental and nonexperimental dataset are ing Selection Bias Using Experimental Data," Economnetrica, 66, 1017-
close to the benchmark experimental estimate and are ro- 1098.
bust to the specification of the comparison group and to the Heckman, J., Ichimura, H., and Todd, P. (1997), "Matching As An Econo-
metric Evaluation Estimator: Evidence from Evaluating a Job Training
functional form used to estimate the propensity score. A
Program," Review of Economic Studies, 64, 605-654.
researcher using this method would arrive at estimates of (1998), "Matching as an Econometric Evaluation Estimator," Re-
the treatment impact ranging from $1,473 to $1,774, close view of Economic Studies, 65, 261-294.
to the benchmark unbiased estimate from the experiment Heckman, J., and Robb, R. (1985), "Alternative Methods for Evaluating
the Impact of Interventions," in Longituidinal Analysis of Labor Market
of $1,794. Furthermore, our methods succeed for a trans-
Data, Econometric Society Monograph No. 10, eds. J. Heckman and B.
parent reason: They use only the subset of the comparison Singer, Cambridge, U.K.: Cambridge University Press, pp. 63-113.
group that is comparable to the treatment group, and dis- Holland, P. W. (1986), "Statistics and Causal Inference," Journal of the
card the complement. Although Lalonde attempted to fol- American Statistical Association, 81, 945-960.
Lalonde, R. (1986), "Evaluating the Econometric Evaluations of Training
low this strategy in his construction of other comparison
Programs," American Economic Review, 76, 604-620.
groups, his method relies on an informal selection based Manpower Demonstration Research Corporation (1983), Summaty and
on preintervention variables. Our application illustrates that Findings of the National Supported Work Demonstr-ation, Cambridge,
even among a large set of potential comparison units, very U.K.: Ballinger.
Manski, C. F., and Garfinkel, I. (1992), "Introduction," in Evaluating Wel-
few may be relevant, and that even a few comparison units
fare and Training Programs, eds. C. Manski and I. Garfinkel, Cambridge,
may be sufficient to estimate the treatment impact. U.K.: Harvard University Press, pp. 1-22.
The methods we suggest are not relevant in all situa- Manski, C. F., Sandefur, G., McLanahan, S., and Powers, D. (1992), "Alter-
tions. There may be important unobservable covariates, for native Estimates of the Effect of Family Structure During Adolescence
on High School Graduation," Journal of the American Statistical Asso-
which the propensity score method cannot account. How-
ciation, 87, 25-37.
ever, rather than giving up, or relying on assumptions about Rosenbaum, P. (1987), "The Role of a Second Control Group in an Ob-
the unobserved variables, there is substantial reward in ex- servational Study," Statistical Science, 2, 292-316.
ploring first the information contained in the variables that Rosenbaum, P., and Rubin, D. (1983), "The Central Role of the Propensity
Score in Observational Studies for Causal Effects," Biometrika, 70, 41-
are observed. In this regard, propensity score methods can
55.
offer both a diagnostic on the quality of the comparison (1984), "Reducing Bias in Observational Studies Using the Sub-
group and a means to estimate the treatment impact. classification on the Propensity Score," Jouirnal of the American Statis-
tical Association, 79, 516-524.
Rubin, D. (1974), "Estimating Causal Effects of Treatments in Randomized
[Received October 1998. Revised May 1999.]
and Non-Randomized Studies," Journal of Educational Psychology, 66,
688-701.