The Econometric Society Econometrica: This Content Downloaded From 130.194.20.173 On Tue, 14 Apr 2020 23:59:28 UTC

Sample Selection Bias as a Specification Error
Author(s): James J. Heckman

Source: Econometrica, Vol. 47, No. 1 (Jan., 1979), pp. 153-161
Published by: The Econometric Society
Stable URL: https://www.jstor.org/stable/1912352
Accessed: 14-04-2020 23:59 UTC
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms
The Econometric Society is collaborating with JSTOR to digitize, preserve and extend access
to Econometrica
This content downloaded from 130.194.20.173 on Tue, 14 Apr 2020 23:59:28 UTC
All use subject to https://about.jstor.org/terms
Econometrica, Vol. 47, No. 1 (January, 1979)
SAMPLE SELECTION BIAS AS A SPECIFICATION ERROR
BY JAMES J. HECKMAN'
This paper discusses the bias that results from using nonrandomly selected samples to
estimate behavioral relationships as an ordinary specification error or "omitted variables"
bias. A simple consistent two stage estimator is considered that enables analysts to utilize
simple regression methods to estimate behavioral functions by least squares methods. The
asymptotic distribution of the estimator is derived.
THIS PAPER DISCUSSES the bias that results from using nonrandomly selected
samples to estimate beliavioral relationships as an ordinary specification bias that
arises because of a missing data problem. In contrast to the usual analysis of
"omitted variables" or specification error in econometrics, in the analysis of
sample selection bias it is sometimes possible to estimate the variables which when
omitted from a regression analysis give rise to the specification error. The
estimated values of the omitted variables can be used as regressors so that it is
possible to estimate the behavioral functions of interest by simple methods. This
paper discusses sample selection bias as a specification error and presents a simple
consistent estimation method that eliminates the specification error for the case of
censored samples. The argument presented in this paper clarifies and extends the
analysis in a previous paper [6] by explicitly developing the asymptotic dis-
tribution of the simple estimator for the general case rather than the special case of
a null hypothesis of no selection bias implicitly discussed in that paper. Accord-
ingly, for reasons of readability, this paper recapitulates some of the introductory
material of the previous paper in, what is hoped, an improved and simplified
form.
Sample selection bias may arise in practice for two reasons. First, there may
be self selection by the individuals or data units being investigated. Second,
sample selection decisions by analysts or data processors operate in much the
same fashion as self selection.
There are many examples of self selection bias. One observes market wages for
working women whose market wage exceeds their home wage at zero hours of
work. Similarly, one observes wages for union members who found their
nonunion alternative less desirable. The wages of migrants do not, in general,
afford a reliable estimate of what nonmigrants would have earned had they
migrated. The earnings of manpower trainees do not estimate the earnings that
nontrainees would have earned had they opted to become trainees. In each of
these examples, wage or earnings functions estimated on selected samples do not,
1 This research was supported by a HEW grant to the Rand Corporation and a U.S. Department o
Labor grant to the National Bureau of Economic Research. A previous version of this paper circulated
under the title "Shadow Prices, Market Wages and Labor Supply Revisited: Some Computational
Simplifications and Revised Estimates," June, 1975. An embarrassingly large number of colleagues
have made valuable comments on this paper, and its various drafts. Particular thanks go to Takeshi
Amemiya, Zvi Griliches, Reuben Gronau, Mark Killingsworth, Ed Leamer, Tom MaCurdy, Bill
Rodgers, and Paul Schultz. I bear full responsibility for any remaining errors.
153
154 JAMES J. HECKMAN
in general, estimate population (i.e., random sample) wage functions.

Comparisons of the wages of migrants with the wages of nonmigrants (or trainee
earnings with nontrainee earnings, etc.) result in a biased estimate of the effect of a
random "treatment" of migration, manpower training, or unionism.
Data may also be nonrandomly selected because of decisions taken by data
analysts. In studies of panel data, it is common to use "intact" observations. For
example, stability of the family unit is often imposed as a requirement for entry
into a sample for analysis. In studies of life cycle fertility and manpower training
experiments, it is common practice to analyze observations followed for the full
length of the sample, i.e., to drop attriters from the analysis. Such procedures have
the same effect on structural estimates as self selection: fitted regression functions
confound the behavioral parameters of interest with parameters of the function
determining the probability of entrance into the sample.
1. A SIMPLE CHARACTERIZATION OF SELECTION BIAS
To simplify the exposition, consider a two equation model. Few new points arise
in the multiple equation case, and the two equation case has considerable
pedagogical merit.
Consider a random sample of I observations. Equations for individual i are
(la) Yli = X 1 + U1i,
(lb) Y2i =X2if2+ U2i (i-1, . ..,I),
where X1i is a 1 x K vector of ex

parameters, and
E(Uji) = O, E(UjiUj,i,,)=o(jjs, i =i",
=0O, i#i"~.
The final assumption is a consequence of a random sampling scheme. The joint
density of U11, U2i is h (U1l, U2i). The regressor matrix is of full rank so th
data were available, the parameters of each equation could be estimated by least
squares.
Suppose that one seeks to estimate equation (1 a) but that data are missing on Y1
for certain observations. The critical question is "why are the data missing?"
The population regression function for equation (la) may be written as
E(Yli I Xii) = Xliol (i = 19 . . ., I)

The regression function for the subsample of available data is
E( Ylil Xii, sample selection rule) = X1i,81 + E(U1i 1 sample selecti

rule),
= 1, ... , I, where the convention is adopted that the first 1 <I observations
have data available on Y1i.
SAMPLE SELECTION BIAS 155
If the conditional expectation of U1i is zero, the regression function for the
selected subsample is the same as the population regression function. Least
squares estimators may be used to estimate /1 on the selected subsample. The
only cost of having an incomplete sample is a loss in efficiency.
In the general case, the sample selection rule that determines the availability of
data has more serious consequences. Suppose that data are available on Y1i if
Y2i - 0 while if Y2i < 0, there are no observations on Y1l. The choice of zero as a
threshold involves an inessential normalization.
In the general case
E(Uli I X1i, sample selection rule) = E(UliX1i, Y21 i 0)

= E(Uli X1i, U2i -X2i12).
In the case of independence between U1i and U2i, so that the data on Y1i are
missing randomly, the conditional mean of U1i is zero. In the general case, it is
nonzero and the subsample regression function is
(2) E(Y1ilX1i, Y2i _ 0) = Xifil + E(Uli I U2i v-X2if2).
The selected sample regression function depends on X1i and X2i. Regression
estimators of the parameters of equation (la) fit on the selected sample omit the
final term of equation (2) as a regressor, so that the bias that results from using
nonrandomly selected samples to estimate behavioral relationships is seen to arise
from the ordinary problem of omitted variables.
Several points are worth noting. First, if the only variable in the regressor vector
X2i that determines sample selection is "1" so that the probability of sample
inclusion is the same for all observations, the conditional mean of U1i is a
constant, and the only bias in /1 that results from using selected samples to
estimate the population structural equation arises in the estimate of the intercept.
One can also show that the least squares estimator of the population variance or11
is downward biased. Second, a symptom of selection bias is that variables that do
not belong in the true structural equation (variables in X2i not in X1i) may appear
to be statistically significant determinants of Y1i when regressions are fit on
selected samples. Third, the model just outlined contains a variety of previous
models as special cases. For example, if h(U1i, U2i) is assumed to be a singular
normal density (Uli U2i) and X2i = X1i, 13i =1 /2, the "Tobit" model emerg
For a more complete development of the relationship between the model
developed here and previous models for limited dependent variables, censored
samples and truncated samples, see Heckman [6]. Fourth, multivariate extensions
of the preceding analysis, while mathematically straightforward, are of consider-
able substantive interest. One example is offered. Consider migrants choosing
among K possible regions of residence. If the self selection rule is to choose to
migrate to that region with the highest income, both the self selection rule and the
subsample regression functions can be simply characterized by a direct extension
of the previous analysis.
2. A SIMPLE ESTIMATOR FOR NORMAL DISTURBANCES
AND ITS PROPERTIES2
Assume that h (U1l, U2i) is a bivariate normal density. Using well known resul
(see [10, pp. 112-113]),
E(Uji | U2iu - -X25f32) (=0 1 Ai,
E(U2i U2i -X2432) (022) Ai,

where
Ai (Zi) 0 (Zi)
'" = 1 - q$(-Zi)
where 4 and eP are, respectively, the density an

standard normal variable, and
X2i?32
zi (022)'
"Ai" is the inverse of Mill's ratio. It is a monotone decreasing function of the

probability that an observation is selected into the sample, P(-Zi) (= 1- (Zi)).
In particular, lim(-z5). Ai = 0, lim (-z1),o Ai = ??, and Ai/a'P(-Zi)< 0.
The full statistical model for normal population disturbances can now be
developed. The conditional regression function for selected samples may be
written as
E(Yji |Xji, Y2i -> O) = XjijGj +02Ai,
E(Y2i X2i, Y2i- ?) = X2432 + 22 Ai
(4a) Y5i = E(Y1i IX1i, Y2i O ?)+ V1i,

(4b) Y2i = E(Y2i |X2i, Y2i 0O)+ V2i,
where
(4c) E(V1i I Xli, Ai, U2i: -X2i.82) = O,

(4d) E(V2i X2i, Ai, U2i - -X25f32) = 0,
(4e) E(V1iViji IX1i, X2i, A1, U2i ? -X2il02) = 0,
2 A grouped data version of the estimation method discussed here was first proposed by Gronau [41
and Lewis [111. However, they do not investigate the statistical properties of the method or develop
the micro version of the estimator presented here.
for i # i'. Further,
(4f) E( V2i Xli, Ai, U2 _1-X2i,32) = ((1-p 2) + p 2(1 + ZiAi -A)),

(4g) E(V1iV2j 1X1i, X2i, A , U21 -X2j32)= 0-12(1 +ZiAi -A ),
(4h) E(V2i | X2X, AU, U21 -X2i,32) =0-22(1 + ZiA -A ),

where
2
2 0-12
P =
C0- 1 022
and
(5) 2<l+AiZi-A l.
If one knew Zi and hence Ai, one could enter Ai as a regressor in equation (4a)
and estimate that equation by ordinary least squares. The least squares estimators
of /31 and o.12/(0.22)1 are unbiased but inefficient. The inefficiency is a consequence
of the heteroscedasticity apparent from equation (4f) when X2j (and hence Zj)
contains nontrivial regressors. As a consequence of inequality (5), the standard
least squares estimator of the population variance o-11 is downward biased. As a
consequence of equation (4g) and inequality (5), the usual estimator of the
interequation covariance is downward biased. A standard GLS procedure can be
used to develop appropriate standard errors for the estimated coefficients of the
first equation (see Heckman [6]).
In practice, one does not know Ai. But in the case of a censored sample, in which
one does not have information on Y1i if Y2,-< 0, but one does know X2j for
observations with Y2j - 0, one can estimate Ai by the following procedure:
(1) Estimate the parameters of the probability that Y2j,- 0 (i.e., 12/(0.22))
using probit analysis for the full sample.3
(2) From this estimator of /32/(0-22)1 (= G* ) one can estimate Zi and hence Ai.
All of these estimators are consistent.
(3) The estimated value of Ai may be used as a regressor in equation (4a) fit on
the selected subsample. Regression estimators of equation (4a) are consistent for
1 and o.12/(0.22)1 (the coefficients of X,i and A1, respectively).4
(4) One can consistently estimate o-1l by the following procedure. From step 3,
one consistently estimates C = p(a11)A =- 012/(0-22)1. Denote the residual for th
ith observation obtained from step 3 as V1i, and the estimator of C by C. Then an
estimator of 0-1l is
i=1 (7
E'i AV1i -A
0-11= _- Z (X12-X2)
Il '1 i=1
In the case in which Y2j is observed, one ca

least squares.
4It is assumed that vector X2, contains nontrivial regressors or that 6 1 contains no intercept or both
where ki and Zi are the estimated values of Zi and Ai obtained from step 2. This
estimator of o-11 is consistent and positive since the term in the second summation
must be negative (see inequality (5)).
The usual formulas for standard errors for least squares coefficients are not
appropriate except in the important case of the null hypothesis of no selection bias
(C = 0.12/(0.22)1 = 0). In that case, the usual regression standard errors are appro-
priate and an exact test of the null hypothesis C = 0 can be performed using the t
distribution. If C $ 0, the usual procedure for computing standard errors
understates the true standard errors and overstates estimated significance levels.
The derivation of the correct limiting distribution for this estimator in the
general case requires some argument.5 Note that equation (4a) with an estimated
value of Ai used in place of the true value of Ai may be written as
(4a') YAi C(AiAi)+Vi.
The error term in the equation consists of the final two terms in the equation.
Since Ai is estimated by I2/(o22)i (= f3*) which is estimated from the entir
sample of I observations by a maximum likelihood probit analysis,6 and since Ai is
a twice continuously differentiable function of G 2, fAi -Ai) has a well defined
limiting normal distribution
iI( A-Ai)-~N(O, Yi)
where 1i is the asymptotic variance-covariance matrix obtained from that of /32

by the following equation:
(A) 2
xi =(-Z) X2 i-X2 i,
where aAi/aZi is the derivative of Ai with respect to Zi, and E is the asymptotic
variance-covariance matrix of II(t2 -2
We seek the limiting distribution of
ig i - 1)_1 ,VX 1 Xl i IX 1i ik 1 x(iXi(C(Ai _ i)+ Vii)8

C-C 1 EX li Ai EA2 J_ X Ai(C(Ai-Ai)? V1i)
In the ensuing analysis, it is important to recall that the probit function is

estimated on the entire sample of I observations whereas the regression analysis i-s
performed solely on the subsample of I, (<I) observations where Y1i is observed.
Further, it is important to note that unlike the situation in the analysis of two stage
least squares procedures, the portion of the residual that arises from the use of an
estinmated value of Ai in place of the actual value of Ai is not orthogonal to the X1
data vector.
S This portion of the paper was stimulated by comments from T. Amemiya. Of course, he is not
responsible for any errors in the argument.
6The ensuing analysis can be modified in a straightforward fashion if Y2j is observed and '6* is
estimated by least squares.
Under general conditions for the regressors discussed extensively in Amemiya

[1] and Jennrich [9],
11 iXhX,~ AAL 1 xlix ( -XXi1 iX~A'j

plim I, ZXiAi Ai plim I ( X A ) B,
11-+c0A 'X,o--c) X jk
where B is a finite positive definite matrix.7 Under these assumptions,
where
1 rL u lln (75 ini jX C2(i)Ql1i-l x17 1-,:tz:x ,2)]

r = plimi A1Xi~ 11 + C2I)i11 ? 12i 'i ',it 9 1
plim-1=k, O<k<1,
I-o I I
where
C = 1/22
im = (1 + C2(Z2AI -A )/oIii),
dZ ~ ~ ~ ~ ~ ~~~i=,= Ii i i-Zl1 I.
vi,= ( dA)( Aaj1) A 2
( (A AZ) ( az1) X)~ X
where 'AE/aZi is the derivative of A, with respect to Zi,
Note that if C =0O, Th/B' collapses to the standard variance-covariance matrix for
the least squares estimator. Note further that because the second matrix in fr is
positive definite, if C $0, the correct asymptotic variance-covariance matrix
(BAfB') produces standard errors of the regression coefficients that are larger th
those given by the incorrect "standard" variance-covariance matrix o-11B. Thus
7Note that this requires that X2, contain nontrivial regressors or that there be no intercept in th
equation, or both.
the usual procedure for estima

were known, leads to an understatement of true standard errors and an over-
statement of significance levels when Ai is estimated and C $ 0.
Under the Amemiya-Jennrich conditions previously cited, f is a bounded
positive definite matrix. f and B can be simply estimated. Estimated values of Ai
C, and o-11 can be used in place of actual values to obtain a consistent estimator of
B/IB'. Estimation of the variance-covariance matrix requires inversion of a
K1 + 1 x K1 + 1 matrix and so is computationally simple. A copy of a program that
estimates the probit function coefficients j3* and the regression coefficients i1 and
C, and produces the correct asymptotic standard errors for the general case is
available on request from the author.8
It is possible to develop a GLS procedure (see Heckman [7]). This procedure is
computationally more expensive and, since the GLS estimates are not asymp-
totically efficient, is not recommended.
The estimation method discussed in this paper has already been put to use.
There is accumulating evidence [3 and 6] that the estimator provides good starti
values for maximum likelihood estimation routines in the sense that it provides
estimates quite close to the maximum likelihood estimates. Given its simplicity
and flexibility, the procedure outlined in this paper is recommended for explora-
tory empirical work.
3. SUMMARY
In this paper the bias that results from using nonrandomly selected samples to
estimate behavioral relationships is discussed within the specification error
framework of Griliches [2] and Theil [12]. A computationally tractable technique
is discussed that enables analysts to use simple regression techniques to estimate
behavioral functions free of selection bias in the case of a censored sample.
Asymptotic properties of the estimator are developed.
An alternative simple estimator that is also applicable to the case of truncated
samples has been developed by Amemiya [1]. A comparison between his estima-
tor and the one discussed here would be of great value, but is beyond the scope of
this paper. A multivariate extension of the analysis of my 1976 paper has been
performed in a valuable paper by Hanoch [5]. The simple estimator developed
here can be used in a variety of statistical models for truncation, sample selection
and limited dependent variables, as well as in simultaneous equation models with
dummy endogenous variables (Heckman [6, 8]).
University of Chicago
Manuscript received March, 1977; final revision received July, 1978.
8 This offer expires two years after the publication of this paper. The program will be provided at
cost.
REFERENCES
[1] AMEMIYA, T.: "Regression Analysis when the Dependent Variable is Truncated Normal,"
Econometrica, 41 (1973), 997-1017.
[21 GRILICHES, ZVI: "Specification Bias in Estimates of Production Functions," Journal of Farm
Economics, 39 (1957), 8-20.
[3] GRILICHES, Z, B. HALL, AND J. HAUSMAN: "Missing Data and Self Selection in Large
Panels," Harvard University, July, 1977.
[41 GRONAU, R.: "Wage Comparisons-A Selectivity Bias," Journal of Political Economy, 82
(1974), 1119-1144.
[5] HANOCH, G.: "A Multivariate Model of Labor Supply: Methodology for Estimation," Rand
Corporation Paper R-1980, September, 1976.
[6] HECKMAN, J.: "The Common Structure of Statistical Models of Truncation, Sample Selection
and Limited Dependent Variables and a Simple Estimator for Such Models," The Annals of
Economic and Social Measurement, 5 (1976), 475-492.
[7] : "Sample Selection Bias as a Specification Error with an Application to the Estimation of
Labor Supply Functions," NBER Working Paper # 172, March, 1977 (revised).
[8] : "Dummy Endogenous Variables in a Simultaneous Equation System," April, 1977
(revised), Econometrica, 46 (1978), 931-961.
[9] JENNRICH, R.: "Asymptotic Properties of Nonlinear Least Squares Estimators," Annals of
Mathematical Statistics, 40 (1969), 633-643.
[10] JOHNSON, N., AND S. KOTZ: Distribution in Statistics: Continuous Multivariate Distributions.
New York: John Wiley & Sons, 1972.
[11] LEWIS, H.: "Comments on Selectivity Biases in Wage Comparisons," Journal of Political
Economy, 82 (1974), 1145-1155.
[12] THEIL, H.: "Specification Errors and the Estimation of Economic Relationships," Revue de
l'Institut International de Statistique, 25 (1957), 41-5 1.

The Econometric Society Econometrica: This Content Downloaded From 130.194.20.173 On Tue, 14 Apr 2020 23:59:28 UTC

Uploaded by

Copyright:

Available Formats

The Econometric Society Econometrica: This Content Downloaded From 130.194.20.173 On Tue, 14 Apr 2020 23:59:28 UTC

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Econometric Society Econometrica: This Content Downloaded From 130.194.20.173 On Tue, 14 Apr 2020 23:59:28 UTC

Uploaded by

Copyright:

Available Formats

Sample Selection Bias as a Specification Error

Author(s): James J. Heckman

SAMPLE SELECTION BIAS AS A SPECIFICATION ERROR

in general, estimate population (i.e., random sample) wage functions.

1. A SIMPLE CHARACTERIZATION OF SELECTION BIAS

(la) Yli = X 1 + U1i,

(lb) Y2i =X2if2+ U2i (i-1, . ..,I),

where X1i is a 1 x K vector of ex

E(Uji) = O, E(UjiUj,i,,)=o(jjs, i =i",

E(Yli I Xii) = Xliol (i = 19 . . ., I)

E( Ylil Xii, sample selection rule) = X1i,81 + E(U1i 1 sample selecti

E(Uli I X1i, sample selection rule) = E(UliX1i, Y21 i 0)

(2) E(Y1ilX1i, Y2i _ 0) = Xifil + E(Uli I U2i v-X2if2).

2. A SIMPLE ESTIMATOR FOR NORMAL DISTURBANCES

AND ITS PROPERTIES2

E(Uji | U2iu - -X25f32) (=0 1 Ai,

E(U2i U2i -X2432) (022) Ai,

where 4 and eP are, respectively, the density an

"Ai" is the inverse of Mill's ratio. It is a monotone decreasing function of the

E(Yji |Xji, Y2i -> O) = XjijGj +02Ai,

E(Y2i X2i, Y2i- ?) = X2432 + 22 Ai

(4a) Y5i = E(Y1i IX1i, Y2i O ?)+ V1i,

(4c) E(V1i I Xli, Ai, U2i: -X2i.82) = O,

(4e) E(V1iViji IX1i, X2i, A1, U2i ? -X2il02) = 0,

for i # i'. Further,

(4f) E( V2i Xli, Ai, U2 _1-X2i,32) = ((1-p 2) + p 2(1 + ZiAi -A)),

(4h) E(V2i | X2X, AU, U21 -X2i,32) =0-22(1 + ZiA -A ),

In the case in which Y2j is observed, one ca

(4a') YAi C(AiAi)+Vi.

iI( A-Ai)-~N(O, Yi)

where 1i is the asymptotic variance-covariance matrix obtained from that of /32

ig i - 1)_1 ,VX 1 Xl i IX 1i ik 1 x(iXi(C(Ai _ i)+ Vii)8

In the ensuing analysis, it is important to recall that the probit function is

Under general conditions for the regressors discussed extensively in Amemiya

11 iXhX,~ AAL 1 xlix ( -XXi1 iX~A'j

1 rL u lln (75 ini jX C2(i)Ql1i-l x17 1-,:tz:x ,2)]

( (A AZ) ( az1) X)~ X

where 'AE/aZi is the derivative of A, with respect to Zi,

the usual procedure for estima

Manuscript received March, 1977; final revision received July, 1978.

You might also like