14.382 Inference: Creative Commons BY-NC-SA
14.382 Inference: Creative Commons BY-NC-SA
14.382 Inference: Creative Commons BY-NC-SA
Massachusetts Institute
of Technology: MIT OpenCourseWare, https://ocw.mit.edu. License: Creative Commons BY-NC-SA.
´ FERNANDEZ-VAL
VICTOR CHERNOZHUKOV AND IVAN ´
Abstract. Here we overview the least squares from several interesting angles. We discuss
Frisch-Waugh-Lovell partialling out and point out its adaptivity property in establishing
approximate normality of the regression estimators of a set of target regression coefficients.
We then discuss construction of simultaneous confidences sets for this set. We make use of
the methods to analyze the gender wage gap and the impact of reemployment incentives
on the duration of unemployment.
1. Notation
k 1/2 k
k k
IvI2 := vi2 , IvI1 := |vi |.
i=1 i=1
The C0 -“norm”, I · I0 , denotes the number of non-zero components of a vector, and the
I · I∞ denotes the max norm:
k
k
IvI0 := 1{vi = 0}, IvI∞ := max{|vi | : i ∈ {1, ..., k}}.
i=1
We use the notation a ∨ b = max(a, b) and a ∧ b = min(a, b). We use x' to denote the
transpose of a column vector x. In what follows we use the notion En f (W ) abbreviates the
1
2 ´ FERNANDEZ-
VICTOR CHERNOZHUKOV AND IVAN ´ VAL
2. Least Squares
We then define least squares or projection parameter β in the population as the solution
of the following prediction problem:
β := arg minp E(Y − X ' b)2
b∈R
We define least squares estimator or projection estimator β̂ in the sample as the solution
of the following prediction problem:
β̂ := arg minp En (Y − X ' b)2
b∈R
Note that the least squares estimator makes sense only if p is not bigger than n. If p > n
other estimators must be used, for example, penalized least squares estimators or post-
selection least squares estimators.
What does the regression coefficient β1 measure here? It measures how our lin
ear prediction of Y changes if we set the gender variable D from 0 to 1, holding the
controls W fixed. We can call this the predictive effect (PE), as it measures the impact
of a variable on the prediction we make. PE is a measure of statistical dependence
or association between variables suggesting that D predicts Y even if we partial-out
linearly the controls W . The PE should not be in general interpreted as a causal or
treatment effect (TE), since correlation is not equivalent to causation. We shall study
assumption needed for causal interpretability of the estimates later in the course. An
important case where β1 measures TE is the case of randomizes control trials, where
D is randomly assigned, and is therefore independent of X.
In population, define the partialling-out operator with respect to a vector W that takes a
random variable V such that EV 2 < ∞ and creates Ṽ according to the rule:
Ṽ = V − W ' γV W , γV W = arg min
p
E(V − W ' b)2 .
b∈R 2
It is not difficult to see that the partialling-out operator is linear on the space of random
variables with finite second moments, i.e. if for V and U such that EU 2 + EV 2 < ∞,
Y = V + U =⇒ Y˜ = V˜ + U
˜.
4 ´ FERNANDEZ-
VICTOR CHERNOZHUKOV AND IVAN ´ VAL
Thus we apply this operator to both sides of the identity (3.1) to get:
Equation (3.2) states that EεD̃ = 0 is the first-order condition for the population regres
sion of Ỹ on D̃. That is, the projection coefficient β1 can be recovered from the regression
of Ỹ on D̃:
β1 = arg min E(Ỹ − D̃' b) = (EDD̃
˜ ' )−1 ED̃Y˜ .
b∈Rp1
Similarly to the population case, the operator is linear. Thus, application of the operator
to the decomposition identity Yi ≡ Di' β̂1 + Wi' β̂2 + ε̂i gives
The partialling-out operation defined above works well when the dimension of W is low
in relation to the sample size. When the dimension is high we need to use variable selection
or penalization for regularization purposes. We shall get to that later in the course.
Theorem 1 (Frisch-Waugh-Lovell). Work with the set-up above. The population projection
coefficient β̂1 can be recovered from the population regression of Ỹ on D̃ :
˜ ' )−1 ED̃Y˜ ,
β1 = (EDD̃
assuming ED̃D̃' is of full rank. The sample projection coefficient β̂1 can be recovered from the
sample regression of Y̌i on Ďi :
βˆ1 = (En ĎĎ' )−1 En Ď Y,
ˇ
ˇD
assuming En D ˇ ' is of full rank.
It is of interest to examine the behavior of the estimator β̂1 . In what follows, we can
assume that dimension p1 of the target parameter β1 is fixed, but the dimension p2 of the
nuisance parameter β2 may grow with n but slowly enough so that p2 /n → 0. In practical
terms, the latter condition simply means that p2 is small compared to n.
Lemma 1 (Adaptivity Property for Partialling Out). Consider the sample projection coef
ficient β̂1 obtained from the sample regression of Y̌i on Ďi :
βˆ1 = (En ĎĎ' )−1 En Ď Y,ˇ
and the sample projection coefficient β̃1 obtained from the sample regression of infeasible Ỹi on
D̃i :
β˜1 = (En D̃D̃' )−1 En D̃ Y.
˜
There exist regularity conditions such that, provided that the dimension p2 is small compared
to n, namely
p2 /n → 0,
we have the following asymptotic equivalence result:
√
n(β̂1 − β̃1 ) →P 0.
That is, the estimator is not affected by the estimation errors in partialling out steps, and they
are approximately negligible.
6 ´ FERNANDEZ-
VICTOR CHERNOZHUKOV AND IVAN ´ VAL
We have that
β̃1 − β1 = (En D̃D̃' )−1 En D̃Ỹ − β1 (4.1)
' −1
= (En D̃D̃ ) En D̃(β1 D̃ + E) − β1 (4.2)
' −1 ˜
= (En D̃D̃ ) En DE. (4.3)
Then we conclude that under mild regularity conditions
√ a
n(β̃1 − β1 ) ∼ N (0, V11 )
a
where ∼ reads as “approximately distributed”,
√
V11 = (ED˜D˜ ' )−1 Var( nEn DE)(E
˜ ˜ ' )−1 .
DD̃
Given the equivalence stated in Lemma above we further conclude that
√ a
n(β̂1 − β1 ) ∼ N (0, V11 ).
Theorem 2. There exist regularity conditions such that, provided that p2 /n → 0, we have
that √ a
n(β̂1 − β1 ) ∼ N (0, V11 ),
as n → ∞, namely that
√
sup P n(β̂1 − β1 ) ∈ A − P(N (0, V11 ) ∈ A) → 0,
A∈A
where A is a collection of sets in Rp1 (e.g. convex sets or rectangles).
The proof of this result is simple under fixed p2 and is rather technical when p2 → ∞, so
we won’t pursue it here, but conceptually it is a more technical version of the result under
fixed p asymptotics that you have seen in the introductory regression course.
Remark 1. Alternatively, the result above could also be derived or conjectured from the
statement that the whole parameter vector is approximately normally distributed as fol
lows:
√ a
n(β̂ − β) ∼ N (0, V ), (4.4)
Here
√
V = Q−1 ΩQ−1 , Q = EXX ' , Ω = Var( nEn Xε).
Then V11 corresponds to the p1 × p1 upper-left block of V . This result is straightforward
when p is fixed as n → ∞. On the other hand, when p is increasing with n, proving that
√
the whole p-dimensional parameter vector n(β̂ − β) is normally distributed is usually
much more demanding, in terms of regularity conditions and the sense in which normal
approximations hold.
L1 7
We shall rely on a suitable estimator V̂11 of V11 , for example, the White estimator under
independent sampling or the Newey-West estimator for the time series case. We then shall
use the normal law N (0, V̂11 /n) for quantification of uncertainty about β1 , that is, for building
confidence bands for β1 and various functionals of β. With V̂11 used instead of V11 the
√ a
statement n(β̂ − β) ∼ N (0, V̂11 ) is defined to mean the following:
√
sup P n(β̂ − β) ∈ A − P(N (0, V¯ ) ∈ A) |V̄ =V̂11 →P 0.
A∈A
Basically, we just insert V̂ wherever V previously appeared and we require the same state
ments to hold stochastically.
√ a
Lemma 2 (Using Estimated Variance is Ok.). Suppose that n(β̂1 − β1 ) ∼ N (0, V11 ) and V̂11
−1 √ a
is consistent for V11 , namely V̂11 V11 →P I and V11 is bounded away from zero. Then n(β̂−β) ∼
N (0, Vˆ11 ).
This lemma is a consequence of the Gaussian vector N (0, V11 ) having bounded density,
so that estimation errors in V̂11 have a negligible effect on probabilities of the containment
events.
Suppose β1 is scalar or we w are interested in the j-th component of βj . The above results
means that we can report Vˆ11,jj /n as (estimated) standard errors for βij , and report
w w
[Cj , uj ] = β̂1j − z V̂11,jj /n, β̂1j + z V̂11,jj /n ,
where z is (1 − α/2)-quantile of the standard normal variable N (0, 1), as the approximate
(1 − α) × 100% confidence interval for β1j . That this is a confidence interval follows from
a more general result we discuss below.
We consider an empirical application to gender wage gap using data from the U.S.
March Supplement of the Current Population Survey (CPU) in 2015. We select white non
hispanic individuals, aged 25 to 64 years, and working more than 35 hours per week dur
ing at least 50 weeks of the year. We exclude self-employed workers; individuals living in
group quarters; individuals in the military, agricultural or private household sectors; in
dividuals with inconsistent reports on earnings and employment status; individuals with
allocated or missing information in any of the variables used in the analysis; and individu
als with hourly wage below $3.1 The resulting sample consists of 32, 523 workers including
18, 137 men and 14, 386 of women. The variable of interest Y is the logarithm of the hourly
1The sample selection criteria is similar to [5].
8 ´ FERNANDEZ-
VICTOR CHERNOZHUKOV AND IVAN ´ VAL
wage rate constructed as the ratio of the annual earnings to the total number of hours
worked, which is constructed in turn as the product of number of weeks worked and the
usual number of hours worked per week. Table 1 reports descriptive statistics for the vari
ables used in the analysis. Working women are less likely to be married and more highly
educated than working men, but have slightly less experience. The unconditional average
gender wage gap is 24%.
To estimate the gender wage gap, we consider the linear regression model:
where Y is the log hourly rate, D is an indicator for female worker, and W is a set of
p = 1, 082 controls including 5 marital status indicators (widowed, divorced, separated,
never married, and married); 5 educational attainment indicators (less than high school
graduates, high school graduates, some college, college graduate, and advanced degree);
4 region indicators (midwest, south, west, and northeast); a quartic in potential experi
ence constructed as the maximum of age minus years of schooling minus 7 and zero, i.e.,
L1 9
Table 2 reports the results of a regression analysis using the CPS data. The first row
obatins the coefficient of D from the OLS regression of Y on D; the second row obtains
the coefficient of D from the OLS regression of Y on X = (D, W ); the second row obtains
the same estimate using the Frisch-Waugh-Lovell theorem for partialing-out the controls
via OLS; and the third row obtains the coefficient of D using a variant of the procedure in
[1] that partials-out the controls via LASSO instead of OLS.4 We will study this procedure
later in the course. All the standard errors are computed with the R package sandwich and
are robust to heteroskedasticity. Using Lasso for partialing out here gives similar results
as using OLS. Lasso is a penalized OLS estimator and it produces high-quality estimates
of the regression function especially in the high-dimensional settings. The penalty takes
the form of the sum of the absolute values of the coefficients times penalty level.
What do the estimated regression coefficients β1 measure here? The first row measures
the unconditional gender gap, i.e. the difference in the average wage of working women
and men. The rest measure how our linear prediction of wage changes if we set the gender
variable D from 0 to 1, holding the controls W fixed. We can call this the predictive effect
(PE), as it measures the impact of a variable on the prediction we make. The PE should
2The occupation categories are: management; business and financial operations; computer and mathe
matics; architecture and engineering; life, physical, and social science; community and social service; legal;
education, training, and library; arts, design, entertainment, sports, and media; healthcare practitioners and
technical; healthcare support; protective service; food preparation and serving; building and grounds clean
ing and maintenance; personal care and service; sales; office and administrative support; farming, fishing,
and forestry; construction and extraction; installation, maintenance, and repair occupations; production; and
transportation and material moving.
3The industry categories are: mining; utilities; construction; nondurable goods manufacturing; durable
goods manufacturing; durable goods wholesale; nondurable goods wholesale; retail trade; transportation
and warehousing; information; finance and insurance; real estate, rental and leasing; professional, scientific,
and technical services; management of companies and enterprises; administrative, support and waste man
agement services; educational services; health care and social assistance; arts, entertainment, and recreation;
accommodation and food services; other services except public administration; and public administration.
4We use the R package hdm to obtain the estimates in the third row.
10 ´ FERNANDEZ-
VICTOR CHERNOZHUKOV AND IVAN ´ VAL
not be in general interpreted as a causal or treatment effect (TE), since correlation is not
equivalent to causation. The causal interpretation of PE here could suggest that β1 is solely
a measure of discrimination, while in reality it may reflect discrimination, selection effects
(e.g., sorting of women and men into different occupation), sample imbalances, etc. In
this case the unconditional wage gap for women of 24% decreases to around 19-20% after
controlling for worker characteristics.
We repeat the analysis for the more homogeneous subpopulation of never married work
ers. Table 3 reports descriptive statistics for the corresponding subsample from the CPS
2015 data. There are 5,150 never married workers, 2,861 men and 2,289 women. Never
married working women are also relatively more educated than working men. Compared
to Table 1, never married workers earn lower average wages, and have much lower ex
perience than the rest of the workers. The regression analysis in Table 4 shows that the
unconditional gender wage gap is less than 4% for this group. This gap increases to 6
7% once we control for worker characteristics.5 A possible explanation of the lower wage
gap for never married working women could be related to fertility and childcare decisions.
Thus, never married women are young and less likely to have children. They can therefore
be more career oriented and have working experiences not interrupted by childbearing or
childcare.
Suppose we want to build simultaneous confidence bands for all the components (β1j )pj=1
1
where we set the critical value c such that the previous display holds. Here we use V11,jj
to denote the (j, j)-th element of matrix V11 .
12 ´ FERNANDEZ-
VICTOR CHERNOZHUKOV AND IVAN ´ VAL
IN (0, C)I∞ ,
where S = diag(V11 ) is a diagonal matrix with the diagonal of V11 in its diagonal and
zeroes elsewhere. The constant c can be approximated by simulation.
where the second equality holds by (6.1), because S 1/2 [−c, c] is a rectangular set in Rp1 .
Note that in practice we shall need to replace V11 with a consistent estimator V̂11 . This
replacement does not affect the approximate coverage property of the confidence regions
in view of Lemma 2.
a
Theorem 3 (Joint Confidence Band For Target Coefficients). Suppose that β̂1 − β1 ∼
N (0, V11 /n) in the sense of (6.1). We have that the confidence band
w w
[Cj , uj ] = β̂1j − c V11,jj /n, β̂1j + c V11,jj /n , j = 1, ..., p1 ,
with c = (1 − α)-quantile of IN (0, CI∞ , where C is the correlation matrix associated to V11 ,
jointly covers all target parameter values (β1j )pj=1
1
with probability approaching the nominal
level, that is, as n → ∞,
P (β1j ∈ [Cj , uj ] for all j) → 1 − α.
−1
The results continue to hold if V11 is replaced by V̂11 , such that Vˆ11 V11 →P I and V11 is
bounded away from zero.
L1 13
Here we re-analyze the Pennsylvania re-employment bonus experiment, which was pre
viously studied in [2], among others. Note that the inferential results on simultaneous
bands we report below will be new. These experiments were conducted in the 1980s by the
U.S. Department of Labor to test the incentive effects of alternative compensation schemes
for unemployment insurance (UI). In these experiments, UI claimants were randomly as
signed either to a control group or one of five treatment groups.6 In the control group the
current rules of the UI applied. Individuals in the treatment groups were offered a cash
bonus if they found a job within some pre-specified period of time (qualification period),
provided that the job was retained for a specified duration. The treatments differed in
the level of the bonus, the length of the qualification period, and whether the bonus was
declining over time in the qualification period; see [2] for further details on data.
Figure Figure 7 shows 90% confidence intervals for the five treatment effects β1 , con
structed using a sample of 13,913 observations.
• The critical value for the simultaneous bands, c = 2.27, is greater than the point-
wise critical value, 1.65.
• It is less than the critical value from the Bonferroni correction, 2.33, obtained as the
(1 − α/2)
¯ quantile of the normal distribution with α¯ = α/5. The idea of Bonferroni
6There are six treatment groups in the experiments. Following [2] we merge the groups 4 and 6.
14 ´ FERNANDEZ-
VICTOR CHERNOZHUKOV AND IVAN ´ VAL
In this case, from the three treatment levels with statistically significant effect on
unemployment duration based on pointwise confidence intervals, only one remains
significant after accounting for simultaneous inference.
The last observation illustrates how econometrics and this class offer better concepts and
tools than what the standard empirical practice often does. It also explains why econome
tricians sometimes teach what “people never use in practice” – they simply teach correct
things to use, and it is up to you to decide whether you want to do wrong or correct things
in practice.
Next we consider a more flexible version of the more basic model, where we take con
trols to include the original set set of controls as well as all two-way interactions, giving
us a total of p = 120 controls. We repeat the exercise we have given above with roughly
similar conclusions. Figure 7 shows 90% confidence intervals for the five treatment effects
β1 . We see that the addition of many more controls does not change the inferential results
noticeably. This highlights the robustness of the conclusions with respect to enriching the
set of controls, and is also in-line with our asymptotic theory, which states that the infer
ence is not impacted in the regime where the number of controls p is much smaller than
n, despite the fact the number of controls is substantial here.
Notes
Least squares were invented by Legendre around 1800, although Gauss claimed the
credit. Frisch, Waugh, and Lovell discovered the partialling out interpretation of the least
squares coefficients in 1930s. The adaptivity results of Lemma 1 went unnoticed by em
piricists, and also manage to escape statistics and econometrics textbooks; we note this
property here though. Regularity conditions under which Lemma 1 and Theorem 2 hold
under fixed p asymptotitcs can be found in the introductory econometrics texts, for exam
ple, [6], and under p → ∞ and p/n → 0 asymptotitcs in [4] and [3]. The results of the latter
reference allow for p/n → c, which introduces an additional asymptotic variance term,
and the case with c = 0 recovers Theorem 2.
Problems
(1) Briefly explain partialling-out and the adaptitive property for the linear regression
model, and use the gender wage gap data to illustrate your points. Present your
L1 15
0.05
Average Effect (log of weeks)
0.00
−0.05
−0.10
−0.15
T1 T2 T3 T4 T5
Treatment
0.05
Average Effect (log of weeks)
0.00
−0.05
−0.10
−0.15
T1 T2 T3 T4 T5
Treatment
(2) Briefly explain the idea of joint confidence bands, and use the Penn data to replicate
the second set of results of our re-analysis of Pennsylvania re-employment exper
iment. Present your results as a brief section of a professionally done empirical
paper.
(3) In the wage gap example and reemployment experiment, discuss whether the em
pirical results have a causal or treatment effect interpretation. Does the estimate
wage gap measure discrimination? Perhaps in part? Do the reductions in unem
ployment duration have a causal meaning? Present your discussion as a brief sec
tion of a professionally done empirical paper.
(4) Explain why in randomized control trials, where assigned treatment D is inde
pendent from controls W , we can estimate the linear predictive effect of D on Y
controlling linearly for W without actually controlling for W . However, including
W may still be a good idea, because using W can lower (and does not increase) the
asymptotic variance of the least squares estimator.
(5) Prove that the population partialling-out operator is linear on the space of random
variables with finite second moments, i.e. if for V and U such that EU 2 +EV 2 < ∞,
˜.
Y = V + U =⇒ Y˜ = V˜ + U
(6) Provide a set of sufficient regularity conditions for Lemma 1 and Theorem 1 and
prove them. Extra credit is given for handling the case where p2 → ∞ as n → ∞,
but don’t spend too much time on this, as this is difficult.
References
[1] Belloni, A., V. Chernozhukov, and C. Hansen (2014): “Inference on Treatment Ef
fects After Selection Amongst High-Dimensional Controls,” Review of Economic Studies,
81, 608–650.
[2] Bilias, Y. (2000): “Sequential testing of duration data: the case of the Pennsylvania
Reemployment bonus experiment,” Journal of Applied Econometrics, 15(6), 575–594.
[3] Cattaneo, M. D., M. Jansson, and W. K. Newey (2015): “Inference in Linear Regression
Models with Many Covariates and Heteroskedasticity,” ArXiv e-prints.
[4] Chernozhukov, V., D. Chetverikov, and K. Kato (2015): “Some New Asymptotic The
ory for Least Squares Series: Pointwise and Uniform Results,” Journal of Econometrics, 186,
345–366.
18 ´ FERNANDEZ-
VICTOR CHERNOZHUKOV AND IVAN ´ VAL
[5] Mulligan, C. B., and Y. Rubinstein (2008): “Selection, Investment, and Women’s Rela
tive Wages Over Time,” The Quarterly Journal of Economics, 123(3), 1061–1110.
[6] White, H. (2014): Asymptotic theory for econometricians. Academic press.
MIT OpenCourseWare
https://ocw.mit.edu
14.382 Econometrics
Spring 2017
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.