14.382 Inference: Creative Commons BY-NC-SA

Victor Chernozhukov and Ivan Fernandez-Val. 14.382 Econometrics. Spring 2017.
Massachusetts Institute
of Technology: MIT OpenCourseWare, https://ocw.mit.edu. License: Creative Commons BY-NC-SA.
14.382 L1. LEAST SQUARES, ADAPTIVE PARTIALLING-OUT, SIMULTANEOUS

INFERENCE
´ FERNANDEZ-VAL
VICTOR CHERNOZHUKOV AND IVAN ´
Abstract. Here we overview the least squares from several interesting angles. We discuss
Frisch-Waugh-Lovell partialling out and point out its adaptivity property in establishing
approximate normality of the regression estimators of a set of target regression coefficients.
We then discuss construction of simultaneous confidences sets for this set. We make use of
the methods to analyze the gender wage gap and the impact of reemployment incentives
on the duration of unemployment.
1. Notation
For two sequences of real numbers, {an }∞ ∞

n=1 and {bn }n=1 , the notation an � bn means
there exists C such that for all n we have that an ≤ Cbn , for some constant C that does not
depend on n. For a vector v = (v1 , v2 , ..., vk )' ∈ Rk , the C2 and C1 norms are denoted by
I · I2 (or simply I · I) and I · I1 , respectively,
k 1/2 k
k k
IvI2 := vi2 , IvI1 := |vi |.
i=1 i=1
The C0 -“norm”, I · I0 , denotes the number of non-zero components of a vector, and the
I · I∞ denotes the max norm:
k
k
IvI0 := 1{vi = 0}, IvI∞ := max{|vi | : i ∈ {1, ..., k}}.
i=1
When applied to a matrix, I · I denotes the operator norm, namely
IAI := max{IAvI : IvI ≤ 1}.
We use the notation a ∨ b = max(a, b) and a ∧ b = min(a, b). We use x' to denote the
transpose of a column vector x. In what follows we use the notion En f (W ) abbreviates the
1
2 ´ FERNANDEZ-
VICTOR CHERNOZHUKOV AND IVAN ´ VAL
empirical expectation of f (W ) as W ranges over the sample (Wi )ni=1 :

n
1k
En f (W ) = f (Wi ),
n
i=1
2. Least Squares
Let Y be a scalar random variable and X be a p-vector of covariates called regressors.

We observe n i.i.d. copies {(Yi , Xi' )}ni=1 of (Y, X ' ). Note that independence is not needed
in many places, as is clear from the context. Throughout we assume that EY 2 and EXX '
are finite.
We then define least squares or projection parameter β in the population as the solution
of the following prediction problem:
β := arg minp E(Y − X ' b)2
b∈R
where β obeys the first-order condition:

E(Y − X ' β)X = 0,
and provided that EXX ' is of full rank, which amounts to absence of the multicollinearity,
has the closed form expression:
β = (EXX ' )−1 EXY,
Defining ε = Y − X ' β, we obtain the decomposition identity
Y ≡ X ' β + ε, EεX = 0.
Observe that we did not need any linearity assumption to obtain this decomposition.
We define least squares estimator or projection estimator β̂ in the sample as the solution
of the following prediction problem:
β̂ := arg minp En (Y − X ' b)2
b∈R
which obeys the first-order condition:

En (Y − X ' β̂)X = 0,
and has the closed form solution
β̂ = (En XX ' )−1 En XY,
provided that En XX ' is of full rank, which amounts to absence of the multicollinearity in
the sample. Defining ε̂i = Yi − Xi' β̂, we obtain the decomposition identity
Yi ≡ Xi' β̂ + ε̂i , En ε̂X = 0.
L1 3
Note that the least squares estimator makes sense only if p is not bigger than n. If p > n
other estimators must be used, for example, penalized least squares estimators or post-
selection least squares estimators.
3. Partialling Out. Frisch-Waugh-Lovell Theorem
This is an important tool that provides conceptual understanding of least squares as

well as a very practical tool for estimation and visualization of results. We partition vector
of regressors X into two groups:
X = (D' , W ' )' ,
where p1 -dimensional subvector D represents “target” regressors of interest, and p2 -dimensional
subvector W represents other regressors, sometimes called the controls. For example, in
wage gender gap analysis, where Y is wage, D is the gender indicator, and W are var
ious other variables explaining variation in wages. In program evaluation, D is often a
treatment or policy variable and W are controls. Write
Y = D' β1 + W ' β2 + ε. (3.1)
What does the regression coefficient β1 measure here? It measures how our lin
ear prediction of Y changes if we set the gender variable D from 0 to 1, holding the
controls W fixed. We can call this the predictive effect (PE), as it measures the impact
of a variable on the prediction we make. PE is a measure of statistical dependence
or association between variables suggesting that D predicts Y even if we partial-out
linearly the controls W . The PE should not be in general interpreted as a causal or
treatment effect (TE), since correlation is not equivalent to causation. We shall study
assumption needed for causal interpretability of the estimates later in the course. An
important case where β1 measures TE is the case of randomizes control trials, where
D is randomly assigned, and is therefore independent of X.
In population, define the partialling-out operator with respect to a vector W that takes a
random variable V such that EV 2 < ∞ and creates Ṽ according to the rule:
Ṽ = V − W ' γV W , γV W = arg min
p
E(V − W ' b)2 .
b∈R 2
When V is a vector, we interpret the application of the operator as componentwise. The

vector W needs to have finite second moment in order for this to be well-defined.
It is not difficult to see that the partialling-out operator is linear on the space of random
variables with finite second moments, i.e. if for V and U such that EU 2 + EV 2 < ∞,
Y = V + U =⇒ Y˜ = V˜ + U
˜.
4 ´ FERNANDEZ-
Thus we apply this operator to both sides of the identity (3.1) to get:
Ỹ = D̃' β1 + W̃ ' β2 + ε̃,

which implies that
Ỹ = D̃' β1 + ε, EεD̃ = 0. (3.2)
The last line follows from W̃ = 0, which holds by definition, and ε̃ = ε, which holds
because of the orthogonality EεX = 0; moreover, since D̃ is a linear combination of com
ponents of X, we have that EεD̃ = 0.
Equation (3.2) states that EεD̃ = 0 is the first-order condition for the population regres
sion of Ỹ on D̃. That is, the projection coefficient β1 can be recovered from the regression
of Ỹ on D̃:
β1 = arg min E(Ỹ − D̃' b) = (EDD̃
˜ ' )−1 ED̃Y˜ .
b∈Rp1
This is a remarkable fact, known as Frisch-Waugh-Lovell (FWL) theorem. It asserts

that β1 is a regression coefficient of Y on D after partialing-out the linear effect of W
from Y and D. In other words, it measures linearly the predictive effect (PE) of D on
Y , after taking out the linear predictive effect of W on both of these variables.
In the sample, partialling-out operation works similarly. Define it as an operator that

converts Vi into V̌i via
Vˇi = Vi − Wi' γ̂V W , γ̂V W = arg min
p
En (V − W ' b)2 .
b∈R 2
Similarly to the population case, the operator is linear. Thus, application of the operator
to the decomposition identity Yi ≡ Di' β̂1 + Wi' β̂2 + ε̂i gives
Yˇi = Ďi' βˆ1 + ε̂i , ˇ = 0.

En εˆD
This implies that
β̂1 = arg min

p
En (Y̌ − Ď' b) = (En ĎĎ' )−1 En Ď Y̌.
b∈R 1
This is the sample version of the FWL Theorem.
The partialling-out operation defined above works well when the dimension of W is low
in relation to the sample size. When the dimension is high we need to use variable selection
or penalization for regularization purposes. We shall get to that later in the course.
We summarize the discussion as a theorem.

L1 5
Theorem 1 (Frisch-Waugh-Lovell). Work with the set-up above. The population projection
coefficient β̂1 can be recovered from the population regression of Ỹ on D̃ :
˜ ' )−1 ED̃Y˜ ,
β1 = (EDD̃
assuming ED̃D̃' is of full rank. The sample projection coefficient β̂1 can be recovered from the
sample regression of Y̌i on Ďi :
βˆ1 = (En ĎĎ' )−1 En Ď Y,
ˇ
ˇD
assuming En D ˇ ' is of full rank.
4. Approximate Distributions for β̂1
It is of interest to examine the behavior of the estimator β̂1 . In what follows, we can
assume that dimension p1 of the target parameter β1 is fixed, but the dimension p2 of the
nuisance parameter β2 may grow with n but slowly enough so that p2 /n → 0. In practical
terms, the latter condition simply means that p2 is small compared to n.
Lemma 1 (Adaptivity Property for Partialling Out). Consider the sample projection coef
ficient β̂1 obtained from the sample regression of Y̌i on Ďi :
βˆ1 = (En ĎĎ' )−1 En Ď Y,ˇ
and the sample projection coefficient β̃1 obtained from the sample regression of infeasible Ỹi on
D̃i :
β˜1 = (En D̃D̃' )−1 En D̃ Y.
˜
There exist regularity conditions such that, provided that the dimension p2 is small compared
to n, namely
p2 /n → 0,
we have the following asymptotic equivalence result:
√
n(β̂1 − β̃1 ) →P 0.
That is, the estimator is not affected by the estimation errors in partialling out steps, and they
are approximately negligible.
6 ´ FERNANDEZ-
We have that
β̃1 − β1 = (En D̃D̃' )−1 En D̃Ỹ − β1 (4.1)
' −1
= (En D̃D̃ ) En D̃(β1 D̃ + E) − β1 (4.2)
' −1 ˜
= (En D̃D̃ ) En DE. (4.3)
Then we conclude that under mild regularity conditions
√ a
n(β̃1 − β1 ) ∼ N (0, V11 )
a
where ∼ reads as “approximately distributed”,
√
V11 = (ED˜D˜ ' )−1 Var( nEn DE)(E
˜ ˜ ' )−1 .
DD̃
Given the equivalence stated in Lemma above we further conclude that
√ a
n(β̂1 − β1 ) ∼ N (0, V11 ).
Theorem 2. There exist regularity conditions such that, provided that p2 /n → 0, we have
that √ a
n(β̂1 − β1 ) ∼ N (0, V11 ),
as n → ∞, namely that
√
sup P n(β̂1 − β1 ) ∈ A − P(N (0, V11 ) ∈ A) → 0,
A∈A
where A is a collection of sets in Rp1 (e.g. convex sets or rectangles).
The proof of this result is simple under fixed p2 and is rather technical when p2 → ∞, so
we won’t pursue it here, but conceptually it is a more technical version of the result under
fixed p asymptotics that you have seen in the introductory regression course.
Remark 1. Alternatively, the result above could also be derived or conjectured from the
statement that the whole parameter vector is approximately normally distributed as fol
lows:
√ a
n(β̂ − β) ∼ N (0, V ), (4.4)
Here
√
V = Q−1 ΩQ−1 , Q = EXX ' , Ω = Var( nEn Xε).
Then V11 corresponds to the p1 × p1 upper-left block of V . This result is straightforward
when p is fixed as n → ∞. On the other hand, when p is increasing with n, proving that
√
the whole p-dimensional parameter vector n(β̂ − β) is normally distributed is usually
much more demanding, in terms of regularity conditions and the sense in which normal
approximations hold.
L1 7
We shall rely on a suitable estimator V̂11 of V11 , for example, the White estimator under
independent sampling or the Newey-West estimator for the time series case. We then shall
use the normal law N (0, V̂11 /n) for quantification of uncertainty about β1 , that is, for building
confidence bands for β1 and various functionals of β. With V̂11 used instead of V11 the
√ a
statement n(β̂ − β) ∼ N (0, V̂11 ) is defined to mean the following:
√
sup P n(β̂ − β) ∈ A − P(N (0, V¯ ) ∈ A) |V̄ =V̂11 →P 0.

A∈A
Basically, we just insert V̂ wherever V previously appeared and we require the same state
ments to hold stochastically.
√ a
Lemma 2 (Using Estimated Variance is Ok.). Suppose that n(β̂1 − β1 ) ∼ N (0, V11 ) and V̂11
−1 √ a
is consistent for V11 , namely V̂11 V11 →P I and V11 is bounded away from zero. Then n(β̂−β) ∼
N (0, Vˆ11 ).
This lemma is a consequence of the Gaussian vector N (0, V11 ) having bounded density,
so that estimation errors in V̂11 have a negligible effect on probabilities of the containment
events.
Suppose β1 is scalar or we w are interested in the j-th component of βj . The above results
means that we can report Vˆ11,jj /n as (estimated) standard errors for βij , and report
w w
[Cj , uj ] = β̂1j − z V̂11,jj /n, β̂1j + z V̂11,jj /n ,
where z is (1 − α/2)-quantile of the standard normal variable N (0, 1), as the approximate
(1 − α) × 100% confidence interval for β1j . That this is a confidence interval follows from
a more general result we discuss below.
5. Gender Wage Gap in 2015
We consider an empirical application to gender wage gap using data from the U.S.
March Supplement of the Current Population Survey (CPU) in 2015. We select white non
hispanic individuals, aged 25 to 64 years, and working more than 35 hours per week dur
ing at least 50 weeks of the year. We exclude self-employed workers; individuals living in
group quarters; individuals in the military, agricultural or private household sectors; in
dividuals with inconsistent reports on earnings and employment status; individuals with
allocated or missing information in any of the variables used in the analysis; and individu
als with hourly wage below $3.1 The resulting sample consists of 32, 523 workers including
18, 137 men and 14, 386 of women. The variable of interest Y is the logarithm of the hourly
1The sample selection criteria is similar to [5].
8 ´ FERNANDEZ-
wage rate constructed as the ratio of the annual earnings to the total number of hours
worked, which is constructed in turn as the product of number of weeks worked and the
usual number of hours worked per week. Table 1 reports descriptive statistics for the vari
ables used in the analysis. Working women are less likely to be married and more highly
educated than working men, but have slightly less experience. The unconditional average
gender wage gap is 24%.
Table 1. Descriptive Statistics
All Men Women

log wage 3.16 3.26 3.02
female 0.44 0.00 1.00
married 0.70 0.73 0.65
widowed 0.01 0.01 0.02
separated 0.02 0.01 0.02
divorced 0.12 0.09 0.15
never married 0.16 0.16 0.16
lhs 0.02 0.03 0.01
hsg 0.25 0.28 0.21
sc 0.28 0.27 0.30
cg 0.28 0.27 0.29
ad 0.17 0.15 0.19
ne 0.19 0.19 0.20
mw 0.26 0.26 0.26
so 0.33 0.33 0.34
we 0.22 0.23 0.21
experience 21.21 21.35 21.03
Source: March Supplement CPS 2015
To estimate the gender wage gap, we consider the linear regression model:
Y = Dβ1 + W ' β2 + ε, Eε(D, W ' )' = 0,
where Y is the log hourly rate, D is an indicator for female worker, and W is a set of
p = 1, 082 controls including 5 marital status indicators (widowed, divorced, separated,
never married, and married); 5 educational attainment indicators (less than high school
graduates, high school graduates, some college, college graduate, and advanced degree);
4 region indicators (midwest, south, west, and northeast); a quartic in potential experi
ence constructed as the maximum of age minus years of schooling minus 7 and zero, i.e.,
L1 9
experience = max(age − education − 7, 0); 22 occupation indicators;2 21 industry indica

tors;3 and all the two-way interactions between the previous variables.
Table 2 reports the results of a regression analysis using the CPS data. The first row
obatins the coefficient of D from the OLS regression of Y on D; the second row obtains
the coefficient of D from the OLS regression of Y on X = (D, W ); the second row obtains
the same estimate using the Frisch-Waugh-Lovell theorem for partialing-out the controls
via OLS; and the third row obtains the coefficient of D using a variant of the procedure in
[1] that partials-out the controls via LASSO instead of OLS.4 We will study this procedure
later in the course. All the standard errors are computed with the R package sandwich and
are robust to heteroskedasticity. Using Lasso for partialing out here gives similar results
as using OLS. Lasso is a penalized OLS estimator and it produces high-quality estimates
of the regression function especially in the high-dimensional settings. The penalty takes
the form of the sum of the absolute values of the coefficients times penalty level.
Table 2. Regression Analysis of the Wage Gap
Estimate Std. Error†

no controls -0.239 0.0067
all controls -0.185 0.0069
partial reg -0.185 0.0069
partial reg with lasso -0.195 0.0068
†
Standard errors are robust to heteroskedasticity
What do the estimated regression coefficients β1 measure here? The first row measures
the unconditional gender gap, i.e. the difference in the average wage of working women
and men. The rest measure how our linear prediction of wage changes if we set the gender
variable D from 0 to 1, holding the controls W fixed. We can call this the predictive effect
(PE), as it measures the impact of a variable on the prediction we make. The PE should
2The occupation categories are: management; business and financial operations; computer and mathe
matics; architecture and engineering; life, physical, and social science; community and social service; legal;
education, training, and library; arts, design, entertainment, sports, and media; healthcare practitioners and
technical; healthcare support; protective service; food preparation and serving; building and grounds clean
ing and maintenance; personal care and service; sales; office and administrative support; farming, fishing,
and forestry; construction and extraction; installation, maintenance, and repair occupations; production; and
transportation and material moving.
3The industry categories are: mining; utilities; construction; nondurable goods manufacturing; durable
goods manufacturing; durable goods wholesale; nondurable goods wholesale; retail trade; transportation
and warehousing; information; finance and insurance; real estate, rental and leasing; professional, scientific,
and technical services; management of companies and enterprises; administrative, support and waste man
agement services; educational services; health care and social assistance; arts, entertainment, and recreation;
accommodation and food services; other services except public administration; and public administration.
4We use the R package hdm to obtain the estimates in the third row.
10 ´ FERNANDEZ-
not be in general interpreted as a causal or treatment effect (TE), since correlation is not
equivalent to causation. The causal interpretation of PE here could suggest that β1 is solely
a measure of discrimination, while in reality it may reflect discrimination, selection effects
(e.g., sorting of women and men into different occupation), sample imbalances, etc. In
this case the unconditional wage gap for women of 24% decreases to around 19-20% after
controlling for worker characteristics.
We repeat the analysis for the more homogeneous subpopulation of never married work
ers. Table 3 reports descriptive statistics for the corresponding subsample from the CPS
2015 data. There are 5,150 never married workers, 2,861 men and 2,289 women. Never
married working women are also relatively more educated than working men. Compared
to Table 1, never married workers earn lower average wages, and have much lower ex
perience than the rest of the workers. The regression analysis in Table 4 shows that the
unconditional gender wage gap is less than 4% for this group. This gap increases to 6
7% once we control for worker characteristics.5 A possible explanation of the lower wage
gap for never married working women could be related to fertility and childcare decisions.
Thus, never married women are young and less likely to have children. They can therefore
be more career oriented and have working experiences not interrupted by childbearing or
childcare.
Table 3. Descriptive Statistics: Never Married Workers
All Male Female

log wage 2.97 2.99 2.95
female 0.44 0.00 1.00
lhs 0.02 0.03 0.01
hsg 0.24 0.29 0.18
sc 0.28 0.27 0.28
cg 0.32 0.29 0.35
.
ad 0.14 0.11 0.18
ne 0.23 0.22 0.24
mw 0.26 0.26 0.26
so 0.30 0.30 0.29
we 0.22 0.22 0.21
experience 13.76 13.78 13.73
Source: March Supplement CPS 2015
5Without the marital status indicators, there are p = 775 controls.

L1 11
Table 4. Regression Analysis of the Wage Gap: Never Married Workers
Estimate Std. Error†

No controls -0.038 0.016
All controls -0.061 0.015
partial reg -0.061 0.015
partial reg via lasso -0.070 0.015
†
Standard errors are robust to heteroskedasticity
6. Joint Confidence Bands for β1
Consider a p1 -dimensional subvector β1 of the coefficient vector β. Assume, without

loss of generality, that these are the first p1 components. Assume that
a
β̂1 − β1 ∼ N (0, V11 /n),
where V11 is the upper-left p1 × p1 sub-block of V , in the sense that
√

sup P
n(β̂1 − β1 ) ∈ A − P(N (0, V11 ) ∈ A) → 0, n → ∞, (6.1)
A∈A
where A is a collection of rectangles in Rp1 .
Suppose we want to build simultaneous confidence bands for all the components (β1j )pj=1
1
of β1 . To give a context, suppose that D represents a collection of indicator (“dummy”)

variables, capturing different types of treatment. For instance, in the Pennsylvania treat
ment experiments example below, the components of D describe various kinds of incen
tives that participants received to find a job quicker. We want to create a confidence set
ˆ j , uj ])pj=1
[C, u] = ([C 1
− that
such
P(β1 ∈ [C, u]) = P (β1j ∈ [Cj , uj ] for all j) → 1 − α.

Using such confidence bands allows us to answer a number of very interesting questions
about both economic and statistical significance of the components of β1 . For example, we
can create a confidence set for a set of treatments that result in more than some target level
of impact.
We consider confidence bands in the form of the rectangle:

w w
[Cj , uj ] = β̂1j − c V11,jj /n, β̂1j + c V11,jj /n , j = 1, ..., p1 ,
where we set the critical value c such that the previous display holds. Here we use V11,jj
to denote the (j, j)-th element of matrix V11 .
12 ´ FERNANDEZ-
The value of c can be determined as (1 − α)-quantile of
IN (0, C)I∞ ,
where C is the correlation matrix associated with V11 , that is,
C = S −1/2 V11 S −1/2
where S = diag(V11 ) is a diagonal matrix with the diagonal of V11 in its diagonal and
zeroes elsewhere. The constant c can be approximated by simulation.
This constant c is the right one by the following argument:

√
P(β1 ∈ [C, u]) = P( n(β̂1 − β1 ) ∈ S 1/2 [−c, c])
= P(N (0, V11 ) ∈ S 1/2 [−c, c]) + o(1)
= P(S −1/2 N (0, V11 ) ∈ [−c, c]) + o(1)
= P(IN (0, S −1/2 V11 S −1/2 )I∞ ≤ c) + o(1)
= 1 − α + o(1),
where the second equality holds by (6.1), because S 1/2 [−c, c] is a rectangular set in Rp1 .
Note that in practice we shall need to replace V11 with a consistent estimator V̂11 . This
replacement does not affect the approximate coverage property of the confidence regions
in view of Lemma 2.
We summarize the discussion as follows.
a
Theorem 3 (Joint Confidence Band For Target Coefficients). Suppose that β̂1 − β1 ∼
N (0, V11 /n) in the sense of (6.1). We have that the confidence band
w w
[Cj , uj ] = β̂1j − c V11,jj /n, β̂1j + c V11,jj /n , j = 1, ..., p1 ,
with c = (1 − α)-quantile of IN (0, CI∞ , where C is the correlation matrix associated to V11 ,
jointly covers all target parameter values (β1j )pj=1
1
with probability approaching the nominal
level, that is, as n → ∞,
P (β1j ∈ [Cj , uj ] for all j) → 1 − α.
−1
The results continue to hold if V11 is replaced by V̂11 , such that Vˆ11 V11 →P I and V11 is
bounded away from zero.
L1 13
7. The Pennsylvania re-employment bonus experiment
Here we re-analyze the Pennsylvania re-employment bonus experiment, which was pre
viously studied in [2], among others. Note that the inferential results on simultaneous
bands we report below will be new. These experiments were conducted in the 1980s by the
U.S. Department of Labor to test the incentive effects of alternative compensation schemes
for unemployment insurance (UI). In these experiments, UI claimants were randomly as
signed either to a control group or one of five treatment groups.6 In the control group the
current rules of the UI applied. Individuals in the treatment groups were offered a cash
bonus if they found a job within some pre-specified period of time (qualification period),
provided that the job was retained for a specified duration. The treatments differed in
the level of the bonus, the length of the qualification period, and whether the bonus was
declining over time in the qualification period; see [2] for further details on data.
To evaluate the impact of the treatments on unemployment duration, we consider the

linear regression model:
Y = D' β1 + W ' β2 + ε, Eε(D' , W ' )' = 0,
where Y is the log of duration of unemployment, D is a vector of 5 treatment indicators,
and W is a set of p = 16 controls including age group dummies, gender, race, number of
dependents, quarter of the experiment, location within the state, existence of recall expec
tations, and type of occupation.
The assignment of units to treatment D is random. We commonly refer to such case

as the randomized control trial (RCT). Under RCT, the projection coefficient β1 has the in
terpretation of the causal effect of the treatment on the average outcome. We thus refer
to β1 as the average treatment effect (ATE). Note that covariates W here are independent
of the treatment D, so we can identify β1 by just regression of Y on D, without adding
covariates. However we do add covariates in an effort to improve the precision of our
estimates of the average treatment effect.
Figure Figure 7 shows 90% confidence intervals for the five treatment effects β1 , con
structed using a sample of 13,913 observations.
• The critical value for the simultaneous bands, c = 2.27, is greater than the point-
wise critical value, 1.65.
• It is less than the critical value from the Bonferroni correction, 2.33, obtained as the
(1 − α/2)
¯ quantile of the normal distribution with α¯ = α/5. The idea of Bonferroni
6There are six treatment groups in the experiments. Following [2] we merge the groups 4 and 6.
14 ´ FERNANDEZ-
correction is to use the union bound P(∪pj=1

1
eventj ) ≤ p1
j=1 P(eventj ) to bound the
noncoverage event, i.e. eventj = {βj ∈ [Cj , uj ]}.
In this case, from the three treatment levels with statistically significant effect on
unemployment duration based on pointwise confidence intervals, only one remains
significant after accounting for simultaneous inference.
The last observation illustrates how econometrics and this class offer better concepts and
tools than what the standard empirical practice often does. It also explains why econome
tricians sometimes teach what “people never use in practice” – they simply teach correct
things to use, and it is up to you to decide whether you want to do wrong or correct things
in practice.
Next we consider a more flexible version of the more basic model, where we take con
trols to include the original set set of controls as well as all two-way interactions, giving
us a total of p = 120 controls. We repeat the exercise we have given above with roughly
similar conclusions. Figure 7 shows 90% confidence intervals for the five treatment effects
β1 . We see that the addition of many more controls does not change the inferential results
noticeably. This highlights the robustness of the conclusions with respect to enriching the
set of controls, and is also in-line with our asymptotic theory, which states that the infer
ence is not impacted in the regime where the number of controls p is much smaller than
n, despite the fact the number of controls is substantial here.
Notes
Least squares were invented by Legendre around 1800, although Gauss claimed the
credit. Frisch, Waugh, and Lovell discovered the partialling out interpretation of the least
squares coefficients in 1930s. The adaptivity results of Lemma 1 went unnoticed by em
piricists, and also manage to escape statistics and econometrics textbooks; we note this
property here though. Regularity conditions under which Lemma 1 and Theorem 2 hold
under fixed p asymptotitcs can be found in the introductory econometrics texts, for exam
ple, [6], and under p → ∞ and p/n → 0 asymptotitcs in [4] and [3]. The results of the latter
reference allow for p/n → c, which introduces an additional asymptotic variance term,
and the case with c = 0 recovers Theorem 2.
Problems
(1) Briefly explain partialling-out and the adaptitive property for the linear regression
model, and use the gender wage gap data to illustrate your points. Present your
L1 15
Treatment Effects on Unemployment Duration

0.10
Regression Estimate
90% Simultaneous Confidence Interval
90% Pointwise Confidence Interval
0.05
Average Effect (log of weeks)
0.00
−0.05
−0.10
−0.15
T1 T2 T3 T4 T5
Treatment
Figure 1. 90% Confidence Intervals for Treatment Effects on Unemploy

ment Duration. Number of controls is 16. Critical value for simultaneous
confidence interval obtained by simulation with 100,000 repetitions.
16 ´ FERNANDEZ-
Treatment Effects on Unemployment Duration

0.10
Regression Estimate
90% Simultaneous Confidence Interval
90% Pointwise Confidence Interval
0.05
Average Effect (log of weeks)
0.00
−0.05
−0.10
−0.15
T1 T2 T3 T4 T5
Treatment
Figure 2. 90% Confidence Intervals for Treatment Effects on Unemploy

ment Duration. Number of controls is 120. Critical value for simultaneous
confidence interval obtained by simulation with 100,000 repetitions.
L1 17
discussion as a brief section of a professionally done empirical paper.
(2) Briefly explain the idea of joint confidence bands, and use the Penn data to replicate
the second set of results of our re-analysis of Pennsylvania re-employment exper
iment. Present your results as a brief section of a professionally done empirical
paper.
(3) In the wage gap example and reemployment experiment, discuss whether the em
pirical results have a causal or treatment effect interpretation. Does the estimate
wage gap measure discrimination? Perhaps in part? Do the reductions in unem
ployment duration have a causal meaning? Present your discussion as a brief sec
tion of a professionally done empirical paper.
(4) Explain why in randomized control trials, where assigned treatment D is inde
pendent from controls W , we can estimate the linear predictive effect of D on Y
controlling linearly for W without actually controlling for W . However, including
W may still be a good idea, because using W can lower (and does not increase) the
asymptotic variance of the least squares estimator.
(5) Prove that the population partialling-out operator is linear on the space of random
variables with finite second moments, i.e. if for V and U such that EU 2 +EV 2 < ∞,
˜.
Y = V + U =⇒ Y˜ = V˜ + U
(6) Provide a set of sufficient regularity conditions for Lemma 1 and Theorem 1 and
prove them. Extra credit is given for handling the case where p2 → ∞ as n → ∞,
but don’t spend too much time on this, as this is difficult.
References
[1] Belloni, A., V. Chernozhukov, and C. Hansen (2014): “Inference on Treatment Ef
fects After Selection Amongst High-Dimensional Controls,” Review of Economic Studies,
81, 608–650.
[2] Bilias, Y. (2000): “Sequential testing of duration data: the case of the Pennsylvania
Reemployment bonus experiment,” Journal of Applied Econometrics, 15(6), 575–594.
[3] Cattaneo, M. D., M. Jansson, and W. K. Newey (2015): “Inference in Linear Regression
Models with Many Covariates and Heteroskedasticity,” ArXiv e-prints.
[4] Chernozhukov, V., D. Chetverikov, and K. Kato (2015): “Some New Asymptotic The
ory for Least Squares Series: Pointwise and Uniform Results,” Journal of Econometrics, 186,
345–366.
18 ´ FERNANDEZ-
[5] Mulligan, C. B., and Y. Rubinstein (2008): “Selection, Investment, and Women’s Rela
tive Wages Over Time,” The Quarterly Journal of Economics, 123(3), 1061–1110.
[6] White, H. (2014): Asymptotic theory for econometricians. Academic press.
MIT OpenCourseWare
https://ocw.mit.edu
14.382 Econometrics
Spring 2017
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.

14.382 Inference: Creative Commons BY-NC-SA

Uploaded by

Copyright:

Available Formats

14.382 Inference: Creative Commons BY-NC-SA

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

14.382 Inference: Creative Commons BY-NC-SA

Uploaded by

Copyright:

Available Formats

Victor Chernozhukov and Ivan Fernandez-Val. 14.382 Econometrics. Spring 2017.

14.382 L1. LEAST SQUARES, ADAPTIVE PARTIALLING-OUT, SIMULTANEOUS

For two sequences of real numbers, {an }∞ ∞

When applied to a matrix, I · I denotes the operator norm, namely

IAI := max{IAvI : IvI ≤ 1}.

empirical expectation of f (W ) as W ranges over the sample (Wi )ni=1 :

Let Y be a scalar random variable and X be a p-vector of covariates called regressors.

where β obeys the ﬁrst-order condition:

which obeys the ﬁrst-order condition:

3. Partialling Out. Frisch-Waugh-Lovell Theorem

This is an important tool that provides conceptual understanding of least squares as

When V is a vector, we interpret the application of the operator as componentwise. The

Ỹ = D̃' β1 + W̃ ' β2 + ε̃,

This is a remarkable fact, known as Frisch-Waugh-Lovell (FWL) theorem. It asserts

In the sample, partialling-out operation works similarly. Deﬁne it as an operator that

Yˇi = Ďi' βˆ1 + ε̂i , ˇ = 0.

β̂1 = arg min

This is the sample version of the FWL Theorem.

We summarize the discussion as a theorem.

4. Approximate Distributions for β̂1

5. Gender Wage Gap in 2015

Table 1. Descriptive Statistics

All Men Women

Y = Dβ1 + W ' β2 + ε, Eε(D, W ' )' = 0,

experience = max(age − education − 7, 0); 22 occupation indicators;2 21 industry indica­

Table 2. Regression Analysis of the Wage Gap

Estimate Std. Error†

Table 3. Descriptive Statistics: Never Married Workers

All Male Female

5Without the marital status indicators, there are p = 775 controls.

Table 4. Regression Analysis of the Wage Gap: Never Married Workers

Estimate Std. Error†

6. Joint Confidence Bands for β1

Consider a p1 -dimensional subvector β1 of the coeﬃcient vector β. Assume, without

where A is a collection of rectangles in Rp1 .

of β1 . To give a context, suppose that D represents a collection of indicator (“dummy”)

P(β1 ∈ [C, u]) = P (β1j ∈ [Cj , uj ] for all j) → 1 − α.

We consider conﬁdence bands in the form of the rectangle:

The value of c can be determined as (1 − α)-quantile of

where C is the correlation matrix associated with V11 , that is,

C = S −1/2 V11 S −1/2

This constant c is the right one by the following argument:

We summarize the discussion as follows.

7. The Pennsylvania re-employment bonus experiment

To evaluate the impact of the treatments on unemployment duration, we consider the

The assignment of units to treatment D is random. We commonly refer to such case

correction is to use the union bound P(∪pj=1

Treatment Effects on Unemployment Duration

Figure 1. 90% Conﬁdence Intervals for Treatment Eﬀects on Unemploy­

Treatment Effects on Unemployment Duration

Figure 2. 90% Conﬁdence Intervals for Treatment Eﬀects on Unemploy­

discussion as a brief section of a professionally done empirical paper.

You might also like

experience = max(age − education − 7, 0); 22 occupation indicators;2 21 industry indica

Figure 1. 90% Conﬁdence Intervals for Treatment Eﬀects on Unemploy

Figure 2. 90% Conﬁdence Intervals for Treatment Eﬀects on Unemploy