Best Linear Predictor
Best Linear Predictor
1 Prediction
(Source: Amemiya, ch. 4)
Best Linear Predictor: a motivation for linear univariate regression
Consider two random variables X and Y . What is the “best” predictor of Y , among all the
possible linear functions of X?
“Best” linear predictor minimizes the mean squared error of prediction:
Solving:
Cov(X, Y )
β∗ =
VX (2)
α∗ = EY − β ∗ EX
• EU = 0
Hence, the b.l.p. accounts for a ρ2XY proportion of the variance in Y ; in this sense, the
correlation measures the linear relationship between Y and X.
Causal inference
Cov(Ŷ , U ) = Cov(Ŷ , Y − Ŷ )
= E[(Ŷ − E Ŷ )(Y − Ŷ − EY + E Ŷ )]
= E[(Ŷ − E Ŷ )(Y − EY ) − (Ŷ − E Ŷ )(Ŷ − E Ŷ )]
= Cov(Ŷ , Y ) − V Ŷ
= E[(α∗ + β ∗ X − α∗ − β ∗ EX)(Y − EY )] − V Ŷ (3)
= β ∗ E[(X − EX)(Y − EY )] − V Ŷ
= β ∗ Cov(X, Y ) − V Ŷ
= Cov 2 (X, Y )/V X − Cov 2 (X, Y )/V X
= 0.
Similarly, Cov(X, U ) = 0.
Hence, for any random variable X, the random variable Y can be written as the sum of
a part which is a linear function of X, and a part which is uncorrelated with X. This
decomposition of Y is done when you regress Y on X.
Finally, note that (obviously) the BLP of the BLP – that is, the best linear predictor given
X of the BLP of Y given X – is just the BLP itself. There is no gain, in predicting Y , by
iterating the procedure.
■■■
Note: in practice, with a finite sample of Y, X, the minimization problem (1) is infeasible.
In practice, we minimize the sample counterpart
∑
min (Yi − α − βXi )2 (4)
α,β
i
which is the objective function in ordinary least squares regression. The OLS values for α
and β are the sample versions of Eq. (2).
■■■
Next we consider some intuition of least-squares regression. Go back to the population
problem. Assume that the “true” model describing the generation of the Y process is:
Y = α0 + β 0 X + ϵ, Eϵ = 0. (5)
What we mean by true model is that this is a causal model in the sense that a one-unit
increase in X would raise Y by β 0 units. (In the previous section, we just assume that Y, X
move jointly together, so there is no sense in which changes in X “cause” changes in Y .)
Causal inference
Question: under what assumptions does doing least-squares on Y, X (which leads to the
best linear predictor from the previous section) recover the true model; ie. α∗ = α0 , and
β∗ = β0?
• For α∗ :
α∗ = EY − β ∗ EX
= α0 + β 0 EX + Eϵ − β ∗ EX
which is equal to α0 if β 0 = β ∗ .
• For β ∗ :
Cov(α0 + β 0 X + ϵ, X)
β∗ =
V arX
1 { }
= · E[X(α0 + β 0 X + ϵ)] − EX · E[α0 + β 0 X + ϵ]
V arX
1 { }
= · α0 EX + β 0 EX 2 + E[ϵX] − α0 EX − β 0 [EX]2 − EXEϵ
V arX
1 { }
= · β 0 [EX 2 − (EX)2 ] + E[ϵX]
V arX
which is equal to β 0 if
E[ϵX] = 0.
This is an “exogeneity” assumption, that (roughly) X and the disturbance term ϵ are
uncorrelated. Under this assumption, the best linear predictors from the infeasible problem
(1) coincide with the true values of α0 , β 0 . Correspondingly, it turns out that the feasible
finite-sample least-squares estimates from (4) are “good” (in some sense) estimators for α0 ,
β0.
■■■
Best prediction
Generalize above results to general (not just linear) prediction.
What if we don’t restrict ourselves to linear function of X? What general function of X is
optimal predictor of Y ?
min E [Y − ϕ(X)]2 .
ϕ(·)
Note:
E [Y − ϕ(X)]2
=E [(Y − E(Y |X)) + (E(Y |X) − ϕ(X))]2 (6)
=E (Y − E(Y |X))2 + 2E (Y − E(Y |X)) (E(Y |X) − ϕ(X)) + E (E(Y |X) − ϕ(X))2 .
Causal inference
EU = EX EY |X (Y − E(Y |X)) = EX 0 = 0
E[U E(Y |X)] = EX [E(Y |X)EY |X U ] = EX [E(Y |X) · 0] = 0.
Projection theorem: Let S be a linear space of random variables with finite second
moments. Then Ŝ is the projection of Y onto S if and only if Ŝ ∈ S and the orthogonality
condition is satisfied:
If Ŝ satisfies the orthogonality condition, then the middle term is zero, and we conclude that
E(Y −S)2 ≥ E(Y − Ŝ)2 with strict inequality unless E(Ŝ −S)2 = 0. Thus the orthogonality
condition implies that Ŝ is a projection, and also that it is unique.
Causal inference
If Ŝ is a projection, then this expression is nonnegative for every α (note that the vector
S + αŜ ∈ S). But the parabola −2αE(Y − Ŝ)S + α2 ES 2 is nonnegative iff E(Y − Ŝ)S = 0
– the orthogonality condition – is satisfied. ■
In the b.l.p. case: S is the space of all linear transformations of X, and the orthogonality
condition (7) implies that both Cov(U, X) = 0 and Cov(U, Ŷ ) = 0 (because both X, Ŷ ∈ S),
which we showed. Note that the projection theorem implies that Cov(U, g(X)) = 0, for any
linear function of X.
In the b.p. case: S is the space of all transformations of X, say g(X) with finite second
moments (i.e., Eg(X)2 < ∞).
■■■
Obviously, the projection of the projection is just the projection itself. Letting PS Y denote
the projection of Y on the space S, we have that PS [PS Y ] = PS Y . That is, projections are
idempotent operators; idempotency is even a defining feature of a projection. (You will use
this fact ubiquitously next quarter.)
y = βd + ϵ (8)
y = β ′ d(z, x) + ϵ
where the notation d(x, z) makes explicit that the treatment d depends on both the instru-
ment z (the “exogenous” variation) and other factors x (which are correlated with ϵ, leading
Causal inference
In the special case when we have a binary auxiliary variable Z ∈ {0, 1}, we obtain the
following estimator:
E[Y |Z = 1] − E[Y |Z = 0]
.
E[D|Z = 1] − E[D|Z = 0]
This is the classical Wald estimator. A number of the treatment effect estimators we consider
below take this form, for different choices of the auxiliary variable Z.
3 Cross-sectional approaches
Here we consider the situation where each individual in the dataset is only observed once.
We also restrict attention to the binary treatment case. (Most common case for policy
evaluation.)
A naive estimator of ATE is just the difference in conditional means E[Y |D = 1]−E[Y |D =
0]. This is obviously not a good thing to do unless Y0 , Y1 ⊥ D – that is, unless treatment
is randomly assigned (as it would be in a controlled lab setting, or in a tightly controlled
field experiment). Otherwise, typically E[Y |D = 0] = E[Y0 |D = 0] ̸= E[Y0 ], and similarly
to E[Y |D = 1].
= E[Y1 − Y0 ]
Q = P rob(D = 1|Z).
This can be estimated for each individual in the sample. Hence we assume that we
observe (Y, D, Z, Q) for everyone in the sample. Remember that Q is just a function
of Z.
• Rosenbaum and Rubin (1983) theorem: under the selection on observables assump-
tion, we also have (Y0 , Y1 ) ⊥ D|Q.
Proof: We want to show that P (D, Y1 , Y0 |Q) = P (D|Q)P (Y1 , Y0 |Q). Starting with
the Law of Total Probability, we have P (D, Y1 , Y0 |Q) = P (D|Y1 , Y0 , Q)P (Y1 , Y0 |Q).
So it suffices to show P (D|Y1 , Y0 , Q) = P (D|Q). Since D is binary, we can focus on
showing this for P [D = 1|Y1 , Y0 , Q) = P (D = 1|Q). Note that
3.2.1Inverse PS weighting
[ ]
• Main result: E(Y1 ) = E D∗Y
Q (Horvitz-Thompson estimator)
Proof:
[ ] ][
D∗Y D∗Y 1
E = EE |Z = E E [D ∗ Y1 |Z]
Q Q Q
1
=E E [E (D ∗ Y1 |Z, D) |Z]
Q
1
= E E [(D ∗ E(Y1 |Z, D) |Z]
Q
1
= E E [(D ∗ E(Y1 |Z)) |Z]
Q
E(Y1 |Z)
=E E [D|Z]
Q
E(Y1 |Z)
=E Q = E(Y1 ).
Q
■
[ ]
(1−D)∗Y
Similarly, E(Y0 ) = E 1−Q .
• This is inverse propensity score weighting. Intuitively, in the case of E(Y1 ), you weight
each individual in the treated sample by the probability of that individual being in
the treated sample, which is Q.
This is known as the overlap assumption. Practically, it implies that for any Z,
individuals with those covariates have a nonzero chance of being treated. Obviously,
if there is any set of Z with positive probability for which Q = 0, then this set must
be excluded from the expectation above, and so it is invalid to interpret it as the
unconditional mean of Y0 .
Causal inference
3.2.2 PS matching
This is just dimension reduction. Let FQ denote the distribution of propensity scores. We
have that
∫ ∫
{E[Y |D = 1, Q] − E[Y |D = 0, Q]} dFQ = {E[Y1 |D = 1, Q] − E[Y0 |D = 0, Q]} dFQ
∫
= {E[Y1 |Q] − E[Y0 |Q]} dFQ
= E[Y1 − Y0 ]
which is the average treatment effect. The penultimate equality uses the Rosenbaum-Rubin
theorem.
This is “matching” in the sense that for each value of Q, you compare the average outcome
of treated vs. untreated with this Q. Many variants on this based on how you match
individuals in the treated vs. untreated samples.
Example: Angrist and Lavy (1999): y is test scores, d is class size, z is indicator for whether
total enrollment was “just above” a multiple of 40. Maimonides’ rules states (roughly) that
no class size should exceed forty, so that if enrollment (treated as exogenous) is “just below”
40, class sizes will be bigger, whereas if enrollment is “just above” 40, class sizes will be
smaller. They restrict their sample to all (school-cohorts) where total enrollment was within
+/- 5 of a multiple of 40.
Causal inference
a Wald-type estimator.1
Proof: We have
• Assumption A2 (Independence): Y1 , Y0 , D1 , D0 ⊥ Z
• A3 (“rank”): E[D1 − D0 ] ̸= 0.
1
See Hahn, Todd, and van der Klaauw (2001).
Causal inference
– D = (1 − Z)D0 + ZD1
– Y = (1 − D)Y0 + DY1 (by exclusion restriction, Z doesn’t enter)
E[Y |Z=1]−E[Y |Z=0]
• Main result: the Wald estimator E[D|Z=1]−E[D|Z=0] estimates E[Y1 − Y0 |D1 > D0 ].
Proof: using independence and exclusion assumptions, we have
Here, the Wald estimator measures the average effect of d on y for those for whom a change
in z from 0 to 1 would have affected the treatment d. This insight is known by several terms,
including local IV and local average treatment effect (LATE) (see Angrist and Imbens (1994)
for more details).
Examples:
4 Panel data
In panel data, one observes the same individual over several time periods, including (ideally)
periods both before and after a policy change. For example, d is often a policy change which
affects some states but not others.
In this richer data environment, one can estimate the effect of the policy change while
controlling arbitrarily for individual-specific heterogeneity, as well as for time-specific effects.
This is the difference-in-difference approach.
Abstractly, consider outcome variables indexed by the triple (i, t, d), with i, t, d ∈ {0, 1} (all
binary). Here i denotes a subsample, with i = 1 being the treated subsample. t denotes time
period, with t = 1 denoting the period when individuals in subsample i = 1 are treated. d
is the treatment variable, as before. Of the eight possible combinations, we only observe
Y000 , Y010 , Y100 , Y111 .
The DID is typically obtained by linear regression. Consider the following linear model:
Y(111)
Y(010)
Y(000) alpha
Y(110)
Y(100)
with η ⊥ ∆di . By running this regression, the estimated β̂ is an estimate of the DID.
In the regression context, it is easy to control for additional variables Zit which also affect
outcomes.
There are many many examples of this. Two examples are:
Card and Krueger (1994) y is employment, d is minimum wage (look for evidence of
general equilibrium effects of minimum wage). Exploit policy shift which resulted in
rise of minimum wage in New Jersey, but not in Pennsylvania. Sample is fast food
restaurants on the NJ/Pennsylvania border.
Kim and Singal (1993) y is price, d is concentration of particular flight market. Exploit
merger of Northwest and Republic airlines, which affected only markets (so we hope)
in which Northwest or Republic offered flights.
References
Angrist, J. (1990): “Lifetime Earnings and the Vietnam Era Draft Lottery: Evidence
from Social Security Administrative Records,” American Economic Review, 80, 313–336.
Angrist, J., and W. Evans (1990): “Lifetime Earnings and the Vietnam Era Draft
Lottery: Evidence from Social Security Administrative Records,” American Economic
Review, 80, 313–336.
Angrist, J., and G. Imbens (1994): “Identification and Estimation of Local Average
Treatment Effects,” Econometrica, 62, 467–476.
Angrist, J., and A. Krueger (1991): “Does Compulsory School Attendance Affect
Scholling and Earnings?,” Quarterly Journal of Economics, 106, 979–1014.
Angrist, J., and V. Lavy (1999): “Using Maimonides’ Rule to Estimate the Effect of
Class Size on Scholastic Achievement,” Quarterly Journal of Economics, 114, 533–575.
Angrist, J., and J. Pischke (2009): Mostly Harmless Econometrics. Princeton University
Press.
Card, D., and A. Krueger (1994): “Minimum Wages and Employment: A Case Study
of the Fast-Food Industry in New Jersey and Pennsylvania,” American Economic Review,
84, 772–93.
Causal inference
Hahn, J., P. Todd, and W. van der Klaauw (2001): “Estimation of Treatment
Effectswith a Quasi-Experimental Regression-Discontinuity Design,” Econometrica, 69,
201–210.
Kim, E., and V. Singal (1993): “Mergers and Market Power: Evidence from the Airline
Industry,” American Economic Review, 83, 549–569.
Rosenbaum, P., and D. Rubin (1983): “The central role of the propensity score in
observational studies of causal effects,” Biometrika, 70, 41–55.