Linear Regression
Linear Regression
Given a new pair (X, Y ) we want to predict Y from X. The conditional prediction risk is
Z
2 2
b = E[(Y − m(X))
r(m) b |D] = (y − m(x))
b dP (x, y)
where
σ 2 = E[Y − m(X)]2 , bn (x) = E[m(x)]
b − m(x), vn (x) = Var(m(x)).
b
A linear predictor has the form g(x) = β T x. The best linear predictor minimzes E(Y −β T X)2 .
The minimizer is β = Σ−1 α where Σ = E[XX T ] and α = E(Y X). We will use linear
predictors; but we should never assume that m(x) is linear.
E[hX, vj iY ]
aj = .
E[hX, Xi2 ]
1
From this we can show that, for any vector b,
and
E[(Y − hb, Xi)2 ] − E[(Y − hβ, Xi)2 ] = E[hb − β, Xi2 ] = (b − β)T Σ(b − β).
Theorem 1 (Theorem 11.3 of Gyorfi, Kohler, Krzyzak and Walk, 2002) Let σ 2 =
supx Var(Y |X = x) < ∞. Assume that all the random variables are bounded by L < ∞.
Then
Z Z
Cd(log(n) + 1)
E |βb x − m(x)| dP (x) ≤ 8 inf |β T x − m(x)|2 dP (x) +
T 2
.
β n
The proof is straightforward but is very long. The strategy is to first bound n−1 i (βbT Xi −
P
m(Xi ))2 using
P the properties of least squares. Then, using concentration of measure one can
relate n−1 i f 2 (Xi ) to f 2 (x)dP (x).
R
Theorem 2 (Hsu, Kakade and Zhang 2014) Let m(x) = E[Y |X = x] and = Y −
m(X). Suppose there exists σ ≥ 0 such that
2 σ 2 /2
E[et |X = x] ≤ et
for all x and all t ∈ R. Let β T x be the best linear apprximation to m(x). With probability
at least 1 − 3e−t ,
√
2A √ 2 σ 2 (d + 2 dt + 2t)
b − r(β) ≤
r(β) (1 + 8t) + + o(1/n)
n n
where A = E[||Σ−1/2 X(m(X) − β T X)||2 ].
2
Theorem 3 We have √
n(βb − β) N (0, Γ)
where
Γ = Σ−1 E[(Y − X T β)2 XX T ]Σ−1
The covariance matrix Γ can be consistently estimated by
Γ b −1 M
b=Σ cΣb −1
where n
c(j, k) = 1
X
M 2i
Xi (j)Xi (k)b
n i=1
i = Yi − βbT Xi .
and b
The matrix Γb is called the sandwich estimator. The Normal approximation can be used to
q
construct confidence intervals for β. For example, β(j)±z
b α Γ(j,
b j)/n is an asymptotic 1−α
confidence interval for β(j). We can also get confidence intervals by using the bootstrap.
See Buja et al (2015) for details.
Now suppose that d > n. We can no longer use least squares. There are many approaches.
The simplest is to preprocess the data to reduce the dimension. For example, we can perform
PCA on the X 0 s and use the first k principal components where k < n. Alternatively, we
can cluster the covariates based on their correlations. We can the use one feature from each
cluster or take the average of the covariates within each cluster. Another approach is to
screen the variables by choosing the k features with the largest correlation with Y . After
dimension reduction, we can the use least squares. These preprocessing methods can be very
effective.
A different approach is to use all the covariates but, instead of least squares, we shrink the
coefficients towards 0. This is called ridge regression and is discussed in the next section.
Yet another approach is model selection where we try to find a good subset of the covariates.
Let S be a subset of {1, . . . , d} and let XS = (X(j) : j ∈ S). If the size of S is not too
large, we can regress Y on XS instead of S.
In particular, fix k < n and let Sk denote all subsets of size k. For a given S ∈ Sk , let βS be
the best linear predictor βS = Σ−1S αS for the subset S. We would like to choose S ∈ Sk to
minimize
E(Y − βST XS )2 .
3
This is equivalent to:
There will be a bias-variance tradeoff. As k increases, the bias decreases but the varaince
increases.
We can approximate the risk with the training error. But the minimization is over all subsets
of size k. This minimization is NP-hard. So best subset regression is infeasible. We can
approximate best subset regression in two different ways: a greedy approxmation or a convex
relaxation. The former leads to forward stepwise regression. The latter leads to the lasso.
All these methods involve a tuning parameter which can be chosen by cross-validation.
3 Ridge Regression
Theorem 4 (Hsu, Kakade and Zhang 2014) Suppose that ||Xi || ≤ r. Let β T x be the
best linear apprximation to m(x). Then, with probability at least 1 − 4e−t ,
!!
r2
b − r(β) ≤ 1 + O 1 + λ λ||β||2 σ 2 tr(Σ)
r(β) + .
n 2 n 2λ
4
Here is the algorithm:
1. Input k. Let S = ∅.
2. Let rj = n−1 i Yi Xi (j) denote the corrleation between Y and the j th feature.
P
Let J = argmaxj |rj |. Let S = S ∪ {J}.
3. Compute the regression of Y on XS = (X(j) : j ∈ S). Compute the residuals
e = (e1 , . . . , en ) where ei = Yi − βbST Xi .
4. Compute the correlations rj between the residuals e and the remaining features.
5. Let J = argmaxj |rj |. Let S = S ∪ {J}.
6. Repeat steps 3-5 until |S| = k.
7. Output S.
Now we will discuss the theory of forward stepwise regression. We will use the following
notation. If β = (β1 , β2 , . . .) is a sequence of real numbers then the `p norm is
!1/p
X
kβkp = |βj |p .
j
Let
|β(1) | ≥ |β(2) | ≥ |β(3) | ≥ · · ·
denote the values of βj ordered by their decreasing absolute values. The weak `p norm of β,
denoted by kβkw,p , is the smallest C such that
C
|β(j) | ≤ , j = 1, 2, . . .
j 1/p
Let ΣN denote all linear combinations of elements of D with at most N terms. Define the
best N -term approximation error
σN (f ) = inf inf kf − gk (1)
|Λ|≤N g∈Span(Λ)
where Λ denotes a subset of D and Span(Λ) is the set of linear combinations of functions in
Λ.
5
A dictionary D is an orthonormal basis for H if
P
In this case, any f ∈ H can be written as f = j βj ψj where βj = hf, ψj i. The Lp ball is
defined by ( )
X
Lp (C) = f = βj ψj : kβkp ≤ C . (2)
j
Now we drop the assumption that D is anPorthonormal basis. A function may then have
more than one expansion of the form f = j βj ψj . We define the norm
6
1. Input: f .
2. Initialize: r0 = f , f0 = 0, V = ∅.
kf kL1
krN k ≤ √ (4)
N +1
for all N ≥ 1.
Proof. Note that fN is the best approximation to f from Span(VN ). On the other hand, the
best approximation from the set {a gN : a ∈ R} is hf, gN igN . The error of the former must be
smaller than the error of the latter. In other words, ||f −fN ||2 ≤ ||f −fN −1 −hrN −1 , gN igN ||2 .
Thus,
7
Now, f = fN −1 + rN −1 and hfN −1 , rN −1 i = 0. So,
4khk2L1
krN k2 ≤ kf − hk2 + . (6)
N
P P
Proof. Choose any h ∈ L1 and write h = j βj ψj where khkL1 = j |βj |. Write f =
fN −1 + f − fN −1 = fN −1 + rN −1 and note that rN −1 is orthogonal to fN −1 . Hence, krN −1 k2 =
hrN −1 , f i and so
8
1. Input: Y ∈ Rn .
2. Initialize: r0 = Y , fb0 = 0, V = ∅.
where ha, bin = n−1 ni=1 ai bi . Set VN = VN −1 ∪ {gN }. Let fN be the projection
P
of rN −1 onto Span(VN ). Let rN = Y − fN .
Hence,
(krN −1 k2 − kf − hk2 )2
|hrN −1 , gk i|2 ≥ .
4khk2L1
Thus,
aN −1
aN ≤ aN −1 1 −
4khk2L1
where aN = krN k2 − kf − hk2 . By induction, the last displayed inequality implies that
aN ≤ 4khk2L1 /k and the result follows.
By combining the previous results with concentration of measure arguments (see appendix
for details) we get the following result, due to Barron, Cohen, Dahmen and DeVore (2008).
9
Theorem
√ 10 Let hn = argminh∈FN kf0 − hk2 . Suppose that lim supn→∞ khn kL1,n < ∞. Let
N ∼ n. Then, for every γ > 0, there exist C > 0 such that
C log n
kf − fbN k2 ≤ 4σN
2
+
n1/2
except on a set of probability n−γ .
P
Let us compare this with the lasso which we will discuss next. Let fL = j βj ψj minimize
kf − fL k2 subject to kβk1 ≤ L. Then, we will see that
1/2
2 2 log n
kf − fbL k ≤ kf − fL k + OP
n
which is the same rate.
The n−1/2 is in fact optimal. It might be surprising that the rate is independent of the
dimension. Why do you think this is the case?
This is a convex problem so the estimator can be found efficiently. The estimator is sparse:
for large enough λ, many of the components of βb are 0. This is proved in the course on
convex optimization. Now we discuss some theoretical properties of the lasso.1
The following result was proved in Zhao and Yu (2006), Meinshausen and Bühlmann (2005)
and Wainwright (2006). The version we state is from Wainwright (2006). Let β = (β1 , . . . , βs , 0, . . . , 0)
and decompose the design matrix as X = (XS XS c ) where S = {1, . . . , s}. Let βS =
(β1 , . . . , βs ).
10
2. The design matrix satisfies
3. φn (dn ) > 0.
5. λn satisfies
nλ2n
→∞
log(dn − sn )
and r −1 !
1 log sn 1 T
+ λn X X → 0. (8)
min1≤j≤sn |βj | n n
∞
The conditions of this theorem are very strong. They are not checkable and they are unlikely
to ever be true in practice.
Then
log n sn log n 1
kβn − βbn k2 = OP +O (9)
n φ2n (sn log n) log n
If
log n
sn log dn →0 (10)
n
11
and s
σy2 Φn (min n, dn )n2
λn = (11)
sn log n
P
then kβbn − βn k2 → 0.
Once again, the conditions of this theorem are very strong. They are not checkable and they
are unlikely to ever be true in practice.
The next theorem is the most important one. It does not require unrealistic conditions. We
state the theorem for bounded covariates. A more general version appears in Greenshtein
and Ritov (2004).
Theorem 13 Let Z = (Y, X). Assume that |Y | ≤ B and maxj |X(j)| ≤ B. Let
β∗ = argmin r(β)
||β||1 ≤L
where r(β) = E(Y − β T X)2 . Thus, cT β∗ is the best, sparse linear predictor (in the L1 sense).
Let βb be the lasso estimator:
βb = argmin rb(β)
||β||1 ≤L
−1
Pn
where rb(β) = n i=1 (Yi− XiT β)2 . With probabilty at least 1 − δ,
√ !
v
u
u 16(L + 1)4 B 2 2d
b ≤ r(β∗ ) + t
r(β) log √ .
n δ
Proof. Let Z = (Y, X) and Zi = (Yi , Xi ). Define γ ≡ γ(β) = (−1, β). Then
where Λ = E[ZZ T ]. Note that ||γ||1 = ||β||1 + 1. Let B = {β : ||β||1 ≤ L}. The training
error is n
1X
rb(β) = (Yi − XiT β)2 = γ T Λγ
b
n i=1
r(β) − r(β)| = |γ T (Λ
|b b − Λ)γ|
X
≤ b k) − Λ(j, k)| ≤ ||γ||2 δn
|γ(j)| |γ(k)| |Λ(j, 1
j,k
≤ (L + 1)2 ∆n
12
where
∆n = max |Λ(j,
b k) − Λ(j, k)|.
j,k
So,
b + (L + 1)2 ∆n ≤ rb(β∗ ) + (L + 1)2 ∆n ≤ r(β∗ ) + 2(L + 1)2 ∆n .
b ≤ rb(β)
r(β)
Note that |Z(j)Z(k)| ≤ B 2 < ∞. By Hoeffding’s inequality,
2 /(2B 2 )
P(∆n (j, k) ≥ ) ≤ 2e−n
√ !
v
u
4 2
b ≤ r(β∗ ) + t 16(L + 1) B log 2d
u
r(β) √ .
n δ
Problems With Sparsity. Sparse estimators are convenient and popualr but they can
some problems. Say that βb is weakly sparsistent if, for every β,
Pβ I(βbj = 1) ≤ I(βj = 1) for all j → 1 (12)
Theorem 14 (Leeb and Pötscher (2007)) Suppose that the following condiitons hold:
1. d is fixed.
2. The covariariates are nonstochastic and n−1 XT X → Q for some positive definite matrix
Q.
3. The errors i are independent with mean 0, finite variance σ 2 and have a density f
satisfying
Z 0 2
f (x)
0< f (x)dx < ∞.
f (x)
13
If βb is weakly sparsistent then
√
Proof. Choose any s ∈ Rd and let βn = −s/ n. Then,
sup Eβ (`(n1/2 (βb − β)) ≥ Eβn (`(n1/2 (βb − β)) ≥ Eβn (`(n1/2 (βb − β))I(βb = 0))
β
√
= `(− nβn )Pβn (βb = 0) = `(s)Pβn (βb = 0).
Now, P0 (βb = 0) → 1 by assumption. It can be shown that we also have Pβn (βb = 0) → 1.2
Hence, with probability tending to 1,
R(βbn )
sup → ∞.
β Rn
The implication is that when d is much smaller than n, sparse estimators have poor behavior.
However, when dn is increasing and dn > n, the least squares estimator no longer satisfies
(13). Thus we can no longer say that some other estimator outperforms the sparse estimator.
In summary, sparse estimators are well-suited for high-dimensional problems but not for low
dimensional problems.
5 Inference?
Is it possible to do inference after model selection? Do we need to? I’ll discuss this in class.
2
This follows from a property called contiguity.
14
References
Buja, Berk, Brown, George, Pitkin, Traskin, Zhao and Zhang (2015). Models as Apprx-
imations — A Conspiracy of Random Regressors and Model Deviations Against Classical
Inference in Regression. Statistical Science.
Hsu, Kakade and Zhang (2014). Random design analysis and ridge regression. arXiv:1106.2363.
Appendix: L2 Boosting
(0) (k)
Define estimators m
bn ,...,m b (0) (x) = 0 and then iterate the follow-
b n , . . . , as follows. Let m
ing steps:
− βbJ XiJ )2 .
P
3. Find J = argminj RSSj where RSSj = i (Ui
b (k+1) (x) = m
4. Set m b (k) (x) + βbJ xJ .
Yb (k) = Bk Y (16)
where Yb (k) = (m
b (k) (X1 ), . . . , m
b (k) (Xn ))T ,
and
Xj XTj
Hj = . (18)
kXj k2
15
Pdn
Theorem 16 (Bühlmann 2005) Let mn (x) = j=1 βj,n xj be the best linear approxima-
tion based on dn terms. Suppose that:
1−ξ
(A1 Growth) dn ≤ C0 eC1 n for some C0 , C1 > 0 and some 0 < ξ ≤ 1.
(A3 Bounded Covariates) supn max1≤j≤dn maxi |Xij | < ∞ with probability 1.
b n (X) − mn (x)|2 → 0
EX |m (19)
as n → 0.
for some 0 < tk ≤ 1. In the weak greedy algorithm we take Fk = Fk−1 +hf, gk igk . In the weak
orthogonal greedy algorithm we take Fk to be the projection of Rk−1 (f ) onto {g1 , . . . , gk }.
Finally set Rk (f ) = f − Fk .
L2 boosting essentially replaces hf, Xj i with hY, Xj in = n−1 i Yi Xij . Now hY, Xj in has
P
mean hf, Xj i. The main burden of the proof is to show that hY, Xj in is close to hf, Xj i with
16
high probability and then apply Temlyakov’s result. For this we use Bernstein’s inequality.
Recall that if |Zj | are bounded by M and Zj has variance σ 2 then
n2
1
P(|Z − E(Zj )| > ) ≤ 2 exp − 2 . (22)
2 σ + M /3
Hence, the probability that any empirical inner products differ from their functional coun-
terparts is no more than
n2
2 1
dn exp − 2 →0 (23)
2 σ + M /3
because of the growth condition.
The L1 norm depends on n and so we denote this by khkL1,n . For technical reasons, we
assume that kf k∞ ≤ B, that fbn is truncated to be no more than B and that kψk∞ ≤ B for
all ψ ∈ Dn .
Theorem 18 Suppose that pn ≡ |D|n ≤ nc for some c ≥ 0. Let fbN be the output of
the stepwise regression algorithm after N steps. Let f (x) = E(Y |X = x) denote the true
regression function. Then, for every h ∈ Dn ,
!
2
8khkL CN log n 1
P kf − fbN k2 > 4kf − hk2 + 1,n
+ < γ
N n n
Before proving this theorem, we need some preliminary results. For any Λ ⊂ D, let SΛ =
Span(Λ). Define ( )
[
FN = SΛ : |Λ| ≤ N .
Recall that, if F is a set of functions then Np (, F, ν) is the Lp covering entropy with respect
to the probability measure ν and Np (, F) is the supremum of Np (, F, ν) over all probability
measures ν.
17
Also,
N +1 N +1
2eB 2 3eB 2
N 2eB 3eB N
N1 (t, FN ) ≤ 12p log , N2 (t, FN ) ≤ 12p log .
t t t2 t2
Proof. The first two equation follow from standard covering arguments. The second two
equations follow from the fact that the number of subsets of Λ of size at most N is
N X N j
X p ep ep N
N
p N
≤ ≤N ≤ p max N ≤ 4pN .
j j N N N
j=1 j=1
The following lemma is from Chapter 11 of Gyorfi et al. The proof is long and technical and
we omit it.
2 (1 − )αn
β
≤ 14N1 ,F exp − .
20B 214(1 + )B 4
Apply Lemma 20 with = 1/2 together with Lemma 19 to conclude that, for C0 > 0 large
enough,
C0 N log n 1
P A1 > for some f < γ .
n n
To bound A2 , apply Theorem 8 with norm k · kn and with Y replacing f . Then,
4khk21,n
kY − fbk2n ≤ kY − hk2n +
k
18
8khk21,n
and hence A2 ≤ k
. Next, we have that
19