Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
4 views

Linear Regression

Uploaded by

idhitappu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Linear Regression

Uploaded by

idhitappu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Linear Regression

We observe D = {(X1 , Y1 ), . . . , (Xn , Yn )} where Xi = (Xi (1), . . . , Xi (d)) ∈ Rd and Yi ∈ R.


For notational simplicity, we will always assume that Xi (1) = 1.

Given a new pair (X, Y ) we want to predict Y from X. The conditional prediction risk is
Z
2 2
b = E[(Y − m(X))
r(m) b |D] = (y − m(x))
b dP (x, y)

and the prediction risk of m


b is
2
b = E(Y − m(X))
R(m) b = E[r(m)].
b

The true regression function is


m(x) = E[Y |X = x].

We have the following decomposition:


Z Z
2 2
R(m)
b = σ + bn (x)dP (x) + vn (x)dP (x)

where
σ 2 = E[Y − m(X)]2 , bn (x) = E[m(x)]
b − m(x), vn (x) = Var(m(x)).
b

We do not assume that m(x) is linear. Let  = Y − m(X). Note that

E[] = E[Y − m(X)] = E[E[Y − m(X) | X]] = 0.

A linear predictor has the form g(x) = β T x. The best linear predictor minimzes E(Y −β T X)2 .
The minimizer is β = Σ−1 α where Σ = E[XX T ] and α = E(Y X). We will use linear
predictors; but we should never assume that m(x) is linear.

1 Low Dimensional Linear Regression

Let Σ = E[XX T ]. We assume that Σ is non-singular. Let v1 , . . . , vd be the eigenvectors of


Σ and let λ1 , . . . , λd be the corresponding eigenvectors.
P
Homework: We can write β = j aj vj where

E[hX, vj iY ]
aj = .
E[hX, Xi2 ]

1
From this we can show that, for any vector b,

E[hX, biY ] = E[hX, bi hX, βi]

and
E[(Y − hb, Xi)2 ] − E[(Y − hβ, Xi)2 ] = E[hb − β, Xi2 ] = (b − β)T Σ(b − β).

The training error is


1X
(Yi − XiT β)2
n i
which is minimized by
b −1 α
βb = Σ b
b = n−1 n Xi X T and α
where Σ
P
b = n−1 ni=1 Yi Xi .
P
i=1 i

Theorem 1 (Theorem 11.3 of Gyorfi, Kohler, Krzyzak and Walk, 2002) Let σ 2 =
supx Var(Y |X = x) < ∞. Assume that all the random variables are bounded by L < ∞.
Then
Z Z
Cd(log(n) + 1)
E |βb x − m(x)| dP (x) ≤ 8 inf |β T x − m(x)|2 dP (x) +
T 2
.
β n

The proof is straightforward but is very long. The strategy is to first bound n−1 i (βbT Xi −
P
m(Xi ))2 using
P the properties of least squares. Then, using concentration of measure one can
relate n−1 i f 2 (Xi ) to f 2 (x)dP (x).
R

Theorem 2 (Hsu, Kakade and Zhang 2014) Let m(x) = E[Y |X = x] and  = Y −
m(X). Suppose there exists σ ≥ 0 such that
2 σ 2 /2
E[et |X = x] ≤ et

for all x and all t ∈ R. Let β T x be the best linear apprximation to m(x). With probability
at least 1 − 3e−t ,

2A √ 2 σ 2 (d + 2 dt + 2t)
b − r(β) ≤
r(β) (1 + 8t) + + o(1/n)
n n
where A = E[||Σ−1/2 X(m(X) − β T X)||2 ].

We have the following central limit theorem for β.


b

2
Theorem 3 We have √
n(βb − β) N (0, Γ)
where
Γ = Σ−1 E[(Y − X T β)2 XX T ]Σ−1
The covariance matrix Γ can be consistently estimated by
Γ b −1 M
b=Σ cΣb −1

where n
c(j, k) = 1
X
M 2i
Xi (j)Xi (k)b
n i=1

i = Yi − βbT Xi .
and b

The matrix Γb is called the sandwich estimator. The Normal approximation can be used to
q
construct confidence intervals for β. For example, β(j)±z
b α Γ(j,
b j)/n is an asymptotic 1−α
confidence interval for β(j). We can also get confidence intervals by using the bootstrap.
See Buja et al (2015) for details.

2 High Dimensional Linear Regression

Now suppose that d > n. We can no longer use least squares. There are many approaches.

The simplest is to preprocess the data to reduce the dimension. For example, we can perform
PCA on the X 0 s and use the first k principal components where k < n. Alternatively, we
can cluster the covariates based on their correlations. We can the use one feature from each
cluster or take the average of the covariates within each cluster. Another approach is to
screen the variables by choosing the k features with the largest correlation with Y . After
dimension reduction, we can the use least squares. These preprocessing methods can be very
effective.

A different approach is to use all the covariates but, instead of least squares, we shrink the
coefficients towards 0. This is called ridge regression and is discussed in the next section.

Yet another approach is model selection where we try to find a good subset of the covariates.
Let S be a subset of {1, . . . , d} and let XS = (X(j) : j ∈ S). If the size of S is not too
large, we can regress Y on XS instead of S.

In particular, fix k < n and let Sk denote all subsets of size k. For a given S ∈ Sk , let βS be
the best linear predictor βS = Σ−1S αS for the subset S. We would like to choose S ∈ Sk to
minimize
E(Y − βST XS )2 .

3
This is equivalent to:

minimize E(Y − β T X)2 subject to ||β||0 ≤ k

where ||β||0 is the number of non-zero elements of β.

There will be a bias-variance tradeoff. As k increases, the bias decreases but the varaince
increases.

We can approximate the risk with the training error. But the minimization is over all subsets
of size k. This minimization is NP-hard. So best subset regression is infeasible. We can
approximate best subset regression in two different ways: a greedy approxmation or a convex
relaxation. The former leads to forward stepwise regression. The latter leads to the lasso.

All these methods involve a tuning parameter which can be chosen by cross-validation.

3 Ridge Regression

In this case we minimize


1X
(Yi − XiT β)2 + λ||β||2
n i
where λ ≥ 0. The minimizer is
b + λI)−1 α
βb = (Σ b.
As λ increases, the bias increases and the variance decreases.

Theorem 4 (Hsu, Kakade and Zhang 2014) Suppose that ||Xi || ≤ r. Let β T x be the
best linear apprximation to m(x). Then, with probability at least 1 − 4e−t ,
!!
r2
b − r(β) ≤ 1 + O 1 + λ λ||β||2 σ 2 tr(Σ)
r(β) + .
n 2 n 2λ

Proposition 5 If Y = X T β + ,  ∼ N (0, σ 2 ) and β ∼ N (0, τ 2 I). Then the posterior mean


is the ridge regression estimator with λ = σ 2 /τ 2 .

4 Forward Stepwise Regression (Greedy Regression)

Forward stepwise regression is a greedy approximation to best subset regression. In what


follows, we will assume
P that the features have been standardized to have sample mean 0 and
−1 2
sample variance n i Xi (j) = 1.

4
Here is the algorithm:

Forward Stepwise Regression

1. Input k. Let S = ∅.
2. Let rj = n−1 i Yi Xi (j) denote the corrleation between Y and the j th feature.
P
Let J = argmaxj |rj |. Let S = S ∪ {J}.
3. Compute the regression of Y on XS = (X(j) : j ∈ S). Compute the residuals
e = (e1 , . . . , en ) where ei = Yi − βbST Xi .
4. Compute the correlations rj between the residuals e and the remaining features.
5. Let J = argmaxj |rj |. Let S = S ∪ {J}.
6. Repeat steps 3-5 until |S| = k.
7. Output S.

Now we will discuss the theory of forward stepwise regression. We will use the following
notation. If β = (β1 , β2 , . . .) is a sequence of real numbers then the `p norm is
!1/p
X
kβkp = |βj |p .
j

Let
|β(1) | ≥ |β(2) | ≥ |β(3) | ≥ · · ·
denote the values of βj ordered by their decreasing absolute values. The weak `p norm of β,
denoted by kβkw,p , is the smallest C such that
C
|β(j) | ≤ , j = 1, 2, . . .
j 1/p

We start by studying greedy approximations when there is no stochastic error. Let H be


a Hilbert space of functions. Let D be a dictionary which is any set of functions from
H whose linear span is H. The elements of D are called atoms. We will assume that the
atoms are normalized, that is, kf k = 1 for all f ∈ D. We will also assume that D is finite or
countable so we can write D = {ψ1 , ψ2 , . . .}. The goal is to approximate some f ∈ H with
linear combinations of atoms from the dictionary.

Let ΣN denote all linear combinations of elements of D with at most N terms. Define the
best N -term approximation error
σN (f ) = inf inf kf − gk (1)
|Λ|≤N g∈Span(Λ)

where Λ denotes a subset of D and Span(Λ) is the set of linear combinations of functions in
Λ.

5
A dictionary D is an orthonormal basis for H if

(i) kψk = 1 for all ψ ∈ D,

(ii) hψ, φi = 0 for all ψ, φ ∈ D such that ψ 6= φ and

(iii) hh, ψi = 0 for all ψ ∈ D implies that h = 0.

P
In this case, any f ∈ H can be written as f = j βj ψj where βj = hf, ψj i. The Lp ball is
defined by ( )
X
Lp (C) = f = βj ψj : kβkp ≤ C . (2)
j

The weak Lp ball is defined as


( )
X
Lw,p (C) = f= βj ψj : kβkw,p ≤ C . (3)
j

It can be shown that Lp (C) ⊂ Lw,p (C).

Theorem 6 If f ∈ Lw,p (C) with 0 < p < 2, then


 
1
σN = O
Ns

where s = p1 − 12 . The best N -term approximation is


P
j∈Λ βj fj where Λ corresponds to
β(1) , . . . , β(N ) .

Homework: Prove the theorem above.


P
Homework: Suppose that H is the set of all functions f : [0, 1] → R of the form f = j βj ψj
where ψ1 , ψ2 , . . . , is an orthonormal basis and such that j βj2 j 2p ≤ C 2 . Let fN = N
P P
j=1 βj ψj .
2
Find ||f − fN || .

Now we drop the assumption that D is anPorthonormal basis. A function may then have
more than one expansion of the form f = j βj ψj . We define the norm

kf kLp = inf kβkp

where the infimum is over all expansions of f . The Lp ball is defined by


( )
Lp (C) = f : kf kLp ≤ C .

6
1. Input: f .

2. Initialize: r0 = f , f0 = 0, V = ∅.

3. Repeat: At step N define

gN = argmaxψ∈D |hrN −1 , ψi|

and set VN = VN −1 ∪ {gN }. Let fN be the projection of rN −1 onto Span(VN ).


Let rN = f − fN .

Figure 1: The Orthogonal Greedy Algorithm.

We now describe a functional version of stepwise regression known as the Orthogonal


Greedy Algorithm (OGA), also known as Orthogonal Matching Pursuit. The algorithm
is given in Figure 1.

The algorithm produces a series of approximations fN with corresponding residuals rN . We


have the following two theorems from Barron et al (2008), the first dating back to DeVore
and Temlyakov (1996).

Theorem 7 For all f ∈ L1 , the residual rN after N steps of OGA satsifies

kf kL1
krN k ≤ √ (4)
N +1
for all N ≥ 1.

Proof. Note that fN is the best approximation to f from Span(VN ). On the other hand, the
best approximation from the set {a gN : a ∈ R} is hf, gN igN . The error of the former must be
smaller than the error of the latter. In other words, ||f −fN ||2 ≤ ||f −fN −1 −hrN −1 , gN igN ||2 .
Thus,

krN k2 ≤ krN −1 − hrN −1 , gN igN k2


= krN −1 k2 + |hrN −1 , gN i|2 kgN k2 −2|hrN −1 , gN i|2
| {z }
=1
2 2
= krN −1 k − |hrN −1 , gN i| . (5)

7
Now, f = fN −1 + rN −1 and hfN −1 , rN −1 i = 0. So,

krN −1 k2 = hrN −1 , rN −1 i = hrN −1 , f − fN −1 i = hrN −1 , f i − hrN −1 , fN −1 i


| {z }
=0
X X
= hrN −1 , f i = βj hrN −1 , ψj i ≤ sup |hrN −1 , ψi| |βj |
j ψ∈D j
= sup |hrN −1 , ψi| kf kL1 = |hrN −1 , gN i| kf kL1 .
ψ∈D

Continuing from equation (5), we have

krN −1 k2 |hrN −1 , gN i|2


 
2 2 2 2
krN k ≤ krN −1 k − |hrN −1 , gN i| = krN −1 k 1 −
krN −1 k4
krN −1 k2 |hrN −1 , gN i|2 krN −1 k2
   
2 2
≤ krN −1 k 1 − = krN −1 k 1 − .
|hrN −1 , gN i|2 kf k2L1 kf k2L1

If a0 ≥ a1 ≥ a2 ≥ · · · are nonnegative numbers such that a0 ≤ M and aN ≤ aN −1 (1 −


aN −1 /M ) then it follows from induction that aN ≤ M/(N + 1). The result follows by setting
aN = krN k2 and M = kf k2L1 . 

If f is not in L1 , it is still possible to bound the error as follows.

Theorem 8 For all f ∈ H and h ∈ L1 ,

4khk2L1
krN k2 ≤ kf − hk2 + . (6)
N
P P
Proof. Choose any h ∈ L1 and write h = j βj ψj where khkL1 = j |βj |. Write f =
fN −1 + f − fN −1 = fN −1 + rN −1 and note that rN −1 is orthogonal to fN −1 . Hence, krN −1 k2 =
hrN −1 , f i and so

krN −1 k2 = hrN −1 , f i = hrN −1 , h + f − hi = hrN −1 , hi + hrN −1 , f − hi


≤ hrN −1 , hi + krN −1 k kf − hk
X
= βj hrN −1 , ψj i + krN −1 k kf − hk
j
X
≤ |βj | |hrN −1 , ψj i| + krN −1 k kf − hk
j
X
≤ max |hrN −1 , ψj i| |βj | + krN −1 k kf − hk
j
j
= |hrN −1 , gk i| khkL1 + krN −1 k kf − hk
1
≤ |hrN −1 , gk i| khkL1 + (krN −1 k2 + kf − hk2 ).
2

8
1. Input: Y ∈ Rn .

2. Initialize: r0 = Y , fb0 = 0, V = ∅.

3. Repeat: At step N define

gN = argmaxψ∈D |hrN −1 , ψin |

where ha, bin = n−1 ni=1 ai bi . Set VN = VN −1 ∪ {gN }. Let fN be the projection
P
of rN −1 onto Span(VN ). Let rN = Y − fN .

Figure 2: The Greedy (Forward Stepwise) Regression Algorithm: Dictionary Version

Hence,
(krN −1 k2 − kf − hk2 )2
|hrN −1 , gk i|2 ≥ .
4khk2L1
Thus,  
aN −1
aN ≤ aN −1 1 −
4khk2L1
where aN = krN k2 − kf − hk2 . By induction, the last displayed inequality implies that
aN ≤ 4khk2L1 /k and the result follows. 

Corollary 9 For each N ,


2
2 4θN2
krN k ≤ + σN
N
where θN is the L1 norm of the best N -atom approximation.

In Figure 2 we re-express forward stepwise regression in a form closer to the notation we


have been using. In this version, we have a finite dictionary Dn and a data vector Y =
(Y1 , . . . , Yn )T and we use the empirical norm defined by
v
u n
u1 X
khkn = t h2 (Xi ).
n i=1

We assume that the dictionary is normalized in this empirical norm.

By combining the previous results with concentration of measure arguments (see appendix
for details) we get the following result, due to Barron, Cohen, Dahmen and DeVore (2008).

9
Theorem
√ 10 Let hn = argminh∈FN kf0 − hk2 . Suppose that lim supn→∞ khn kL1,n < ∞. Let
N ∼ n. Then, for every γ > 0, there exist C > 0 such that
C log n
kf − fbN k2 ≤ 4σN
2
+
n1/2
except on a set of probability n−γ .

P
Let us compare this with the lasso which we will discuss next. Let fL = j βj ψj minimize
kf − fL k2 subject to kβk1 ≤ L. Then, we will see that
 1/2
2 2 log n
kf − fbL k ≤ kf − fL k + OP
n
which is the same rate.

The n−1/2 is in fact optimal. It might be surprising that the rate is independent of the
dimension. Why do you think this is the case?

4.1 The Lasso

The lasso approximates best subset regression


P by using a convex relaxation. In particular,
the norm ||β||0 is replaced with ||β||1 = j |βj |.

The lasso estimator βb is defined as the minimizer of


X
(Yi − β T Xi )2 + λ||β||1 .
i

This is a convex problem so the estimator can be found efficiently. The estimator is sparse:
for large enough λ, many of the components of βb are 0. This is proved in the course on
convex optimization. Now we discuss some theoretical properties of the lasso.1

The following result was proved in Zhao and Yu (2006), Meinshausen and Bühlmann (2005)
and Wainwright (2006). The version we state is from Wainwright (2006). Let β = (β1 , . . . , βs , 0, . . . , 0)
and decompose the design matrix as X = (XS XS c ) where S = {1, . . . , s}. Let βS =
(β1 , . . . , βs ).

Theorem 11 (Sparsistency) Suppose that:

1. The true model is linear.


1
√The norm√ ||β||1 can be thought of as a measure of sparsity. For example, √ the vectors x =
(1/ d, . . . , 1/ d) and y = (1, 0, . . . , 1) have the same L2 norm. But ||y||1 = 1 < ||x||1 = d.

10
2. The design matrix satisfies

kXS c XS (XTS XS )−1 k∞ ≤ 1 −  for some 0 <  ≤ 1. (7)

3. φn (dn ) > 0.

4. The i are Normal.

5. λn satisfies
nλ2n
→∞
log(dn − sn )
and r  −1 !
1 log sn 1 T
+ λn X X → 0. (8)
min1≤j≤sn |βj | n n

b = support(β)) → 1 where support(β) =


Then the lasso is sparsistent, meaning that P (support(β)
{j : β(j) 6= 0.

The conditions of this theorem are very strong. They are not checkable and they are unlikely
to ever be true in practice.

Theorem 12 (Consistency: Meinshausen and Yu 2006) Assume that

1. The true regression function is linear.

2. The columns of X have norm n and the covariates are bounded.

3. E(exp |i |) < ∞ and E(2i ) = σ 2 < ∞.

4. E(Yi2 ) ≤ σy2 < ∞.

5. 0 < φn (kn ) ≤ Φn (kn ) < ∞ for kn = min{n, dn }.

6. lim inf n→∞ φn (sn log n) > 0 where sn = kβn k0 .

Then    
log n sn log n 1
kβn − βbn k2 = OP +O (9)
n φ2n (sn log n) log n
If  
log n
sn log dn →0 (10)
n

11
and s
σy2 Φn (min n, dn )n2
λn = (11)
sn log n
P
then kβbn − βn k2 → 0.

Once again, the conditions of this theorem are very strong. They are not checkable and they
are unlikely to ever be true in practice.

The next theorem is the most important one. It does not require unrealistic conditions. We
state the theorem for bounded covariates. A more general version appears in Greenshtein
and Ritov (2004).

Theorem 13 Let Z = (Y, X). Assume that |Y | ≤ B and maxj |X(j)| ≤ B. Let

β∗ = argmin r(β)
||β||1 ≤L

where r(β) = E(Y − β T X)2 . Thus, cT β∗ is the best, sparse linear predictor (in the L1 sense).
Let βb be the lasso estimator:
βb = argmin rb(β)
||β||1 ≤L
−1
Pn
where rb(β) = n i=1 (Yi− XiT β)2 . With probabilty at least 1 − δ,
√ !
v
u
u 16(L + 1)4 B 2 2d
b ≤ r(β∗ ) + t
r(β) log √ .
n δ

Proof. Let Z = (Y, X) and Zi = (Yi , Xi ). Define γ ≡ γ(β) = (−1, β). Then

r(β) = E(Y − β T X)2 = γ T Λγ

where Λ = E[ZZ T ]. Note that ||γ||1 = ||β||1 + 1. Let B = {β : ||β||1 ≤ L}. The training
error is n
1X
rb(β) = (Yi − XiT β)2 = γ T Λγ
b
n i=1

where Λ b = 1 Pn Zi Z T . For any β ∈ B,


n i=1 i

r(β) − r(β)| = |γ T (Λ
|b b − Λ)γ|
X
≤ b k) − Λ(j, k)| ≤ ||γ||2 δn
|γ(j)| |γ(k)| |Λ(j, 1
j,k

≤ (L + 1)2 ∆n

12
where
∆n = max |Λ(j,
b k) − Λ(j, k)|.
j,k

So,
b + (L + 1)2 ∆n ≤ rb(β∗ ) + (L + 1)2 ∆n ≤ r(β∗ ) + 2(L + 1)2 ∆n .
b ≤ rb(β)
r(β)
Note that |Z(j)Z(k)| ≤ B 2 < ∞. By Hoeffding’s inequality,
2 /(2B 2 )
P(∆n (j, k) ≥ ) ≤ 2e−n

and so, by the union bound,


2 /(2B 2 )
P(∆n ≥ ) ≤ 2d2 e−n =δ
r √ 
if we choose  = (4B 2 /n) log √2δd . Hence,

√ !
v
u
4 2
b ≤ r(β∗ ) + t 16(L + 1) B log 2d
u
r(β) √ .
n δ

with probabiloty at least 1 − δ. 

Problems With Sparsity. Sparse estimators are convenient and popualr but they can
some problems. Say that βb is weakly sparsistent if, for every β,

Pβ I(βbj = 1) ≤ I(βj = 1) for all j → 1 (12)

as n → ∞. In particular, if βbn is sparsistent, then it is weakly sparsistent. Suppose that d


is fixed. Then the least squares estimator βbn is minimax and satisfies

sup Eβ (nkβbn − βk2 ) = O(1). (13)


β

But sparsistent estimators have much larger risk:

Theorem 14 (Leeb and Pötscher (2007)) Suppose that the following condiitons hold:

1. d is fixed.

2. The covariariates are nonstochastic and n−1 XT X → Q for some positive definite matrix
Q.

3. The errors i are independent with mean 0, finite variance σ 2 and have a density f
satisfying
Z  0 2
f (x)
0< f (x)dx < ∞.
f (x)

13
If βb is weakly sparsistent then

sup Eβ (nkβbn − βk2 ) → ∞. (14)


β

More generally, if ` is any nonnegative loss function then

sup Eβ (`(n1/2 (βbn − β))) → sup `(s). (15)


β s


Proof. Choose any s ∈ Rd and let βn = −s/ n. Then,

sup Eβ (`(n1/2 (βb − β)) ≥ Eβn (`(n1/2 (βb − β)) ≥ Eβn (`(n1/2 (βb − β))I(βb = 0))
β

= `(− nβn )Pβn (βb = 0) = `(s)Pβn (βb = 0).

Now, P0 (βb = 0) → 1 by assumption. It can be shown that we also have Pβn (βb = 0) → 1.2
Hence, with probability tending to 1,

sup Eβ (`(n1/2 (βb − β)) ≥ `(s).


β

Since s was arbitrary the result follows. 

It follows that, if Rn denotes the minimax risk then

R(βbn )
sup → ∞.
β Rn

The implication is that when d is much smaller than n, sparse estimators have poor behavior.
However, when dn is increasing and dn > n, the least squares estimator no longer satisfies
(13). Thus we can no longer say that some other estimator outperforms the sparse estimator.
In summary, sparse estimators are well-suited for high-dimensional problems but not for low
dimensional problems.

5 Inference?

Is it possible to do inference after model selection? Do we need to? I’ll discuss this in class.
2
This follows from a property called contiguity.

14
References

Buja, Berk, Brown, George, Pitkin, Traskin, Zhao and Zhang (2015). Models as Apprx-
imations — A Conspiracy of Random Regressors and Model Deviations Against Classical
Inference in Regression. Statistical Science.

Hsu, Kakade and Zhang (2014). Random design analysis and ridge regression. arXiv:1106.2363.

Gyorfi, Kohler, Krzyzak and Walk. (2002). A Distribution-Free Theory of Nonparametric


Regression. Springer.

Appendix: L2 Boosting
(0) (k)
Define estimators m
bn ,...,m b (0) (x) = 0 and then iterate the follow-
b n , . . . , as follows. Let m
ing steps:

1. Compute the residuals Ui = Yi − m b (k) (Xi ).

2. Regress the residuals on the Yi ’s: βbj = i Ui Xij / i Xij2 , j = 1, . . . , d.


P P

− βbJ XiJ )2 .
P
3. Find J = argminj RSSj where RSSj = i (Ui

b (k+1) (x) = m
4. Set m b (k) (x) + βbJ xJ .

The version above is called L2 boosting or matching pursuit. A variation is to set


mb (k+1) (x) = m b (k) (x) + ν βbJ xJ where 0 < ν ≤ 1. Another variation is to set m b (k+1) (x) =
mb (k) (x)+νsign(βbJ )xJ which is called forward stagewise regression. Yet another variation
is to set mb (k) to be the linear regression estimator based on all variables selected up to that
point. This is forward stepwise regression or orthogonal matching pursuit.

Theorem 15 The matching pursuit estimator is linear. In particular,

Yb (k) = Bk Y (16)

where Yb (k) = (m
b (k) (X1 ), . . . , m
b (k) (Xn ))T ,

Bk = I − (I − Hk )(I − Hk−1 ) · · · (I − H1 ), (17)

and
Xj XTj
Hj = . (18)
kXj k2

15
Pdn
Theorem 16 (Bühlmann 2005) Let mn (x) = j=1 βj,n xj be the best linear approxima-
tion based on dn terms. Suppose that:
1−ξ
(A1 Growth) dn ≤ C0 eC1 n for some C0 , C1 > 0 and some 0 < ξ ≤ 1.

(A2 Sparsity) supn dj=1


Pn
|βj,n | < ∞.

(A3 Bounded Covariates) supn max1≤j≤dn maxi |Xij | < ∞ with probability 1.

(A4 Moments) E||s < ∞ for some s > 4/ξ.

Then there exists kn → ∞ such that

b n (X) − mn (x)|2 → 0
EX |m (19)

as n → 0.

We won’t prove the theorem


R but we will outline the idea. Let H be a Hilbert space with
inner product hf, gi = f (x)g(x)dP (x). Let D be a dictionary, that is a set of functions,
each of unit norm, that span H. Define a functional version of matching pursuit, known as
the weak greedy algorithm, as follows. Let R0 (f ) = f , F0 = 0. At step k, find gk ∈ D so
that
|hRk−1 (f ), gk i| ≥ tk sup |hRk−1 (f ), hi|
h∈D

for some 0 < tk ≤ 1. In the weak greedy algorithm we take Fk = Fk−1 +hf, gk igk . In the weak
orthogonal greedy algorithm we take Fk to be the projection of Rk−1 (f ) onto {g1 , . . . , gk }.
Finally set Rk (f ) = f − Fk .

Theorem 17 (Temlyakov 2000) Let f (x) = j βj gj (x) where gj ∈ D and ∞


P P
j=1 |βj | ≤
B < ∞. Then, for the weak orthogonal greedy algorithm
B
kRk (f )k ≤  1/2 (20)
Pk 2
1+ j=1 tj

and for the weak greedy algorithm


B
kRk (f )k ≤  tk /(2(2+tk )) . (21)
Pk 2
1+ j=1 tj

L2 boosting essentially replaces hf, Xj i with hY, Xj in = n−1 i Yi Xij . Now hY, Xj in has
P
mean hf, Xj i. The main burden of the proof is to show that hY, Xj in is close to hf, Xj i with

16
high probability and then apply Temlyakov’s result. For this we use Bernstein’s inequality.
Recall that if |Zj | are bounded by M and Zj has variance σ 2 then

n2
 
1
P(|Z − E(Zj )| > ) ≤ 2 exp − 2 . (22)
2 σ + M /3
Hence, the probability that any empirical inner products differ from their functional coun-
terparts is no more than
n2
 
2 1
dn exp − 2 →0 (23)
2 σ + M /3
because of the growth condition.

Appendix: Proof of Theorem 10

The L1 norm depends on n and so we denote this by khkL1,n . For technical reasons, we
assume that kf k∞ ≤ B, that fbn is truncated to be no more than B and that kψk∞ ≤ B for
all ψ ∈ Dn .

Theorem 18 Suppose that pn ≡ |D|n ≤ nc for some c ≥ 0. Let fbN be the output of
the stepwise regression algorithm after N steps. Let f (x) = E(Y |X = x) denote the true
regression function. Then, for every h ∈ Dn ,
!
2
8khkL CN log n 1
P kf − fbN k2 > 4kf − hk2 + 1,n
+ < γ
N n n

for some positive constants γ and C.

Before proving this theorem, we need some preliminary results. For any Λ ⊂ D, let SΛ =
Span(Λ). Define ( )
[
FN = SΛ : |Λ| ≤ N .

Recall that, if F is a set of functions then Np (, F, ν) is the Lp covering entropy with respect
to the probability measure ν and Np (, F) is the supremum of Np (, F, ν) over all probability
measures ν.

Lemma 19 For every t > 0, and every Λ ⊂ Dn ,


|Λ|+1 |Λ|+1
2eB 2 3eB 2
   
2eB 3eB
N1 (t, SΛ ) ≤ 3 log , N2 (t, SΛ ) ≤ 3 log .
t t t2 t2

17
Also,
N +1 N +1
2eB 2 3eB 2
   
N 2eB 3eB N
N1 (t, FN ) ≤ 12p log , N2 (t, FN ) ≤ 12p log .
t t t2 t2

Proof. The first two equation follow from standard covering arguments. The second two
equations follow from the fact that the number of subsets of Λ of size at most N is
N   X N  j
X p ep  ep N
N
 p N
≤ ≤N ≤ p max N ≤ 4pN .
j j N N N
j=1 j=1

The following lemma is from Chapter 11 of Gyorfi et al. The proof is long and technical and
we omit it.

Lemma 20 Suppose that |Y | ≤ B, where B ≥ 1, and F is a set of real-valued


R functions
such that kf k∞ ≤ B for all f ∈ F. Let f0 (x) = E(Y |X = x) and kgk2 = g 2 (x)dP (x).
Then, for every α, β > 0 and  ∈ (0, 1/2],
!
P (1 − )kf − f0 k2 ≥ kY − f k2n − kY − f0 k2n + (α + β) for some f ∈ F

2 (1 − )αn
   
β
≤ 14N1 ,F exp − .
20B 214(1 + )B 4

Proof of Theorem 18. For any h ∈ Fn we have


!
kfb − f0 k2n = kfb − f0 k2 − 2 kY − fbk2n − kY − f0 k2n
| {z }
A1
! !
+ 2 kY − fbk2n − kY − hk2n + 2 kY − hk2n − kY − f0 k2n .
| {z } | {z }
A2 A3

Apply Lemma 20 with  = 1/2 together with Lemma 19 to conclude that, for C0 > 0 large
enough,  
C0 N log n 1
P A1 > for some f < γ .
n n
To bound A2 , apply Theorem 8 with norm k · kn and with Y replacing f . Then,
4khk21,n
kY − fbk2n ≤ kY − hk2n +
k
18
8khk21,n
and hence A2 ≤ k
. Next, we have that

E(A3 ) = kf0 − hk2

and for large enough C1 ,


 
C1 N log n 1
P A3 > kf0 − hk2 + for some f < .
n nγ

19

You might also like