0% found this document useful (0 votes)

4 views

Linear Regression

Uploaded by

idhitappu

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Linear Regression

Uploaded by

idhitappu

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Linear Regression

We observe D = {(X1 , Y1 ), . . . , (Xn , Yn )} where Xi = (Xi (1), . . . , Xi (d)) ∈ Rd and Yi ∈ R.

For notational simplicity, we will always assume that Xi (1) = 1.

Given a new pair (X, Y ) we want to predict Y from X. The conditional prediction risk is
Z
2 2
b = E[(Y − m(X))
r(m) b |D] = (y − m(x))
b dP (x, y)

and the prediction risk of m

b is
2
b = E(Y − m(X))
R(m) b = E[r(m)].
b

The true regression function is

m(x) = E[Y |X = x].

We have the following decomposition:

Z Z
2 2
R(m)
b = σ + bn (x)dP (x) + vn (x)dP (x)

where
σ 2 = E[Y − m(X)]2 , bn (x) = E[m(x)]
b − m(x), vn (x) = Var(m(x)).
b

We do not assume that m(x) is linear. Let = Y − m(X). Note that

E[] = E[Y − m(X)] = E[E[Y − m(X) | X]] = 0.

A linear predictor has the form g(x) = β T x. The best linear predictor minimzes E(Y −β T X)2 .
The minimizer is β = Σ−1 α where Σ = E[XX T ] and α = E(Y X). We will use linear
predictors; but we should never assume that m(x) is linear.

1 Low Dimensional Linear Regression

Let Σ = E[XX T ]. We assume that Σ is non-singular. Let v1 , . . . , vd be the eigenvectors of

Σ and let λ1 , . . . , λd be the corresponding eigenvectors.
P
Homework: We can write β = j aj vj where

E[hX, vj iY ]
aj = .
E[hX, Xi2 ]

1
From this we can show that, for any vector b,

E[hX, biY ] = E[hX, bi hX, βi]

and
E[(Y − hb, Xi)2 ] − E[(Y − hβ, Xi)2 ] = E[hb − β, Xi2 ] = (b − β)T Σ(b − β).

The training error is

1X
(Yi − XiT β)2
n i
which is minimized by
b −1 α
βb = Σ b
b = n−1 n Xi X T and α
where Σ
P
b = n−1 ni=1 Yi Xi .
P
i=1 i

The proof is straightforward but is very long. The strategy is to first bound n−1 i (βbT Xi −
P
m(Xi ))2 using
P the properties of least squares. Then, using concentration of measure one can
relate n−1 i f 2 (Xi ) to f 2 (x)dP (x).
R

Theorem 2 (Hsu, Kakade and Zhang 2014) Let m(x) = E[Y |X = x] and = Y −
m(X). Suppose there exists σ ≥ 0 such that
2 σ 2 /2
E[et |X = x] ≤ et

for all x and all t ∈ R. Let β T x be the best linear apprximation to m(x). With probability
at least 1 − 3e−t ,
√
2A √ 2 σ 2 (d + 2 dt + 2t)
b − r(β) ≤
r(β) (1 + 8t) + + o(1/n)
n n
where A = E[||Σ−1/2 X(m(X) − β T X)||2 ].

We have the following central limit theorem for β.

2
Theorem 3 We have √
n(βb − β) N (0, Γ)
where
Γ = Σ−1 E[(Y − X T β)2 XX T ]Σ−1
The covariance matrix Γ can be consistently estimated by
Γ b −1 M
b=Σ cΣb −1

where n
c(j, k) = 1
X
M 2i
Xi (j)Xi (k)b
n i=1

i = Yi − βbT Xi .
and b

The matrix Γb is called the sandwich estimator. The Normal approximation can be used to
q
construct confidence intervals for β. For example, β(j)±z
b α Γ(j,
b j)/n is an asymptotic 1−α
confidence interval for β(j). We can also get confidence intervals by using the bootstrap.
See Buja et al (2015) for details.

2 High Dimensional Linear Regression

Now suppose that d > n. We can no longer use least squares. There are many approaches.

The simplest is to preprocess the data to reduce the dimension. For example, we can perform
PCA on the X 0 s and use the first k principal components where k < n. Alternatively, we
can cluster the covariates based on their correlations. We can the use one feature from each
cluster or take the average of the covariates within each cluster. Another approach is to
screen the variables by choosing the k features with the largest correlation with Y . After
dimension reduction, we can the use least squares. These preprocessing methods can be very
effective.

A different approach is to use all the covariates but, instead of least squares, we shrink the
coefficients towards 0. This is called ridge regression and is discussed in the next section.

Yet another approach is model selection where we try to find a good subset of the covariates.
Let S be a subset of {1, . . . , d} and let XS = (X(j) : j ∈ S). If the size of S is not too
large, we can regress Y on XS instead of S.

In particular, fix k < n and let Sk denote all subsets of size k. For a given S ∈ Sk , let βS be
the best linear predictor βS = Σ−1S αS for the subset S. We would like to choose S ∈ Sk to
minimize
E(Y − βST XS )2 .

3
This is equivalent to:

minimize E(Y − β T X)2 subject to ||β||0 ≤ k

where ||β||0 is the number of non-zero elements of β.

There will be a bias-variance tradeoff. As k increases, the bias decreases but the varaince
increases.

We can approximate the risk with the training error. But the minimization is over all subsets
of size k. This minimization is NP-hard. So best subset regression is infeasible. We can
approximate best subset regression in two different ways: a greedy approxmation or a convex
relaxation. The former leads to forward stepwise regression. The latter leads to the lasso.

All these methods involve a tuning parameter which can be chosen by cross-validation.

3 Ridge Regression

In this case we minimize

1X
(Yi − XiT β)2 + λ||β||2
n i
where λ ≥ 0. The minimizer is
b + λI)−1 α
βb = (Σ b.
As λ increases, the bias increases and the variance decreases.

Theorem 4 (Hsu, Kakade and Zhang 2014) Suppose that ||Xi || ≤ r. Let β T x be the
best linear apprximation to m(x). Then, with probability at least 1 − 4e−t ,
!!
r2
b − r(β) ≤ 1 + O 1 + λ λ||β||2 σ 2 tr(Σ)
r(β) + .
n 2 n 2λ

Proposition 5 If Y = X T β + , ∼ N (0, σ 2 ) and β ∼ N (0, τ 2 I). Then the posterior mean

is the ridge regression estimator with λ = σ 2 /τ 2 .

4 Forward Stepwise Regression (Greedy Regression)

Forward stepwise regression is a greedy approximation to best subset regression. In what

follows, we will assume
P that the features have been standardized to have sample mean 0 and
−1 2
sample variance n i Xi (j) = 1.

4
Here is the algorithm:

Forward Stepwise Regression

1. Input k. Let S = ∅.
2. Let rj = n−1 i Yi Xi (j) denote the corrleation between Y and the j th feature.
P
Let J = argmaxj |rj |. Let S = S ∪ {J}.
3. Compute the regression of Y on XS = (X(j) : j ∈ S). Compute the residuals
e = (e1 , . . . , en ) where ei = Yi − βbST Xi .
4. Compute the correlations rj between the residuals e and the remaining features.
5. Let J = argmaxj |rj |. Let S = S ∪ {J}.
6. Repeat steps 3-5 until |S| = k.
7. Output S.

Now we will discuss the theory of forward stepwise regression. We will use the following
notation. If β = (β1 , β2 , . . .) is a sequence of real numbers then the `p norm is
!1/p
X
kβkp = |βj |p .
j

Let
|β(1) | ≥ |β(2) | ≥ |β(3) | ≥ · · ·
denote the values of βj ordered by their decreasing absolute values. The weak `p norm of β,
denoted by kβkw,p , is the smallest C such that
C
|β(j) | ≤ , j = 1, 2, . . .
j 1/p

We start by studying greedy approximations when there is no stochastic error. Let H be

a Hilbert space of functions. Let D be a dictionary which is any set of functions from
H whose linear span is H. The elements of D are called atoms. We will assume that the
atoms are normalized, that is, kf k = 1 for all f ∈ D. We will also assume that D is finite or
countable so we can write D = {ψ1 , ψ2 , . . .}. The goal is to approximate some f ∈ H with
linear combinations of atoms from the dictionary.

Let ΣN denote all linear combinations of elements of D with at most N terms. Define the
best N -term approximation error
σN (f ) = inf inf kf − gk (1)
|Λ|≤N g∈Span(Λ)

where Λ denotes a subset of D and Span(Λ) is the set of linear combinations of functions in
Λ.

5
A dictionary D is an orthonormal basis for H if

(i) kψk = 1 for all ψ ∈ D,

(ii) hψ, φi = 0 for all ψ, φ ∈ D such that ψ 6= φ and

(iii) hh, ψi = 0 for all ψ ∈ D implies that h = 0.

P
In this case, any f ∈ H can be written as f = j βj ψj where βj = hf, ψj i. The Lp ball is
defined by ( )
X
Lp (C) = f = βj ψj : kβkp ≤ C . (2)
j

The weak Lp ball is defined as

( )
X
Lw,p (C) = f= βj ψj : kβkw,p ≤ C . (3)
j

It can be shown that Lp (C) ⊂ Lw,p (C).

Theorem 6 If f ∈ Lw,p (C) with 0 < p < 2, then

1
σN = O
Ns

where s = p1 − 12 . The best N -term approximation is

P
j∈Λ βj fj where Λ corresponds to
β(1) , . . . , β(N ) .

Homework: Prove the theorem above.

P
Homework: Suppose that H is the set of all functions f : [0, 1] → R of the form f = j βj ψj
where ψ1 , ψ2 , . . . , is an orthonormal basis and such that j βj2 j 2p ≤ C 2 . Let fN = N
P P
j=1 βj ψj .
2
Find ||f − fN || .

Now we drop the assumption that D is anPorthonormal basis. A function may then have
more than one expansion of the form f = j βj ψj . We define the norm

kf kLp = inf kβkp

where the infimum is over all expansions of f . The Lp ball is defined by

( )
Lp (C) = f : kf kLp ≤ C .

6
1. Input: f .

2. Initialize: r0 = f , f0 = 0, V = ∅.

3. Repeat: At step N define

gN = argmaxψ∈D |hrN −1 , ψi|

and set VN = VN −1 ∪ {gN }. Let fN be the projection of rN −1 onto Span(VN ).

Let rN = f − fN .

Figure 1: The Orthogonal Greedy Algorithm.

We now describe a functional version of stepwise regression known as the Orthogonal

Greedy Algorithm (OGA), also known as Orthogonal Matching Pursuit. The algorithm
is given in Figure 1.

The algorithm produces a series of approximations fN with corresponding residuals rN . We

have the following two theorems from Barron et al (2008), the first dating back to DeVore
and Temlyakov (1996).

Theorem 7 For all f ∈ L1 , the residual rN after N steps of OGA satsifies

kf kL1
krN k ≤ √ (4)
N +1
for all N ≥ 1.

Proof. Note that fN is the best approximation to f from Span(VN ). On the other hand, the
best approximation from the set {a gN : a ∈ R} is hf, gN igN . The error of the former must be
smaller than the error of the latter. In other words, ||f −fN ||2 ≤ ||f −fN −1 −hrN −1 , gN igN ||2 .
Thus,

krN k2 ≤ krN −1 − hrN −1 , gN igN k2

= krN −1 k2 + |hrN −1 , gN i|2 kgN k2 −2|hrN −1 , gN i|2
| {z }
=1
2 2
= krN −1 k − |hrN −1 , gN i| . (5)

7
Now, f = fN −1 + rN −1 and hfN −1 , rN −1 i = 0. So,

krN −1 k2 = hrN −1 , rN −1 i = hrN −1 , f − fN −1 i = hrN −1 , f i − hrN −1 , fN −1 i

| {z }
=0
X X
= hrN −1 , f i = βj hrN −1 , ψj i ≤ sup |hrN −1 , ψi| |βj |
j ψ∈D j
= sup |hrN −1 , ψi| kf kL1 = |hrN −1 , gN i| kf kL1 .
ψ∈D

Continuing from equation (5), we have

krN −1 k2 |hrN −1 , gN i|2

If a0 ≥ a1 ≥ a2 ≥ · · · are nonnegative numbers such that a0 ≤ M and aN ≤ aN −1 (1 −

aN −1 /M ) then it follows from induction that aN ≤ M/(N + 1). The result follows by setting
aN = krN k2 and M = kf k2L1 .

If f is not in L1 , it is still possible to bound the error as follows.

Theorem 8 For all f ∈ H and h ∈ L1 ,

4khk2L1
krN k2 ≤ kf − hk2 + . (6)
N
P P
Proof. Choose any h ∈ L1 and write h = j βj ψj where khkL1 = j |βj |. Write f =
fN −1 + f − fN −1 = fN −1 + rN −1 and note that rN −1 is orthogonal to fN −1 . Hence, krN −1 k2 =
hrN −1 , f i and so

krN −1 k2 = hrN −1 , f i = hrN −1 , h + f − hi = hrN −1 , hi + hrN −1 , f − hi

8
1. Input: Y ∈ Rn .

2. Initialize: r0 = Y , fb0 = 0, V = ∅.

3. Repeat: At step N define

gN = argmaxψ∈D |hrN −1 , ψin |

where ha, bin = n−1 ni=1 ai bi . Set VN = VN −1 ∪ {gN }. Let fN be the projection
P
of rN −1 onto Span(VN ). Let rN = Y − fN .

Figure 2: The Greedy (Forward Stepwise) Regression Algorithm: Dictionary Version

Hence,
(krN −1 k2 − kf − hk2 )2
|hrN −1 , gk i|2 ≥ .
4khk2L1
Thus,
aN −1
aN ≤ aN −1 1 −
4khk2L1
where aN = krN k2 − kf − hk2 . By induction, the last displayed inequality implies that
aN ≤ 4khk2L1 /k and the result follows.

Corollary 9 For each N ,

2
2 4θN2
krN k ≤ + σN
N
where θN is the L1 norm of the best N -atom approximation.

In Figure 2 we re-express forward stepwise regression in a form closer to the notation we

have been using. In this version, we have a finite dictionary Dn and a data vector Y =
(Y1 , . . . , Yn )T and we use the empirical norm defined by
v
u n
u1 X
khkn = t h2 (Xi ).
n i=1

We assume that the dictionary is normalized in this empirical norm.

By combining the previous results with concentration of measure arguments (see appendix
for details) we get the following result, due to Barron, Cohen, Dahmen and DeVore (2008).

9
Theorem
√ 10 Let hn = argminh∈FN kf0 − hk2 . Suppose that lim supn→∞ khn kL1,n < ∞. Let
N ∼ n. Then, for every γ > 0, there exist C > 0 such that
C log n
kf − fbN k2 ≤ 4σN
2
+
n1/2
except on a set of probability n−γ .

P
Let us compare this with the lasso which we will discuss next. Let fL = j βj ψj minimize
kf − fL k2 subject to kβk1 ≤ L. Then, we will see that
1/2
2 2 log n
kf − fbL k ≤ kf − fL k + OP
n
which is the same rate.

The n−1/2 is in fact optimal. It might be surprising that the rate is independent of the
dimension. Why do you think this is the case?

4.1 The Lasso

The lasso approximates best subset regression

P by using a convex relaxation. In particular,
the norm ||β||0 is replaced with ||β||1 = j |βj |.

The lasso estimator βb is defined as the minimizer of

X
(Yi − β T Xi )2 + λ||β||1 .
i

This is a convex problem so the estimator can be found efficiently. The estimator is sparse:
for large enough λ, many of the components of βb are 0. This is proved in the course on
convex optimization. Now we discuss some theoretical properties of the lasso.1

The following result was proved in Zhao and Yu (2006), Meinshausen and Bühlmann (2005)
and Wainwright (2006). The version we state is from Wainwright (2006). Let β = (β1 , . . . , βs , 0, . . . , 0)
and decompose the design matrix as X = (XS XS c ) where S = {1, . . . , s}. Let βS =
(β1 , . . . , βs ).

Theorem 11 (Sparsistency) Suppose that:

1. The true model is linear.

1
√The norm√ ||β||1 can be thought of as a measure of sparsity. For example, √ the vectors x =
(1/ d, . . . , 1/ d) and y = (1, 0, . . . , 1) have the same L2 norm. But ||y||1 = 1 < ||x||1 = d.

10
2. The design matrix satisfies

kXS c XS (XTS XS )−1 k∞ ≤ 1 − for some 0 < ≤ 1. (7)

3. φn (dn ) > 0.

4. The i are Normal.

5. λn satisfies
nλ2n
→∞
log(dn − sn )
and r −1 !
1 log sn 1 T
+ λn X X → 0. (8)
min1≤j≤sn |βj | n n
∞

b = support(β)) → 1 where support(β) =

Then the lasso is sparsistent, meaning that P (support(β)
{j : β(j) 6= 0.

The conditions of this theorem are very strong. They are not checkable and they are unlikely
to ever be true in practice.

Theorem 12 (Consistency: Meinshausen and Yu 2006) Assume that

1. The true regression function is linear.

2. The columns of X have norm n and the covariates are bounded.

3. E(exp |i |) < ∞ and E(2i ) = σ 2 < ∞.

4. E(Yi2 ) ≤ σy2 < ∞.

5. 0 < φn (kn ) ≤ Φn (kn ) < ∞ for kn = min{n, dn }.

6. lim inf n→∞ φn (sn log n) > 0 where sn = kβn k0 .

Then
log n sn log n 1
kβn − βbn k2 = OP +O (9)
n φ2n (sn log n) log n
If
log n
sn log dn →0 (10)
n

11
and s
σy2 Φn (min n, dn )n2
λn = (11)
sn log n
P
then kβbn − βn k2 → 0.

Once again, the conditions of this theorem are very strong. They are not checkable and they
are unlikely to ever be true in practice.

The next theorem is the most important one. It does not require unrealistic conditions. We
state the theorem for bounded covariates. A more general version appears in Greenshtein
and Ritov (2004).

Theorem 13 Let Z = (Y, X). Assume that |Y | ≤ B and maxj |X(j)| ≤ B. Let

β∗ = argmin r(β)
||β||1 ≤L

where r(β) = E(Y − β T X)2 . Thus, cT β∗ is the best, sparse linear predictor (in the L1 sense).
Let βb be the lasso estimator:
βb = argmin rb(β)
||β||1 ≤L
−1
Pn
where rb(β) = n i=1 (Yi− XiT β)2 . With probabilty at least 1 − δ,
√ !
v
u
u 16(L + 1)4 B 2 2d
b ≤ r(β∗ ) + t
r(β) log √ .
n δ

Proof. Let Z = (Y, X) and Zi = (Yi , Xi ). Define γ ≡ γ(β) = (−1, β). Then

r(β) = E(Y − β T X)2 = γ T Λγ

where Λ = E[ZZ T ]. Note that ||γ||1 = ||β||1 + 1. Let B = {β : ||β||1 ≤ L}. The training
error is n
1X
rb(β) = (Yi − XiT β)2 = γ T Λγ
b
n i=1

where Λ b = 1 Pn Zi Z T . For any β ∈ B,

n i=1 i

r(β) − r(β)| = |γ T (Λ
|b b − Λ)γ|
X
≤ b k) − Λ(j, k)| ≤ ||γ||2 δn
|γ(j)| |γ(k)| |Λ(j, 1
j,k

≤ (L + 1)2 ∆n

12
where
∆n = max |Λ(j,
b k) − Λ(j, k)|.
j,k

So,
b + (L + 1)2 ∆n ≤ rb(β∗ ) + (L + 1)2 ∆n ≤ r(β∗ ) + 2(L + 1)2 ∆n .
b ≤ rb(β)
r(β)
Note that |Z(j)Z(k)| ≤ B 2 < ∞. By Hoeffding’s inequality,
2 /(2B 2 )
P(∆n (j, k) ≥ ) ≤ 2e−n

and so, by the union bound,

2 /(2B 2 )
P(∆n ≥ ) ≤ 2d2 e−n =δ
r √
if we choose = (4B 2 /n) log √2δd . Hence,

√ !
v
u
4 2
b ≤ r(β∗ ) + t 16(L + 1) B log 2d
u
r(β) √ .
n δ

with probabiloty at least 1 − δ.

Problems With Sparsity. Sparse estimators are convenient and popualr but they can
some problems. Say that βb is weakly sparsistent if, for every β,

Pβ I(βbj = 1) ≤ I(βj = 1) for all j → 1 (12)

as n → ∞. In particular, if βbn is sparsistent, then it is weakly sparsistent. Suppose that d

is fixed. Then the least squares estimator βbn is minimax and satisfies

sup Eβ (nkβbn − βk2 ) = O(1). (13)

But sparsistent estimators have much larger risk:

Theorem 14 (Leeb and Pötscher (2007)) Suppose that the following condiitons hold:

1. d is fixed.

2. The covariariates are nonstochastic and n−1 XT X → Q for some positive definite matrix
Q.

3. The errors i are independent with mean 0, finite variance σ 2 and have a density f
satisfying
Z 0 2
f (x)
0< f (x)dx < ∞.
f (x)

13
If βb is weakly sparsistent then

sup Eβ (nkβbn − βk2 ) → ∞. (14)

More generally, if ` is any nonnegative loss function then

sup Eβ (`(n1/2 (βbn − β))) → sup `(s). (15)

β s

√
Proof. Choose any s ∈ Rd and let βn = −s/ n. Then,

sup Eβ (`(n1/2 (βb − β)) ≥ Eβn (`(n1/2 (βb − β)) ≥ Eβn (`(n1/2 (βb − β))I(βb = 0))
β
√
= `(− nβn )Pβn (βb = 0) = `(s)Pβn (βb = 0).

Now, P0 (βb = 0) → 1 by assumption. It can be shown that we also have Pβn (βb = 0) → 1.2
Hence, with probability tending to 1,

sup Eβ (`(n1/2 (βb − β)) ≥ `(s).

Since s was arbitrary the result follows.

It follows that, if Rn denotes the minimax risk then

R(βbn )
sup → ∞.
β Rn

The implication is that when d is much smaller than n, sparse estimators have poor behavior.
However, when dn is increasing and dn > n, the least squares estimator no longer satisfies
(13). Thus we can no longer say that some other estimator outperforms the sparse estimator.
In summary, sparse estimators are well-suited for high-dimensional problems but not for low
dimensional problems.

5 Inference?

Is it possible to do inference after model selection? Do we need to? I’ll discuss this in class.
2
This follows from a property called contiguity.

14
References

Buja, Berk, Brown, George, Pitkin, Traskin, Zhao and Zhang (2015). Models as Apprx-
imations — A Conspiracy of Random Regressors and Model Deviations Against Classical
Inference in Regression. Statistical Science.

Hsu, Kakade and Zhang (2014). Random design analysis and ridge regression. arXiv:1106.2363.

Gyorfi, Kohler, Krzyzak and Walk. (2002). A Distribution-Free Theory of Nonparametric

Regression. Springer.

Appendix: L2 Boosting
(0) (k)
Define estimators m
bn ,...,m b (0) (x) = 0 and then iterate the follow-
b n , . . . , as follows. Let m
ing steps:

1. Compute the residuals Ui = Yi − m b (k) (Xi ).

2. Regress the residuals on the Yi ’s: βbj = i Ui Xij / i Xij2 , j = 1, . . . , d.

P P

− βbJ XiJ )2 .
P
3. Find J = argminj RSSj where RSSj = i (Ui

b (k+1) (x) = m
4. Set m b (k) (x) + βbJ xJ .

The version above is called L2 boosting or matching pursuit. A variation is to set

mb (k+1) (x) = m b (k) (x) + ν βbJ xJ where 0 < ν ≤ 1. Another variation is to set m b (k+1) (x) =
mb (k) (x)+νsign(βbJ )xJ which is called forward stagewise regression. Yet another variation
is to set mb (k) to be the linear regression estimator based on all variables selected up to that
point. This is forward stepwise regression or orthogonal matching pursuit.

Theorem 15 The matching pursuit estimator is linear. In particular,

Yb (k) = Bk Y (16)

where Yb (k) = (m
b (k) (X1 ), . . . , m
b (k) (Xn ))T ,

Bk = I − (I − Hk )(I − Hk−1 ) · · · (I − H1 ), (17)

and
Xj XTj
Hj = . (18)
kXj k2

15
Pdn
Theorem 16 (Bühlmann 2005) Let mn (x) = j=1 βj,n xj be the best linear approxima-
tion based on dn terms. Suppose that:
1−ξ
(A1 Growth) dn ≤ C0 eC1 n for some C0 , C1 > 0 and some 0 < ξ ≤ 1.

(A2 Sparsity) supn dj=1

Pn
|βj,n | < ∞.

(A3 Bounded Covariates) supn max1≤j≤dn maxi |Xij | < ∞ with probability 1.

(A4 Moments) E||s < ∞ for some s > 4/ξ.

Then there exists kn → ∞ such that

b n (X) − mn (x)|2 → 0
EX |m (19)

as n → 0.

We won’t prove the theorem

R but we will outline the idea. Let H be a Hilbert space with
inner product hf, gi = f (x)g(x)dP (x). Let D be a dictionary, that is a set of functions,
each of unit norm, that span H. Define a functional version of matching pursuit, known as
the weak greedy algorithm, as follows. Let R0 (f ) = f , F0 = 0. At step k, find gk ∈ D so
that
|hRk−1 (f ), gk i| ≥ tk sup |hRk−1 (f ), hi|
h∈D

for some 0 < tk ≤ 1. In the weak greedy algorithm we take Fk = Fk−1 +hf, gk igk . In the weak
orthogonal greedy algorithm we take Fk to be the projection of Rk−1 (f ) onto {g1 , . . . , gk }.
Finally set Rk (f ) = f − Fk .

Theorem 17 (Temlyakov 2000) Let f (x) = j βj gj (x) where gj ∈ D and ∞

P P
j=1 |βj | ≤
B < ∞. Then, for the weak orthogonal greedy algorithm
B
kRk (f )k ≤ 1/2 (20)
Pk 2
1+ j=1 tj

and for the weak greedy algorithm

B
kRk (f )k ≤ tk /(2(2+tk )) . (21)
Pk 2
1+ j=1 tj

L2 boosting essentially replaces hf, Xj i with hY, Xj in = n−1 i Yi Xij . Now hY, Xj in has
P
mean hf, Xj i. The main burden of the proof is to show that hY, Xj in is close to hf, Xj i with

16
high probability and then apply Temlyakov’s result. For this we use Bernstein’s inequality.
Recall that if |Zj | are bounded by M and Zj has variance σ 2 then

n2

1
P(|Z − E(Zj )| > ) ≤ 2 exp − 2 . (22)
2 σ + M /3
Hence, the probability that any empirical inner products differ from their functional coun-
terparts is no more than
n2

2 1
dn exp − 2 →0 (23)
2 σ + M /3
because of the growth condition.

Appendix: Proof of Theorem 10

The L1 norm depends on n and so we denote this by khkL1,n . For technical reasons, we
assume that kf k∞ ≤ B, that fbn is truncated to be no more than B and that kψk∞ ≤ B for
all ψ ∈ Dn .

Theorem 18 Suppose that pn ≡ |D|n ≤ nc for some c ≥ 0. Let fbN be the output of
the stepwise regression algorithm after N steps. Let f (x) = E(Y |X = x) denote the true
regression function. Then, for every h ∈ Dn ,
!
2
8khkL CN log n 1
P kf − fbN k2 > 4kf − hk2 + 1,n
+ < γ
N n n

for some positive constants γ and C.

Before proving this theorem, we need some preliminary results. For any Λ ⊂ D, let SΛ =
Span(Λ). Define ( )
[
FN = SΛ : |Λ| ≤ N .

Recall that, if F is a set of functions then Np (, F, ν) is the Lp covering entropy with respect
to the probability measure ν and Np (, F) is the supremum of Np (, F, ν) over all probability
measures ν.

Lemma 19 For every t > 0, and every Λ ⊂ Dn ,

|Λ|+1 |Λ|+1
2eB 2 3eB 2

2eB 3eB
N1 (t, SΛ ) ≤ 3 log , N2 (t, SΛ ) ≤ 3 log .
t t t2 t2

17
Also,
N +1 N +1
2eB 2 3eB 2

N 2eB 3eB N
N1 (t, FN ) ≤ 12p log , N2 (t, FN ) ≤ 12p log .
t t t2 t2

Proof. The first two equation follow from standard covering arguments. The second two
equations follow from the fact that the number of subsets of Λ of size at most N is
N X N j
X p ep ep N
N
p N
≤ ≤N ≤ p max N ≤ 4pN .
j j N N N
j=1 j=1

The following lemma is from Chapter 11 of Gyorfi et al. The proof is long and technical and
we omit it.

Lemma 20 Suppose that |Y | ≤ B, where B ≥ 1, and F is a set of real-valued

R functions
such that kf k∞ ≤ B for all f ∈ F. Let f0 (x) = E(Y |X = x) and kgk2 = g 2 (x)dP (x).
Then, for every α, β > 0 and ∈ (0, 1/2],
!
P (1 − )kf − f0 k2 ≥ kY − f k2n − kY − f0 k2n + (α + β) for some f ∈ F

2 (1 − )αn

β
≤ 14N1 ,F exp − .
20B 214(1 + )B 4

Proof of Theorem 18. For any h ∈ Fn we have

!
kfb − f0 k2n = kfb − f0 k2 − 2 kY − fbk2n − kY − f0 k2n
| {z }
A1
! !
+ 2 kY − fbk2n − kY − hk2n + 2 kY − hk2n − kY − f0 k2n .
| {z } | {z }
A2 A3

Apply Lemma 20 with = 1/2 together with Lemma 19 to conclude that, for C0 > 0 large
enough,
C0 N log n 1
P A1 > for some f < γ .
n n
To bound A2 , apply Theorem 8 with norm k · kn and with Y replacing f . Then,
4khk21,n
kY − fbk2n ≤ kY − hk2n +
k
18
8khk21,n
and hence A2 ≤ k
. Next, we have that

E(A3 ) = kf0 − hk2

and for large enough C1 ,

C1 N log n 1
P A3 > kf0 − hk2 + for some f < .
n nγ

Advanced Econometrics PDF
No ratings yet
Advanced Econometrics PDF
58 pages
36-708 Statistical Machine Learning Homework #4 Solutions: DUE: April 19, 2019
No ratings yet
36-708 Statistical Machine Learning Homework #4 Solutions: DUE: April 19, 2019
16 pages
HW 03 Sol
No ratings yet
HW 03 Sol
9 pages
Homework For Day 1
17% (12)
Homework For Day 1
4 pages
Linear Regression: 1 1 N N I I I D I I
No ratings yet
Linear Regression: 1 1 N N I I I D I I
20 pages
Hdnotes 2021
No ratings yet
Hdnotes 2021
31 pages
Local Linear Regression For Functional Data: Alain Berlinet, Abdallah Elamine, André Mas Université Montpellier 2
No ratings yet
Local Linear Regression For Functional Data: Alain Berlinet, Abdallah Elamine, André Mas Université Montpellier 2
23 pages
Week6
No ratings yet
Week6
34 pages
Intro&NP Stat
No ratings yet
Intro&NP Stat
122 pages
HW1 Solutions
No ratings yet
HW1 Solutions
3 pages
Practice 1130
No ratings yet
Practice 1130
20 pages
Weather Wax Hastie Solutions Manual
No ratings yet
Weather Wax Hastie Solutions Manual
18 pages
4 Linear Regression Additional Notes
No ratings yet
4 Linear Regression Additional Notes
8 pages
Inference in Linear Regression Models With Many Covariates and Heteroskedasticity
No ratings yet
Inference in Linear Regression Models With Many Covariates and Heteroskedasticity
47 pages
Notes MSM
No ratings yet
Notes MSM
66 pages
Chapter 6: Regression
No ratings yet
Chapter 6: Regression
7 pages
Kernel Ridge Regression: Max Welling
No ratings yet
Kernel Ridge Regression: Max Welling
3 pages
CH 11-Regression
No ratings yet
CH 11-Regression
52 pages
05 Regression Least Squares
No ratings yet
05 Regression Least Squares
5 pages
Linear Stochastic Models: 5.1 Least Squares
No ratings yet
Linear Stochastic Models: 5.1 Least Squares
12 pages
14-AOS1221
No ratings yet
14-AOS1221
37 pages
Linear Model and Extensions
No ratings yet
Linear Model and Extensions
400 pages
Functional Local Linear Relative Regression: Abdelkader Chahad Ali Laksaci Ait-Hennani Larbi
No ratings yet
Functional Local Linear Relative Regression: Abdelkader Chahad Ali Laksaci Ait-Hennani Larbi
7 pages
Final_Exam_Solution_20220201
No ratings yet
Final_Exam_Solution_20220201
14 pages
Technometrics
No ratings yet
Technometrics
14 pages
Stat 378
No ratings yet
Stat 378
73 pages
Ecom 165 Notes
No ratings yet
Ecom 165 Notes
98 pages
Multiple Linear Reegression
No ratings yet
Multiple Linear Reegression
21 pages
RigNotes15 PDF
No ratings yet
RigNotes15 PDF
130 pages
Kondor Regression
No ratings yet
Kondor Regression
4 pages
note8 (1)
No ratings yet
note8 (1)
12 pages
Cours 2 MVA
No ratings yet
Cours 2 MVA
5 pages
MathModel_Lecture 8 1
No ratings yet
MathModel_Lecture 8 1
8 pages
Machine learning
No ratings yet
Machine learning
19 pages
Stat 353 Study Guide
No ratings yet
Stat 353 Study Guide
44 pages
Design and Analysis of Computer Experiments: Theory: 1 Density Estimation
No ratings yet
Design and Analysis of Computer Experiments: Theory: 1 Density Estimation
9 pages
Chapter2 Annotated Part2
No ratings yet
Chapter2 Annotated Part2
30 pages
Representer Function
No ratings yet
Representer Function
12 pages
Econometrics: Problem Set 2: Professor: Mauricio Sarrias
No ratings yet
Econometrics: Problem Set 2: Professor: Mauricio Sarrias
10 pages
COL774 Practice Problems
No ratings yet
COL774 Practice Problems
22 pages
Regression Analysis
100% (1)
Regression Analysis
280 pages
Regression Interpolation
No ratings yet
Regression Interpolation
34 pages
Mathematical model
No ratings yet
Mathematical model
34 pages
Non Parametric Prediction
No ratings yet
Non Parametric Prediction
16 pages
Notes Part 2 PDF
No ratings yet
Notes Part 2 PDF
63 pages
Industrial Mathematics Institute: Research Report
No ratings yet
Industrial Mathematics Institute: Research Report
25 pages
Lecture Notes on High Dimensional Linear Regression
No ratings yet
Lecture Notes on High Dimensional Linear Regression
73 pages
Chap7
No ratings yet
Chap7
7 pages
Ec2 1
No ratings yet
Ec2 1
11 pages
Online Learning Lecture Notes 2011 Oct 20
No ratings yet
Online Learning Lecture Notes 2011 Oct 20
125 pages
4 - Multiple Linear Regressions
No ratings yet
4 - Multiple Linear Regressions
61 pages
Wk05 machine learning
No ratings yet
Wk05 machine learning
6 pages
Estimation of Time-Varying Par in STAT Models - Bertsimas Et - Al. (1999) - PUB
No ratings yet
Estimation of Time-Varying Par in STAT Models - Bertsimas Et - Al. (1999) - PUB
21 pages
Regress
No ratings yet
Regress
11 pages
Raghu Meka notes
No ratings yet
Raghu Meka notes
7 pages
Cheat Sheet 1
No ratings yet
Cheat Sheet 1
2 pages
8. Linear Regression
No ratings yet
8. Linear Regression
29 pages
FECO Note 2 - Simple Linear Regression: Xuan Chinh Mai
No ratings yet
FECO Note 2 - Simple Linear Regression: Xuan Chinh Mai
7 pages
Long-Memory Time Series: Theory and Methods
From Everand
Long-Memory Time Series: Theory and Methods
Wilfredo Palma
No ratings yet
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Group Theory I Essentials
From Everand
Group Theory I Essentials
Emil Milewski
No ratings yet
02 The Noisy Channel Model of Spelling 19-30
No ratings yet
02 The Noisy Channel Model of Spelling 19-30
12 pages
05 Smoothing - Add-One 6-30
No ratings yet
05 Smoothing - Add-One 6-30
3 pages
03 Real-Word Spelling Correction 9-19
No ratings yet
03 Real-Word Spelling Correction 9-19
4 pages
08 Kneser-Ney Smoothing 8-59
No ratings yet
08 Kneser-Ney Smoothing 8-59
3 pages
02 Regular Expressions in Practical NLP 6-04
No ratings yet
02 Regular Expressions in Practical NLP 6-04
3 pages
05 Sentence Segmentation 5-31
No ratings yet
05 Sentence Segmentation 5-31
3 pages
Tcs NQT - 2024 Recorded Classes Schedule
No ratings yet
Tcs NQT - 2024 Recorded Classes Schedule
3 pages
Managerial Economics in A Global Economy: Bab 5 Peramalan Permintaan (Demand Forecasting
No ratings yet
Managerial Economics in A Global Economy: Bab 5 Peramalan Permintaan (Demand Forecasting
14 pages
12th Maths EM Unit Test 1 Model Question Paper English Medium PDF Download
No ratings yet
12th Maths EM Unit Test 1 Model Question Paper English Medium PDF Download
2 pages
Trees, Binary Trees, Expression Trees
No ratings yet
Trees, Binary Trees, Expression Trees
67 pages
Laplace of Gaussian
No ratings yet
Laplace of Gaussian
7 pages
Question Bank
100% (1)
Question Bank
7 pages
Discrete Cumulative - Probability Distribution
No ratings yet
Discrete Cumulative - Probability Distribution
50 pages
Python Week 5 - 6 GrPA (Made by Unknown)
No ratings yet
Python Week 5 - 6 GrPA (Made by Unknown)
4 pages
Lab Report 3
No ratings yet
Lab Report 3
11 pages
IA - DCA2102 - DBMS - Set 1 and 2 - Dec2023
No ratings yet
IA - DCA2102 - DBMS - Set 1 and 2 - Dec2023
1 page
Tutorial 1 Part 1 With Answers
No ratings yet
Tutorial 1 Part 1 With Answers
2 pages
Source Code For Chatbot
No ratings yet
Source Code For Chatbot
22 pages
Attendance Marking System Using Image Recognition: Professor: Sanjay Srivastava
No ratings yet
Attendance Marking System Using Image Recognition: Professor: Sanjay Srivastava
15 pages
Assignment 1
No ratings yet
Assignment 1
4 pages
LP Examples
No ratings yet
LP Examples
2 pages
BE368 Lecture 4
No ratings yet
BE368 Lecture 4
28 pages
1013 - Mathematics Matrices Questions
No ratings yet
1013 - Mathematics Matrices Questions
3 pages
L18 K Means
No ratings yet
L18 K Means
27 pages
DAV Previous Year Papers
No ratings yet
DAV Previous Year Papers
6 pages
Sharpe Index Model: Portfolio Expected Return E (R) o
No ratings yet
Sharpe Index Model: Portfolio Expected Return E (R) o
2 pages
Homework 6 Sol
No ratings yet
Homework 6 Sol
2 pages
Chapter 9 - Dynamic Behavior of Closed Loop Systems
No ratings yet
Chapter 9 - Dynamic Behavior of Closed Loop Systems
45 pages
Anomaly Detection PPT
No ratings yet
Anomaly Detection PPT
14 pages
PPT06 - Eigenvalues and Eigenvectors
No ratings yet
PPT06 - Eigenvalues and Eigenvectors
13 pages
Purple Gradient Artificial Intelligence Presentation
No ratings yet
Purple Gradient Artificial Intelligence Presentation
9 pages
Homework 1: EE 737 Spring 2019-20 Assigned: 30 Jan Due: Beginning of Class, 06 Feb
100% (1)
Homework 1: EE 737 Spring 2019-20 Assigned: 30 Jan Due: Beginning of Class, 06 Feb
2 pages
Math Quantum Theory Notes
No ratings yet
Math Quantum Theory Notes
133 pages
Ai Introduction
No ratings yet
Ai Introduction
31 pages
Properties of 2D Systems
No ratings yet
Properties of 2D Systems
19 pages