Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

n14 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Cross-Validation

compiled by Alvin Wan from Professor Benjamin Rechts lecture

1 Generalization

We note that not generalizing means not doing well on the new data. Generalizing means
that we do the same on new data. We want to achieve the following, in any machine learning
problem.

minimizew R[w] = E(x,y) [loss(w; (x, y))]

1
PnT
Let us consider a model for empirical risk, which is the error RT [w] = nT i=1 loss(w, (xi , yi ))
on our training set (x1 , y1 ), (x2 , y2 ), (xnT , ynT ).

R[w] = R[w] RT [w] + RT [w]

We can see that this statement is trivially true. The term R[w]RT [w] is our generalization
error.

Theorem Empirical risk is an unbiased estimate of risk.

If (xi , yi ) are sampled i.i.d. and w is fixed (independent of (xi , yi )). Compute the expectation
of RT [w].

n
1 X
E[RT [w]] = E[loss(w, (xi , yi ))]
nT i=1
nT
1 X
= R[w]
nT i=1
= R[w]

*Note: This assumes that the loss is bounded and positive. In other words, for some B,
0 loss(w, (x, y)) B.
1
We now consider the following. var(X E[X]) = var(X), where X = RT [w].

1
var(RT [w] R[w]) = var(loss(w, (x, y)) R[w])
nT
B2

nT

So, the more data you get (the greater nT is), the less variance you have in your estimate
for total risk.

2 Hoeffdings Inequality

Definition: Hoeffdings Inequality Let z1 , z2 ...zn be i.i.d. random variables. Assume the
mean is , E[zi ] = and that all probabilities are bounded by some B, Pr(0 zi B) = 1.
Then, we find that the estimate forPour mean gets exponentially better with an increase in
2
the amount of training data: Pr( n1 ni=1 zi t) exp( 2nt
B2
).

We refer to this as a quantitative prediction, because it provides us with probability esti-


mates. Applying Hoeffdings to the above, we find that the actual risk is bounded within a
certain distance of our empirical risk.

2nt2
R[w] RT [w] + t with probability exp( )
B2

B
Plug in t =
n
, and we get the following.

B
R[w] RT [w] + with probability at least 86%
n

Theorem Given we take the maximum of k estimates qfor w, our empirical risk grows expo-
b + B log
b R[w]
nentially closer to the actual risk: R[w] nT
k
with probability at least 1 k1 .
More formally, we have that:

Suppose we have w1 , w2 ...wk .

2
w = argmin1ik RT [wi ]

We now consider the new risk.


q
R[w] RT [w] + B log
nT
k
with probability at least 1 k1 . We can prove this using the union
bound. Note that the union bound states the following.

k
X
Pr(ki=1 Ei ) Pr(Ei )
i=1

1
Pn
Let us plug into the Hoeffding inequality, where RT [wi ] = nT i=1 zi .

2nt2
Pr(R[wi ] RT [wi ] t) exp( )
B2
2nt2
Pr(i s.t.R[wi ] RT [wi ] t) exp( 2 )
B

q q
log k k
If you look at the test set and adapt, the term nV
actually approaches nV
.

Theorem If you have d parameters (i.e., x Rd , w Rd ), then wT = argminkwkM RT [w].

If you choose to minimize R[w] + kwk2 , we find that our w is bounded, kwT k 1 RT [o],
where RT [o] = RT [wT ] + kwT k2 kwT k2 . Let us consider an example.

Bound the norm of w by M , w = 1 (1, 1, 1).


This gives us 2d possible vectors. Using
m q
Hoeffdings, we get that R[wT ] RT [wT ] + CM nd . Intuitively, we can prove this by
considering all points where norm is less than M . We know that every point within  of a
certain point must have similar loss. We can then extend this to a ball of radius M .

If you have more points (n) than parameters (d), your risk is bounded. If otherwise, where
the number of parameters is much larger than the number of points, then the bound provided
above does not give us much information. Instead, we now need a new technique and a new
bound: cross-validation.

3
3 Cross Validation

We take a validation set (x1 , y1 (xnV , ynV ). We then take our k q


parameter settings, and
using the training set, generate w1 , w2 ...wk . Then R[w] RV [w]+B log nV
k
. Validation error,
or test error, then tells us which wi gives the smallest risk.

In practice, we are not provided with a validation set. So, we take our data (x1 , y1 )...(xn , yn )
and partition randomly it into the new training set T = {(xi , yi )}ni=1
T
and the validation set
nV
V = {(xi , yi )}i=1 , such that we dont lose data, n = nT + nV and the test training set is
larger than the validation set nT > nV .

It is highly recommended that you repeat this process.



Note that the training set must be large in order for the
q bound dnT to be meaningful.
log k
Additionally, we need a large enough test set so that B nV is likewise meaningful. This
is why we split into validation and training sets.

Here are the takeaways from both this section and all of this course.

1. More points than parameters.

2. Dont adapt to the test data.

3. Overfitting can happen in many different ways for one, how we cross-validate.

You might also like