n14 PDF
n14 PDF
n14 PDF
1 Generalization
We note that not generalizing means not doing well on the new data. Generalizing means
that we do the same on new data. We want to achieve the following, in any machine learning
problem.
1
PnT
Let us consider a model for empirical risk, which is the error RT [w] = nT i=1 loss(w, (xi , yi ))
on our training set (x1 , y1 ), (x2 , y2 ), (xnT , ynT ).
We can see that this statement is trivially true. The term R[w]RT [w] is our generalization
error.
If (xi , yi ) are sampled i.i.d. and w is fixed (independent of (xi , yi )). Compute the expectation
of RT [w].
n
1 X
E[RT [w]] = E[loss(w, (xi , yi ))]
nT i=1
nT
1 X
= R[w]
nT i=1
= R[w]
*Note: This assumes that the loss is bounded and positive. In other words, for some B,
0 loss(w, (x, y)) B.
1
We now consider the following. var(X E[X]) = var(X), where X = RT [w].
1
var(RT [w] R[w]) = var(loss(w, (x, y)) R[w])
nT
B2
nT
So, the more data you get (the greater nT is), the less variance you have in your estimate
for total risk.
2 Hoeffdings Inequality
Definition: Hoeffdings Inequality Let z1 , z2 ...zn be i.i.d. random variables. Assume the
mean is , E[zi ] = and that all probabilities are bounded by some B, Pr(0 zi B) = 1.
Then, we find that the estimate forPour mean gets exponentially better with an increase in
2
the amount of training data: Pr( n1 ni=1 zi t) exp( 2nt
B2
).
2nt2
R[w] RT [w] + t with probability exp( )
B2
B
Plug in t =
n
, and we get the following.
B
R[w] RT [w] + with probability at least 86%
n
Theorem Given we take the maximum of k estimates qfor w, our empirical risk grows expo-
b + B log
b R[w]
nentially closer to the actual risk: R[w] nT
k
with probability at least 1 k1 .
More formally, we have that:
2
w = argmin1ik RT [wi ]
k
X
Pr(ki=1 Ei ) Pr(Ei )
i=1
1
Pn
Let us plug into the Hoeffding inequality, where RT [wi ] = nT i=1 zi .
2nt2
Pr(R[wi ] RT [wi ] t) exp( )
B2
2nt2
Pr(i s.t.R[wi ] RT [wi ] t) exp( 2 )
B
q q
log k k
If you look at the test set and adapt, the term nV
actually approaches nV
.
If you choose to minimize R[w] + kwk2 , we find that our w is bounded, kwT k 1 RT [o],
where RT [o] = RT [wT ] + kwT k2 kwT k2 . Let us consider an example.
If you have more points (n) than parameters (d), your risk is bounded. If otherwise, where
the number of parameters is much larger than the number of points, then the bound provided
above does not give us much information. Instead, we now need a new technique and a new
bound: cross-validation.
3
3 Cross Validation
In practice, we are not provided with a validation set. So, we take our data (x1 , y1 )...(xn , yn )
and partition randomly it into the new training set T = {(xi , yi )}ni=1
T
and the validation set
nV
V = {(xi , yi )}i=1 , such that we dont lose data, n = nT + nV and the test training set is
larger than the validation set nT > nV .
Here are the takeaways from both this section and all of this course.
3. Overfitting can happen in many different ways for one, how we cross-validate.