Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
37 views40 pages

The Problem of Overfitting

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 40

The Problem of Overfitting

BR data: neural network with 20% classification noise, 307 training examples
Overfitting on BR (2)

Overfitting: h ∈ H overfits training set S if there exists h’ ∈ H that


has higher training set error but lower test error on new data points.
(More specifically, if learning algorithm A explicitly considers and
rejects h’ in favor of h, we say that A has overfit the data.)
Overfitting

H1 ⊂ H2 ⊂ H3 ⊂ L

If we use an hypothesis space Hi that is too large,


eventually we can trivially fit the training data. In other
words, the VC dimension will eventually be equal to the
size of our training sample m.
This is sometimes called “model selection”, because we
can think of each Hi as an alternative “model” of the data
Approaches to Preventing
Overfitting
Penalty methods
– MAP provides a penalty based on P(H)
– Structural Risk Minimization
– Generalized Cross-validation
– Akaike’s Information Criterion (AIC)
Holdout and Cross-validation methods
– Experimentally determine when overfitting occurs
Ensembles
– Full Bayesian methods vote many hypotheses
∑h P(y|x,h) P(h|S)
– Many practical ways of generating ensembles
Penalty methods
Let εtrain be our training set error and εtest be our test
error. Our real goal is to find the h that minimizes εtest.
The problem is that we can’t directly evaluate εtest. We
can measure εtrain, but it is optimistic
Penalty methods attempt to find some penalty such that
εtest = εtrain + penalty

The penalty term is also called a regularizer or


regularization term.
During training, we set our objective function J to be
J(w) = εtrain(w) + penalty(w)
and find the w to minimize this function
MAP penalties
hmap = argmaxh P(S|h) P(h)
As h becomes more complex, we can assign it a lower
prior probability. A typical approach is to assign equal
probability to each of the nested hypothesis spaces so
that
P(h ∈ H1) = P(h ∈ H2) = L = α
Because H2 contains more hypotheses than H1, each
individual h ∈ H2 will have lower prior probability:
P(h) = ∑i P(h ∈ Hi) = ∑i α/|Hi| for each i where h ∈ Hi
If there are infinitely many Hi, this will not work, because
the probabilities must sum to 1. In this case, a common
approach is
P(h) = ∑i 2-i/|Hi| for each i where h ∈ Hi
This is not usually a big enough penalty to prevent
overfitting, however
Structural Risk Minimization
Define regularization penalty using PAC theory
" #
k 4 2em 4
² <= 2²T + dk log + log
m dk δ
" #
C R2 + kξk2 2 1
² <= 2
log m + log
m γ δ
Other Penalty Methods
Generalized Cross Validation
Akaike’s Information Criterion
Mallow’s P

Simple Holdout Method
Subdivide S into Strain and Seval
For each Hi, find hi ∈ Hi that best fits Strain
Measure the error rate of each hi on Seval
Choose hi with the best error rate

Example: let Hi be the set


of neural network weights
after i epochs of training
on Strain
Our goal is to choose i
Simple Holdout Assessment
Advantages
– Guaranteed to perform within a constant factor of any
penalty method (Kearns, et al., 1995)
– Does not rely on theoretical approximations
Disadvantages
– Strain is smaller than S, so h is likely to be less
accurate
– If Seval is too small, the error rate estimates will be
very noisy
Simple Holdout is widely applied to make other
decisions such as learning rates, number of
hidden units, SVM kernel parameters, relative
size of penalty, which features to include, feature
encoding methods, etc.
k-fold Cross-Validation to
determine Hi
Randomly divide S into k equal-sized subsets
Run learning algorithm k times, each time use
one subset for Seval and the rest for Strain
Average the results
K-fold Cross-Validation to determine Hi
Partition S into K disjoint subsets S1, S2, …, Sk
Repeat simple holdout assessment K times
– In the k-th assessment, Strain = S – Sk and Seval = Sk
– Let hik be the best hypothesis from Hi from iteration k.
– Let εi be the average Seval of hik over the K iterations
– Let i* = argmini εi
Train on S using Hi* and output the resulting
hypothesis
i
0 1 2 3 4 5 6 7 8 9
1
2
k 3
4
5
Ensembles
Bayesian Model Averaging. Sample hypotheses hi
according to their posterior probability P(h|S). Vote
them. A method called Markov chain Monte Carlo
(MCMC) can do this (but it is quite expensive)
Bagging. Overfitting is caused by high variance.
Variance reduction methods such as bagging can help.
Indeed, best results are often obtained by bagging
overfitted classifiers (e.g., unpruned decision trees, over-
trained neural networks) than by bagging well-fitted
classifiers (e.g., pruned trees).
Randomized Committees. We can train several
hypotheses hi using different random starting weights for
backpropagation
Random Forests. Grow many decision trees and vote
them. When growing each tree, randomly (at each
node) choose a subset of the available features (e.g., √n
out of n features). Compute the best split using only
those features.
Overfitting Summary
Minimizing training set error (εtrain) does not necessarily
minimize test set error (εtest).
– This is true when the hypothesis space is too large (too
expressive)
Penalty methods add a penalty to εtrain to approximate
εtest
– Bayesian, MDL, and Structural Risk Minimization
Holdout and Cross-Validation methods without a subset
of the training data, Seval, to determine the proper
hypothesis space Hi and its complexity
Ensemble Methods take a combination of several
hypotheses, which tends to cancel out overfitting errors
Penalty Methods for decision trees,
neural networks, and SVMs
Decision Trees
– pessimistic pruning
– MDL pruning
Neural Networks
– weight decay
– weight elimination
– pruning methods
Support Vector Machines
– maximizing the margin
Pessimistic Pruning of Decision Trees

Error rate on training data is 4/20 = 0.20 = p.


Binomial confidence interval (using the
normal approximation to the binomial
distribution) is
s s
p(1 − p) p(1 − p)
p−zα/2· <= p <= p+zα/2·
n n
If we use α = 0.25, then zα/2 = 1.150 so we
obtain
0.097141 · p · 0.302859
We use the upper bound of this as our error
rate estimate. Hence, we estimate
0.302859 × 20 = 6.06 errors
Pruning Algorithm (1):
Traversing the Tree
float Prune(Node & node)
{
if (node.leaf) return PessError(node);
float childError = Prune(node.left) + Prune(node.right);
float prunedError = PessError(node);
if (prunedError < childError) { // prune
node.leaf = true;
node.left = node.right = NULL;
return prunedError}
else // don't prune
return childError;
}
Pruning Algorithm (2):
Computing the Pessimistic Error
const float zalpha2 = 1.150; // p = 0.25 two-sided
float PessError(Node & node)
{
float n = node.class[0] + node.class[1];
float nl = n + 2.0;
float wrong = min(node.class[0], node.class[1]) + 1.0;
// Laplace estimate of error rate
float p = wrong / nl;
return n * (p + zalpha2 * sqrt( p * (1.0 - p) / n));
}
Pessimistic Pruning Example
Penalty methods for Neural
Networks
Weight Decay
1 X
2
Ji (W ) = (ŷi − yi ) + λ wj2
2 j

Weight Elimination
1 X w2 j /w 2
0
2
Ji(W ) = (ŷi − yi) + λ
2 1 + w 2/w 2
j j 0

w0 large encourages many small weights


w0 small encourages a few large weights
Weight Elimination

This essentially counts the number of large weights. Once they are
large enough, their penalty does not change
Neural Network Pruning Methods:
Optimal Brain Damage
(LeCun, Denker, Solla, 1990)

Taylor’s Series Expansion of the Squared Error:


X 1X 1 X
∆J(W ) = gj ∆wj + 2
hjj (∆wj ) + hjk ∆wj ∆wk+O(||∆wj ||3)
j 2 j 2 j6=k
where
∂J (W) ∂ 2J(W )
gj = and hj k =
∂wj ∂wj ∂wk
At a local minimum, gj = 0.
Assume off-diagonal terms hjk = 0
1X
∆J (W ) = hjj (∆wj )2
2 j
If we set wj = 0, the error will change by
hj j wj2/2
Optimal Brain Damage Procedure
1. Choose a reasonable network architecture
2. Train the network until a reasonable solution is
obtained
3. Compute the second derivatives hjj for each
weight wj
4. Compute the saliencies for each weight hjjwj}2/2
5. Sort the weights by saliency and delete some
low-saliency weights
6. Repeat from step 2
OBD Results
On an OCR problem, they started with a highly-
constrained and sparsely connected network
with 2,578 weights, trained on 9300 training
examples. They were able to delete more than
1000 weights without hurting training or test
error.
Optimal Brain Surgeon attempts to do this
before reaching a local minimum
Experimental evidence is mixed about whether
this reduced overfitting, but it does reduce the
computational cost of using the neural network
Penalty Methods for Support Vector
Machines
Our basic SVM tried to fit the training data perfectly
(possibly by using kernels). However, this will quickly
lead to overfitting.
Recall margin-based bound. With probability 1 – δ, a
linear separator with unit weight vector and margin γ on
training data lying in a ball of radius R will have an error
rate on new data points bounded by
" #
C R2 + kξk2 2 1
² <= 2
log m + log
m γ δ
For some constant C. ξ is the margin slack vector such
that
ξi = max{0, γ − yig(xi)}
Preventing SVM Overfitting
Maximize margin γ
Minimize slack vector ||ξ||
Minimize R

The (reciprocal) of the margin acts as a


penalty to prevent overfitting
Functional Margin versus
Geometric Margin
Functional margin: γf = yi · w · xi
Geometric margin: γg = γ / ||w||
f
The margin bound applies only to the
geometric margin γg
The functional margin can be made
arbitrarily large by rescaling the weight
vector, but the geometric margin is
invariant to scaling
Intermission: Geometry of Lines
Consider the line w · x + b = 0, where ||w|| = 1 is
a vector of unit length. Then the minimum
distance to the origin is b.
Geometry of a Margin
If a point x+ is a distance γ away from the
line, then it lies on the line w · x + b = γ
The Geometric Margin is the
Inverse of ||w||
Lemma: γg = 1/||w||
Proof:
– Let w be an arbitrary weight vector such that the
positive point x+ has a functional margin of 1. Then
w · x+ + b = 1
– Now normalize this equation by dividing by ||w||.
w + b 1
·x + = = γg
kwk kwk kwk
– Implication: We can hold the functional margin at 1
and minimize the norm of the weight vector
Support Vector Machine Quadratic
Program
Find w
Minimize ||w||2
Subject to
yi · (w · xi + b) ≥ 1
This requires every training example to
have a functional margin of at least 1 and
then maximizes the geometric margin.
However it still requires perfectly fitting the
data
Handling Non-Separable Data:
Introduce Margin Slack Variables
Find: w, ξ
Minimize: ||w||2 + C||ξ||2
Subject to:
yi · (w · xi + b) + ξi ≥ 1
– ξi is positive only if example xi does not have a
functional margin of at least 1
– ||ξ||2 measures how well the SVM fits the training data
– ||w||2 is the penalty term
– C is the tradeoff parameter that determines the
relative weight of the penalty compared to the fit to
the data
Kernel Trick Form of SVM
To apply the Kernel Trick, we need to
reformulate the SVM quadratic program so
that it only involves dot products between
training examples. This can be done by
an operation called the Lagrange Dual
Lagrange Dual Problem
Find αi
Minimize
∑i αi – ½ ∑i ∑j yi yj αi αj [xi · xj + δij/C]
Subject to
∑i yi α i = 0
αi ≥ 0
where δij = 1 if i = j and 0 otherwise.
Kernel Trick Form
Find αi
Minimize
∑i αi – ½ ∑i ∑j yi yj αi αj [K(xi,xj) + δij/C]
Subject to
∑i yi α i = 0
αi ≥ 0
where δij = 1 if i = j and 0 otherwise.
Resulting Classifier
The resulting classifier is
f(x) = ∑j yj αj K(xj, x) + b
where b is chosen by finding an i with
αi > 0 and solving
yi f(xi) = 1 – αi/C
for f(xi)
Variations on the SVM Problem:
Variation 1: Use L1 norm of ξ
This is the “official” SVM, which was
originally published by Vapnik and Cortes
Find: w, ξ
Minimize: ||w||2 + C||ξ||
Subject to:
yi · (w · xi + b) + ξi ≥ 1
Dual Form of L1 SVM
Find αi
Minimize
∑i αi – ½ ∑i ∑j yi yj αi αj K(xi,xj)
Subject to
∑i yi α i = 0
C ≥ αi ≥ 0
Variation 2:
Linear Programming SVMs
Use L1 norm for w too
Find u, v, ξ
Minimize ∑j uj + ∑j vj + C ∑i ξi
Subject to
yi · ((u – v) · xi + b) + ξi ≥ 1
The kernel form of this is
Find αi, ξ
Minimize ∑i αi + C ∑i ξi
Subject to
∑j αj yi yj K(xi,xj) + ξi ≥ 1
αj ≥ 0
Setting the Value of C
We see that the full SVM algorithm requires
choosing the value of C, which controls the
tradeoff between fitting the data and obtaining a
large margin.
To set C, we could train an SVM with different
values of C and plug the resulting C, γ, and ξ
into the margin bounds theorem to choose the C
that minimizes the bound on ε.
In practice, this does not work well, and we must
rely on holdout methods (next lecture).

You might also like