The Problem of Overfitting: Overfitting With Linear Regression
The Problem of Overfitting: Overfitting With Linear Regression
6/15/2019 07_Regularization
To recap, if we have too many features then the learned hypothesis may give a cost function of
exactly zero
But this tries too hard to fit the training set
Fails to provide a general solution - unable to generalize (apply to new examples)
www.holehouse.org/mlclass/07_Regularization.html 1/7
Overfitting with logistic regression
6/15/2019 07_Regularization
Addressing overfitting
The addition in blue is a modification of our cost function to help penalize θ3 and θ4
So here we end up with θ3 and θ4 being close to zero (because the constants are massive)
So we're basically left with a quadratic function
Regularization
Small values for parameters corresponds to a simpler hypothesis (you effectively get rid
of some of the terms)
A simpler hypothesis is less prone to overfitting
Another example
Have 100 features x1, x2, ..., x100
Unlike the polynomial example, we don't know what are the high order terms
How do we pick the ones to pick to shrink?
With regularization, take cost function and modify it to shrink all the parameters
Add a term at the end
This regularization term shrinks every parameter
By convention you don't penalize θ0 - minimization is from θ1 onwards
Previously, gradient descent would repeatedly update the parameters θj, where j = 0,1,2...n
simultaneously
Shown below
www.holehouse.org/mlclass/07_Regularization.html 4/7
6/15/2019 07_Regularization
The term
We saw earlier that logistic regression can be prone to overfitting with lots of features
Logistic regression cost function is as follows;
www.holehouse.org/mlclass/07_Regularization.html 5/7
6/15/2019 07_Regularization
Again, to modify the algorithm we simply need to modify the update rule for θ1, onwards
Looks cosmetically the same as linear regression, except obviously the hypothesis is very
different
www.holehouse.org/mlclass/07_Regularization.html 6/7
use fminunc
6/15/2019
Pass it an @costfunction argument07_Regularization
Minimizes in an optimized manner using the cost function
jVal
Need code to compute J(θ)
Need to include regularization term
Gradient
Needs to be the partial derivative of J(θ) with respect to θi
Adding the appropriate term here is also necessary
www.holehouse.org/mlclass/07_Regularization.html 7/7
08: Neural Networks - Representation
6/15/2019 08_Neural_Networks_Representation
Neural networks (NNs) were originally motivated by looking at machines which replicate
the brain's functionality
Looked at here as a machine learning technique
Origins
To build learning systems, why not mimic the brain?
Used a lot in the 80s and 90s
Popularity diminished in late 90s
Recent major resurgence
NNs are computationally expensive, so only recently large scale neural networks
became computationally feasible
Brain
Does loads of crazy things
Hypothesis is that the brain has a single learning algorithm
Evidence for hypothesis
Auditory cortex --> takes sound signals
If you cut the wiring from the ear to the auditory cortex
Re-route optic nerve to the auditory cortex
Auditory cortex learns to see
Somatosensory context (touch processing)
If you rewrite optic nerve to somatosensory cortex then it learns to see
With different tissue learning to see, maybe they all learn in the same way
Brain learns by itself how to learn
Other examples
www.holehouse.org/mlclass/08_Neural_Networks_Representation.html 2/14
Seeing with your tongue
6/15/2019 08_Neural_Networks_Representation
Brainport
Grayscale camera on head
Run wire to array of electrodes on tongue
Pulses onto tongue represent image signal
Lets people see with their tongue
Human echolocation
Blind people being trained in schools to interpret sound and echo
Lets them move around
Haptic belt direction sense
Belt which buzzes towards north
Gives you a sense of direction
Brain can process and learn from data from any source
Model representation 1
How do we represent neural networks (NNs)?
Neural networks were developed as a way to simulate networks of neurones
What does a neurone look like
www.holehouse.org/mlclass/08_Neural_Networks_Representation.html 6/14
6/15/2019 08_Neural_Networks_Representation
For example
Ɵ131 = means
1 - we're mapping to node 1 in layer l+1
3 - we're mapping from node 3 in layer l
1 - we're mapping from layer 1
Model representation II
Here we'll look at how to carry out the computation efficiently through a vectorized
implementation. We'll also consider
why NNs are good and how we can use them to learn complex non-linear things
www.holehouse.org/mlclass/08_Neural_Networks_Representation.html 7/14
6/15/2019 08_Neural_Networks_Representation
z2 is a 3x1 vector
We can vectorize the computation of the neural network as as follows in two steps
z2 = Ɵ(1)x
i.e. Ɵ(1) is the matrix defined above
x is the feature vector
a = g(z(2))
2
www.holehouse.org/mlclass/08_Neural_Networks_Representation.html 10/14
6/15/2019 08_Neural_Networks_Representation
Example on the right shows a simplified version of the more complex problem we're dealing
with (on the left)
We want to learn a non-linear decision boundary to separate the positive and negative
examples
y = x1 XOR x2
x1 XNOR x2
Positive examples when both are true and both are false
Let's start with something a little more straight forward...
Don't worry about how we're determining the weights (Ɵ values) for now - just get
a flavor of how NNs work
Can we get a one-unit neural network to compute this logical AND function? (probably...)
Add a bias unit
Add some weights for the networks
What are weights?
Weights are the parameter values which multiply into the input nodes (i.e. Ɵ)
www.holehouse.org/mlclass/08_Neural_Networks_Representation.html 11/14
6/15/2019 08_Neural_Networks_Representation
So, as we can see, when we evaluate each of the four possible input, only (1,1) gives a positive
output
www.holehouse.org/mlclass/08_Neural_Networks_Representation.html 12/14
Negation is achieved by putting a large negative weight in front of the variable you want to
6/15/2019 08_Neural_Networks_Representation
negative
Simplez!
Multiclass classification
Multiclass classification is, unsurprisingly, when you distinguish between more than two
categories (i.e. more than 1 or 0)
With handwritten digital recognition problem - 10 possible categories (0-9)
How do you do that?
Done using an extension of one vs. all classification
Recognizing pedestrian, car, motorbike or truck
Build a neural network with four output units
Output a vector of four numbers
www.holehouse.org/mlclass/08_Neural_Networks_Representation.html 13/14
1 is 0/1 pedestrian
6/15/2019 08_Neural_Networks_Representation
2 is 0/1 car
3 is 0/1 motorcycle
4 is 0/1 truck
When image is a pedestrian get [1,0,0,0] and so on
Just like one vs. all described earlier
Here we have four logistic regression classifiers
www.holehouse.org/mlclass/08_Neural_Networks_Representation.html 14/14
09: Neural Networks - Learning
6/15/2019 09_Neural_Networks_Learning
So here
l =4
s1 = 3
s2 = 5
s3 = 5
s4 = 4
For neural networks our cost function is a generalization of this equation above, so instead of one output we generate k outputs
There are basically two halves to the neural network logistic regression cost function
First half
Second half
This is a massive regularization summation term, which I'm not going to walk through, but it's a
fairly straightforward triple nested summation
This is also called a weight decay term
As before, the lambda value determines the important of the two halves
Remember that the partial derivative term we calculate above is a REAL number (not a vector or a matrix)
Ɵ is the input parameters
Ɵ1 is the matrix of weights which define the function mapping from layer 1 to layer 2
Ɵ101 is the real number parameter which you multiply the bias unit (i.e. 1) with for the bias unit input into the
first unit in the second layer
Ɵ111 is the real number parameter which you multiply the first (real) unit with for the first input into the first
unit in the second layer
Ɵ211 is the real number parameter which you multiply the first (real) unit with for the first input into the second
unit in the second layer
As discussed, Ɵijl i
i here represents the unit in layer l+1 you're mapping to (destination node)
j is the unit in layer l you're mapping from (origin node)
l is the layer your mapping from (to layer l+1) (origin layer)
NB
The terms destination node, origin node and origin layer are terms I've made up!
So - this partial derivative term is
The partial derivative of a 3-way indexed dataset with respect to a real number (which is one of the values in that
dataset)
Gradient computation
One training example
Imagine we just have a single pair (x,y) - entire training set
How would we deal with this example?
The forward propagation algorithm operates as follows
Layer 1
a1 = x
z2 = Ɵ1a1
Layer 2
a2 = g(z2) (add a02)
z3 = Ɵ2a2
Layer 3
a3 = g(z3) (add a03)
z4 = Ɵ3a3
Output
a4 = hƟ(x) = g(z4)
www.holehouse.org/mlclass/09_Neural_Networks_Learning.html 3/11
6/15/2019 09_Neural_Networks_Learning
And if we take a second to consider the vector dimensionality (with our example above [3-5-5-4])
Ɵ3 = is a matrix which is [4 X 5] (if we don't include the bias term, 4 X 6 if we do)
(Ɵ3)T = therefore, is a [5 X 4] matrix
4
δ = is a 4x1 vector
So when we multiply a [5 X 4] matrix with a [4 X 1] vector we get a [5 X 1] vector
Which, low and behold, is the same dimensionality as the a3 vector, meaning we can run our pairwise multiplication
Why do we do this?
We do all this to get all the δ terms, and we want the δ terms because through a very complicated derivation you can use δ to get the
partial derivative of Ɵ with respect to individual parameters (if you ignore regularization, or regularization is 0, which we deal with
later)
= ajl δi(l+1)
By doing back propagation and computing the delta terms you can then compute the partial derivative terms
We need the partial derivatives to minimize the cost function!
i.e. for each example in the training set (dealing with each example as (x,y)
Set a1 (activation of input layer) = xi
Perform forward propagation to compute al for each layer (l = 1,2, ... L)
i.e. run forward propagation
Then, use the output label for the specific example we're looking at to calculate δL where δL = aL - yi
So we initially calculate the delta value for the output layer
Then, using back propagation we move back through the network from layer L-1 down to layer
Finally, use Δ to accumulate the partial derivative terms
Note here
l = layer
j = node in that layer
i = the error of the affected node in the target layer
www.holehouse.org/mlclass/09_Neural_Networks_Learning.html 5/11
You can vectorize the Δ expression too, as
6/15/2019 09_Neural_Networks_Learning
Finally
After executing the body of the loop, exit the for loop and compute
www.holehouse.org/mlclass/09_Neural_Networks_Learning.html 6/11
6/15/2019 09_Neural_Networks_Learning
The sigmoid function applied to the z values gives the activation values
Below we show exactly how the z value is calculated for an example
Back propagation
This function cycles over each example, so the cost for one example really boils down to this
Which, we can think of as a sigmoidal version of the squared difference (check out the derivation if you don't believe me)
So, basically saying, "how well is the network doing on example i "?
We can think about a δ term on a unit as the "error" of cost for the activation value associated with a unit
More formally (don't worry about this...), δ is
www.holehouse.org/mlclass/09_Neural_Networks_Learning.html 7/11
6/15/2019 09_Neural_Networks_Learning
Looking at another example to see how we actually calculate the delta value;
So, in effect,
Back propagation calculates the δ, and those δ values are the weighted sum of the next layer's delta values, weighted by
the parameter associated with the links
Forward propagation calculates the activation (a) values, which
Depending on how you implement you may compute the delta values of the bias values
However, these aren't actually used, so it's a bit inefficient, but not a lot more!
Example
www.holehouse.org/mlclass/09_Neural_Networks_Learning.html 8/11
6/15/2019 09_Neural_Networks_Learning
Use the thetaVec = [ Theta1(:); Theta2(:); Theta3(:)]; notation to unroll the matrices into a long vector
To go back you use
Theta1 = resape(thetaVec(1:110), 10, 11)
Gradient checking
Backpropagation has a lot of details, small bugs can be present and ruin it :-(
This may mean it looks like J(Ɵ) is decreasing, but in reality it may not be decreasing by as much as it should
So using a numeric method to check the gradient can help diagnose a bug
Gradient checking helps make sure an implementation is working correctly
Example
Have an function J(Ɵ)
Estimate derivative of function at point Ɵ (where Ɵ is a real number)
How?
Numerically
Compute Ɵ + ε
Compute Ɵ - ε
Join them by a straight line
Use the slope of that line as an approximation to the derivative
So, in octave we use the following code the numerically compute the derivatives
www.holehouse.org/mlclass/09_Neural_Networks_Learning.html 9/11
So on each loop thetaPlus = theta except for thetaPlus(i)
6/15/2019 Resets thetaPlus on each loop 09_Neural_Networks_Learning
Create a vector of partial derivative approximations
Using the vector of gradients from backprop (DVec)
Check that gradApprox is basically equal to DVec
Gives confidence that the Backproc implementation is correc
Implementation note
Implement back propagation to compute DVec
Implement numerical gradient checking to compute gradApprox
Check they're basically the same (up to a few decimal places)
Before using the code for learning turn off gradient checking
Why?
GradAprox stuff is very computationally expensive
In contrast backprop is much more efficient (just more fiddly)
Random initialization
Pick random small initial values for all the theta values
If you start them on zero (which does work for linear regression) then the algorithm fails - all activation values for each layer
are the same
So chose random values!
Between 0 and 1, then scale by epsilon (where epsilon is a constant)
for i = 1:m {
Forward propagation on (xi, yi) --> get activation (a) terms
Back propagation on (xi, yi) --> get delta (δ) terms
Compute Δ := Δl + δl+1(al)T
}
Notes on implementation
Usually done with a for loop over training examples (for forward and back propagation)
Can be done without a for loop, but this is a much more complicated way of doing things
Be careful
www.holehouse.org/mlclass/09_Neural_Networks_Learning.html 10/11
2.5) Use gradient checking to compare the partial derivatives computed using the above algorithm and numerical estimation of
6/15/2019 gradient of J(Ɵ) 09_Neural_Networks_Learning
Disable the gradient checking code for when you actually run it
2.6) Use gradient descent or an advanced optimization method with back propagation to try to minimize J(Ɵ) as a function of
parameters Ɵ
Here J(Ɵ) is non-convex
Can be susceptible to local minimum
In practice this is not usually a huge problem
Can't guarantee programs with find global optimum should find good local optimum at least
e.g. above pretending data only has two features to easily display what's going on
Our minimum here represents a hypothesis output which is pretty close to y
If you took one of the peaks hypothesis is far from y
Gradient descent will start from some random point and move downhill
Back propagation calculates gradient down that hill
www.holehouse.org/mlclass/09_Neural_Networks_Learning.html 11/11