Kevin Swingler - Lecture 4: Multi-Layer Perceptrons
Kevin Swingler - Lecture 4: Multi-Layer Perceptrons
Kevin Swingler - Lecture 4: Multi-Layer Perceptrons
Kevin Swingler
kms@cs.stir.ac.uk
2. 3. 4. 5.
6.
I1
I2
The proposed solution was to use a more complex network that is able to generate more complex decision boundaries. That network is the Multi-Layer Perceptron.
Dept. of Computing Science & Math 4
w jk O j )
Y1
Y2
Yk
Output layer, k
Oj = f (
wij X i )
i
O1
Oj
Hidden layer, j
X1
X2
X3
Xi
Input layer, i
Can We Use a Generalized Form of the PLR/Delta Rule to Train the MLP?
Recall the PLR/Delta rule: Adjust neuron weights to reduce error at neurons output:
w = wold + x
where
= yt arg et y
Main problem: How to adjust the weights in the hidden layer, so they reduce the error in the output layer, when there is no specified target response in the hidden layer? Solution: Alter the non-linear Perceptron (discrete threshold) activation function to make it differentiable and hence, help derive Generalized DR for MLP training.
+1 +1
-1 Sigmoid function 6
w = w wold =
E = + x w
So, the weight change from the input layer unit i to hidden layer unit j is:
wij = j xi
where
j = o j (1 o j ) w jk k
k
The weight change from the hidden layer unit j to the output layer unit k is:
w jk = k o j
where
k = ( yt arg et ,k y k ) y k (1 y k )
Local minimum
Global minimum
Weight, wi
Ideal weight
10
12
13
14
15
Local Minima
Cost functions can quite easily have more than one minimum:
If we start off in the vicinity of the local minimum, we may end up at the local minimum rather than the global minimum. Starting with a range of different initial weight sets increases our chances of finding the global minimum. Any variation from true gradient descent will also increase our chances of stepping into the deeper valley.
Dept. of Computing Science & Math 17
18
Too few hidden units will generally leave high training and generalisation errors due to under-fitting. Too many hidden units will result in low training errors, but will make the training unnecessarily slow, and will result in poor generalisation unless some other technique (such as regularisation) is regularisation used to prevent over-fitting. Virtually all rules of thumb you hear about are actually nonsense. A sensible strategy is to try a range of numbers of hidden units and see which works best.
Dept. of Computing Science & Math 19
In practice, it is often quicker to just use the same rates for all the weights and thresholds, rather than spending time trying to work out appropriate differences. A very powerful approach is to use evolutionary strategies to determine good learning rates.
20