Neural Network Module 2 Notes
Neural Network Module 2 Notes
Module 2
Supervised Learning
Contents
Perceptron Learning and Non Separable sets
α- Least Mean Square Learning
MSE Error Surface
Steepest Descent Search
μ- LMS Approximate to Gradient Descent
Application of LMS to Noise Cancelling
Multi-layered Network Architecture
Backpropagation Learning Algorithm
Practical Consideration of BP Algorithm
∙ Limitations to the performance of a neural network (or for that matter any machine learning technique in
general) can be partly attributed to the quality of data employed for training.
∙ Real world data sets are generally incomplete: they may lack feature values or may have missing features.
∙ Data is also often noisy, containing errors or outliers, and may be inconsistent.
∙ Alternatively, the data set might have many more features than are necessary for solving the classification
problem at hand, especially in application domains such as bioinformatics where features (such as
microarray gene expression values) can run in thousands, whereas only a mere handful of them usually turn
out to be essential.
∙ Further, feature values may have scales that are different from one another by orders of magnitude.
∙ It is therefore necessary to perform some kind of pre-processing on the data prior to using it for training and
testing purpose
∙ If the learning rates are small enough, then the algorithm converges to the closest local minimum.
∙ Very small leaming rates can lead to long training times.
∙ In addition, if the network learning is non-uniform and the learning is stopped before the network is trained to an error
minimum, some weights will have reached their final 'optimal' values while others may not have. In such a situation, the
network might perform well on some patterns and poorly on others.
∙ If the error function can be approximated by a quadratic, then the following observation can be made
o An optimal learning rate will reach the error minimum in a single learning step.
o Rates that are lower will take longer to converge to the same solution.
o Rates that are larger but less than twice the optimal learning rate will converge to the error minimum but only after
much oscillation.
o Learning rates that are larger than twice the optimal value will diverge from the solution.
∙ There are a number of algorithms that attempt to adjust this learning rate somewhat optimally in order to speed up
conventional backpropagation.
Weight Decay:
▪ Over-fitted networks with a high degree of curvature are likely to contain weight with unusually large
magnitude.
▪ Weight decay is a simple regularization technique which penalizes large magnitudes of weights by including
into the error function a penalty term that grows with weight magnitudes.
▪ For example, the sum of the squares of the weights (including the biases) of the entire network can be
multiplied by a decay constant which decides the extent to which the penalty term affects the error function.
▪ With this, the learning process tends to favour lower magnitudes of weights and thus helps keep the
operating range of activations of neurons in the linear regime of the sigmoid
∙ Both the generalization and approximation ability of a feedforward neural network are closely related to the
architecture of the network (which determines the number of weights or free parameters in the network) and
the of the training set.
∙ It is possible to have a situation where there are too many connections in the network and too few training
examples.
∙ In such a situation, the network might "memorize' the training examples only too well, and may fail to
generalize properly because the number of training examples are insufficient to appropriately pin down all
the connection values in the network.
∙ In such a case, a network gets over trained and loses its ability to generalize or interpolate correctly.
∙ The real problem is to find a network architecture that is capable of approximation and generalization
simultaneously.
∙ Although the backpropagation algorithm can be applied to any number of hidden layers, a three-layered
network can approximate any continuous function.
∙ The problem with multi-layered networks using a single hidden layer is that the neurons tend to interact with
each other globally.
∙ Such interactions can make it difficult to generate approximations of arbitrary accuracy.
∙ For a network with two hidden layers, the curve-fitting process is easier for the neural network.
∙ The reason for this is that the first hidden layer extracts local features of the function.
∙ This is done in much the same way as binary threshold neurons partition the input space into regions.
∙ Neurons in the first hidden layer learn the local features that characterize specific regions of the input space.
∙ Global features are extracted in the second hidden layer.
∙ Once again, in a way similar to the multi-layered TLN network, neurons in the second hidden layer combine
the outputs of neurons in the first hidden layer. This facilitates the learning of global features.