Artificial Neural Networks: Module-3
Artificial Neural Networks: Module-3
Module-3
Artificial Neural Networks
3.1 Introduction
Artificial neural networks (ANNs) provide a general, practical method for learning real-
valued, discrete-valued, and vector-valued functions from examples. Algorithms such as
back propagation use gradient descent to tune network parameters to best fit a training
set of input-output pairs. ANN learning is robust to errors in the training data and has
been successfully applied to problems such as interpreting visual scenes, speech
recognition, and learning robot control strategies.
For certain types of problems, such as learning to interpret complex real-world sensor
data, artificial neural networks are among the most effective learning methods currently
known. For example, the Back-Propagation algorithm has proven surprisingly
successful in many practical problems such as learning to recognize handwritten
characters, learning to recognize spoken words, and learning to recognize faces.
Biological Motivation
The study of artificial neural networks (ANNs) has been inspired in part by the
observation that biological learning systems are built of very complex webs of
interconnected neurons. In rough analogy, artificial neural networks are built out of a
densely interconnected set of simple units, where each unit takes a number of real-valued
inputs (possibly the outputs of other units) and produces a single real-valued output
(which may become the input to many other units). To develop a feel for this analogy, let
us consider a few facts from neurobiology. The human brain, for example, is estimated to
contain a densely interconnected network of approximately 1011 neurons, each
connected, on average, to 104 others. Neuron activity is typically excited or inhibited
through connections to other neurons. The fastest neuron switching times are known to
be on the order of 10-3 seconds--quite slow compared to computer switching speeds of
10-10 seconds.
Yet humans are able to make surprisingly complex decisions, surprisingly quickly. For
example, it requires approximately 10-1 seconds to visually recognize your mother.
Notice the sequence of neuron firings that can take place during this 10-1 second interval
cannot possibly be longer than a few hundred steps, given the switching speed of single
neurons. This observation has led many to speculate that the information-processing
abilities of biological neural systems must follow from highly parallel processes operating
on representations that are distributed over many neurons. One motivation for ANN
systems is to capture this kind of highly parallel computation based on distributed
representations. Most ANN software runs on sequential machines emulating distributed
processes, although faster versions of the algorithms have also been implemented on
highly parallel machines and on specialized hardware designed specifically for ANN
applications.
speeds on public highways. The input to the neural network is a 30 x 32 grid of pixel
intensities obtained from a forward-pointed camera mounted on the vehicle. The
network output is the direction in which the vehicle is steered. The ANN is trained to
mimic the observed steering commands of a human driving the vehicle for
approximately 5 minutes. ALVINN has used its learned networks to successfully drive
at speeds up to 70 miles per hour and for distances of 90 miles on public highways
(driving in the left lane of a divided public highway, with other vehicles present).
Neural network learning to steer an autonomous vehicle. The ALVINN system uses
BACKPROPAGATION to learn to steer an autonomous vehicle (photo at top) driving at speeds up to 70
miles per hour. The diagram on the left shows how the image of a forward-mounted camera is mapped
to 960 neural network inputs, which are fed forward to 4 hidden units, connected to 30 output units.
Network outputs encode the commanded steering direction. The figure on the right shows weight
values for one of the hidden units in this network. The 30 x 32 weights into the hidden unit are
displayed in the large matrix, with white blocks indicating positive and black indicating negative
weights. The weights from this hidden unit to the 30 output units are depicted by the smaller
rectangular block directly above the large block. As can be seen from these output weights, activation
of this particular hidden unit encourages a turn toward the left.
Above figure illustrates the neural network representation used in one version of the
ALVINN system, and illustrates the kind of representation typical of many ANN systems.
The network is shown on the left side of the figure, with the input camera image depicted
below it. Each node (i.e., circle) in the network diagram corresponds to the output of a
single network unit, and the lines entering the node from below are its inputs. As can be
seen, there are four units that receive inputs directly from all of the 30 x 32 pixels in the
image. These are called "hidden" units because their output is available only within
the network and is not available as part of the global network output. Each of these
four hidden units computes a single real-valued output based on a weighted combination
of its 960 inputs. These hidden unit outputs are then used as inputs to a second layer of
30 "output" units. Each output unit corresponds to a particular steering direction, and the
output values of these units determine which steering direction is recommended most
strongly.
The diagrams on the right side of the figure depict the learned weight values associated
with one of the four hidden units in this ANN. The large matrix of black and white boxes
on the lower right depicts the weights from the 30 x 32-pixel inputs into the hidden unit.
Here, a white box indicates a positive weight, a black box a negative weight, and the size
of the box indicates the weight magnitude.
The smaller rectangular diagram directly above the large matrix shows the weights from
this hidden unit to each of the 30 output units.
The network structure of ALYINN is typical of many ANNs. Here the individual units are
interconnected in layers that form a directed acyclic graph. In general, ANNs can be
graphs with many types of structures-acyclic or cyclic, directed or undirected. Here we
will focus on the most common and practical ANN approaches, which are based on the
BackPropagation algorithm. The BACKPROPAGATION algorithm assumes the
network is a fixed structure that corresponds to a directed graph, possibly
containing cycles. Learning corresponds to choosing a weight value for each edge in the
graph. Although certain types of cycles are allowed, the vast majority of practical
applications involve acyclic feed-forward networks, similar to the network structure
used by ALVINN.
the corresponding steering direction. We can also train a single network to output
both the steering command and suggested acceleration, simply by concatenating
the vectors that encode these two output predictions.
• The training examples may contain errors. ANN learning methods are quite
robust to noise in the training data.
• Long training times are acceptable. Network training algorithms typically
require longer training times than, say, decision tree learning algorithms. Training
times can range from a few seconds to many hours, depending on factors such as
the number of weights in the network, the number of training examples
considered, and the settings of various learning algorithm parameters.
• Fast evaluation of the learned target function may be required. Although ANN
learning times are relatively long, evaluating the learned network, in order to
apply it to a subsequent instance, is typically very fast. For example, ALVINN
applies its neural network several times per second to continually update its
steering command as the vehicle drives forward.
• The ability of humans to understand the learned target function is not
important. The weights learned by neural networks are often difficult for humans
to interpret. Learned neural networks are less easily communicated to humans
than learned rules.
3.4 Perceptrons
One type of ANN system is based on a unit called a perceptron, illustrated in below Figure.
Where
Learning a perceptron involves choosing values for the weights w0,……….,wn. Therefore, the
space H of candidate hypotheses considered in perceptron learning is the set of all
possible real-valued weight vectors.
The decision surface represented by a two-input perceptron. (a) A set of training examples and the
decision surface of a perceptron that classifies them correctly. (b) A set of training examples that is
not linearly separable (i.e., that cannot be correctly classified by any straight line). xl and x2 are the
Perceptron inputs. Positive examples are indicated by "+", negative by "-".
The equation for this decision hyperplane is . Of course, some sets of positive
and negative examples cannot be separated by any hyperplane. Those that can be
separated are called linearly separable sets of examples.
A single perceptron can be used to represent many boolean functions. For example, if we
assume boolean values of 1 (true) and -1 (false), then one way to use a two-input
perceptron to implement the AND function is to set the weights wo = -0.8, and w1 = w2 =
0.5. This perceptron can be made to represent the OR function instead by altering the
threshold to wo = -0.3. In fact, AND and OR can be viewed as special cases of m-of-n
functions: that is, functions where at least m of the n inputs to the perceptron must be
true. The OR function corresponds to m= 1 and the AND function to m = n. Any m-of-n
function is easily represented using a perceptron by setting all input weights to the same
value (e.g., 0.5) and then setting the threshold wo accordingly.
Perceptrons can represent all of the primitive boolean functions AND, OR, NAND (⌐AND),
and NOR (⌐OR). Unfortunately, however, some boolean functions cannot be represented
by a single perceptron, such as the XOR function whose value is 1 if and only if xl ≠x2.
Note the set of linearly nonseparable training examples shown in above Figure (b)
corresponds to this XOR function.
The ability of perceptrons to represent AND, OR, NAND, and NOR is important because
every boolean function can be represented by some network of interconnected units
based on these primitives. In fact, every boolean function can be represented by some
network of perceptrons only two levels deep, in which the inputs are fed to multiple units,
and the outputs of these units are then input to a second, final stage.
One way is to represent the Boolean function in disjunctive normal form (i.e., as the
disjunction (OR) of a set of conjunctions (ANDs) of the inputs and their negations). Note
that the input to an AND perceptron can be negated simply by changing the sign of the
corresponding input weight.
Where
Here t is the target output for the current training example, o is the output generated by
the perceptron, and n is a positive constant called the learning rate. The role of the
learning rate is to moderate the degree to which weights are changed at each step. It is
usually set to some small value (e.g., 0.1) and is sometimes made to decay as the number
of weight-tuning iterations increases.
Why should this update rule converge toward successful weight values? To get an
intuitive feel, consider some specific cases. Suppose the training example is correctly
classified already by the perceptron. In this case, (t - o) is zero, making ∆wi zero, so that
no weights are updated. Suppose the perceptron outputs a -1, when the target output is
+ 1. To make the perceptron output a + 1 instead of - 1 in this case, the weights must be
altered to increase the value . For example, if xi > 0, then increasing wi will bring the
perceptron closer to correctly classifying this example. Notice the training rule will
increase wi in this case, because (t - o), n and xi are all positive. For example, if xi = .8, n
= 0.1, t = 1, and o = - 1, then the weight update will be ∆wi = n(t - o)xi = 0.1(1 - (-1))0.8
= 0.16. On the other hand, if t = -1 and o = 1, then weights associated with positive xi will
be decreased rather than increased.
In fact, the above learning procedure can be proven to converge within a finite number
of applications of the perceptron training rule to a weight vector that correctly classifies
all training examples, provided the training examples are linearly separable and
provided a sufficiently small n is used. If the data are not linearly separable, convergence
is not assured.
Thus, a linear unit corresponds to the first stage of a perceptron, without the threshold.
In order to derive a weight learning rule for linear units, let us begin by specifying a
measure for the training error of a hypothesis (weight vector), relative to the training
examples. Although there are many ways to define this error, one common measure that
will turn out to be especially convenient is
where D is the set of training examples, td is the target output for training example d, and
od is the output of the linear unit for training example d.
By this definition, is simply half the squared difference between the target output
td and the linear unit output od, summed over all training examples. Here we characterize
E as a function of , because the linear unit output o depends on this weight vector. Of
course E also depends on the particular set of training examples, but we assume these
are fixed during training, so we do not bother to write E as an explicit function of these.
Error of different hypotheses. For a linear unit with two weights, the hypothesis space H is the w0, w1 plane. The
vertical axis indicates the error of the corresponding weight vector hypothesis, relative to a fixed set of training
examples. The arrow shows the negated gradient at one particular point, indicating the direction in the w0, w1
plane producing steepest descent along the error surface.
The error surface shown in the figure thus summarizes the desirability of every weight
vector in the hypothesis space (we desire a hypothesis with minimum error). Given the
way in which we chose to define E, for linear units this error surface must always be
parabolic with a single global minimum. The specific parabola will depend, of course, on
the particular set of training examples.
Gradient descent search determines a weight vector that minimizes E by starting with an
arbitrary initial weight vector, then repeatedly modifying it in small steps. At each step,
the weight vector is altered in the direction that produces the steepest descent along the
error surface depicted in above Figure. This process continues until the global minimum
error is reached.
(1)
Notice is itself a vector, whose components are the partial derivatives of E with
respect to each of the wi. When interpreted as a vector in weight space, the gradient
specifies the direction that produces the steepest increase in E. The negative of this
vector therefore gives the direction of steepest decrease.
For example, the arrow in above Figure shows the negated gradient for a
particular point in the w0, w1 plane.
Since the gradient specifies the direction of steepest increase of E, the training rule for
gradient descent is
where
Here n is a positive constant called the learning rate, which determines the step size in
the gradient descent search.
The negative sign is present because we want to move the weight vector in the direction
that decreases E. This training rule can also be written in its component form
where
------------------(2)
which makes it clear that steepest descent is achieved by altering each component wi of
in proportion to
where xid denotes the single input component xi for training example d. We now have an
equation that gives in terms of the linear unit inputs xid, outputs td, and target values
td associated with the training examples.
Substituting above Equation into Equation (2) yields the weight update rule for gradient
descent
where t, o, and xi are the target value, unit output, and ith input for the training example
in question. To modify the gradient descent algorithm in above Table to implement this
stochastic approximation, Equation (T4.2) is simply deleted and Equation (T4.1)
replaced by . One way to view this stochastic gradient descent is
to consider a distinct error function for each individual training example d as
follows
where td and od are the target value and the unit output value for training example d.
Stochastic gradient descent iterates over the training examples d in D, at each iteration
altering the weights according to the gradient with respect to . The sequence of
these weight updates, when iterated over all training examples, provides a reasonable
approximation to descending the gradient with respect to our original error function .
By making the value of (the gradient descent step size) sufficiently small, stochastic
gradient descent can be made to approximate true gradient descent arbitrarily closely.
The key differences between standard gradient descent and stochastic gradient
descent are:
• In standard gradient descent, the error is summed over all examples before
updating weights, whereas in stochastic gradient descent weights are updated
upon examining each training example.
• Summing over multiple examples in standard gradient descent requires more
computation per weight update step. On the other hand, because it uses the true
gradient, standard gradient descent is often used with a larger step size per weight
update than stochastic gradient descent.
• In cases where there are multiple local minima with respect to stochastic
gradient descent can sometimes avoid falling into these local minima because it
uses the various r rather than to guide its search.
Both stochastic and standard gradient descent methods are commonly used in practice.
Remarks
We have considered two similar algorithms for iteratively learning perceptron weights.
The key difference between these algorithms is that the perceptron training rule updates
weights based on the error in the thresholded perceptron output, whereas the delta rule
updates weights based on the error in the unthresholded linear combination of inputs.
The difference between these two training rules is reflected in different convergence
properties. The perceptron training rule converges after a finite number of iterations to
a hypothesis that perfectly classifies the training data, provided the training examples
are linearly separable. The delta rule converges only asymptotically toward the
minimum error hypothesis, possibly requiring unbounded time, but converges
regardless of whether the training data are linearly separable.
A third possible algorithm for learning the weight vector is linear programming.
Linear programming is a general, efficient method for solving sets of linear inequalities.
Notice each training example corresponds to an inequality of the form
and their solution is the desired weight vector. Unfortunately, this approach yields a
solution only when the training examples are linearly separable; however, Duda and Hart
(1973, p. 168) suggest a more subtle formulation that accommodates the nonseparable
case. In any case, the approach of linear programming does not scale to training
multilayer networks, which is our primary concern. In contrast, the gradient descent
approach, on which the delta rule is based, can be easily extended to multilayer networks.
(i.e., "hid," "had," "head," "hood," etc.). The input speech signal is represented by two
numerical parameters obtained from a spectral analysis of the sound, allowing us to
easily visualize the decision surface over the two-dimensional instance space. As shown
in the figure, it is possible for the multilayer network to represent highly nonlinear
decision surfaces that are much more expressive than the linear decision surfaces of
single units shown earlier.
This section discusses how to learn such multilayer networks using a gradient descent
algorithm similar to that discussed in the previous section.
Decision regions of a multilayer feedforward network. The network shown here was trained to
recognize 1 of 10 vowel sounds occurring in the context "hd" (e.g., "had," "hid"). The network input
consists of two parameters, F1 and F2, obtained from a spectral analysis of the sound. The 10 network
outputs correspond to the 10 possible vowel sounds. The network prediction is the output whose value
is highest. The plot on the right illustrates the highly nonlinear decision surface represented by the
learned network. Points shown on the plot are test examples distinct from the examples used to train
the network.
What type of unit shall we use as the basis for constructing multilayer networks? At first,
we might be tempted to choose the linear units discussed in the previous section, for
which we have already derived a gradient descent learning rule. However, multiple layers
of cascaded linear units still produce only linear functions, and we prefer networks
capable of representing highly nonlinear functions. The perceptron unit is another
possible choice, but its discontinuous threshold makes it undifferentiable and hence
unsuitable for gradient descent. What we need is a unit whose output is a nonlinear
function of its inputs, but whose output is also a differentiable function of its inputs. One
solution is the sigmoid unit-a unit very much like a perceptron, but based on a
smoothed, differentiable threshold function.
The sigmoid unit is illustrated in below Figure. Like the perceptron, the sigmoid unit first
computes a linear combination of its inputs, then applies a threshold to the result.
In the case of the sigmoid unit, however, the threshold output is a continuous function of
its input. More precisely, the sigmoid unit computes its output o
as where a is often called the sigmoid function or, alternatively, the logistic function. Note
its output ranges between 0 and 1, increasing monotonically with its input. Because it
maps a very large input domain to a small range of outputs, it is often referred to as the
squashing function of the unit. The sigmoid function has the useful property that its
derivative is easily expressed in terms of its output [in particular]
As we shall see, the gradient descent learning rule makes use of this derivative. Other
differentiable functions with easily calculated derivatives are sometimes used in place of
a. For example, the term e-y in the sigmoid function definition is sometimes replaced by
e-k.y where k is some positive constant that determines the steepness of the threshold.
The function tanh is also sometimes used in place of the sigmoid function.
where outputs is the set of output units in the network, and tkd and okd are the target and
output values associated with the kth output unit and training example d.
The learning problem faced by backpropagation is to search a large hypothesis space
defined by all possible weight values for all the units in the network.
The situation can be visualized in terms of an error surface similar to that shown for
linear units in Figure of hyperbola error surface.
The error in that diagram is replaced by our new definition of E, and the other dimensions
of the space correspond now to all of the weights associated with all of the units in the
network. As in the case of training a single unit, gradient descent can be used to attempt
to find a hypothesis to minimize E.
One major difference in the case of multilayer networks is that the error surface can have
multiple local minima, in contrast to the single-minimum parabolic error surface shown
in previous Figure. Unfortunately, this means that gradient descent is guaranteed only to
converge toward some local minimum, and not necessarily the global minimum error.
Despite this obstacle, in practice Back Propagation has been found to produce excellent
results in many real-world applications.
The stochastic gradient descent version of the Back-Propagation algorithm for feedforward networks
containing two layers of sigmoid units.
The Back-Propagation algorithm is presented in above Table. The algorithm as described
here applies to layered feedforward networks containing two layers of sigmoid units,
with units at each layer connected to all units from the preceding layer. This is the
incremental, or stochastic, gradient descent version of Backpropagation. The notation
used here is the same as that used in earlier sections, with the following extensions:
• An index (e.g., an integer) is assigned to each node in the network, where a "node"
is either an input to the network or the output of some unit in the network.
• Xji denotes the input from node i to unit j, and wji denotes the corresponding
weight.
• denotes the error term associated with unit n. It plays a role analogous to the
quantity (t - o) in our earlier discussion of the delta training rule. As we shall see
later,
Notice the algorithm in above Table begins by constructing a network with the desired
number of hidden and output units and initializing all network weights to small random
values. Given this fixed network structure, the main loop of the algorithm then repeatedly
iterates over the training examples. For each training example, it applies the network to
the example, calculates the error of the network output for this example, computes the
gradient with respect to the error on this example, then updates all weights in the
network. This gradient descent step is iterated (often thousands of times, using the same
training examples multiple times) until the network performs acceptably well.
The gradient descent weight-update rule (Equation [T4.5] in above Table) is similar to
the delta training rule. Like the delta rule, it updates each weight in proportion to the
learning rate , the input value xji to which the weight is applied, and the error in the
output of the unit. The only difference is that the error (t - o) in the delta rule is replaced
by a more complex error term . The exact form of follows from the derivation of the
weight tuning rule given in Section (Derivation of Backpropagation Rule).
To understand it intuitively, first consider how is computed for each network output
unit k. is simply the familiar (tk – ok) from the delta rule, multiplied by the factor ok(l
– ok), which is the derivative of the sigmoid squashing function. The value for each
hidden unit h has a similar form (Equation [T4.4] in the algorithm). However, since
training examples provide target values tk only for network outputs, no target values are
directly available to indicate the error of hidden units' values.
Instead, the error term for hidden unit h is calculated by summing the error terms for
each output unit influenced by h, weighting each of the , the weight from
hidden unit h to output unit k. This weight characterizes the degree to which hidden unit
h is "responsible for" the error in output unit k.
The algorithm in above Table updates weights incrementally, following the Presentation
of each training example. This corresponds to a stochastic approximation to gradient
descent. To obtain the true gradient of E one would sum the values over all training
examples before altering weight values.
The weight-update loop in BACKPROPAGATION may be iterated thousands of times in a
typical application. A variety of termination conditions can be used to halt the procedure.
One may choose to halt after a fixed number of iterations through the loop, or once the
error on the training examples falls below some threshold, or once the error on a separate
validation set of examples meets some criterion.
The choice of termination criterion is an important one, because too few iterations can
fail to reduce error sufficiently, and too many can lead to overfitting the training data.
Adding Momentum
Because Back Propagation is such a widely used algorithm, many variations have been
developed. Perhaps the most common is to alter the weight-update rule in the algorithm
by making the weight update on the nth iteration depend partially on the update that
occurred during the (n - 1)th iteration, as follows:
Here is the weight update performed during the nth iteration through the main
loop of the algorithm, and is a constant called the momentum. Notice the first
term on the right of this equation is just the weight-update rule of Equation (T4.5) in the
Back-Propagation algorithm. The second term on the right is new and is called the
momentum term. To see the effect of this momentum term, consider that the gradient
descent search trajectory is analogous to that of a (momentumless) ball rolling down the
error surface. The effect of alpha is to add momentum that tends to keep the ball rolling
in the same direction from one iteration to the next. This can sometimes have the effect
of keeping the ball rolling through small local minima in the error surface, or along flat
regions in the surface where the ball would stop if there were no momentum. It also has
the effect of gradually increasing the step size of the search in regions where the gradient
is unchanging, thereby speeding convergence.
Notice this is identical to Step 3 in the Backpropagation algorithm given in Table, so all
we are really saying here is that this step may be repeated for any number of hidden
layers in the network.
It is equally straightforward to generalize the algorithm to any directed acyclic graph,
regardless of whether the network units are arranged in uniform layers as we have
assumed up to now. In the case that they are not, the rule for calculating sigma for any
internal unit (i.e., any unit that is not an output) is
where Downstream(r) is the set of units immediately downstream from unit r in the
network: that is, all units whose inputs include the output of unit r.
This section presents the derivation of the BackPropagation weight-tuning rule. The
specific problem we address here is deriving the stochastic gradient descent rule
implemented by the backpropagation algorithm. Recall from previous sections that
stochastic gradient descent involves iterating through the training examples one at a
time, for each training example d descending the gradient of the error Ed with respect to
this single example. In other words, for each training example d every weight wji is
updated by adding to it ∆wji
where Ed is the error on training example d, summed over all output units in the network
Here outputs is the set of output units in the network, tk is the target value of unit k for
training example d, and ok is the output of unit k given training example d.
Given above equation, our remaining task is to derive a convenient expression for
We consider two cases in turn: the case where unit j is an output unit for the
network, and the case where j is an internal unit.
------(1)
The derivatives will be zero for all output units k except when k = j. We
therefore drop the summation over output units and simply set k = j.
--------------(2)
Next consider the second term in Equation (1). Since oj = Sigmoid(netj), the derivative
is is just the derivative of the sigmoid function, which we have already noted is equal
to Therefore,
---------(3)
Substituting expressions 2 and 3 into 1, we obtain
and combining this with Equations of Backpropagation rule, we have the stochastic
gradient descent rule for output units
and
which is precisely the general rule from Equation given in Arbitrary acyclic networks for
updating internal unit weights in arbitrary acyclic directed graphs.
Reference: Tom M. Mitchell, Machine Learning, India Edition 2013, McGraw Hill Education.