CSD311: Artificial Intelligence

CSD311: Artificial Intelligence
Neural networks I
I Neural networks are non-linear function approximators that

learn the approximation from training data - that is a set of
input-output pairs.
I The basic neural network consists of 3 layers of units (usually
called neurons) with interconnections between successive
layers. There are no connections between units in the same
layer or between the first and third layer. The three layers are
called 1-input, 2-hidden, 3-output. See the figure later for
details.
I Each connection has a weight wij and the input to a unit, say
P
j, is a linear combination i∈(units connected to j) wij xi + w0j ,
where xi is the output of unit i and w0j is a bias term (like a
threshold).
Neural networks II
xd 1(bias)
wdj
w0j
xi wij Pd
j fj ( i=1 wij xi + w0j )
w1j Pd
x1 i=1 wij xi + w0j
I The output from a unit j is a non-linear function fj (called the
activation function) of the input. That is output of unit j, yj
is:
Xd
yj = fj (netj ) = fj ( wij xi + w0j )
i=1
Notationally, we represent the net input to unit j as

netj = di=0 wij xi , where x0 = 1 for the bias and the output
P
of j is yj = fj (netj ).
I Typical activation functions are: sigmoid ( 1+e1 −x ), tanh
x −x
( ee x −e
+e −x
), ReLU (rectified linear unit) max(0, x) - widely used
in deep neural nets .
Diagram of neural network with one hidden layer
Input Hidden Output

wdm
xd → d m
wmK
xd−1 → · · K → zK
wjK
xi → i j k → zk
w1K
x2 → · · 1 → z1
x1 → 1 w11
1
Note: Each neuron in a layer is connected to all

neurons in the next layer. This is not shown to
avoid clutter. Above diagram will be used in the
derivation later.
Feedforward operation
I In feed forward operation the output of neuron j is:

P
oj = fj ( i∈(all inputs to j) wij oi + w0j ) where oi is the output of
neuron i in the previous layer which feeds into j, fj is the
activation function of neuron j.
I The outputs z1 to zK are combined/interpreted suitably to
either get a class label or to get a regressed value. For
regression the output layer normally has just one neuron.
I Notation: We use yj for the output of neuron j in the hidden
layer and zk for the output of neuron k in the output layer.
Let number of neurons in the input layer be d, in hidden layer
m and output layer K . Also, subscripts i, j and k will be used
for the input, hidden and output layers respectively.
Learning via backpropagation
I Learning in a 3NN (one hidden layer) happens by adjusting
the weights using a loss function (e.g. squared loss) and by
gradient descent as input patterns are fed one-by-one to the
network.
I In online learning updates are done after each vector is
exposed. In the batch version the update is done after all the
input patterns have been fed to the network and individual
vector updates are batched . The online version is much more
commonly used. After advent of deep n/ws mini-batches are
almost always used since data set size is very large.
I This happens layer by layer starting with the weights between
the hidden layer and the output layer with the error being
backpropagated to adjust the next layer of weights.
I At the start all weights are randomly initialized to small values
(say between −1 to 1 avoiding 0).
I For the output layer define the loss function as:
K
1X
E(w) = (tk − zk )2
2
k=0
where tk is the desired output from the neuron k in the

output layer and zk is the actual output after a feedforward
step. The gradient descent weight update is: ∆w = −η ∂∂wE ,
where η is the learning parameter.
I Define netk = m
P
j=0 wjk yj where m is all the units in the
previous (hidden) layer connected to unit k. Then
zk = f (netk ) and E(w) = 21 K 2
P
k=0 k − fk (netk )) . For a
(t
∂E ∂E ∂netk
single weight wjk we have ∂w jk
= ∂netk ∂wjk
.
∂netk
I ∂E
∂netk = −(tk − zk )fk0 (netk ) and ∂wjk = yj . So, update for wjk
is:
∆wjk = η(epoch)(tk − zk )fk0 (netk )yj
Assuming fk is differentiable we can get f 0 and consequently

calculate ∆wjk .
I To find the update ∆wij = −η ∂w ∂E
ij
for the first layer of
weights the calculation is more involved. We now want to find
∂E
∂wij . For neuron j in the hidden layer we have yj = fj (netj )
and netj = di=0 wij xi . Also, by chain rule:
P
∂E ∂E ∂yj ∂netj
=
∂wij ∂yj ∂netj ∂wij
∂netj ∂yj
I ∂wij = xi and ∂netj = fj0 (netj )
PK
∂E ∂[ 1 k=0 (tk − zk )2 ]
= 2
∂yj ∂yj
K
X ∂zk
=− (tk − zk )
∂yj
k=0
K
X ∂zk ∂netk
=− (tk − zk )
∂netk ∂yj
k=0
XK
=− (tk − zk )fk0 (netk )wjk
k=0
So,
K
X
∆wij = ηfj0 (netj )xi (tk − zk )fk0 (netk )wjk
k=0
Training algorithm for 3NN
Algorithm 0.20: Train3NN(wi , wj , η, stopFn)
comment: Online version. wi , wj are first and second layer of weights;

wi , wj ← initialize randomly;
while  (stopFn is false);

 while 
(L =6 ∅)
x ← choose randomly from L without replacemnt;


 

do Run network in feedforward mode;

do
wj ← update second layer of weights using update eqn;



 
wi ← update first layer of weights using update eqn;
 
return (wi , wj );
Stopping criteria
An epoch is one exposure of each item in the learning data set. Training
is normally counted in number of epochs. Stopping of training can be
done based on the following:
I Use a validation set. Typical error behaviour is shown below:
Figure: Error versus epochs
I Use a threshold for E.

I Use a threshold on number of epochs.

CSD311: Artificial Intelligence

Uploaded by

Copyright:

Available Formats

CSD311: Artificial Intelligence

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CSD311: Artificial Intelligence

Uploaded by

Copyright:

Available Formats

CSD311: Artificial Intelligence

I Neural networks are non-linear function approximators that

Notationally, we represent the net input to unit j as

Input Hidden Output

Note: Each neuron in a layer is connected to all

I In feed forward operation the output of neuron j is:

where tk is the desired output from the neuron k in the

∆wjk = η(epoch)(tk − zk )fk0 (netk )yj

Assuming fk is differentiable we can get f 0 and consequently

Algorithm 0.20: Train3NN(wi , wj , η, stopFn)

comment: Online version. wi , wj are first and second layer of weights;

Figure: Error versus epochs

I Use a threshold for E.

You might also like