Main
Main
Neural Networks
Professor: Dan Roth Scribe: C. Cheng, C. Cervantes
Overview
• Introduction
• Learning rules
• Over-training prevention
• Input-output coding
• Auto-associative networks
• Convolutional Neural Networks
• Recurrent Neural Networks
1 Introduction
Neural networks can be thought of as a robust approach to approximating real-
valued, discrete-values, and vector-valued target functions. They’re particularly
effective for complex and hard to interpret input data, and have had a lot of
recent success in handwritten character recognition, speech recognition, object
recognition, and some NLP problems.
We can write neural networks as functions such that
NN :X → Y,
where X can be a continuous space [0, 1]n or a discrete space {0, 1}n and Y =
[0, 1] or {0, 1}, correspondingly. In this way, it can be thought of as a classifier,
but it can also be used to approximate other real-valued functions.
Neural networks themselves were named after – and inspired by – biological
systems. However, there is actually very little connection to this architecture
and anything we know (thought we don’t know a lot) about a real neural system.
In essence, a neural network is a machine learning algorithm with a specific
architecture.
We are currently on rising part of a wave of interest in neural network archi-
tectures, after a long downtime from the mid-nineties, for multiple reasons.
The wave came back in the last five years or so, because of better computer
Neural Networks-1
architecture (GPUs, parallelism) and a lot more data than before. Tiny al-
gorithmic changes were made since the late-eighties for the optical character
recognition problem (OCR), but the recent change has been driven by the ar-
chitecture.
Interestingly, one emerging perspective on neural networks is that of interme-
diate representations. In the past, neural networks were thought of as one of
of the family of function approximators (perceptron, boosting, decision trees,
etc.). Now, there is a belief that the hidden layers – that is, intermediate neural
network representations – that age generated during learning may be meaning-
ful. Ideas are being developed on the value of these intermediate representations
for transfer learning etc.
In a linear function, we’re interested in the basic unit oi = w ·x: the dot product
of the weights and the input that gives an output. In neural networks, however,
we want to introduce non-linearity to increase expressivity. If all units were
linear, stacking them together would still be a linear function, and thus we add
no expressivity.
One way to add this non-linearity is to take the sign of the dot product, such
that oi = sgn(w · x). However, this unit would not be differentiable and thus
would be inappropriate for gradient descent.
In neural networks, we must propagate error from the top of the network to the
bottom. To do so using gradient descent, we must use threshold units that are
differentiable.
The basic operation is the linear sum. The net input to a unit is defined as
Neural Networks-2
Figure 2: A differentiable threshold unit
P
netj = wij xi , and the output of a unit is given by
1
oj = ,
1 + exp(−(netj − Tj ))
In 1943, McCollough and Pitts showed that linear threshold units were expres-
sive, could be used to compute logical functions, and – by properly setting
weights – could be used to build basic logic gates
• AND: wij = Tj /n
• OR: wij = Tj
• NOT: use negative weight
Given these basic gates, arbitrary logic circuits, finite-state machines, and com-
puters can be built. Also, since DNF and CNF are universal representations,
any Boolean function could be specified using a two layer network (with nega-
tion).
Learning came just a bit later. In 1949, Hebb suggested that if two units are
both active (firing) then the weights between them should increase:
where R is a constant called learning rate, acting upon the product of activa-
tions.
Following that, in 1959, Rosenblatt suggested that when a target output value
is provided for a single neuron with fixed input, it can incrementally change
weights and learn to produce the output using the perceptron learning rule.
This led to the perceptron learning algorithm, which is really the basic learning
machinery that we use in machine learning.
Neural Networks-3
1.3 Learning Rules
If the neuron sees xi as input, and the output it produces is oj , given the target
output tj for the output unit, the perceptron learning algorithm updates weights
according to
wij ← wij + R(tj − oj )xi .
2.1 Intuition
So far, everything we’ve done so far has involved modifying the feature space;
we start with a space that a linear function cannot express, and move to a
representation that can be expressed linearly.
One question that can be asked is: can the learning algorithm learn this expres-
sive representation directly?
Decision trees are one family of algorithms that do this, and multi-layer neural
networks is another. By stacking several layers of threshold elements, where each
layer uses the output as the previous as input, a multi-layer neural network can
overcome the expressivity limitation of a single threshold element.
Neural Networks-4
2.2 Learning
It is easy to learn the top layer of a network, as it is just a linear unit; the
feedback (truth) given at the top layer, the weights of the layer below can be
updated using either the perceptron update rule or gradient descent, depending
on which loss function is applied.
The intuition for why downstream weights can be updated in this way is given
by the chain rule
If y is a function of x, and z is a function of y, then z is a function of x
To differentiate z relative to x, then, we must also differentiate that intermediate
function, such that
∂z ∂z ∂y
=
∂x ∂y ∂z
In the case of neural networks, all the activation units are differentiable, as
is the output of the network. Thus, if we can define an error function (eg.
sum of squares) that is a differentiable function of the output, we can evalu-
ate the derivatives of this error with respect to the weights, and find weights
that minimize the error using gradient descent or other methods. This method
of propegating error from the top layers to lower layers is called backpropaga-
tion.
3 Backpropagation
The backpropagation algorithm learns the weights for a multi-layer network,
given a network with a fixed set of units and interconnections. The error function
used here is the squared error (LMS). Every other error function could work,
but here the learning rules are developed according to LMS. Since there could
Neural Networks-5
be multiple output units, we define the error as the sum over all the network
output units
1XX
Err(w) = (tkd − okd )2 ,
2
d∈D k∈K
where D is the set of training examples and K is the set of output units.
This is used to derive the (global) learning rule which performs gradient descent
in the weight space in an attempt to minimize the error function
∂E
∆wij = −R
∂wij
.
Here tk is the target output value of unit k for training example d, and ok is
the output of unit k given training example d.
Notice that wij can only influence the output through netj , such that
X
netj = wij xij
where xij is the ith input to unit j (thus xij is from the previous layer of unit
j).
Neural Networks-6
Therefore, we can use the chain rule to write
∂Ed ∂Ed ∂netj ∂Ed
= = xij .
∂wij ∂netj ∂wij ∂netj
∂Ed
Now our task is to derive a convenient expression for ∂netj
. We consider two
cases: the case where unit j is an output unit in the network, and the case
where unit j is a hidden unit.
Just as wij can only influence the rest of the network only through netj , netj
can influence the network only through oj . Therefore, we can again invoke the
chain rule to write
∂Ed ∂Ed ∂oj
= .
∂netj ∂oj ∂netj
Recall that Ed = 12 k∈K (tk − ok )2 , thus
P
∂Ed ∂ 1 X
= (tk − ok )2
∂oj ∂oj 2
k∈K
The derivatives ∂o∂ j (tk − ok )2 will be zero for all output units k except when
k = j. Therefore,
∂Ed ∂ 1 ∂(tj − oj )
= (tj − oj )2 = 2(tj − oj )
∂oj ∂oj 2 ∂oj
1
Next, consider the sigmoid function y = 1+exp(−(x−T )) , its derivative w.r.t. x is
given by
∂y exp(−(x − T ))
∂x
= 2 = y(1 − y)
(1 + exp(−(x − T )))
Neural Networks-7
1
Since oj = 1+exp(−(netj −Tj )) is a sigmoid function, we have
∂oj
= oj (1 − oj )
∂netj
Then, we have
∂Ed
= −(tj − oj )oj (1 − oj )
∂netj
Hence, the learning rule of weights of output units can be written as
∂Ed
∆wij = −R = R(tj − oj )oj (1 − oj )xij
∂wij
or
∆wij = Rδj xij
where δj = (tj −oj )oj (1−oj ) is dependent on the output and its feedback.
Now we already know how to update the output layer, we need to figure out how
to propagate the error to hidden units. In the case of where j is a hidden unit in
the network, the derivation of the training rule for wij must take into account
the indirect ways in which wij can influence the network outputs and thus Ed .
For this reason, we will find it useful to refer to the set of all units immediately
downstream of unit j in the network (i.e., all units whose direct inputs include
the output of unit j). We denote this set of units by downstream(j).
Notice that netj can influence the network outputs only through the units in
downstream(j). Therefore, we have
∂Ed X ∂Ed ∂netk
=
∂netj ∂netk ∂netj
k∈downstream(j)
X ∂netk
= −δk
∂netj
k∈downstream(j)
Neural Networks-8
X ∂netk ∂oj
= −δk
oj ∂netk
k∈downstream(j)
X
= −δk wjk oj (1 − oj )
k∈downstream(j)
∂Ed
Using δj to denote − ∂net j
, we have
X
δj = oj (1 − oj ) δk wjk
k∈downstream(j)
and hence the learning rule of weights of hidden units can be written as
∂Ed ∂Ed
∆wij = −R = −R xij = Rδj xij
∂wij ∂netj
Now let’s summarize what we have done. It is described for three layers, but
exactly the same is going to work for k layers.
First, create a fully connected three layer network and initialize weights. Then,
go example by example, until all examples produce the correct output within
(or other criteria).
For each example in the training set
Compute the network output for this example ok
Compute the error between the output and the target value
δk = (tk − ok )ok (1 − ok )
For each output unitP k, compute error term
δj = oj (1 − oj ) k∈downstream(j) δk wjk
For each hidden unit, compute error term
∆wij = Rδj xij
Update network weights wij
wij ← wij + ∆wij
Neural Networks-9
Figure 7: More hidden layers
4 Training
It is important to remember that there is no guarantee of convergence: the al-
gorithm may oscillate or reach a local minima. In practice, many large networks
can be trained on large amounts of data requiring many hours of computation
time.
As in all gradient algorithms driven algorithms, and important question is ter-
mination criteria: number of epochs, threshold on training set error, no decrease
in error, increased error on a validation set, etc.
To avoid local minima, one useful technique is to use several trials with different
random initial weights with majority or voting.
Running too many epochs may over-train the network and result in over-fitting
(improved result on training, decrease in performance on test set).
We have talked about some standard techniques to avoid over-training, and
some of them you have experimented with.
Keeping a hold-out validation set and test accuracy after every epoch is going
to work. You can also maintain weights for best performing network on the
validation set and return it when performance decreases significantly beyond
that. To avoid losing training data to validation, use k-fold cross-validation to
determine the average number of epochs that optimizes validation performance
and train on the full data set using this many epochs to produce the final
results.
Neural Networks-10
4.2 Network Architecture
Apart from parameters and tuning methodologies, there is also question on what
should be the architecture of the network: how many hidden layers and in what
arrangement.
Since it was known that a single hidden layer is enough to approximate any
function, it used to be the case that few hidden layers were used. However,
using too few hidden units might prevent the system from adequately fitting
the data and learning the concept, while using too many hidden units leads to
over-fitting.
Various modern systems train very deep networks, which is not a simple issue
because the gradients are going to be smaller and smaller as you go down. This
vanishing gradient problem is difficult to deal with.
As with the number of layers, there is no theory behind the size of the layers
themselves, and the cross-validation method is one way to approximate number
of hidden units in a layer.
Another approach to prevent over-fitting is weight-decay: all weights are mul-
tiplied by some fraction in (0,1) after every epoch. In this way, smaller weights
and less complex hypothesis are encouraged. Equivalently, we can change the
error function to include a term for the sum of the squares of the weights in
the network. These are general techniques that can be applied to other algo-
rithms.
Previously, we’ve discussed that having weights of value zero simplifies the
learned hypothesis function, which should reduce overfitting. Dropout train-
ing simulates the notion of zero weights by eliminating some of the hidden units
while training. In dropout training, each time an example is read, some hidden
units are removed with probability p, and the network is trained and propagates
error as if those weights were not there.
Experiments showed that if dropout scheme is used, a more robust result can
be obtained.
Neural Networks-11
Though dropout training was introduced in the context of neural networks, it
can be applies to all learning algorithms; rather than changing the architecture
of the network, dropout can be thought of as a change in the input.
Given a set of examples, a learning algorithm will determine which features are
important. The weights for those important features will then be large and
thus dominate the prediction. In the scheme of dropout learning, some features
are randomly dropped as examples are read, forcing the learning algorithm to
attend to all features.
In practice, it turns out that this idea is effective and often much stronger than
other known regularizers.
Neural Networks-12
5.2 Hidden Layer Representation
network is trained in this way, the hidden layer becomes a binary encoding of
eight numbers; learning, here, is a compression mechanism.
Stated more generally, given examples x, an auto-associative network learns to
produce x as output, where a hidden layer is of lower dimensionalty. The learned
representation is thus more compact, and can be used to chain auto-associative
networks, as shown in Figure 10. In such a network, the reconstruction layer is
dropped after optimization and a new layer is added. These kinds of tricks have
been found to be useful for computer vision, and more generally people have
found that the final layer of a network can be transferred; using the final layer
of one network can be helpful for slightly different problems. This is something
interesting that has not been completely explored yet.
Neural Networks-13
Figure 10: Stacking Auto-encoder
6 Receptive Fields
Consider the problem of encoding input into mathematical models. Humans
have sensory elements – eyes and ears – to encode information from the environ-
ment. Neural networks also require eyes and ears – things to encode information
– in order to process images and sentences. This is a big challenge and there
are different ways to handle this, for different tasks and different types of data.
However, no ideal, one-size-fits-all solution exists.
In neural network jargon, the input connections can be referred to as receptive
fields. This term is borrowed from biology, and refers to the individual sensory
neuron for which a stimulus will trigger the neuron to fire. For example, in the
auditory system, receptive fields can correspond to volumes in auditory space.
However, designing proper receptive fields for the input Neurons is a significant
challenge.
Consider using a neural network to predict whether an image contains a face.
Receptive fields should provide expressive features from the raw image data,
converting the image to inputs that the neural network can use. One approach
to do so would be to design a filter to tell how ”edgy” the picture is, and give
the value to the neural network. Based on this encoding, the whole picture, no
matter how big it is, is converted to a real-valued signal. Although it might not
be an ideal approach to detecting faces, it is a very good starting point.
Another idea is that for every pixel in the input image, give all the pixels to
Neural Networks-14
each unit in the input layer. It will work even when you have images with
different sizes. However, the problem is that this network does not have any
understanding of the local connections between pixels (spatial correlations are
lost).
(a) All image info to all units (b) Image divided into blocks
Rather than giving all the image data to all units in the input layer, we could
create small blocks within the image. Each unit is then responsible for a certain
block in the image. As shown in Figure 12b above, the blocks are disjoint. Inside
each block, we still have the problem of losing spatial correlations. Another issue
is when we are moving from block to block, the smoothness of moving from pixel
to pixel is lost. Therefore, this approach is also not ideal.
7 Convolutional layers
These days, people commonly create filters to capture different patterns in the
input space. For images, these filters are matrices. Each of the filters scans
over the image and creates different outputs. For each filter, there is a different
output. For example, a filter can be designed to be sensitive to sharp corners.
Using the filters, not only the spatial correlations are preserved, desired prop-
erties of the image can also be obtained. This idea can be generalized to other
problems – such as text – but this idea also lies at the heart of convolutional
neural networks.
Neural Networks-15
7.1 Convolutional Operator
for continuous and discrete cases, respectively. In the definition above, x and
h are both functions of t (or n). Let’s say x is the input, and h is the filter.
Convolution of x and h is just an integration of product of x and flipped h.
Convolution is very similar to cross-correlation, except that in convolution one of
the functions is flipped. In two dimensions, the idea is the same; flip one matrix
and slide it on the other matrix. In the example below, the image is convolved
with the ’sharpen’ kernel matrix. First, flip the matrix both vertically and
horizontally. Then, starting from one corner of the image, multiply this matrix
element-wise with the matrices representing blocks of pixels in the image. Sum
them up, and put it in another image. Keep doing this for all blocks of size
3-by-3 over the whole image. To deal with the boundaries, we can either start
within the boundaries, or pad zero values around the image. The result is going
to be another picture, sharper than the previous one. Likewise, we can design
filters for other purposes.
In practice, Fast-Fourier-Transform (FFT) is applied to compute the convolu-
tions. For n inputs, the complexity of the convolution operator is n log n. For
two-dimensions, each convolution takes M N log M N time, where the size of
input is M N .
Neural Networks-16
Figure 16: Example: sharpen kernel
So far, we have the inputs and the convolutional layer. The convolution of
the input (vector/matrix) with weights (vector/matrix) results in a response
vector/matrix. We can have multiple filters (four in the example shown in the
figure below) in each convolutional layer, each producing an output. If we have
multiple channels in the input – a channel for blue color and a channel for green,
for example – each channel will have a set of outputs.
Now the sizes of the outputs depend on the sizes of the inputs. People in
the community are actually using something very simple called a pooling layer,
which is a layer that reduces input of different sizes to a fixed size. There are
different variations of pooling. For max pooling, simply take the value of the
Neural Networks-17
block with the largest value. One could also take the average value of blocks,
or any other combinations of the values.
• Max pooling:
hi [n] = max h̃[i]
i∈N (n)
• Average pooling:
1 X
hi [n] = h̃[i]
n
i∈N (n)
• L-2 pooling:
1
s X
hi [n] = h̃2 [i]
n
i∈N (n)
Combined, the convolution and pooling operations are said to constitute a single
convolutional stack.
1. Convolve an input with a filter → produce outputs of variable sizes
2. Use pooling to shrink outputs to a single, desired size
We can then combine these stacks as often as we want; the size of output depends
on the number of features, channels and filters and design choices. We can give
an image as input, and get a class label as prediction. This whole thing is a
convolutional network.
Remember in backpropagation we started from the error terms in the last layer,
and passed them back to the previous layers, one by one. The same procedure
from backpropagation applies here.
Neural Networks-18
Consider the case of max pooling. This layer only routes the gradient to the
input that has the highest value in the forward pass. Therefore, during the
forward pass of a pooling layer it is common to keep track of the index of
the max activation (sometimes called the switches) so that gradient routing is
efficient during backpropagation. Therefore, we have δ = ∂E
∂yi . Derivations are
d
To get more intuition about ConvNets, let’s look at the following example of
identifying whether there is a car in an image. In the first stage, we have
convolutions with a bunch of filters and more sets of convolutions in the following
stages. When we are training, what are we really training? We are training
the filters. Our task is to identify whether a car is in the image, but at each
stage, there are multiple filters. Usually, in the early stages, the filters are more
sensitive to more general and less detailed elements of the picture, as shown in
the figures above. In later stages, more detailed pieces are favored.
Neural Networks-19
7.6 History
In 1980s, Fukushima designed network with same basic structure but did not
train by backpropagation. The first successful applications of Convolutional
Networks was done by Yann LeCun in 1990s (LeNet). The LeNet was used to
generally more accurate, it is not clear theoretically why this is the case.
Neural Networks-20
9 Recurrent Neural Networks
In the feed-forward neural network architecture, there are no cycles, because
error must be propagated backwards. In principle RNNs have cycles, but in
practice these cycles are broken. An RNN is a digraph that has cycles, which
can act as memory; the hidden states can carry information about a potentially
unbounded number of previous inputs. In practice we essentially we break the
cycles in an RNN by unwrapping it across time.
Consider the cyclic representation in Figure 26. Assume that there is a time
Training a general RNNs can be hard. Here we will focus on a special family
of RNNs that predict on chain-like input. Consider the task of Part-of-Speech
(POS) tagging Given a sentence of words, the RNN should output a POS tag
for each of the word in the sentence.
Neural Networks-21
Figure 27: POS tagging words in a sentence
There are several issues we have to handle. First of all, there are connections
between labels. For example, verbs tend to appear after adverbs. Second, some
sentences tend to be longer than the other ones. We have to handle variable
sizes of inputs. Also, there is interdependence between elements of the inputs.
The final decision is based on an intricate interdependence of the words on each
other.
To handle the chain-like input, we can design an RNN with a chain-like struc-
ture.
As shown in Figure 28, the xt ’s are values obtained from the input space. They
are vector representations of the words. Hidden (memory) units are another set
of vectors. They are computed from the past memory and the current word.
Each input is combined with a current hidden state, and another hidden state
is produced. Each ht contains information about previous inputs and previous
hidden units ht−1 , ht−2 , etc. They summarize the sentence up to each time step
t, which in this example refers to the words in the sentence.
The structure shown in the above figure is the same structure being applied
multiple times (three in the figure). It is not a three-layer stack model. Instead,
it is a fixed structure, whose output is applied again to the same structure. It
is like applying it multiple times to itself. That is a big difference from the
fully-connected feed-forward networks.
Depending on the task, prediction can be made on each word or each sentence.
That is really a design choice.
Neural Networks-22
9.3 Bi-Directional RNNs
Rather than having just one-directional structure, in which the prediction would
only depend on previous contexts, you can have bi-directional structures like the
one shown in Figure 29 Using the same idea, the model can be made further
complicated, like the stack of bi-directional networks shown in the below figure.
9.4 Training
In the POS tagging task, each word is represented as a vector of fixed size.
Consider initializing each word with a random weight. Now, the representation
for each word is a set of parameters we must train. These input representations
are then multiplied by a matrix to get the hidden state from the previous state,
which is multiplied by another matrix to get the next hidden state. This process
is how we transfer from one hidden state to the next; the matrices involved are
more parameters that must be trained.
Given these hidden states, we multiply them with a matrix, apply the softmax
function, and produce a distribution over the output labels. This final matrix is
Neural Networks-23
Figure 31: Training an RNN
also another set of parameters that must be learned. The parameters we have
to train include the matrix multiplied to generate outputs, the matrix that gives
the hidden state from the previous state, and the matrix that gives the hidden
state from the vector representations of the input values.
To actually train the RNN, we need to generalize the same ideas from back-
propagation for feed-forward networks.
PT
As a starting point, we first get the total output error E(y, t) = t=1 Et (yt , tt ),
which is computed over time (words across the sentence). Then, we propagate
the gradients of this error in the outputs back to the parameters. The gradients
w.r.t. matrix W are calculated as
T
∂E X ∂Et
=
∂W t=1
∂W
where
T
∂Et X ∂Et ∂yt ∂ht ∂ht−k
=
∂W t=1
∂yt ∂ht ∂ht−k ∂W
What is a little tricky here is to calculate the gradient of a hidden state w.r.t
another previous hidden state. It can actually be calculated as the product of
a bunch of matrices.
∂ht
= Wh diag[f 0 (Wh ht−1 + Wi xt )]
∂ht−1
t t
∂ht Y ∂hj Y
= = Wh diag[f 0 (Wh ht−1 + Wi xt )]
∂ht−k ∂hj−1
j=t−k+1 j=t−k+1
Neural Networks-24
terms. In such cases, the gradient ∂E
∂W would become super small or large. This
t
Neural Networks-25