Introduction To Artificial Neural Networks
Introduction To Artificial Neural Networks
Introduction To Artificial Neural Networks
Neural Networks
Ahmed Guessoum
Natural Language Processing and Machine Learning
Research Group
Laboratory for Research in Artificial Intelligence
Université des Sciences et de la Technologie
Houari Boumediene
Hebb’s Rule:
If an input of a neuron is repeatedly and persistently
causing the neuron to fire, then a metabolic change
happens in the synapse of that particular input to
reduce its resistance
• Human Brain
– Number of neurons: ~100 billion (1011)
– Connections per neuron: ~10-100 thousand (104 – 105)
– Neuron switching time: ~ 0.001 (10-3) second
– Scene recognition time: ~0.1 second
– 100 inference steps doesn’t seem sufficient!
Massively parallel computation
x0 = 1 𝒏
x1 w1 𝟏, 𝒘𝒊 𝒙𝒊 ≥ 𝟎
w0 n 𝒐 𝒙𝟏 , 𝒙𝟐 , … , 𝒙𝒏 =
𝒊=𝟎
x2 w2 wi xi −𝟏, 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆
i 0
xn wn
𝟏, 𝒘 𝒙 ≥𝟎
𝐯𝐞𝐜𝐭𝐨𝐫 𝐧𝐨𝐭𝐚𝐭𝐢𝐨𝐧 𝒐 𝒙 = 𝒔𝒈𝒏(𝒙, 𝒘) =
−𝟏, 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆
x2 x2
+ +
- + -
+
x1 x1
- +
-
-
Example A Example B
• Perceptron: Can Represent Some Useful Functions (And, Or, Nand, Nor)
– LTU emulation of logic gates (McCulloch and Pitts, 1943)
– e.g., What weights represent g(x1, x2) = AND (x1, x2)? OR(x1, x2)? NOT(x)?
(w0 + w1 . x1 + w2 . x2 w0 = -0.8 w1 = w2 = 0.5 w0 = - 0.3 )
• Some Functions are Not Representable
– e.g., not linearly separable
– Solution: use networks of perceptrons (LTUs)
Ahmed Guessoum – Intro. to Neural Networks 9
24/06/2018 AMLSS
Learning Rules for Perceptrons
• Learning Rule Training Rule
– Not specific to supervised learning
– Idea: Gradual building/update of a model
• Hebbian Learning Rule (Hebb, 1949)
– Idea: if two units are both active (“firing”),
weights between them should increase
– wij = wij + r oi oj
where r is a learning rate constant
– Supported by neuropsychological evidence
24/06/2018 AMLSS Ahmed Guessoum – Intro. to Neural Networks 10
Learning Rules for Perceptrons
• Perceptron Learning Rule (Rosenblatt, 1959)
– Idea: when a target output value is provided for a single
neuron with fixed input, it can incrementally update weights to
learn to produce the output
– Assume binary (boolean-valued) input/output units; single LTU
– w i w i Δw i
Δw i r(t o)x i
where t = c(x) is target output value, o is perceptron output,
r is small learning rate constant (e.g., 0.1)
– Convergence proven for D linearly separable and r small
enough
• Perceptron Learnability
– Recall: can only learn h H - i.e., linearly separable (LS)
functions
– Minsky and Papert, 1969: demonstrated representational
limitations
• e.g., parity (n-attribute XOR: x1 x2 … xn)
• e.g., symmetry, connectedness in visual pattern
recognition
• Influential book Perceptrons discouraged ANN research
for ~10 years
– NB: “Can we transform learning problems into LS ones?”
24/06/2018 AMLSS Ahmed Guessoum – Intro. to Neural Networks 13
Linear Separators
• Functional Definition x2
+
– f(x) = 1 if w1x1 + w2x2 + … + wnxn , 0 otherwise +
+ +
- -
+ +
– : threshold value + +
+ -- -
-
+ - x1
-
• Non Linearly Separable Functions + -
- -
-
-
- -
– Disjunctions: c(x) = x1’ x2’ … xm’ -
𝒐 𝒙 = 𝒏𝒆𝒕 𝒙 = 𝒘𝒊 𝒙𝒊
– Objective: find “best fit” to D 𝒊=𝟎
• Approximation Algorithm
– Quantitative objective: minimize error over training data set D
– Error function: sum squared error (SSE)
𝟏
𝑬 𝒘 = 𝑬𝒓𝒓𝒐𝒓𝑫 𝒘 = (𝒕 𝒙 − 𝒐 𝒙 )𝟐
𝟐
𝒙∈𝑫
• How to Minimize?
– Simple optimization
– Move in direction of steepest gradient in weight-error space
• Computed by finding tangent
• i.e. partial derivatives (of E) with respect to weights (wi)
∆𝑾 = −𝒓 𝜵𝑬 𝒘
𝝏𝑬
∆𝒘𝒊 = −𝒓
𝝏𝒘𝒊
𝝏𝑬 𝝏 𝟏 𝟐 𝟏 𝝏 𝟐
= 𝒕 𝒙 −𝒐 𝒙 = 𝒕 𝒙 −𝒐 𝒙
𝝏𝒘𝒊 𝝏𝒘𝒊 𝟐 𝟐 𝝏𝒘𝒊
𝒙∈𝑫 𝒙∈𝑫
𝝏𝑬 𝟏 𝝏 𝝏
= 𝟐 𝒕 𝒙 −𝒐 𝒙 𝒕 𝒙 −𝒐 𝒙 = 𝒕 𝒙 −𝒐 𝒙 𝒕 𝒙 − 𝒘 𝒙
𝝏𝒘𝒊 𝟐 𝝏𝒘𝒊 𝝏𝒘𝒊
𝒙∈𝑫 𝒙∈𝑫
𝝏𝑬
= 𝒕 𝒙 −𝒐 𝒙 − 𝒙𝒊
𝝏𝒘𝒊
𝒙∈𝑫 17
Gradient Descent:
Algorithm using Delta/LMS Rule
• Algorithm Gradient-Descent (D, r)
– Each training example is a pair of the form <x, t(x)>, where x:
input vector; t(x): target vector; r :learning rate
– Initialize all weights wi to (small) random values
– UNTIL the termination condition is met, DO
Initialize each wi to zero
FOR each instance <x, t(x)> in D, DO
Input x into the unit and compute output o
FOR each linear unit weight wi, DO
wi wi + r(t - o)xi
wi wi + wi
– RETURN final w
24/06/2018 AMLSS Ahmed Guessoum – Intro. to Neural Networks 18
Gradient Descent:
Algorithm using Delta/LMS Rule
2 xD 2
– Incremental gradient descent can approximate batch gradient descent arbitrarily closely
if r made small enough
• Gradient Descent
– Converges to a weight vector with minimal error regardless
whether D is Lin. Separable, given a sufficiently small
learning rate is used.
– Difficulties:
Convergence to local minimum can be slow
No guarantee to find global minimum
• Stochastic Gradient Descent intended to alleviate
these difficulties
• Differences
– In Standard GD, error summed over D before updating W
– In Standard GD, more computation per weight update step
(but larger step size per weight update)
– Stochastic GD can sometimes avoid falling into local
minima
• Both commonly used in Practice
Hidden Layer h1 h2 h3 h4
– Recall: activation function sgn (w x) u 11
of sgn
• Multi-Layer Networks
– A specific type: Multi-Layer Perceptrons (MLPs)
– Definition: a multi-layer feedforward network is composed of an input
layer, one or more hidden layers, and an output layer
– Only hidden and output layers contain perceptrons (threshold or
nonlinear units)
u 11
– Network (of 2 or more layers) can Input Layer x1 x2 x3
x0 = 1
x1 w1
w0
x2 w2 𝒐 𝒙 = 𝒔𝒈𝒏(𝒙, 𝒘) = 𝝈(𝒏𝒆𝒕)
xn wn
𝒏
𝒏𝒆𝒕 = 𝒊=𝟎 𝒘𝒊 𝒙𝒊 = 𝒘 𝒙
• Sigmoid Activation Function
– Linear threshold gate activation function: sgn (w x)
– Nonlinear activation (aka transfer, squashing) function: generalization of sgn
1
– is the sigmoid function σnet
1 e net
– Can derive gradient rules to train
• One sigmoid unit
• Multi-layer, feedforward networks of sigmoid units (using backpropagation)
sinhnet e net e net
• Hyperbolic Tangent Activation Function σnet
coshnet e net e net
24/06/2018 AMLSS Ahmed Guessoum – Intro. to Neural Networks 26
Error Gradient for a Sigmoid Unit
𝝏𝑬 𝝏𝑬 𝝏𝑬
• Recall: Gradient of Error Function 𝜵𝑬 𝒘 = , ,…,
𝝏𝒘𝟎 𝝏𝒘𝟏 𝝏𝒘𝒏
• Gradient of Sigmoid Activation Function
𝝏𝑬 𝝏 𝟏 𝟐 𝟏 𝝏 𝟐
= 𝒕 𝒙 −𝒐 𝒙 = 𝒕 𝒙 −𝒐 𝒙 =
𝝏𝒘𝒊 𝝏𝒘𝒊 𝟐 𝟐 𝝏𝒘𝒊
𝒙 𝒕(𝒙) ∈𝑫 𝒙 𝒕(𝒙) ∈𝑫
𝟏 𝝏
= 𝟐 𝒕 𝒙 −𝒐 𝒙 𝒕 𝒙 −𝒐 𝒙
𝟐 𝝏𝒘𝒊
𝒙 𝒕(𝒙) ∈𝑫
𝝏
= 𝒕 𝒙 −𝒐 𝒙 − 𝒐 𝒙
𝝏𝒘𝒊
𝒙 𝒕(𝒙) ∈𝑫
𝝏𝒐 𝒙 𝝏𝒏𝒆𝒕(𝒙)
=− 𝒕 𝒙 −𝒐 𝒙
𝝏𝒏𝒆𝒕(𝒙) 𝝏𝒘𝒊
𝒙 𝒕(𝒙) ∈𝑫
• B𝐮𝐭 𝐰𝐞 𝐤𝐧𝐨𝐰:
𝝏𝒐 𝒙 𝝏𝝈(𝒏𝒆𝒕) 𝝏𝒏𝒆𝒕(𝒙) 𝝏 𝒘 𝒙
= = 𝒐 𝒙 𝟏−𝒐 𝒙 = = 𝒙𝒊
𝝏𝒏𝒆𝒕(𝒙) 𝝏𝒏𝒆𝒕(𝒙) 𝝏𝒘𝒊 𝝏𝒘𝒊
So:
𝝏𝑬
= 𝒕 𝒙 −𝒐 𝒙 𝒐 𝒙 𝟏 − 𝒐 𝒙 𝒙𝒊
𝝏𝒘𝒊 27
𝒙 𝒕(𝒙) ∈𝑫
The Backpropagation Algorithm
• Intuitive Idea: Distribute the Blame for Error to the Previous Layers
• Algorithm Train-by-Backprop (D, r)
– Each training example is a pair of the form <x, t(x)>, where x: input vector; t(x): target
vector; r :learning rate
– Initialize all weights wi to (small) random values
– UNTIL the termination condition is met, DO
FOR each <x, t(x)> in D, DO
Input the instance x to the unit and compute the output o(x) = (net(x))
FOR each output unit k, DO (calculate its error )
δ k ok x 1 ok x t k x ok x Output Layer
o1 o2
v42
FOR each hidden unit j, DO Hidden Layer h1
δ j h j x 1 h j x
h2 h3 h4
koutputs
v j,k δk u 11
Update each w = ui,j (a = hj) or w = vj,k (a = ok) Input Layer x1 x2 x3
o1 o2
v42
Hidden Layer h1 h2 h3 h4
u 11
x1 x2 x3
• Backprop in Practice
– Local optimization often works well (can run multiple times)
– A weight momentum is often included:
Δw start-layer, end-layer n r δ end-layer aend-layer αΔ w start-layer,end-layer n 1
– Minimizes error over training examples
Generalization to subsequent instances?
– Training often very slow: thousands of iterations over D (epochs)
– Inference (applying network after training) typically very fast
• Classification
• Control
24/06/2018 AMLSS Ahmed Guessoum – Intro. to Neural Networks 32
When to Consider Neural Networks
• Input: High-Dimensional and Discrete or Real-Valued
– e.g., raw sensor input
– Conversion of symbolic data to quantitative (numerical) representations possible
• Output: Discrete or Real Vector-Valued
– e.g., low-level control policy for a robot actuator
– Similar qualitative/quantitative (symbolic/numerical) conversions may apply
• Data: Possibly Noisy
• Target Function: Unknown Form
• Result: Human Readability Less Important Than Performance
– Performance measured purely in terms of accuracy and efficiency
– Readability: ability to explain inferences made using model; similar criteria
• Examples
– Speech phoneme recognition
– Image classification
– Financial prediction
Hidden-to-Output Unit
Weight Map
(for one hidden unit)
Input-to-Hidden Unit
Weight Map
(for one hidden unit)
• Solution Approaches
– Prevention: attribute subset selection
– Avoidance
• Hold out cross-validation (CV) set or split k ways (when to
stop?)
• Weight decay: decrease each weight by some factor on each
epoch
– Detection/recovery: random restarts, addition and deletion
of units
30 x 32 Inputs
• Input Encoding:
– How to encode an image?
• Extract edges, regions of uniform intensity, other local
features?
• Problem: variable number of features variable # of
input units
– Choice: encode image as 30 x 32 pixel intensity values
(summary/means of original 120 x 128) computational
demands manageable
• This is crucial in case of ALVINN (autonomous driving)
• Output Encoding:
– ANN to output 1 of 4 values
• Option1: single unit (values e.g. 0.2, 0.4, 0.6, 0.8)
• Option2: 1-of-n output encoding (better option)
– Note: Instead of 0 and 1 values, 0.1 and 0.9 are used (sigmoid
units cannot output 0 and 1 given finite weights)
• GD with momentum
• Resilient backpropagation
Resilient Backpropagation:
Problem with Sigmoid functions: their slopes
approach zero as the input gets large
the gradient can have a very small magnitude
small changes in weights and biases, even
though the weights and biases are far from their
optimal values.
• Resilient backpropagation attempts to eliminate
harmful effects of the magnitudes of the gradient
• Magnitude of the gradient has no effect on the
weight update
• Only the sign of the gradient is used to
determine the direction of the weight update;
24/06/2018 AMLSS Ahmed Guessoum – Intro. to Neural Networks 55
Resilient Backpropagation (cont.)
• Fletcher-Reeves Update
• Polak-Ribiére Update
• Powell-Beale Restarts
• Scaled Conjugate Gradient
• Advantages
– Adapt to unknown situations
– Robustness: fault tolerance due to network
redundancy
– Autonomous learning and generalization
• Disadvantages
– Complexity of finding the “right” network
structure
– “Black box”
• Hybrid Approaches
– Incorporating knowledge and analytical learning into ANNs
• Knowledge-based neural networks
• Explanation-based neural networks
• Combining uncertain reasoning and ANN learning and
inference
• Probabilistic ANNs
• Bayesian networks