Ann

ARTIFICIAL NEURAL NETWORKS
Federico Marini
Dept. Chemistry, University of Rome “La Sapienza”, Italy
Artificial neural networks (ANNs)
Marini - ChemSchool2022
Historical background
• 1943: McCulloch & Pitts model of a neuron

• 1949: Hebb postulates a learning rule
• 1958: Rosenblatt model of a neural net
• 1969: the book “Perceptrons” freezes most of the
initial enthusiasm about ANNs
• 1982: Hopfield and its model of neural net act as
“catalyst” in attracting the attention of many
scientists towards NNs
ANN papers published: 1982-2002
4780 4916
5000 4643
4000 3577
3000 2509
2000 1615
1000 367
1 4 28 148
0
1982 85-86 89-90 93-94 97-98 2001-
2002
J. Hopfield, Neural Networks and Physical Systems with Emergent Collective

Computational Ability, P. Nac. Acad. Sci. Biol., 79(2), (1982), 2554-2558
Strengths of a Neural Network
• Power: Model complex functions, nonlinearity built

into the network
• Ease of use:
– Learn by example
– Very little user domain-specific expertise needed
• Intuitively appealing: based on model of biology,
will it lead to genuinely intelligent computers/robots?
Neural networks cannot do anything that cannot be

done using traditional computing techniques, BUT
they can do some things which would otherwise be
very difficult.
The roughest approach to NNs
NNs in a nutshell
• From a computational point of view, ANNs represent a way to
operate a non-linear functional mapping between an input and
an output space.
y  f (x)
• Y can be:
• an n-D (usually 2D) vector of coordinates (mapping)
• A multi-dimensional vector of response (regression)
• A binary vector of class-memberships (classification)
• This functional relation is expressed in an implicit way
Opening the blackbox
• The peculiarity of ANNs relies on the fact that they

operate using a large number of parallel connected
simple arithmetic units (neurons).
• Mathematically, a neuron can be defined as nonlinear,
parameterized, bounded function  the variables this
function depends on are called the inputs of the neuron
and its value is called the output.
y  f ( i wi xi  w0 )
Artificial neurons
Nonlinear generalization of the McCullogh-Pitts
neuron:
y  f ( x, w)
y is the neuron’s output, x is the vector of inputs,
and w is the vector of synaptic weights.
Examples: 1
y  wT x  a sigmoidal neuron
1 e
||x  w||2

Gaussian neuron
ye 2a 2
Artificial neurons y  f ( x, w)
𝑥1 𝑤1
𝑤2 Σ ℱ
𝑥2 𝑠 = 𝒘𝑇 𝒙 𝑦 = 𝑓(𝑠) 𝑦
𝑤3
𝑥3
𝑤4
𝑥4 𝑤0
+1
1
0.8
0.6
0.4
0.2
0
y
-0.2
-0.4
-0.6
-0.8
-1
1.5
1
1.5
0.5 1
0 0.5
-0.5 0
-0.5
-1 -1
x2 -1.5 x1
-1.5
Activation functions
Hard threshold
Piecewise linear
Sigmoid
From the neuron to the net
• Just as a neuron can be thought as a nonlinear function of
its inputs, a network represents the composition of the
nonlinear functions of two or more neurons
• The way the different units are connected among each
other governs the way the different functions they
describe are weighted and combined to produce the
overall output.
• This pattern of interconnection among the neurons is
called the network “architecture”, and can be
conveniently represented on a graph: neurons operating
on the same input variables are organized in layers, while
the weights that modulate the combination of the
nonlinear functions are represented as lines connecting
units in different layers.
Artificial neural networks
Inputs
Output
An artificial neural network is composed of many artificial neurons

that are linked together according to a specific network
architecture. The objective of the neural network is to transform
the inputs into meaningful outputs.
Artificial (multilayer feed-forward) NNs
Hidden Output
Inputs
yj  g  Nh
 
w h  w j0  g
k 1 jk k
Nh
w f
k 1 jk
 Ni
i 1

wki xi  wk 0  w j 0 
Artificial neural networks
s=sum(w Tx) h=tanh(s)
10 2
0 1
-10 0
2 2 2 2
0 0 1 0 0 1
-2 -2 -1 -2 -2 -1
5 0.5
0
-5 0
2 2 2 2
0 0 1 0 0 1
-2 -2 -1 -2 -2 -1
6 6
4 4
2 2
2 2 2 2
0 0 1 0 0 1
-2 -2 -1 -2 -2 -1
y=tanh(w T2 h)
1
10 10
0 5 0
-10 0
2 2 2 2 -1
0 0 1 0 0 1
-2 -2 -1 -2 -2 -1
-2 0 22 0 -2
2 2
0 1
-2 0
2 2 2 2
0 0 1 0 0 1
-2 -2 -1 -2 -2 -1
2 2
1.5 1.5
1 1
2 2 2 2
0 0 1 0 0 1
-2 -2 -1 -2 -2 -1
5 5
0
-5 0
2 2 2 2
0 0 1 0 0 1
-2 -2 -1 -2 -2 -1
How does a neural network learn?
• A neural network learns by determining the relation

between the inputs and outputs.
• By calculating the relative importance of the inputs and

outputs the system can determine such relationships.
• Through trial and error the system compares its results

with the expert provided results in the data until it has
reached an accuracy level defined by the user.
– With each trial the weight assigned to the inputs is changed
until the desired results are reached.
Multilayer feed-forward networks
• The most common network architecture is the feed-

forward one, in which the information flows only in the
forward direction, from inputs to outputs.
• This means that its graph representation is acyclic: no
path in the graph, following the connections, can lead
back to the starting point.
• A great variety of network topologies can be imagined,
under the sole constraint that the graph of connections
be acyclic. However, as anticipated before, the vast
majority of neural network applications implement
multilayer networks.
Multilayer feed-forward NNs 2
The network computes
Hidden Output as many functions of
the input variables of
Inputs
the network as are the

components of the
output vector
• Neurons are organized in three kind of layers: input, hidden and output.
• The output neurons are the neurons that perform the final computation, i.e., whose
outputs are the outputs of the network, while the other neurons, which perform
intermediate computations, are termed hidden neurons.
• The units of the input layer just pass the inputs as variables to the hidden neurons,
without doing any processing on them.
• Each output is a nonlinear function (computed by the corresponding output
neuron) of the nonlinear functions computed by the hidden neurons.
yj  g  Nh
k 1
 
w jk hk  w j 0  g
Nh
k 1
w jk f  Ni
i 1

wki xi  wk 0  w j 0 
“Training” the net
• A feed-forward network with a single hidden layer can
approximate with arbitrary accuracy any bounded and
sufficiently regular function in a finite region of variable
space.
• The procedure by whereby the parameters of the network
are estimated, in order to approximate such a function is
called training of the ANN.
• Since usually the nonlinear relationship between dependent
and independent variables is not known analytically, but a
finite number of numerical values of the function are known
(because they are obtained through measurements
performed on a physical, chemical, biological, etc. process):
the task that is assigned to the network is that of
approximating the regression function of the available data.
Supervised training
• This kind of training is referred to as “supervised” since the
function that the network should implement is known in some
or all points: a “teacher” provides “examples” of values of the
inputs and of the corresponding values of the output,
• The goal of the training algorithm is to find the best set of
model parameters, given the data  find the numerical
values of the network weights which minimize a cost function
representing the distance between the prediction of the
model and the measured values.
Supervised training
• When this cost function is the squared error of the residuals:
• the resulting training algorithm is called back-propagation

(BP) and is essentially an iterative weight update on the
basis of a steepest descent criterion.
E
w ji (t )    w ji (t  1)
w ji
More on backpropagation
Weight update rule
Output neurons
Hidden neurons
Second order methods
• Backpropagation is rather simple and easy to implement but
can suffer from sever convergence problems.
• One way of coping with this is momentum.
• Another is to use second-order methods.
• Given an initial estimate of the weights, the error is Taylor-
expanded up to the second order:
• where:
• Backpropagation corresponds to truncating the expansion to

the first order.
Second order methods - 2
• Based on the second order expansion, solution is sought by
differentiating the previous equation wrt to the weights and
setting the derivative = 0 (Newton’s method)
• Problems:
– Hessian matrix H should be calculated and stored (computationally and
memory intensive).
– Hessian matrix should be nonsingular (not guaranteed, quite often H is
not full rank).
• Solutions:
– Quasi-second order methods:
• Gauss-Newton
• Levenberg-Marquardt
Levenberg-Marquardt method
• In Gauss-Newton, the Hessian matrix is approximated by:
• However, this approximation is accurate only near the minimum

and can suffer from initial guesses far from the optimal solution.
• LM method introduces an additional term to stabilize the
estimate:
• The parameter μ controls the property of the algorithm:

– When μ is large the steepest descent term is dominant
– When μ is small the Gauss-Newton term is dominant
• Fast and efficient
Specifically for classification
• When ANNs are used for classification:
– Error criterion is cross-entropy instead of RMSE
yij
ECE = å å yij ln
N M
i=1 j=1 ŷij

– Output function is “softmax”
t
ej
ŷj =
å
M
etk
k=1
Generalization vs specialization
• Optimal number of hidden neurons

– Too many hidden neurons: you get an over fit,
training set is memorized, thus making the
network useless on new data sets
– Not enough hidden neurons:
network is unable to learn problem concept
~ conceptually: the network’s language isn’t able

to express the problem solution
Generalization vs. specialization 2
• Overtraining:
– Too much examples, the ANN memorizes the examples
instead of the general idea
• Generalization vs. specialization trade-off:
# hidden nodes & training samples
RADIAL BASIS FUNCTIONS - NN
The RBF-NN
• Differently from MLP, RBF-NN performs classification and
regression based on similarity with examples from the training
set
• Its basic unit is the Gaussian (RBF) neuron:
2
xi -ck ck is the center of the kth RBF
- 2
- b xi -ck
y= e =e
2
2s
The RBF-NN for classification
yj = å k=1 wjke
nRBF 2
- b xi - mk
w11
y1
yc
xi wck
There is the need to optimize the center and width of RBFs and the weights wjk
Training the RBF-NN
• There are many different training algorithms for RBF-NN
• Apart from backpropagation, the most common are Orthogonal
least squares and the following:
1. Select the RBF centers by k-means clustering (possibly,
applying clustering separately by category)
2. Calculate the RBF width as the mean cluster distance to
the centroid:
1 m 1
sk =
m
å i=1
xi - mk and b k =
2s k2
1. Calculate the weight by LS regression:

2
- b k xi - mk
W = H Y with hik = e
†
The RBF-NN for regression
y = å k=1 wke
nRBF 2
- b xi - mk
w1
wk
xi
There is the need to optimize the center and width of RBFs and the weights wjk
Training the RBF-NN for regression
• The training algorithms for regression are the same as for
classification, with two main differences:
1. The width of the RBF is normally selected as equal for all
nodes and based on cross-validation instead as on
clustering.
2. The output of the function is often scaled (normalized):
å
nRBF 2
- b xi - mk
we
k
y= k=1
å
nRBF - b x - m 2
k=1
e i k
Self organizing maps
(Kohonen architectures)
Relative distance measure
The objects Xs are adjusted according to an In the case of Kohonen neural networks the
absolute distance measure d(Xs,Wj) to objects Xs and the arbitrarily distributed points
positions of the prespecified points Wj Wj are, first, assigned to each other, and next,
distributed in a topologically predefined the points Wj with associated closest objects
scheme. {Xs} are pooled together to predefined
positions in a 2-D plane.
Wj
x3 Xs
x2 W’j
x1
The final result does not depend much on the distances between objects, but rather on the
distances between the objects Xs and the pivot points Wj.
Kohonen self organizing maps (SOMs)
• The implicit functional relation that we want to approximate is a
nonlinear mapping from an Ni-dimensional input space to a low-
dimensional (usually 2D) discrete coordinate space (the map).
• Since there is no desired response to be obtained, training
occurs by self-organization, i.e. a Kohonen network adapts itself
so that similar input objects are associated with topological
close neurons.
• In a self-organizing map, the target space used in Kohonen
mapping is a two-dimensional array of neurons (the Kohonen
layer or top-map), fully connected to the input layer, onto which
the samples are mapped.
• Introducing the preservation of topology results in specifying for
each node in the Kohonen layer a defined number of neurons as
nearest neighbors, second-nearest neighbors and so on.
Kohonen self organizing maps 2
The most important feature of the Kohonen neural network is the topological order in which the neurons are combined together into the network.
x1 x2 x3 x4
The lay-out of neurons in the Kohonen

network can be linear (all neurons in one x1
row ot line) or planar (in a rectangular or x2
x3
hexagonal lay-out). x4
x1 x2 x3 x4 y1 y2 y3 y4 y5 y6 y7
x1
x2
x3
x4
y11 y12 y13 y14 y15 y16 y17 y74

y11 y12 y13 y14 y15 y16 y17 y63 y75
y21 y22 y23 y24 y25 y26 y17 y52 y64 y76
y21 y22 y23 y24 y25 y26 y17 y41 y53 y65 y77
y31 y32 y33 y34 y35 y36 y37 y42 y54 y66
y31 y32 y33 y34 y35 y36 y37
y41 y42 y43 y44 y45 y46 y47 y31 y43 y55 y67
y41 y42 y43 y44 y45 y46 y47 y32 y44 y56
y51 y52 y53 y54 y55 y56 y57 y21 y33 y45 y57
y51 y52 y53 y54 y55 y56 y57 y22 y34 y46
y16 y62 y63 y64 y65 y66 y67 y11 y23 y35 y47
y16 y62 y63 y64 y65 y66 y67 y12 y24 y36
y17 y72 y73 y74 y75 y76 y77 y13 y25
y17 y72 y73 y74 y75 y76 y77
y14
Defining the neighborhood
The neighborhood of a neuron is usually considered to be square or hexagonal
which means that each neuron has 8 or 6 nearest neighbors respectively.
5 5 5 5 5 5 5 5 5 5 5
5 4 4 4 4 4 4 4 4 4 5
3 2 1 0 1 2 3
3 2 1 0 1 2 3 5 4 3 3 3 3 3 3 3 4 5
5 4 3 2 2 2 2 2 3 4 5
x1 5 4 3 2 1 1 1 2 3 4 5
x2 5 4 3 2 1 0 1 2 3 4 5
x3
x4 5 4 3 2 1 1 1 2 3 4 5
5 4 3 2 2 2 2 2 3 4 5
5 4 3 3 3 3 3 3 3 4 5
y1 y2 y3 y5 yn 5 4 4 4 4 4 4 4 4 4 5
5 5 5 5 5 5 5 5 5 5 5
5
5 5
5 5 4 5
5 4 4 4 4 4 4 5 5 4 4 5
4 5 4 3 4 5
4 4 5 3 3 3 3 3 4 5 4 5
4 3 4 5 5 4 3 3
4 4 3 2 3 4
3 3
4 2 2 2 2 3 4 5 5 3 2 2 3 3
5
3 2 3 4 2 1 2 4
2 2 3 3 3 5
2 1 2 4 1 1 1 2 3 4 5 5 1 1
1 1 3 4 2 0 2 4
2 0 2 4 1 0 1 2 3 4 5 5 3 1 1 3 5
1 1 3 4 2 1 2 4
1 2 1 1 1 2 3 4 5 5 3 2 2 3 5
3
4 3 2 3 3
4
2 2 2 2 2 3 4 5 5 4 3 3 4 5
5 4 3 4 5
5 4 4 5
Marini - ChemSchool2022 5 4 5
5 5
5
Cyclic and toroid conditions
The cyclic and the toroid boundary conditions. The edge one one side
is linked to the edge on the opposite side.
0 1 2 3 … … 3 2 1 0
Cyclic conditions in a line of

neurons: 1st neighbours of W1 x1 x1
x2
x2
are W2 and Wn x3 x3
x4
x4
y3
y1 y2 yj yn y2
yn-1
yn y1
c
Toroid conditions in a
a rectangular plane
of neurons: 1st b c
neghbours of d
neurons on edges a d
and c are neurons
on edges b and d
respectively.
Cyclic and toroid conditions 2
The consideration of the toroidal conditions in Kohonen neural
networks can sometimes lead to much clearer results.
Learning procedure in Kohonen networks
Learning in the Kohonen neural network is iterative, i.e., a set of objects {Xs} is sent through the network several times.
After the pass of one object through the network the weights, which at the begining of learning are randomised, are changed. The learning procedure of a single pass consists of
three steps:
1. selection of one neuron in the network according to a

prespecified criterion (largest response, most similar input Xs
to the neuron Wj, or similar),
2. correction of the weights of the selected neuron, and
3. correction of the weights of the neighbouring neurons up to a
specified range around the selected neuron.
One pass of the entire data set through the network is called one epoch.
Training is stopped after a certain number of epochs. The main result of the Kohonen learning is the “top-map” or “self-organised map” (SOM) of objects
associated with the excited neurons.
Kohonen networks in practice
Step one: selection of the “excited”, “central”, or “responding” neuron. In large
majority of cases the selection of the neuron is made according to the smallest
distance criterion: m
d ( X s ,W j )   ( x si  w ji ) 2 for all j  1, ..., N  N
i 1
Input Input
object object
X1 X2
x11 x21
x12 x22
x13 x23
x14 x24
The “excited” or the
“selected” neuron”
W11 W12W13 W14W15 W16W17 W11W12 W13W14 W15 W16W17

W21W22W23W24W25W26W27 W21W22W23W24W25W26W27
W31W32W33W34W35W36W37 W31W32W33W34W35W36W37
W41W42W43W44W45W46W47 W41W42W43W44W45W46W47
W51W52W53W54W55W56W57 W51W52W53W54W55W56W57
W61W62W63W64W65W66W67 W61W62W63W64W65W66W67
W71W72W73W74W75W76W77 W71W72W73W74W75W76W77
Kohonen networks in practice 2
Step two: correction of weights of the selected neuron We
Because after each pass only one neuron is selected the leaning procedure is
called the “winner-takes-all” strategy.
We=  (Xs-Weold)
d (Wenew , Xs)
We d(We , Xs)
Wene
w
Xs
 = ( start -  final)(1 - nepoch/ntot) + final

The parameter η is called the learning rate and is in most applications “time” dependent, i.e. dependent on the number of currently performed learning epochs.
The corrections W e are driving the weights wei of the excited neuron W e = (we1,we2,...wei...wem) closer to the variables xsi of object Xs = (xs1,xs2,...xsi,...xsm) that has excited
it.
Kohonen networks in practice 3
Step three: correction of the weights in neurons Wj surrounding the selected neuron We

w ji  a( d j ) xsi  wold
ji  dj
a (d j )  1 
(d max  1)
a(dj)
1 dj
w ji   (1  )( x si  w old
ji )
d max  1
dj
Wc dmax dj w ji   (1  )( x si  w old
ji )
topological distance nepoch
N net (1  ) 1
For the selected neuron a(dj) = 1 because dj = 0. ntot
For the neurons separated dmax from the selected neuron We
the value a(dmax) is the smallest posible corection.
At the beginning of learning:
nepoch = 1  dmax = Nnet; correction
Various neighbourhood functions a(dj)
covers the entire network
a(dj) a(dj) a(dj) At the end of the learning:
1 1 1 nepoch = ntot  dmax = 0; correction is
limited only to the
selected neuron We.
W Wc
Wc c
topological distance
Weight maps
Correction of the weights in neurons Wj is applied to all levels of weights. Due
to the fact that weights wji in the neurons of the Kohonen network are alligned
into levels according to the order of input variables xi, each level of weights is
influenced only by one variable and thus forms a map of weight values or
weight map.
Input
Kohonen
w ji network  
a( d j ) xsi  wWeight
old levels
ji
object 1
Xs 2
3
x1 4
x2 5
x3 6
x4 7
x5 8
x6
x7
x8
Weight level 2
Weight level 6
Individual weights
Weight maps 2
A A A A,C B
A A A,C C B
Weight map of the A C C C B
.76 .85 .90 .93 .95 .97 .99 variable x2 - normalised
binder concentration C C C B
.63 .75 .84 .80 .85 .89 .95
C C C B B
.59 .66 .68 .65 .77 .75 .92
C C C,B B B B B
.32 .33 .52 .58 .60 .69 .85
A C C B B B B
.31 .43 .46 .42 .53 .61 .81
.16 .25 .30 .33 .46 .55 .75
.03 .09 .08 .22 .38 .50 .67 Top map showing the
cells containing the
information avout the
overal quality of the
Input object Xs (recipe for the paint-coat). Only 0.9 0.7
0.5 0.3 paint-coat
three variables are shown.
0.1
Labels A, B, and C
Solvent x1 0.3
refere to the excellent,
0.9
0.7 passable, and below
Binder x2 0.5
standard quality,
respectively.
Pigment x3 0.3
0.5
0.7
0.9
The most important feature of the Kohonen network is the fact that all weights in all
neurons regardless whether they were excited during the trainng or not bear valuable
information.
Counterpropagation networks
By the addition of an identical layer of weights to which the targets are input the Kohonen
network is transformed into the counterpropagation network.
The additional layer (the output or Grossberg layer) has the same number and the same layout
of neurons as the first (Kohonen) layer, however, in each layer the neurons have different
Kohonen
number of weights. layer
The winning
Input X neuron We
x1
 
x2
wKji  a( d j ) xsi  wKji ,old x3 K
x4
x5
x6
x7
x8
Grossberg or
Target T
output layer

wGji  a( d j ) tsi  wGji ,old  t1
t2
t3
G
During the retrieval (prediction)
Output:
t4 we1output the weights weioutput in the output
During the training the target values T we2output or Grossberg layer selected by
we3output
associated with each object X are input we4output the winning neuron of the input
into the output layer in exactly the same object X (in the Kohonen layer)
manner as the objects X are input into the are used as the predictions.
Kohonen layer.
Counterpropagation networks 2
Each counterpropagation network can be used as a direct and as an inverse “model”.
Kohonen
layer
The winning
Input X neuron We
x1
x2
x3
Target T
Grossberg or
Direct model t1 Output X Inverse model
Input X output layer
x1 = 0.8 x1 = (0.45 - 0.95) 0.9

0.9
0.1 0.1
x2 = 0.2 x2 = (0.00 - 0.30)

0.1 0.5 0.1 0.5
x3 = 0.6 x3 = (0.30 - 0.70)

0.7 0.7
0.3 0.3
Response Y 0.1 0.1

0.3 Input Y 0.3
y = 0.9 0.5 0.5
0.9 y = 0.9 0.9
0.9
0.7 0.7
Supervised Kohonen architectures
• Counterpropagation networks are semi-supervised architectures, as the value of the Y vector doesn’t drive the selection of the winning neuron and, as a consequence,
the direction of the training.
• There are other proposed architectures that are truly supervised and that can be used to build classification models:
• Supervised Kohonen networks
• XY-fused network
• Bidirectional Kohonen networks
Supervised Kohonen networks
• X and Y variables are concatenated to train the network as in the standard Kohonen architecture.
• After training the two blocks are separated and prediction occur as in counterpropagation
XY-fused networks
• The winning neuron is decided by considering a similarity function which is a weighted sum of the similarity in the X space and in the Y space.
• The parameter α starts with a high value and then decreases, so that at the end only similarity in the Y space contributes, while at the beginning the X space is
dominant.
Bidirectional Kohonen networks
• The concept of BDK networks is similar to that of XY-fused, combining similarity in both spaces to update the weights.
• In BDK, however, X and Y weights are updated in an alternate fashion:
• The parameter α starts with a high value and then decreases, so in the beginning similarity in the Y space governs selection of the winning neuron in the X space while
similarity in the X spaces determines the minning neuron in the Y space.
The fortunes and misfortunes of NNs
• Despite their increasing popularity up to the beginning of the
2000s, interest in neural networks seemed to vanish:
– Curse of dimensionality
– Inefficient learning algorithms
– Lack of interpretability of the models
– Performances heavily dependent on the choice of data representation (or
features) on which they are applied.
• Much of the actual effort in deploying machine learning algorithms goes into the
design of preprocessing pipelines and data transformations that result in a
representation of the data that can support effective machine learning
Representation learning
• Learning representations of the data that make it easier to
extract useful information when building classifiers or other
predictors
– Captures the posterior distribution of the underlying explanatory
factors for the observed input.
– Is useful as input to a supervised predictor.
• DEEP LEARNING:
– composition of multiple non-linear transformations, with the goal of
yielding more abstract – and ultimately more useful – representations
• Fundamental questions:
– What makes one representation better than another?
– Given an example, how should we compute its representation, i.e.
perform feature extraction?
– What are appropriate objectives for learning good representations?
Representation learning: “chemometric” concepts
• Representations → convenient to express many general (not task-
specific) priors that are likely to be useful for a learning machine to
solve AI-tasks.
• The revival experienced by NNs in the recent years has much to do
with such priors
– Their absence was one of the main reasons for the vanishing interest towards
NN in the 2000s
– They share much in common with essential ideas which make chemometric
representations useful and versatile
– Possibility of mutual benefit between the disciplines
• Some of these will be briefly discussed in the following
The general priors of representation learning
• Smoothness:
– 𝑥1 ≈ 𝑥2 ⇒ 𝑓(𝑥1 ) ≈ 𝑓(𝑥2 )
– Insufficient to circumvent the curse of dimensionality
• Multiple explanatory factors:
– Data distribution generated by different underlying factors
– What one learns about one factor generalizes in many configurations
of the other factors
– The objective is to recover or at least disentangle these underlying
factors of variation
• Hierarchy:
– The features that are useful for describing the world around us can be
defined in terms of other features, in a hierarchy
– More abstract concepts higher in the hierarchy are defined in terms of
less abstract ones → Deep learning
The general priors of representation learning - 2
• Semi-supervised learning:
– A subset of the factors explaining X’s distribution explain much of Y,
given X.
– Representations that are useful for P(X) tend to be useful when learning
P(Y|X)
– Sharing of statistical strength between the unsupervised and supervised
learning tasks
• Manifolds:
– Probability mass concentrates near regions that have a much smaller
dimensionality than the original space where the data lives.
• Sparsity:
– For any given x, only a small fraction of the possible factors are relevant.
– Features that are often zero or insensitive to small variations of x.
– Priors on latent variables (peaked at 0), or by a nonlinearity whose value
is often flat at 0 (e.g., ReLU)
Building deep representations
• Greedy layerwise unsupervised pre-training:

– Learn a hierarchy of features one level at a time, using unsupervised feature
learning to learn a new transformation at each level to be composed with the
previously learned transformations;
– The set of layers could be combined to initialize a deep supervised predictor,
such as a neural network classifier, or a deep Boltzmann Machine
Two learning paradigms
• One rooted in probabilistic graphical models and one rooted in
neural networks.
• Probabilistic modeling:
– Attempt to recover a parsimonious set of latent random variables that
describe a distribution over the observed data.
– Feature values are conceived as the result of an inference process to
determine the probability distribution of the latent variables given the
data, i.e. p(h|x), the posterior probability.
• Main differences:
– The layered architecture of a deep learning model is to be interpreted as
describing a probabilistic graphical model or as describing a computation
graph?
– Are hidden units considered latent random variables or as computational
nodes?
And two corresponding architectures
• Convolutional NN • (Restricted) Boltzmann machines
Convolutional Neural Networks
• Designed to process data that come in the form
of multiple arrays:
– a colour image composed of three 2D arrays
containing pixel intensities in the three colour channels.
– 1D for signals and sequences
• There are four key ideas behind ConvNets that take advantage of
the properties of natural signals:
– local connections
– shared weights
– pooling
– the use of many layers.
Convolutional Neural Networks
• The architecture is structured as a series of stages.
• The first few stages are composed of two types of layers:
convolutional layers and pooling layers.
– Units in a convolutional layer are organized in feature maps, within which
each unit is connected to local patches in the feature maps of the previous
layer through a set of weights called a filter bank.
The convolutional Concept
• Convolution extracts feature from the
input image (data)
• Preserves the spatial relationship
between pixels
And for multiple-layered inputs
• Filtering proceeds in parallel across the depth of the image
Going nonlinear
• A nonlinear operation is added on top of the convolution.
• Usually it is carried out by means of a Rectified Linear Unit (but one could also use,
sigmoid or hyperbolic tangent)
• ReLU is an element wise operation (applied per pixel to the activation maps) and
replaces all negative pixel values in the feature map by zero.
ℎ = max(0, 𝑠)
Compressing (Pooling)
• Spatial Pooling reduces the dimensionality of each feature map but retains the
most important information.
• Can be of different types: Max, Average, Sum etc.
• Define a spatial neighborhood (for example, a 2×2 window) and:
– take the largest element from the rectified feature map within that window (Max pooling)
– Take the average (Average Pooling) or sum of all elements in that window
– Max Pooling has been shown to work better.
Compressing (Pooling) - 2
• Pooling is applied separately to each feature map
• The function of Pooling is to progressively reduce the spatial size of the input
representation:
– makes the input representations (feature dimension) smaller and more manageable
– reduces the number of parameters and computations in the network, therefore,
controlling overfitting
– makes the network invariant to small transformations, distortions and translations in the input
image (a small distortion in input will not change the output of Pooling – since we take the
maximum / average value in a local neighborhood).
– almost scale invariant representation of our image (the exact term is “equivariant”)→detect objects
in an image no matter where they are located.
Wrapping up
Generative Topographic Mapping (GTM)
Eric Latrille – LBE –INRAE (France
GTM is a dimensionality reduction algorithm well described by Bishop et al.
Briefly speaking, the algorithm injects a 2D hypersurface (manifold) into an

initial Ddimensional data space. The manifold is fitted to the data distribution
by the ExpectationMaximization (EM) algorithm which minimizes the log-
likelihood of the training data.
Once the fitting is done, each item from the data space is projected to a 2D
latent grid of K nodes.
Bishop CM, Svensén M, and Williams CKI (1998) GTM: The Generative
Topographic Mapping. Neural Comput 10:215–234.
https://doi.org/10.1162/089976698300017953
1
GTM is a probabilistic extension of SOM where log-likelihood is utilized as an
objective function.
The manifold used to bind a data point t* in the data space and its projection
x* in the latent space is described by a set of M Radial Basis Function
centers (RBF; Gaussian functions are generally used).
Marini - 2
ChemSchool2022

Ann

Uploaded by

Copyright:

Available Formats

Ann

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ann

Uploaded by

Copyright:

Available Formats

ARTIFICIAL NEURAL NETWORKS

• 1943: McCulloch & Pitts model of a neuron

J. Hopfield, Neural Networks and Physical Systems with Emergent Collective

• Power: Model complex functions, nonlinearity built

Neural networks cannot do anything that cannot be

• The peculiarity of ANNs relies on the fact that they

An artificial neural network is composed of many artificial neurons

• A neural network learns by determining the relation

• By calculating the relative importance of the inputs and

• Through trial and error the system compares its results

• The most common network architecture is the feed-

the network as are the

• the resulting training algorithm is called back-propagation

• Backpropagation corresponds to truncating the expansion to

• However, this approximation is accurate only near the minimum

• The parameter μ controls the property of the algorithm:

i=1 j=1 ŷij

• Optimal number of hidden neurons

~ conceptually: the network’s language isn’t able

1. Calculate the weight by LS regression:

The lay-out of neurons in the Kohonen

y11 y12 y13 y14 y15 y16 y17 y74

Cyclic conditions in a line of

1. selection of one neuron in the network according to a

W11 W12W13 W14W15 W16W17 W11W12 W13W14 W15 W16W17

 = ( start -  final)(1 - nepoch/ntot) + final

x1 = 0.8 x1 = (0.45 - 0.95) 0.9

x2 = 0.2 x2 = (0.00 - 0.30)

x3 = 0.6 x3 = (0.30 - 0.70)

Response Y 0.1 0.1

• Greedy layerwise unsupervised pre-training:

• Convolutional NN • (Restricted) Boltzmann machines

GTM is a dimensionality reduction algorithm well described by Bishop et al.

Briefly speaking, the algorithm injects a 2D hypersurface (manifold) into an

You might also like