Statistical Mechanics of On-Line Learning

Statistical Mechanics of On-line learning
Michael Biehl1 , Nestor Caticha2 , and Peter Riegler3

1
University of Groningen, Inst. of Mathematics and Computing Science,
P.O. Box 407, 9700 AK Groningen, The Netherlands,
2
Instituto de Fisica, Universidade de S ao Paulo,
CP66318, CEP 05315-970, S ao Paulo, SP Brazil,
3
Fachhochschule Braunschweig/Wolfenb uttel, Fachbereich Informatik,
Salzdahlumer Str. 46/48, 38302 Wolfenb uttel, Germany
Abstract. We introduce and discuss the application of statistical physics

concepts in the context of on-line machine learning processes. The con-
sideration of typical properties of very large systems allows to perfom av-
erages over the randomness contained in the sequence of training data.
It yields an exact mathematical description of the training dynamics
in model scenarios. We present the basic concepts and results of the
approach in terms of several examples, including the learning of linear
separable rules, the training of multilayer neural networks, and Learning
Vector Quantization.
1 Introduction
The perception that Statistical Mechanics is an inference theory has opened
the possibility of applying its methods to several areas outside the traditional
realms of Physics. This explains the interest of part of the Statistical Mechanics
community during the last decades in machine learning and optimization and the
application of several techniques of Statistical Mechanics of disordered systems
in areas of Computer Science.
As an inference theory Statistical Mechanics is a Bayesian theory. Bayesian
methods typically demand extended computational resources which probably
explains why methods that were used by Laplace, were almost forgotten in the
following century. The current availability of ubiquitous and now seemingly pow-
erful computational resources has promoted the diffusion of these methods in
statistics. For example, posterior averages can now be rapidly estimated by us-
ing efficient Markov Chain Monte Carlo methods, which started to be developed
to attack problems in Statistical Mechanics in the middle of last century. Of
course the drive to develop such techniques (starting with [1]) is due to the total
impossibility of introducing classical statistics methods to study thermodynamic
problems. Several other techniques Statistical Mechanics have also found their
way into statistics. In this paper we review some applications of Statistical Me-
chanics to artificial machine learning.
Learning is the change induced by the arrival of information. We are inter-
ested in learning from examples. Different scenarios arise if examples are con-
sidered in batches or just one at a time. This last case is the on-line learning
II
scenario and the aim of this contribution is to present the characterization of on-
line learning using methods of Statistical Mechanics. We will consider in section
2 simple linearly separable classification problems , just to introduce the idea.
The dynamics of learning is described by a set of stochastic difference equations
which show the evolution, as information arrives, of quantities that characterize
the problem and in Statistical Mechanics are called order parameters. Looking
at the limit of large dimensions and scaling the number of examples in a correct
manner, the difference equations simplify into deterministic ordinary differen-
tial equations. Numerical solutions of the ODE give then, for the specific model
of the available information such as the distribution of examples, the learning
curves.
While this large dimension limit may seem artificial, it must be stressed that
it is most sensible in the context of thermostatistics, where Statistical Mechanics
is applied to obtain thermodynamic properties of physical systems. There the di-
mension, which is the number of degrees of freedom, is of the order of Avogadros
number ( 1023 ). Simulations however have to be carried for much smaller sys-
tems, and this prompted the study of finite size corrections of expectations of
experimentally relevant quantities, which depend on several factors, but typi-
cally go to zero algebraically with the dimension. If the interest lies in intensive
quantities such as temperature, pressure or chemical potential then corrections
are negligible. If one is interested on extensive quantities, such as energy, entropy
or magnetization, one has to deal with their densities, such as energy per degree
of freedom. Again the errors due to the infinite size limit are negligible. In this
limit, central limit theorems apply , resulting in deterministic predictions and
the theory is in the realm of thermodynamics. Thus this limit is known as the
thermodynamic limit (TL). For inference problems we can use the type of theory
to control finite size effects in the reverse order. We can calculate for the easier
deterministic infinite limit and control the error made by taking the limit. We
mention this and give references, but will not deal with this problem except by
noticing that it is theoretically understood and that simulations of stylized mod-
els, necessarily done in finite dimensions, agree well with the theoretical results
in the TL. The reader might consider that the thermodynamic limit is equivalent
to the limit of infinite sequences in Shannons channel theorem.
Statistical Mechanics (see e.g. [2,3]) had its origins in the second half of the
XIX century, in an attempt, mainly due to Maxwell, Boltzmann and Gibbs to
deduce from the microscopic dynamical properties of atoms and molecules, the
macroscopic thermodynamic laws. Its success in building a unified theoretical
framework which applied to large quantities of experimental setups was one of
the great successes of Physics in the XIX century. A measure of its success
can be seen from its role in the discovery of Quantum Mechanics. Late in the
XIX century, when its application to a problem where the microscopic laws
involved electromagnetism, i.e the problem of Black Body radiation, showed
irreconcilable results with experiment, Max Planck showed that the problem laid
with the classical laws of electromagnetism and not with Statistical Mechanics.
This started the revolution that led to Quantum Mechanics.
III
This work is organized as follows. We introduce the method in section 2

the main ideas in a simple model. Section 3 will look into the extension of on-
line methods to the richer case of two-layered networks which include universal
approximators. In section 4 we present the latest advances in the area which deal
with characterizing theoretically clustering methods such as Learning Vector
Quantization (LVQ).
This paper is far from giving a complete overview of this successful approach
to machine learning. Our intention is to illustrate the basic concepts in terms of
selected examples, mainly from our own work. Also references are by no means
complete and serve merely as a starting point for the interested reader.
2 On-line learning in Classifiers: Linearly separable case
We consider a supervised classification problem, where vectors RN have to

be classified into one of two classes which we label by +1 or 1 respectively.
These vectors are drawn independently from a probability distribution Po ().
The available information is in the form of example pairs of vector-label: ( , ),
= 1, 2.... The scenario of on-line learning is defined by the fact that we take
into account one pair at a time, which permits to identify and as time
indexes. We also restrict our attention to the simple case where examples are
used once to induce some change in our machine and then are discarded. While
this seems quite inefficient since recycling examples to extract more information
can indeed be useful, it permits to develop a simple theory due to the assumption
of independence of the examples. The recycling of examples can also be treated
([4]) but it needs a repertoire of techniques that is beyond the scope of this
review. For many simple cases this will be seen to be quite efficient.
As a measure of the efficiency of the learning algorithm we will concentrate
on the generalization error, which is the probability of making a classification
error on a new, statistically independent example +1 . If any generalization
is at all possible, of course there must be an underlying rule to generate the
example labels, which is either deterministic
B = fB () (1)
or described by the conditional probability P ( B |fB ()) depending on a transfer

function fB parameterized by a set of K unknown parameters B. At this point
we take B to be fixed in time, a constraint that can be relaxed and still be
studied within the theory, see [5,6,7,8].
Learning is the compression of information from the set of example pairs
into a set of M weights J RM and our machine classifies according to
J = gJ () (2)
The generalization error is

IV
Z
Z Y X
eG () = dPo () dPo ( ) P (B |fB ( ))( J B ())
=1 =1
= h( J () B ())i{ , }=1, ,, , (3)
where the step function (x) = 1 for x > 0 and zero otherwise. As this stands
it is impossible to obtain results other than of a general nature. To obtain sharp
results we have to specify a model.
The transfer functions fB and gJ specify the architectures of the rule and of
the classifier, while P ( |fB ( )) models possible noise in the specification of
the supervision label. The simplest case that can be studied is where both fB
and gJ are linearly separable classifiers of the same dimension: K = M = N ,
J = sign(J .), B = sign(B.) (4)
As simple and artificial as it may be, the study of this special case serves several
purposes and is a stepping stone into more realistic scenarios.
An interesting feature of Statistical Mechanics lies in that it points out what
are the relevant order parameters in a problem. In physics, this turns out to be
information about what are the objects of experimental interest.
Without any loss we can take all vectors and B to be normalized as = N
and B B = 1. For J however, which is a dynamical quantity that evolves under
the learning algorithm still to be specified we let it free and call J J = Q.
Define the fields
h = J , b=B (5)
To advance further we choose a model for the distribution of examples Po ()
and the natural starting point is to choose a uniform distribution over the N -
dimensional sphere. Different choices to model specific situation are of course
possible. Under these assumptions, since the scalar products of (4) are sums
of random variables, for large N , h and b are correlated Gaussian variables,
completely characterized by
hhi = hbi = 0,
hh2 i = Q, hb2 i = 1,
hhbi = J B = R. (6)

It is useful to introduce the overlap = R/ Q between the rule and machine
parameter vectors. The joint distribution is given by
1 1
2(12 )
(h2 2hb+b2 )
P (h, b) = p e . (7)
2 (1 2 )
The correlation is the overlap , which is related to the angle between J and
B: = cos1 , it follows that || 1. It is geometrically intuitive and also easy
V
to prove that the probability of making an error on an independent example,

the generalization error, is /2:
1
eG = cos1 (8)
2
The strategy now is to introduce a learning algorithm, i.e to define the change
that the inclusion of a new example causes in J, calculate the change in the
overlap and then obtain the learning curve for the generalization error. We
will consider learning algorithms of the form
F
J+1 = J + +1 , (9)
N
where F , called the modulation function of vector +1 , should depend on the
B
supervised information, the label +1 . It may very well depend on some addi-
tional information carried by hyperparameters or on itself. It is F that defines
the learning algorithm. We consider the case where F is a scalar function, but
it could differ for different components of . Projecting (9) into B and into J
we obtain respectively
F
R+1 = R + b (10)
N
F F2
Q+1 = Q + 2 h + . (11)
N N
which describe the learning dynamics. We can also write an equivalent equation
for the overlap which is valid for large N and on the hypersphere and
J+1 .B
+1 = p =
Q+1
!
1 F 1 F 1 F
= 1 p h+1 ( p )2 + p b+1 (12)
N Q 2N Q N Q
1 F 1 F 2
+1 = p (b+1 h+1 ) (13)
N Q 2N Q
Since at each time step a random vector is drawn from the distribution Po
equations (12) and (13) are stochastic difference equations. We now take the
thermodynamic limit N and average over the test example . Note that
each example induces a change of the order parameters of order 1/N . Hence,
one expects the need of order N many examples to create a change of order 1.
This prompts the introduction of = limN /N which by measuring the

number of examples measures time. The behavior of and are very different
in the limit. It can be shown (see [9]) that order parameters such as , R, Q self-
average. Their fluctuations tend to zero in this limit, see Fig. 3 for an example.

On the other hand has fluctuations of order one and we look at its average
over the test vector:
d
=h ih,b, . (14)
d
VI
the pairs (Q, Q/) and (R, R/) behave in a similar way. We average
over the fields h, b and over the labels , using P (|b) .This leads to the coupled
system of ordinary differential equations, which for a particular form of F were
introduced in [10].
dR X
Z
= dhdbP (h, b)P (|b) [F b] = hF bi (15)
d
dQ X
Z
dhdbP (h, b)P (|b) 2F h + F 2 = h2F h + F 2 i,

= (16)
d
where the angular brackets stand for the average over the fields and label. Since
the generalization error is directly related to it will be useful to look
at the
equivalent set of equations for and the length of the weight vector Q:
F 2

d XZ F
= dhdbP (h, b)P (|b) (b h) (17)
d
Q 2Q

1 F2

d Q X
Z
= dhdbP (h, b)P (|b) F h + (18)
d
2 Q
We took the average with respect to the two fields h and b as if they stood on
symmetrical grounds, but they dont. It is reasonable to assume knowledge of h
and but not of b. Making this explicit
F 2

d XZ F
= dhP (h)P () hb hib|h (19)
d
Q 2Q

1 F2

d Q X
Z
= dhP (h)P () F h + (20)
d
2 Q
call
Q
F = hb hib|h (21)

where the average is over unavailable quantities. Equation 19 can then be written
as
d X 1
Z
= dhP (h)P () F F F 2 (22)
d Q 2
This notation makes it natural to ask for an interpretation of the meaning of

F . The differential equations above describe the dynamics for any modulation
function. We can ask ([11]) if there is a modulation function optimal in the
sense of maximizing the information gain per example, which can be stated as
a variational problem
d
=0 (23)
F d
VII
This problem can be solved for a few architectures, including some networks with
hidden layers [12,16], although the optimization becomes much more difficult.
Within the class of algorithms we have considered, the performance bound is
given by the modulation function F given by Eq. (21).
The optimal resulting algorithm has several features that are useful in consid-
ering practical algorithms, such as the optimal annealing schedule of the leaning
rate and, in the presence of noise, adaptive cutoffs for surprising examples.
As an example we discuss learning of a linearly separable rule (4) in some
detail. There, an example will be misclassified if ( J B ) > 0 or equivalently
h B < 0. We will refer to the latter quantity as aligned field. It basically measures
the correctness (h B > 0) or wrongness (h B < 0) of the current classification.
Fig. 1 depicts the optimal modulation function for inferring a linearly sep-
arable rule from linearly separable data. Most interesting is the dependence on
= cos(g ): For = 0 (g = 1/2) the modulation function does not take into
account the correctness of the current examples. The modulation function is
constant, which corresponds to Hebbian learning. As increases, however, for
already correctly classified examples the magnitude of the modulations function
decreases with increasing aligned field. For misclassified examples, however, the
update becomes the larger the smaller the aligned field is. In the limit 1
(g ) the optimal modulation function approaches the Adatron algorithm
([13]) where only misclassified examples trigger an update which is proportional
to the aligned
p field. In addition to that, for the optimal modulation function
() = Q(), i.e. the order parameter Q can be used to estimate .
Now imagine, that the label of linearly separable data is noisy, i.e. it is
changed with a certain probability. In this case it would be very dangerous to
follow an Adatron-like algorithm and perform an large update if h B < 0, since
the seeming misclassification might be do to a corrupted label. The optimal
modulation function for that case perceives this danger and introduces a sort
of cutoff w.r.t. the aligned field. Hence, no considerable update is performed
if the aligned field becomes too large in magnitude. Fig. 1 shows the general
behavior. [14] gives further details and an extension to other scenarios of noisy
but otherwise linearly separable data sets.
These results for optimal modulations functions are, in general, better under-
stood from a Bayesian point of view ([17,18,19]). Suppose the knowledge about
the weight vector is coded in a Gaussian probability density. As a new example
arrives, this probability density is used as a prior distribution. The likelihood is
built out of the knowledge of the network architecture, of the noise process that
may be corrupting the label and of the example vector and its label. The new
posterior is not, in general Gaussian and a new Gaussian is chosen, to be the
prior for the next learning step, in such a way as to minimize the information
loss. This is done by the maximum entropy method or equivalently minimizing
the Kullback-Leibler divergence. It follows that the learning of one example in-
duces a mapping from a Gaussian to another Gaussian, which can be described
by update equations of the Gaussians mean and covariance. These equations
define a learning algorithm together with a tensor like adaptive schedule for the
VIII

F B / Q
h B h B
Fig. 1. Left: Optimal modulation function F for learning a linearly separable classifi-
cation of type (4) for various values of the order parameter . Right: In addition to (4)
the labels B are subject to noise, .e. are flipped with probability . The aligned field
h B is a measure of the agreement of the classification of the current example with the
given label. In both cases, all examples are weighted equally for = 0 irrespective of the
value of the aligned field h B . This corresponds to Hebbian learning. In the noiseless
case ( = 0) examples with a negative aligned field receive more weight as increases
while those with positive aligned field gain less weight. For > 0 also examples with
large negative aligned fields are not taken into account for updating. These examples
are most likely affected by the noise and therefore are deceptively misclassified. The
optimal weight function possesses a cutoff value at negative values of h B . This cutoff
decreases in absolute value with increasing and .
learning rate. This algorithms are closely related to the variational algorithm
defined above. The Bayesian method has the advantage of being more general.
It can also be readily applied to other cases where the Gaussian prior is not
adequate [19].
In the following sections we follow the developments of these techniques in
other directions. We first look into networks with hidden layers, since the expe-
rience gained in studying on-line learning in these more complex architectures
will be important to better understand the application of on-line learning to the
problem of clustering.
3 On-line Learning in Two-Layered Networks
Classifiers such as (4) are the most fundamental learning networks. In terms
of network architecture the next generalization are networks with one layer of
hidden units and a fixed hidden-to-output relation. An important example is
the so-called committee machine, where the overall output is determined by a
majority vote of several classifiers of type (4). For such networks the variational
approach of the previous section can be readily applied [12,16].
General two-layered networks, however, have variable hidden-to-output weights
and are soft classifiers, i.e. have continuous transfer functions. They consist
IX
of a layer of N inputs, a layer of K hidden units, and a single output. The

particular input-output relation is given by
K
X
() = wi g(J i ). (24)
i=1
Here, Ji denotes the N -dimensional weight vector of the i-th input branch and
wi the weight connecting the i-th hidden unit with the output. For a (soft)
committee machine, wi 0 for all branches i. Often, the number K of hidden
weight vectors Ji is chosen as K N . In fact, most analyses specialize on
K = O(1). This restriction will also be pursued here. Note that the overall
output is linear in wi , in contrast to the outputs of the hidden layer which in
general depend nonlinearly on the weights Ji via the transfer function g.
The class of networks (24) is of importance for several reasons: Firstly, they
can realize more complex classification schemes. Secondly, they are commonly
used in practical applications. Finally, two-layered networks with a linear output
are known to be universal approximators [20].
Analogously to section 2 the network (24) is trained by a sequence of uncor-
related examples {( , )} which are provided by an unknown rule () by the
environment. As above, the example input vectors are denoted by , while here
is the corresponding correct rule output.
In a commonly studied scenario the rule is provided by a network of the same
architecture with hidden layer weights B i , hidden-to-output weights vi , and an
in general different number of hidden units M :
M
X
() = vk g(B k ). (25)
k=1
In principle, the network (24) can implement such a function if K M .

As in (9) the change of a weight is usually taken proportional to the input
of the corresponding layer in the network, i.e.
1
Ji +1 = Ji + Fi +1 (26)
N
1
wi+1 = wi + Fw g(Ji +1 ). (27)
N
Again, the modulation functions Fi , Fw will in general depend on the recently
provided information ( , ) and the current weights.
Note, however, that there is an asymmetry between the updates
PK of Ji and wi .
The change of the former is O(1/N ) due to | 2 | = O(N ). As i=1 g 2 (Ji ) =
O(K) a change of the latter according to
1
wi+1 = wi + Fw g(Ji +1 ). (28)
K
seems to be more reasonable. For reasons that will become clear below we will
prefer a scaling with 1/N as in (27) over a scaling with 1/K, at least for the
time being.
X
Also note from (24, 25) that the stochastic dynamics of Ji and wi only de-
pends on the fields hi = Ji , bk = Bk which can be viewed as a generalization
of (ref to eq 5). As in section 2, for large N these become Gaussian variables.
Here, they have zero means and correlations
hhi hj i = Ji Jj =: Qij , hbk bl i = Bk Bl =: Tkl , hhi bk i = Ji Bk =: Rik , (29)
where i, j = 1 . . . K and k, l = 1 . . . M .
Introducing = /N as above, the discrete dynamics (26, 27) can be replaced
by a set of coupled differential equations for Rik , Qij , and wi in the limit of
large N : Projecting (26) into Bk and Jj , respectively, and averaging over the
randomness of leads to
dRik
= hFi bk i (30)
d
dQij
= hFi hj + Fj hi + Fi Fj i, (31)
d
where the average is now with respect to the fields {hi } and {bk }. Hence, the
microscopic stochastic dynamics of O(K N ) many weights Ji is replaced by the
macroscopic dynamics of O(K 2 ) many order parameters Rik and Qij , respec-
tively. Again, these order parameters are self-averaging, i.e. their fluctuations
vanish as N . Fig. 3 exemplifies this for a specific dynamics.
The situation is somewhat different for the hidden-to-output weights wi . In
the transition from microscopic, stochastic dynamics to macroscopic, averaged
dynamics the hidden-layer weights Ji are compressed to order parameters which
are scalar products, cf. (29). The hidden-to-output weights, however, are not
compressed P into new parameters of the form of scalar products. (Scalar products
of the type i wi vi do not even exist for K 6= M .) Scaling the update of wi by
1/N as in (27) allows to replace 1/N by the differential d as N . Hence,
the averaged dynamics of the hidden-to-output weight reads
dwi
= hFw g(hi )i. (32)
d
Note that the r.h.s. of these differential equations depend on Rik , Qij via (29)
and, hence, are coupled to the differential equations (30, 31) of these order
parameters as well. So in total, the macroscopic description of learning dynamics
consists of the coupled set (30, 31, 32).
It might be surprising that the hidden-to-output weights wi by themselves
are appropriate for a macroscopic description while the hidden weights Ji are
not. The reason for this is twofold. First, the number K of wi had been taken to
be O(1), i.e. it does not scale with the dimension of inputs N . Therefore, there
is no need to compress a large number of microscopic variables into a small
number of macroscopic order parameters as for the Ji . Second, the change in wi
had been chosen to scale with 1/N . For this choice one can show that like Rik
and Qij the weights wi are self-averaging.
XI
For a given rule () to be learned, the generalization error is
g ({Ji , wi }) = h({Ji , wi })i , (33)
where ({Ji , wi }) = 21 ( )2 is the instantaneous error. As the outputs and

depend on only via the fields {hi } and {bk }, respectively, the average over
can be replaced by an average over these fields. Hence, the generalization error
only depends on order parameters Rik , Qij , wi as well as on Tkl and vk .
In contrast to section 2 it is a difficult task to derive optimal algorithms by a
variational approach, since the generalization error (33) is a function of several
order parameters. Therefore on-line dynamics in two-layer networks has mostly
been studied in the setting of heuristic algorithms, in particular for backpropa-
gation. There, the update of the weights is taken proportional to the gradient of
the instantaneous error = ({Ji , wi }) = 12 ( )2 with respect to the weights:
J
Ji +1 = Ji J i (34)
N
w
wi+1 = wi (35)
N wi
The parameters J and w denote learning rates which scale the size of the
update along the gradient of the instantaneous error. In terms of (26, 27) back-
propagation corresponds to the special choices FJ = J ( )wi g (hi ) and
Fw = w ( ). A common choice for the transfer function is g(x) = erf(x/ 2).
With this specific choice, the averaging in the equations of motion (30, 31, 32)
can be performed analytically for general K and M [22,23,24]
Independent of the particular choice of learning algorithms a general problem
in two-layered networks is caused by the inherent permutation symmetry: The
i-th input branch of the adaptive network (24) does not necessarily specialize on
the i-th branch in the network (25). Without loss of generality, however, one can
relabel the dynamical variables such as if this were indeed the case. Nevertheless,
this permutation symmetry will turn out to be a dominant feature because it
leads to a deceleration of learning due to plateau states in the dynamics of the
generalization error. Fig. 2 gives an example.
These plateau states correspond to configurations which are very close to
certain fixed points of the set of differential equations (30, 31, 32) for the order
parameters. In the simplest case the vectors Ji are almost identical during the
plateau phase. They have apart from small deviations the same scalar product
with each vector Bk . These fixed points are repulsive, so small fluctuations will
cause a specialization of each Ji towards a distinct Bk which then leads to
minimum generalization error. If there are several such repulsive fixed points
there can even be cascades of plateaus. The lengths of the plateaus can be
shown to depend crucially on the choice of initial conditions as well as on the
dimension N of the inputs [25].
For backpropagation, the differential equations (30, 31, 32) can easily be used
to determine those learning rates J and w which lead to the fastest decrease
of the generalization error. This is most interesting for the learning rate w of
XII
Fig. 2. Time evolution for on-line backpropagation in the K = M = 2 learning scenario

(24,25,34) with Tnm = nm and fixed wi = vi . Left: generalization error g (). Right:
order parameters Rin (full curves) and Qik (dotted curves). The plateaus in both graphs
are due to the internal permutation symmetry of(25) w.r.t. the summation index.
the hidden-to-output weights as it turns out that the decrease of g () is largest

as w .
Obviously, this divergence of w indicates that one should have chosen a
different scaling for the change of the weights wi , namely a scaling with 1/K
as in (28) as opposed to (27). For such a scaling, the weights wi will not be
self-averaging anymore, however, see Fig. 3. Hence, equations (30, 31, 32) fail
to provide a macroscopic description of the learning dynamics in this case. This
does by no means signify that they are inapplicable, however. The optimal choice
w simply indicates that the dynamics of the hidden-to-output weights wi
is on a much faster time scale compared to the time scale on which the self-
averaging quantities Rik and Qij change.
An appropriate method to do deal with such situations is known as adiabatic
elimination. It relies on the assumption that the mean value of the fast variable
has a stable equilibrium value at each macroscopic time . One obtains this
equilibrium value from the zero of the r.h.s. of (32) with respect to wi , i.e.
by investigating the case dwi /d = 0. The equilibrium values for wi are thus
obtained as functions of Rik and Qij and can be further used to eliminate any
appearance of wi in (30, 31). See [21] for details.
The variational approach discussed in Sec. 2 has also been applied to the anal-
ysis of multilayered networks, in particular the soft committee machine [16,26].
Due to the larger number of order parameters the treatment is much more in-
volved than for the simple perceptron network. The investigations show that, in
principle, it is possible to reduce the length of the plateau states as discussed
above drastically by using appropriate training prescriptions.
XIII

1/ N 1/ N
Fig. 3. Finite size analysis of the order parameters R (), Q (2) and w (, right panel)
in the dynamics (34,35) for the special case K=M =1. Shown are the observed standard
deviations as a function of the system size for a fixed value of . Each point depicts an
average taken over 100 simulation runs. As can be seen, R and Q become selfaveraging
in the thermodynamic limit N , i.e. their fluctuations vanish in this limit. In
contrast to that, the fluctuations of w remain finite if one optimizes the learning rate
w , which leads to the divergence w .
4 Dynamics of prototype based learning

In all training scenarios discussed above, the consideration of isotropic, i.i.d.
input data yields non-trivial insights, already. The key information is contained
in the training labels and, for modeling purposes, we have assumed that they
are provided by a teacher network.
In practical situations one would clearly expect the presence of structures in
input space, e.g. in the form of clusters which are more or less correlated with
the target function. Here we briefly discuss how the theoretical framework has
been extended in this direction. We will address, mainly, supervised learning
schemes which detect or make use of structures in input space. Unsupervised
learning from unlabeled, structured data has been treated along the very same
lines but will not be discussed in detail, here. We refer to, for instance, [27,28]
for overviews and [34,35,38,43,45] for example results in this context.
We will focus on prototype based supervised learning schemes which take into
account label information. The popular Learning Vector Quantization algorithm
[32] and many of its variants follow the lines of competitive learning. However
the aim is to obtain prototypes as typical representatives of their classes which
parameterize a distance based classification scheme.
LVQ algorithms can be treated within the same framework as above. The
analysis requires only slight modifications due to the assumption of a non-trivial
input densities.
Several possibilities to model anisotropy in input space have been considered
in the literature, a prominent example being unimodal Gaussians with distinct
principal axes [30,29,34,35]. Here, we focus on another simple but non-trivial
XIV
4
4
2
2
b h 0
0
-2
-2
-4
-2 0 2 4 -4 -2 0 2 4
b+ h+
Fig. 4. Data as generated according to the density (36) in N = 200 dimensions with
example parameters p = 0.6, p+ = 0.4, v = 0.64, and v+ = 1.44. The open (filled)
circles represent 160 (240) vectors from clusters centered about orthonormal vectors
B + (B ) with = 1, respectively. The left panel displays to the projections b =
B and diamonds mark the position of the cluster centers. The right panel shows
projections h = w of the same data on a randomly chosen pair of orthogonal unit
vectors w .
model density: we assume feature vectors are generated independently according

to a mixture of isotropic Gaussians:

X 1 1 2
P () = p P ( | ) with P ( | ) = N
exp ( B ) .
=1 2v 2 v
(36)
The conditional densities P ( | = 1) correspond to clusters with variances
v centered at B . For convenience, we assume that the vectors B are or-
thonormal: B 2 = 1 and B + B = 0. The first condition sets only the scale on
which the parameter controls the cluster offset. The orthogonality condition
fixes the position of cluster centers with respect to the origin in feature space
which could be chosen arbitrarily. Similar densities have been considered in, for
instance, [31,37,38,39,40].
In the context of supervised Learning Vector Quantization, see next section,
we will assume that the target classification coincides with the cluster member-
ship label for each vector . Due to the significant overlap of clusters, this task
is obviously not linear separable.
Note that linearly separable rules defined for bimodal input data similar to
(36) have been studied in [36]. While transient learning curves of the perceptron
can differ significantly from the technically simpler case of isotropic inputs, the
main results concerning the ( )asymptotic behavior persist.
Learning Vector Quantization (LVQ) a particularly intuitive and powerful
family of algorithms which has been applied in a variety of practical problems
XV
[41]. LVQ identifies prototypes, i.e. typical representatives of the classes in feature
space which then parameterize a distance based classification scheme.
Competitive learning schemes have been suggested in which a set of prototype
vectors is updated from a sequence of example data. We will restrict ourselves
to the simplest non-trivial case of two prototypes w+ , w IRN and data
generated according to a bi-modal distribution of type (36).
Most frequently a nearest prototype scheme is implemented: For classification
of a given input , the distances
2
ds () = ( ws ) , s = 1 (37)
are determined and is assigned to class +1 if d+ () d () and to class

1 else. The (squared) Euclidean distance (37) appears to be a natural choice.
In practical situations, however, it can lead to inferior performance and the
identification of an appropriate distance or similarity measure is one of the key
issues in applications of LVQ.
A simple two prototype system as described above parameterizes a linearly
separable classifier, only. However, we will consider learning of a non-separable
rule where non-trivial effects of the prototype dynamics can be studied in this
simple setting already. Extensions to more complex models with several proto-
types, i.e. piecewise linear decision boundaries and multi-modal input densities
are possible but non-trivial, see [45] for a recent example.
Generically, LVQ algorithms perform updates of the form

w+1
s = ws + f (d+ , d , s, ) ( ws ) . (38)
N
Hence, prototypes are either moved towards or away from the current input.
Here, the modulation function f controls the sign and, together with an explicit
learning rate , the magnitude of the update.
Socalled Winner-Takes-All (WTA) schemes update only the prototype which
is currently closest to the presented input vector. A prominent example of su-
pervised WTA learning is Kohonens original formulation, termed LVQ1 [32,33].
In our model scenario it corresponds to Eq. (38) with
f (d+ , d , s, ) = (ds d+s ) s (39)
The Heaviside function singles out the winning prototype, and the product
s = +1(1) if the labels of prototype and example coincide (disagree).
For the formal analysis of the training dynamics, we can proceed in complete
analogy to the previously studied cases of perceptron and layered neural net-
works. A natural choice of order parameters are the self- and cross-overlaps of
the involved N -dimensional vectors:
Rs = ws B and Qst = ws wt with , s, t {1, +1} (40)
While these definitions are formally identical with Eq. (29), the role of the ref-
erence vectors B is not that of teacher vectors, here.
XVI
Following the by now familiar lines we obtain a set of coupled ODE of the
form
dRS
= (hb fS i RS hfS i)
d
dQST
= (hhS fT + hT fS i QST hfS + fT i)
d
X
+ 2 v p hfS fT i . (41)
=1
Here, averages h. . .i over the full density P (), Eq. (36) have to be evaluated
as appropriate sums over conditional averages h. . .i corresponding to drawn
from cluster :
h. . .i = p+ h. . .i+ + p h. . .i .
For a large class of LVQ modulation functions, the actual input appears
on the right hand side of Eq. (41) only through its length and the projections
hs = ws and b = B (42)
where we omitted indices but implicitly assume that the input is uncorrelated
with the current prototypes ws . Note that also Heaviside terms as in Eq. (39)
do not depend on explicitly, for example:
(d d+ ) = [+2(h+ h ) Q++ + Q ] .
When performing the average over the actual example we first exploit the fact
that
lim 2 /N = (v+ p+ + v p )

N
for all input vectors in the thermodynamic limit. Furthermore, the joint Gaussian
density P (h+ , h , b+ , b ) can be expressed as a sum over contributions from
the clusters. The respective conditional densities are fully specified by first and
second moments
hhs i = Rs , hb i = , hhs ht i hhs i hht i = v Qst
hhs b i hhs i hb i = v Rs , hb b i hb i hb i = v (43)
where s, t, , , {+1, 1} and ... is the Kronecker-Delta. Hence, the density

of h and b is given in terms of the model parameters , p , v , and the above
defined set of order parameters in the previous time step.
After working out the system of ODE for a specific modulation function, it
can be integrated, at least numerically. Here we consider prototypes that are
initialized as independent random vectors of squared length Q with no prior
knowledge about the cluster positions. In terms of order parameters this implies
in our simple model
and Q+ (0) = RS (0) = 0
Q++ (0) = Q (0) = Q, for all S, . (44)
XVII
As in any supervised scenario, the success of learning is to be quantified in

terms of the generalization error. Here we have to consider two contributions for
misclassifying data from cluster = 1 or = 1 separately:
= p+ + + p with = h (d+ d )i . (45)
Exploiting the central limit theorem in the same fashion as above, one obtains
for the above contributions :
!
Q Q 2(R R )
= p (46)
2 v Q++ 2Q+ + Q
Rz 2
where (z) = dx ex /2 / 2.
By inserting {RS (), QST ()} we obtain the learning curve g (), i.e. the
typical generalization error after on-line training with N random examples.
Here, we once more exploit the fact that the order parameters and, thus, also g
are self-averaging non-fluctuating quantities in the thermodynamic limit N
.
As an example, we consider the dynamics of LVQ1, cf. Eq. (39). Fig. 5 (left
panel) displays the learning curves as obtained for a particular setting of the
model parameters and different choices of the learning rate . Initially, a large
learning rate is favorable, whereas smaller values of facilitate better gener-
alization behavior at large . One can argue that, as in stochastic gradient
descent procedures, the best asymptotic g will be achieved for small learning
rates 0. In this limit, we can omit terms quadratic in from the differential
equations and integrate them directly in rescaled time (). The asymptotic,
stationary result for () then corresponds to the best achievable per-
formance of the algorithm in the given model settings. Figure 5 (right panel)
displays, among others, an example result for LVQ1.
With the formalism outlined above it is readily possible to compare differ-
ent algorithms within the same model situation. This concerns, for instance,
the detailed prototype dynamics and sensitivity to initial conditions. Here, we
restrict ourselves to three example algorithms and very briefly discuss essential
properties and the achievable generalization ability:
LVQ1: This basic prescription was already defined above as the original
WTA algorithm with modulation function
fs = (ds d+s ) s .
LVQ+/-: Several modifications have been suggested in the literature, aim-

ing at better generalization behavior. The term LVQ+/- is used here to
describe one basic variant which updates two vectors at a time: the closest
among all prototypes which belong to the same class (a) and the closest one
among those that represent a different class (b). The so-called correct winner
(a) is moved towards the data while the wrong winner (b) is pushed even
XVIII
0.26 0.25
0.24
0.2
g 0.22 g
0.15
0.2
0.18 0.1
0.16 0.05
0.14
0
0 50 100 150 200 0 0.2 0.4 0.6 0.8 1
p+
Fig. 5. LVQ learning curves and comparison of algorithms.

Left panel: g () for = 1.2, p+ = 0.8, and v+ = v = 1. Prototypes were initialized
as in Eq. (44) with Q = 104 . From bottom to top at = 200, the graphs correspond
to learning rates = 0.2, 1.0, and 2.0, respectively.
Right panel: Achievable generalization error as a function of p+ = 1 p in the
model with = 1, v+ = 0.25, and v = 0.81. Initialization of all training schemes as
specified in the left panel. The solid line marks the result for LVQ1 in the limits 0
and , the dashed line corresponds to an idealized early stopping procedure
applied to LVQ+/-, and the chain line represents the asymptotic outcome of
LFM training. In addition, the dotted curve represents the best possible linear decision
boundary constructed from the input density.
farther away. In our simple model scenario this amounts to the modulation
function
fs = s = 1.
LFM: The so-called learning from mistakes (LFM) scheme performs an
update of the LVQ+/- type, but only for inputs which are misclassified
before the learning step:
fs = (d d ) s .
The prescription can be obtained as a limiting case of various algorithms

suggested in the literature, see [42] for a detailed discussion. Here we em-
phasize the similarity with the familiar perceptron training discussed in the
first sections.
Learning curves and asymptotic behavior of LVQ1 are exemplified in Fig.

5. As an interesting feature one notes a non-monotonicity of g () for small
learning rates. It corresponds to an over-shooting effect observed in the dynamics
of prototypes when approaching their stationary positions [42]. It is important to
note that this very basic, heuristic prescription achieves excellent performances:
Typically, the asymptotic result is quite close to the optimal g , given by the
best linear decision boundary as constructed from the input density (36).
XIX
The naive application of LVQ+/- training results in a strong instability: In

all settings with p+ 6= p , the prototype representing the weaker class will be
pushed away in the majority of training steps. Consequently, a divergent behav-
ior is observed and, for , the classifier assigns all data to the the stronger
cluster with the trivial result g = min{p+ , p }. Several measures have been sug-
gested to control the instability. In Kohonens LVQ2.1 and other modifications,
example data are only accepted for update when falls into a window close to
the current decision boundary. Another intuitive approach is based on the obser-
vation that g () generically displays a pronounced minimum before the general-
ization behavior deteriorates. In our formal model, it is possible to work out the
location of the minimum analytically and thus determine the performance of an
idealized early stopping method. The corresponding result is displayed in Fig. 5
(right panel, dashed line) and appears to compete well with LVQ1. However, it
is important to note that the quality of the minimum in g () strongly depends
on the initial conditions. Furthermore, in a practical situation, successful early
stopping would require the application of costly validation schemes.
Finally, we briefly discuss the LFM prescription. A peculiar feature of LFM
is that the stationary position of prototypes does depend strongly on the initial
configuration [42]. On the contrary, the asymptotic decision boundary
is well-defined. In the LFM prescription, emphasis is on the classification; the
aspect of Vector Quantization (representation of clusters) is essentially disre-
garded. While at first sight clear and attractive, LFM yields a far from optimal
performance in the limit . Note that already in perceptron training as
discussed in the first sections, a naive learning from mistakes strategy is bound
to fail miserably in general cases of noisy data or unlearnable rules.
The above considerations concern only the simplest LVQ training scenarios
and algorithms. Several directions in which to extend the formalism are obviously
interesting.
We only mention that unsupervised prototype based learning has been treated
in complete analogy to the above [43,44]. Technically, it reduces to the consid-
eration of modulation functions which do not depend on the cluster or class
label. The basic competitive WTA Vector Quantization training would be repre-
sented, for instance, by the modulation function fs = (ds d+s ) which always
moves the winning prototype closer to the data. The training prescription can
be interpreted as a stochastic gradient descent of a cost function, the so-called
quantization error [44]. The exchange and permutation symmetry of prototypes
in unsupervised training results in interesting effects which resemble the plateaus
discussed in multilayered neural networks, cf. section 3.
The consideration of a larger number of Gaussians contributing to the input
density is relatively simple. Thus, it is possible to model more complex data
structures and study their effect on the training dynamics. The treatment of
more than two prototypes is also conceptually simple but constitutes a technical
challenge. Obviously, the number of order parameters increases. In addition,
the r.h.s. of the ODE cannot be evaluated analytically, in general, but involve
numerical integrals. Note that the dynamics of several prototypes representing
XX
the same class of data resembles strongly the behavior of unsupervised Vector
Quantization. First results along these lines have been obtained recently, see for
instance [45].
The variational optimization, as discussed for the perceptron in detail, should
give insights into the essential features of robust and successful LVQ schemes.
Due to the more complex input density, however, the analysis proves quite in-
volved and has not yet been completed.
A highly relevant extension of LVQ is that of relevance learning. Here, the
idea is to replace the simple minded Euclidean metrics by an adaptive measure.
An important example is a weighted distance of the form
N N
2
X X
d(w, ) = 2j (wj j ) with 2j = 1
j=1 j=1
where the normalized factors j are called relevances as they measure the impor-
tance of dimension j in the classification. Relevance LVQ (RLVQ) and related
schemes update these factors according to a heuristic or gradient based scheme in
the course of training [46,47]. More complex schemes employ a full matrix of rel-
evances in a generalized quadratic measure or consider local measures attached
to the prototypes [48].
The analysis of the corresponding training dynamics constitutes another chal-
lenge in the theory of on-line learning. The description has to go beyond the
techniques discussed in this paper, as the relevances define a time-dependent
linear transformation of feature space.
5 Summary and Outlook

The statistical physics approach to learning has allowed for the analytical treat-
ment of the learning dynamics in a large variety of adaptive systems. The con-
sideration of typical properties of large systems in specific model situations com-
plements other approaches and contributes to the theoretical understanding of
adaptive systems.
Here, we have highlighted only selected topics as an introduction to this line
of research. The approach is presented, first, in terms of the perceptron network.
Despite its simplicity, the framework led to highly non-trivial insights and facili-
ated the putting forward of the method. For instance, the variational approach to
optimal training was developed in this context. Gradient based training in mul-
tilayered networks constitutes an important example for the analysis of more
complex architectures. Here, non-trivial effects such as quasistationary plateau
states can be observed and investigated systematically. Finally, a recent appli-
cation of theoretical framework concerns prototype based training in so-called
Learning Vector Quantization.
Several interesting questions and results have not been discussed at all or
could be mentioned only very briefly: the study of finite system sizes, on-line
learning in networks with discrete weights, unsupervised learning and clustering,
XXI
training from correlated data or from a fixed pool of examples, query strategies,
to name only a few. We can only refer to the list of references, in particular [27]
and [28] may serve as a starting point for the interested reader.
Due to the conceptual simplicity of the approach and its applicability in a
wide range of contexts it will certainly continue to facilitate better theoretical
understanding of learning systems in general. Current challenges include the
treatment of non-trivial input distributions, the dynamics of learning in more
complex networks architectures, the optimization of algorithms and their prac-
tical implementation in such systems, or the investigation of component-wise
updates as, for instance, in relevance LVQ.
References
1. Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equa-
tions of State calculations by fast computing machines. J. Chem. Phys. 21, 1087
(1953)
2. Huang, K.: Statistical Mechanics. Wiley and Sons, New York (1987)
3. Jaynes, E.T.: Probability Theory: The Logic of Science. Bretthorst, G.L. (ed.),
Cambridge University Press, Cambridge, UK (2003)
4. Mace, C.W.H., Coolen, T.: Dynamics of Supervised Learning with Restricted
Training Sets. Statistics and Computing 8, 55-88 (1998)
5. Biehl, M., Schwarze, M.: On-line learning of a time-dependent rule Europhys. Lett.
20, 733-738 (1992)
6. Biehl, M., Schwarze, H.: Learning drifting concepts with neural networks. Journal
of Physics A: Math. Gen. 26, 2651-2665 (1993)
7. Kinouchi, O., Caticha, N.: Lower bounds on generalization errors for drifting rules.
J. Phys. A: Math. Gen.26, 6161-6171 (1993)
8. Vicente, R, Kinouchi, O., Caticha, N.: Statistical Mechanics of Online Learning of
Drifting Concepts: A Variational Approach. Machine Learning 32, 179-201 (1998)
9. Reents, G., Urbanczik, R.: Self-averaging and on-line learning. Phys. Rev. Lett.
80, 5445-5448 (1998)
10. Kinzel, W., Rujan, P.: Improving a network generalization ability by selecting
examples. Europhys. Lett. 13, 2878 (1990)
11. Kinouchi, O., Caticha, N.: Optimal generalization in perceptrons. J. Phys. A: Math.
Gen. 25, 6243-6250 (1992)
12. Copelli, M., Caticha, N.: On-line learning in the committee machine. J. Phys. A:
Math. Gen. 28, 1615-1625 (1995)
13. Biehl, M., Riegler, P.: On-line Learning with a Perceptron. Europhys. Lett. 78:
525-530 (1994)
14. Biehl, M., Riegler, P., Stechert, M.: Learning from Noisy Data: An Exactly Solvable
Model. Phys. Rev. E 76, R4624-R4627 (1995)
15. Copelli, M., Eichhorn, R., Kinouchi, O., Biehl, M., Simonetti, R., Riegler, P.,
Caticha, N.: Noise robustness in multilayer neural networks. Europhys. Lett. 37,
427-432 (1995)
16. Vicente, R., Caticha, N.: Functional optimization of online algorithms in multilayer
neural networks. J. Phys. A: Math. Gen. 30, L599-L605 (1997)
17. Opper, M.: A Bayesian approach to on-line learning. In: [27], pp. 363-378 (1998)
18. Opper, M., Winther, O.: A mean field approach to Bayes learning in feed-forward
neural networks. Phys. Rev. Lett. 76, 1964-1967 (1996)
XXII
19. Solla, S.A., Winther, O.: Optimal perceptron learning: an online Bayesian ap-
proach. In: [27], pp. 379-398 (1998)
20. Cybenko, G.V.: Approximation by superposition of a sigmoidal function. Math. of
Control, Signals and Systems 2, 303-314 (1989)
21. Endres, D., Riegler, P.: Adaptive systems on different time scales. J. Phys. A:
Math. Gen. 32: 8655-9663 (1999)
22. Biehl, M., Schwarze, H.: Learning by on-line gradient descent. J. Phys A: Math.
Gen. 28, 643 (1995)
23. Saad, D., Solla, S.A.: Exact solution for on-line learning in multilayer neural net-
works. Phys. Rev. Lett. 74, 4337-4340 (1995)
24. Saad, D., Solla, S.A.: Online learning in soft committee machines. Phys. Rev. E
52, 4225-4243 (1995)
25. Biehl, M., Riegler, P., W ohler, C.: Transient Dynamics of Online-learning in two-
layered neural networks. J. Phys. A: Math. Gen. 29: 4769 (1996)
26. Saad, D, Rattray, M.: Globally optimal parameters for on-line learning in multilayer
neural networks. Phys. Rev. Lett. 79, 2578 (1997)
27. Saad, D. (ed.): On-line learning in neural networks. Cambridge University Press,
Cambridge, UK (1998)
28. Engel, A., Van den Broeck, C.: The Statistical Mechanics of Learning. Cambridge
University Press, Cambridge, UK (2001)
29. Schl
osser, E., Saad, D., Biehl, M.: Optimisation of on-line Principal Component
Analysis. J. Physics A: Math. Gen. 32, 4061 (1999)
30. Biehl, M., Schlosser, E.: The dynamics of on-line Principal Component Analysis.
J. Physics A: Math. Gen. 31: L97 (1998)
31. Biehl, M., Mietzner, A.: Statistical mechanics of unsupervised learning. Europhys.
Lett. 27, 421-426 (1993)
32. Kohonen, T.: Self-Organizing Maps. Springer, Berlin (1997)
33. Kohonen, T.: Learning vector quantization. In: Arbib, M.A. (ed.) The Handbook
of Brain Theory and Neural Networks., pp. 537-540. MIT Press, Cambridge, MA
(1995)
34. Van den Broeck, C., Reimann, P.: Unsupervised Learning by Examples: On-line
Versus Off-line. Phys. Rev. Lett. 76, 2188-2191, (1996)
35. Reimann, P, Van den Broeck, C, Bex, G.J.: A Gaussian Scenario for Unsupervised
Learning. J. Phys. A: Math. Gen. 29, 3521-3533 (1996)
36. Riegler, P., Biehl, M., Solla, S.A., Marangi, C.: On-line learning from clustered
input examples. In: Marinaro, M., Tagliaferri, R. (eds.) Neural Nets WIRN Vietri-
95, Proc. of the 7th Italian Workshop on Neural Nets, pp. 87-92. World Scientific,
Singapore (1996)
37. Marangi, C., Biehl, M., Solla, S.A.: Supervised learning from clustered input ex-
amples. Europhys. Lett. 30, 117-122 (1995)
38. Biehl, M.: An exactly solvable model of unsupervised learning. Europhysics Lett.
25, 391-396 (1994)
39. Meir, R.: Empirical risk minimization versus maximum-likelihood estimation: a
case study. Neural Computation 7, 144-157 (1995)
40. Barkai, N, Seung, H.S., Sompolinksy, H.: Scaling laws in learning of classification
tasks. Phys. Rev. Lett. 70, 3167-3170 (1993)
41. Neural Networks Research Centre. Bibliography on the self-organizing maps
(SOM) and learning vector quantization (LVQ). Helsinki University of Technology,
available on-line: http://liinwww.ira.uka.de/bibliography/Neural/SOM.LVQ.html
(2002)
XXIII
42. Biehl, M, Ghosh, A., Hammer, B.: Dynamics and generalization ability of LVQ
algorithms. J. Machine Learning Research 8, 323-360 (2007)
43. Biehl, M., Freking, A., Reents, G.: Dynamics of on-line competitive learning.
Europhysics Letters 38, 73-78 (1997)
44. Biehl, M., Ghosh, A., Hammer, B.: Learning Vector Quantization: The Dynamics
of Winner-Takes-All algorithms. Neurocomputing 69, 660-670 (2006)
45. Witeolar, A., Biehl, M., Ghosh, A., Hammer, B.: Learning Dynamics of Neural
Gas and Vector Quantization. Neurocomputing 71, 1210-1219 (2008)
46. Bojer, T., Hammer, B., Schunk, D., Tluk von Toschanowitz, K.: Relevance deter-
mination in learning vector quantization. In: Verleysen, M. (ed.) European Sym-
posium on Artificial Neural Networks ESANN 2001, pp. 271-276. D-facto publica-
tions, Belgium (2001)
47. Hammer, B., Villmann, T.: Generalized relevance learning vector quantization.
Neural Networks 15, 1059-1068 (2002)
48. Schneider, P, Biehl, M., Hammer, B.: Relevance Matrices in Learning Vector Quan-
tization In: Verleysen, M. (ed.) European Symposium on Artificial Neural Networks
ESANN 2007, pp. 37-43, d-side publishing, Belgium (2007)

Statistical Mechanics of On-Line Learning

Uploaded by

Copyright:

Available Formats

Statistical Mechanics of On-Line Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistical Mechanics of On-Line Learning

Uploaded by

Copyright:

Available Formats

Statistical Mechanics of On-line learning

Michael Biehl1 , Nestor Caticha2 , and Peter Riegler3

Abstract. We introduce and discuss the application of statistical physics

This work is organized as follows. We introduce the method in section 2

2 On-line learning in Classifiers: Linearly separable case

We consider a supervised classification problem, where vectors RN have to

or described by the conditional probability P ( B |fB ()) depending on a transfer

The generalization error is

= h( J () B ())i{ , }=1, ,, , (3)

J = sign(J .), B = sign(B.) (4)

to prove that the probability of making an error on an independent example,

This notation makes it natural to ask for an interpretation of the meaning of

3 On-line Learning in Two-Layered Networks

of a layer of N inputs, a layer of K hidden units, and a single output. The

In principle, the network (24) can implement such a function if K M .

hhi hj i = Ji Jj =: Qij , hbk bl i = Bk Bl =: Tkl , hhi bk i = Ji Bk =: Rik , (29)

For a given rule () to be learned, the generalization error is

g ({Ji , wi }) = h({Ji , wi })i , (33)

where ({Ji , wi }) = 21 ( )2 is the instantaneous error. As the outputs and

Fig. 2. Time evolution for on-line backpropagation in the K = M = 2 learning scenario

the hidden-to-output weights as it turns out that the decrease of g () is largest

4 Dynamics of prototype based learning

model density: we assume feature vectors are generated independently according

are determined and is assigned to class +1 if d+ () d () and to class

f (d+ , d , s, ) = (ds d+s ) s (39)

Rs = ws B and Qst = ws wt with , s, t {1, +1} (40)

hhs i = Rs , hb i = , hhs ht i hhs i hht i = v Qst

hhs b i hhs i hb i = v Rs , hb b i hb i hb i = v (43)

where s, t, , , {+1, 1} and ... is the Kronecker-Delta. Hence, the density

As in any supervised scenario, the success of learning is to be quantified in

= p+ + + p with = h (d+ d )i . (45)

LVQ+/-: Several modifications have been suggested in the literature, aim-

Fig. 5. LVQ learning curves and comparison of algorithms.

The prescription can be obtained as a limiting case of various algorithms

Learning curves and asymptotic behavior of LVQ1 are exemplified in Fig.

The naive application of LVQ+/- training results in a strong instability: In

5 Summary and Outlook

You might also like