Statistical Mechanics of On-Line Learning
Statistical Mechanics of On-Line Learning
Statistical Mechanics of On-Line Learning
1 Introduction
The perception that Statistical Mechanics is an inference theory has opened
the possibility of applying its methods to several areas outside the traditional
realms of Physics. This explains the interest of part of the Statistical Mechanics
community during the last decades in machine learning and optimization and the
application of several techniques of Statistical Mechanics of disordered systems
in areas of Computer Science.
As an inference theory Statistical Mechanics is a Bayesian theory. Bayesian
methods typically demand extended computational resources which probably
explains why methods that were used by Laplace, were almost forgotten in the
following century. The current availability of ubiquitous and now seemingly pow-
erful computational resources has promoted the diffusion of these methods in
statistics. For example, posterior averages can now be rapidly estimated by us-
ing efficient Markov Chain Monte Carlo methods, which started to be developed
to attack problems in Statistical Mechanics in the middle of last century. Of
course the drive to develop such techniques (starting with [1]) is due to the total
impossibility of introducing classical statistics methods to study thermodynamic
problems. Several other techniques Statistical Mechanics have also found their
way into statistics. In this paper we review some applications of Statistical Me-
chanics to artificial machine learning.
Learning is the change induced by the arrival of information. We are inter-
ested in learning from examples. Different scenarios arise if examples are con-
sidered in batches or just one at a time. This last case is the on-line learning
II
scenario and the aim of this contribution is to present the characterization of on-
line learning using methods of Statistical Mechanics. We will consider in section
2 simple linearly separable classification problems , just to introduce the idea.
The dynamics of learning is described by a set of stochastic difference equations
which show the evolution, as information arrives, of quantities that characterize
the problem and in Statistical Mechanics are called order parameters. Looking
at the limit of large dimensions and scaling the number of examples in a correct
manner, the difference equations simplify into deterministic ordinary differen-
tial equations. Numerical solutions of the ODE give then, for the specific model
of the available information such as the distribution of examples, the learning
curves.
While this large dimension limit may seem artificial, it must be stressed that
it is most sensible in the context of thermostatistics, where Statistical Mechanics
is applied to obtain thermodynamic properties of physical systems. There the di-
mension, which is the number of degrees of freedom, is of the order of Avogadros
number ( 1023 ). Simulations however have to be carried for much smaller sys-
tems, and this prompted the study of finite size corrections of expectations of
experimentally relevant quantities, which depend on several factors, but typi-
cally go to zero algebraically with the dimension. If the interest lies in intensive
quantities such as temperature, pressure or chemical potential then corrections
are negligible. If one is interested on extensive quantities, such as energy, entropy
or magnetization, one has to deal with their densities, such as energy per degree
of freedom. Again the errors due to the infinite size limit are negligible. In this
limit, central limit theorems apply , resulting in deterministic predictions and
the theory is in the realm of thermodynamics. Thus this limit is known as the
thermodynamic limit (TL). For inference problems we can use the type of theory
to control finite size effects in the reverse order. We can calculate for the easier
deterministic infinite limit and control the error made by taking the limit. We
mention this and give references, but will not deal with this problem except by
noticing that it is theoretically understood and that simulations of stylized mod-
els, necessarily done in finite dimensions, agree well with the theoretical results
in the TL. The reader might consider that the thermodynamic limit is equivalent
to the limit of infinite sequences in Shannons channel theorem.
Statistical Mechanics (see e.g. [2,3]) had its origins in the second half of the
XIX century, in an attempt, mainly due to Maxwell, Boltzmann and Gibbs to
deduce from the microscopic dynamical properties of atoms and molecules, the
macroscopic thermodynamic laws. Its success in building a unified theoretical
framework which applied to large quantities of experimental setups was one of
the great successes of Physics in the XIX century. A measure of its success
can be seen from its role in the discovery of Quantum Mechanics. Late in the
XIX century, when its application to a problem where the microscopic laws
involved electromagnetism, i.e the problem of Black Body radiation, showed
irreconcilable results with experiment, Max Planck showed that the problem laid
with the classical laws of electromagnetism and not with Statistical Mechanics.
This started the revolution that led to Quantum Mechanics.
III
B = fB () (1)
J = gJ () (2)
Z
Z Y X
eG () = dPo () dPo ( ) P (B |fB ( ))( J B ())
=1 =1
where the step function (x) = 1 for x > 0 and zero otherwise. As this stands
it is impossible to obtain results other than of a general nature. To obtain sharp
results we have to specify a model.
The transfer functions fB and gJ specify the architectures of the rule and of
the classifier, while P ( |fB ( )) models possible noise in the specification of
the supervision label. The simplest case that can be studied is where both fB
and gJ are linearly separable classifiers of the same dimension: K = M = N ,
As simple and artificial as it may be, the study of this special case serves several
purposes and is a stepping stone into more realistic scenarios.
An interesting feature of Statistical Mechanics lies in that it points out what
are the relevant order parameters in a problem. In physics, this turns out to be
information about what are the objects of experimental interest.
Without any loss we can take all vectors and B to be normalized as = N
and B B = 1. For J however, which is a dynamical quantity that evolves under
the learning algorithm still to be specified we let it free and call J J = Q.
Define the fields
h = J , b=B (5)
To advance further we choose a model for the distribution of examples Po ()
and the natural starting point is to choose a uniform distribution over the N -
dimensional sphere. Different choices to model specific situation are of course
possible. Under these assumptions, since the scalar products of (4) are sums
of random variables, for large N , h and b are correlated Gaussian variables,
completely characterized by
hhi = hbi = 0,
hh2 i = Q, hb2 i = 1,
hhbi = J B = R. (6)
It is useful to introduce the overlap = R/ Q between the rule and machine
parameter vectors. The joint distribution is given by
1 1
2(12 )
(h2 2hb+b2 )
P (h, b) = p e . (7)
2 (1 2 )
The correlation is the overlap , which is related to the angle between J and
B: = cos1 , it follows that || 1. It is geometrically intuitive and also easy
V
Since at each time step a random vector is drawn from the distribution Po
equations (12) and (13) are stochastic difference equations. We now take the
thermodynamic limit N and average over the test example . Note that
each example induces a change of the order parameters of order 1/N . Hence,
one expects the need of order N many examples to create a change of order 1.
This prompts the introduction of = limN /N which by measuring the
number of examples measures time. The behavior of and are very different
in the limit. It can be shown (see [9]) that order parameters such as , R, Q self-
average. Their fluctuations tend to zero in this limit, see Fig. 3 for an example.
On the other hand has fluctuations of order one and we look at its average
over the test vector:
d
=h ih,b, . (14)
d
VI
the pairs (Q, Q/) and (R, R/) behave in a similar way. We average
over the fields h, b and over the labels , using P (|b) .This leads to the coupled
system of ordinary differential equations, which for a particular form of F were
introduced in [10].
dR X
Z
= dhdbP (h, b)P (|b) [F b] = hF bi (15)
d
dQ X
Z
dhdbP (h, b)P (|b) 2F h + F 2 = h2F h + F 2 i,
= (16)
d
where the angular brackets stand for the average over the fields and label. Since
the generalization error is directly related to it will be useful to look
at the
equivalent set of equations for and the length of the weight vector Q:
F 2
d XZ F
= dhdbP (h, b)P (|b) (b h) (17)
d
Q 2Q
1 F2
d Q X
Z
= dhdbP (h, b)P (|b) F h + (18)
d
2 Q
We took the average with respect to the two fields h and b as if they stood on
symmetrical grounds, but they dont. It is reasonable to assume knowledge of h
and but not of b. Making this explicit
F 2
d XZ F
= dhP (h)P () hb hib|h (19)
d
Q 2Q
1 F2
d Q X
Z
= dhP (h)P () F h + (20)
d
2 Q
call
Q
F = hb hib|h (21)
where the average is over unavailable quantities. Equation 19 can then be written
as
d X 1
Z
= dhP (h)P () F F F 2 (22)
d Q 2
This problem can be solved for a few architectures, including some networks with
hidden layers [12,16], although the optimization becomes much more difficult.
Within the class of algorithms we have considered, the performance bound is
given by the modulation function F given by Eq. (21).
The optimal resulting algorithm has several features that are useful in consid-
ering practical algorithms, such as the optimal annealing schedule of the leaning
rate and, in the presence of noise, adaptive cutoffs for surprising examples.
As an example we discuss learning of a linearly separable rule (4) in some
detail. There, an example will be misclassified if ( J B ) > 0 or equivalently
h B < 0. We will refer to the latter quantity as aligned field. It basically measures
the correctness (h B > 0) or wrongness (h B < 0) of the current classification.
Fig. 1 depicts the optimal modulation function for inferring a linearly sep-
arable rule from linearly separable data. Most interesting is the dependence on
= cos(g ): For = 0 (g = 1/2) the modulation function does not take into
account the correctness of the current examples. The modulation function is
constant, which corresponds to Hebbian learning. As increases, however, for
already correctly classified examples the magnitude of the modulations function
decreases with increasing aligned field. For misclassified examples, however, the
update becomes the larger the smaller the aligned field is. In the limit 1
(g ) the optimal modulation function approaches the Adatron algorithm
([13]) where only misclassified examples trigger an update which is proportional
to the aligned
p field. In addition to that, for the optimal modulation function
() = Q(), i.e. the order parameter Q can be used to estimate .
Now imagine, that the label of linearly separable data is noisy, i.e. it is
changed with a certain probability. In this case it would be very dangerous to
follow an Adatron-like algorithm and perform an large update if h B < 0, since
the seeming misclassification might be do to a corrupted label. The optimal
modulation function for that case perceives this danger and introduces a sort
of cutoff w.r.t. the aligned field. Hence, no considerable update is performed
if the aligned field becomes too large in magnitude. Fig. 1 shows the general
behavior. [14] gives further details and an extension to other scenarios of noisy
but otherwise linearly separable data sets.
These results for optimal modulations functions are, in general, better under-
stood from a Bayesian point of view ([17,18,19]). Suppose the knowledge about
the weight vector is coded in a Gaussian probability density. As a new example
arrives, this probability density is used as a prior distribution. The likelihood is
built out of the knowledge of the network architecture, of the noise process that
may be corrupting the label and of the example vector and its label. The new
posterior is not, in general Gaussian and a new Gaussian is chosen, to be the
prior for the next learning step, in such a way as to minimize the information
loss. This is done by the maximum entropy method or equivalently minimizing
the Kullback-Leibler divergence. It follows that the learning of one example in-
duces a mapping from a Gaussian to another Gaussian, which can be described
by update equations of the Gaussians mean and covariance. These equations
define a learning algorithm together with a tensor like adaptive schedule for the
VIII
F B / Q
h B h B
Fig. 1. Left: Optimal modulation function F for learning a linearly separable classifi-
cation of type (4) for various values of the order parameter . Right: In addition to (4)
the labels B are subject to noise, .e. are flipped with probability . The aligned field
h B is a measure of the agreement of the classification of the current example with the
given label. In both cases, all examples are weighted equally for = 0 irrespective of the
value of the aligned field h B . This corresponds to Hebbian learning. In the noiseless
case ( = 0) examples with a negative aligned field receive more weight as increases
while those with positive aligned field gain less weight. For > 0 also examples with
large negative aligned fields are not taken into account for updating. These examples
are most likely affected by the noise and therefore are deceptively misclassified. The
optimal weight function possesses a cutoff value at negative values of h B . This cutoff
decreases in absolute value with increasing and .
learning rate. This algorithms are closely related to the variational algorithm
defined above. The Bayesian method has the advantage of being more general.
It can also be readily applied to other cases where the Gaussian prior is not
adequate [19].
In the following sections we follow the developments of these techniques in
other directions. We first look into networks with hidden layers, since the expe-
rience gained in studying on-line learning in these more complex architectures
will be important to better understand the application of on-line learning to the
problem of clustering.
Classifiers such as (4) are the most fundamental learning networks. In terms
of network architecture the next generalization are networks with one layer of
hidden units and a fixed hidden-to-output relation. An important example is
the so-called committee machine, where the overall output is determined by a
majority vote of several classifiers of type (4). For such networks the variational
approach of the previous section can be readily applied [12,16].
General two-layered networks, however, have variable hidden-to-output weights
and are soft classifiers, i.e. have continuous transfer functions. They consist
IX
Here, Ji denotes the N -dimensional weight vector of the i-th input branch and
wi the weight connecting the i-th hidden unit with the output. For a (soft)
committee machine, wi 0 for all branches i. Often, the number K of hidden
weight vectors Ji is chosen as K N . In fact, most analyses specialize on
K = O(1). This restriction will also be pursued here. Note that the overall
output is linear in wi , in contrast to the outputs of the hidden layer which in
general depend nonlinearly on the weights Ji via the transfer function g.
The class of networks (24) is of importance for several reasons: Firstly, they
can realize more complex classification schemes. Secondly, they are commonly
used in practical applications. Finally, two-layered networks with a linear output
are known to be universal approximators [20].
Analogously to section 2 the network (24) is trained by a sequence of uncor-
related examples {( , )} which are provided by an unknown rule () by the
environment. As above, the example input vectors are denoted by , while here
is the corresponding correct rule output.
In a commonly studied scenario the rule is provided by a network of the same
architecture with hidden layer weights B i , hidden-to-output weights vi , and an
in general different number of hidden units M :
M
X
() = vk g(B k ). (25)
k=1
Also note from (24, 25) that the stochastic dynamics of Ji and wi only de-
pends on the fields hi = Ji , bk = Bk which can be viewed as a generalization
of (ref to eq 5). As in section 2, for large N these become Gaussian variables.
Here, they have zero means and correlations
where i, j = 1 . . . K and k, l = 1 . . . M .
Introducing = /N as above, the discrete dynamics (26, 27) can be replaced
by a set of coupled differential equations for Rik , Qij , and wi in the limit of
large N : Projecting (26) into Bk and Jj , respectively, and averaging over the
randomness of leads to
dRik
= hFi bk i (30)
d
dQij
= hFi hj + Fj hi + Fi Fj i, (31)
d
where the average is now with respect to the fields {hi } and {bk }. Hence, the
microscopic stochastic dynamics of O(K N ) many weights Ji is replaced by the
macroscopic dynamics of O(K 2 ) many order parameters Rik and Qij , respec-
tively. Again, these order parameters are self-averaging, i.e. their fluctuations
vanish as N . Fig. 3 exemplifies this for a specific dynamics.
The situation is somewhat different for the hidden-to-output weights wi . In
the transition from microscopic, stochastic dynamics to macroscopic, averaged
dynamics the hidden-layer weights Ji are compressed to order parameters which
are scalar products, cf. (29). The hidden-to-output weights, however, are not
compressed P into new parameters of the form of scalar products. (Scalar products
of the type i wi vi do not even exist for K 6= M .) Scaling the update of wi by
1/N as in (27) allows to replace 1/N by the differential d as N . Hence,
the averaged dynamics of the hidden-to-output weight reads
dwi
= hFw g(hi )i. (32)
d
Note that the r.h.s. of these differential equations depend on Rik , Qij via (29)
and, hence, are coupled to the differential equations (30, 31) of these order
parameters as well. So in total, the macroscopic description of learning dynamics
consists of the coupled set (30, 31, 32).
It might be surprising that the hidden-to-output weights wi by themselves
are appropriate for a macroscopic description while the hidden weights Ji are
not. The reason for this is twofold. First, the number K of wi had been taken to
be O(1), i.e. it does not scale with the dimension of inputs N . Therefore, there
is no need to compress a large number of microscopic variables into a small
number of macroscopic order parameters as for the Ji . Second, the change in wi
had been chosen to scale with 1/N . For this choice one can show that like Rik
and Qij the weights wi are self-averaging.
XI
1/ N 1/ N
Fig. 3. Finite size analysis of the order parameters R (), Q (2) and w (, right panel)
in the dynamics (34,35) for the special case K=M =1. Shown are the observed standard
deviations as a function of the system size for a fixed value of . Each point depicts an
average taken over 100 simulation runs. As can be seen, R and Q become selfaveraging
in the thermodynamic limit N , i.e. their fluctuations vanish in this limit. In
contrast to that, the fluctuations of w remain finite if one optimizes the learning rate
w , which leads to the divergence w .
4
4
2
2
b h 0
0
-2
-2
-4
-2 0 2 4 -4 -2 0 2 4
b+ h+
Fig. 4. Data as generated according to the density (36) in N = 200 dimensions with
example parameters p = 0.6, p+ = 0.4, v = 0.64, and v+ = 1.44. The open (filled)
circles represent 160 (240) vectors from clusters centered about orthonormal vectors
B + (B ) with = 1, respectively. The left panel displays to the projections b =
B and diamonds mark the position of the cluster centers. The right panel shows
projections h = w of the same data on a randomly chosen pair of orthogonal unit
vectors w .
[41]. LVQ identifies prototypes, i.e. typical representatives of the classes in feature
space which then parameterize a distance based classification scheme.
Competitive learning schemes have been suggested in which a set of prototype
vectors is updated from a sequence of example data. We will restrict ourselves
to the simplest non-trivial case of two prototypes w+ , w IRN and data
generated according to a bi-modal distribution of type (36).
Most frequently a nearest prototype scheme is implemented: For classification
of a given input , the distances
2
ds () = ( ws ) , s = 1 (37)
The Heaviside function singles out the winning prototype, and the product
s = +1(1) if the labels of prototype and example coincide (disagree).
For the formal analysis of the training dynamics, we can proceed in complete
analogy to the previously studied cases of perceptron and layered neural net-
works. A natural choice of order parameters are the self- and cross-overlaps of
the involved N -dimensional vectors:
While these definitions are formally identical with Eq. (29), the role of the ref-
erence vectors B is not that of teacher vectors, here.
XVI
Following the by now familiar lines we obtain a set of coupled ODE of the
form
dRS
= (hb fS i RS hfS i)
d
dQST
= (hhS fT + hT fS i QST hfS + fT i)
d
X
+ 2 v p hfS fT i . (41)
=1
Here, averages h. . .i over the full density P (), Eq. (36) have to be evaluated
as appropriate sums over conditional averages h. . .i corresponding to drawn
from cluster :
h. . .i = p+ h. . .i+ + p h. . .i .
For a large class of LVQ modulation functions, the actual input appears
on the right hand side of Eq. (41) only through its length and the projections
hs = ws and b = B (42)
where we omitted indices but implicitly assume that the input is uncorrelated
with the current prototypes ws . Note that also Heaviside terms as in Eq. (39)
do not depend on explicitly, for example:
(d d+ ) = [+2(h+ h ) Q++ + Q ] .
When performing the average over the actual example we first exploit the fact
that
lim 2 /N = (v+ p+ + v p )
N
for all input vectors in the thermodynamic limit. Furthermore, the joint Gaussian
density P (h+ , h , b+ , b ) can be expressed as a sum over contributions from
the clusters. The respective conditional densities are fully specified by first and
second moments
Exploiting the central limit theorem in the same fashion as above, one obtains
for the above contributions :
!
Q Q 2(R R )
= p (46)
2 v Q++ 2Q+ + Q
Rz 2
where (z) = dx ex /2 / 2.
By inserting {RS (), QST ()} we obtain the learning curve g (), i.e. the
typical generalization error after on-line training with N random examples.
Here, we once more exploit the fact that the order parameters and, thus, also g
are self-averaging non-fluctuating quantities in the thermodynamic limit N
.
As an example, we consider the dynamics of LVQ1, cf. Eq. (39). Fig. 5 (left
panel) displays the learning curves as obtained for a particular setting of the
model parameters and different choices of the learning rate . Initially, a large
learning rate is favorable, whereas smaller values of facilitate better gener-
alization behavior at large . One can argue that, as in stochastic gradient
descent procedures, the best asymptotic g will be achieved for small learning
rates 0. In this limit, we can omit terms quadratic in from the differential
equations and integrate them directly in rescaled time (). The asymptotic,
stationary result for () then corresponds to the best achievable per-
formance of the algorithm in the given model settings. Figure 5 (right panel)
displays, among others, an example result for LVQ1.
With the formalism outlined above it is readily possible to compare differ-
ent algorithms within the same model situation. This concerns, for instance,
the detailed prototype dynamics and sensitivity to initial conditions. Here, we
restrict ourselves to three example algorithms and very briefly discuss essential
properties and the achievable generalization ability:
LVQ1: This basic prescription was already defined above as the original
WTA algorithm with modulation function
fs = (ds d+s ) s .
0.26 0.25
0.24
0.2
g 0.22 g
0.15
0.2
0.18 0.1
0.16 0.05
0.14
0
0 50 100 150 200 0 0.2 0.4 0.6 0.8 1
p+
farther away. In our simple model scenario this amounts to the modulation
function
fs = s = 1.
LFM: The so-called learning from mistakes (LFM) scheme performs an
update of the LVQ+/- type, but only for inputs which are misclassified
before the learning step:
fs = (d d ) s .
the same class of data resembles strongly the behavior of unsupervised Vector
Quantization. First results along these lines have been obtained recently, see for
instance [45].
The variational optimization, as discussed for the perceptron in detail, should
give insights into the essential features of robust and successful LVQ schemes.
Due to the more complex input density, however, the analysis proves quite in-
volved and has not yet been completed.
A highly relevant extension of LVQ is that of relevance learning. Here, the
idea is to replace the simple minded Euclidean metrics by an adaptive measure.
An important example is a weighted distance of the form
N N
2
X X
d(w, ) = 2j (wj j ) with 2j = 1
j=1 j=1
where the normalized factors j are called relevances as they measure the impor-
tance of dimension j in the classification. Relevance LVQ (RLVQ) and related
schemes update these factors according to a heuristic or gradient based scheme in
the course of training [46,47]. More complex schemes employ a full matrix of rel-
evances in a generalized quadratic measure or consider local measures attached
to the prototypes [48].
The analysis of the corresponding training dynamics constitutes another chal-
lenge in the theory of on-line learning. The description has to go beyond the
techniques discussed in this paper, as the relevances define a time-dependent
linear transformation of feature space.
training from correlated data or from a fixed pool of examples, query strategies,
to name only a few. We can only refer to the list of references, in particular [27]
and [28] may serve as a starting point for the interested reader.
Due to the conceptual simplicity of the approach and its applicability in a
wide range of contexts it will certainly continue to facilitate better theoretical
understanding of learning systems in general. Current challenges include the
treatment of non-trivial input distributions, the dynamics of learning in more
complex networks architectures, the optimization of algorithms and their prac-
tical implementation in such systems, or the investigation of component-wise
updates as, for instance, in relevance LVQ.
References
1. Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equa-
tions of State calculations by fast computing machines. J. Chem. Phys. 21, 1087
(1953)
2. Huang, K.: Statistical Mechanics. Wiley and Sons, New York (1987)
3. Jaynes, E.T.: Probability Theory: The Logic of Science. Bretthorst, G.L. (ed.),
Cambridge University Press, Cambridge, UK (2003)
4. Mace, C.W.H., Coolen, T.: Dynamics of Supervised Learning with Restricted
Training Sets. Statistics and Computing 8, 55-88 (1998)
5. Biehl, M., Schwarze, M.: On-line learning of a time-dependent rule Europhys. Lett.
20, 733-738 (1992)
6. Biehl, M., Schwarze, H.: Learning drifting concepts with neural networks. Journal
of Physics A: Math. Gen. 26, 2651-2665 (1993)
7. Kinouchi, O., Caticha, N.: Lower bounds on generalization errors for drifting rules.
J. Phys. A: Math. Gen.26, 6161-6171 (1993)
8. Vicente, R, Kinouchi, O., Caticha, N.: Statistical Mechanics of Online Learning of
Drifting Concepts: A Variational Approach. Machine Learning 32, 179-201 (1998)
9. Reents, G., Urbanczik, R.: Self-averaging and on-line learning. Phys. Rev. Lett.
80, 5445-5448 (1998)
10. Kinzel, W., Rujan, P.: Improving a network generalization ability by selecting
examples. Europhys. Lett. 13, 2878 (1990)
11. Kinouchi, O., Caticha, N.: Optimal generalization in perceptrons. J. Phys. A: Math.
Gen. 25, 6243-6250 (1992)
12. Copelli, M., Caticha, N.: On-line learning in the committee machine. J. Phys. A:
Math. Gen. 28, 1615-1625 (1995)
13. Biehl, M., Riegler, P.: On-line Learning with a Perceptron. Europhys. Lett. 78:
525-530 (1994)
14. Biehl, M., Riegler, P., Stechert, M.: Learning from Noisy Data: An Exactly Solvable
Model. Phys. Rev. E 76, R4624-R4627 (1995)
15. Copelli, M., Eichhorn, R., Kinouchi, O., Biehl, M., Simonetti, R., Riegler, P.,
Caticha, N.: Noise robustness in multilayer neural networks. Europhys. Lett. 37,
427-432 (1995)
16. Vicente, R., Caticha, N.: Functional optimization of online algorithms in multilayer
neural networks. J. Phys. A: Math. Gen. 30, L599-L605 (1997)
17. Opper, M.: A Bayesian approach to on-line learning. In: [27], pp. 363-378 (1998)
18. Opper, M., Winther, O.: A mean field approach to Bayes learning in feed-forward
neural networks. Phys. Rev. Lett. 76, 1964-1967 (1996)
XXII
19. Solla, S.A., Winther, O.: Optimal perceptron learning: an online Bayesian ap-
proach. In: [27], pp. 379-398 (1998)
20. Cybenko, G.V.: Approximation by superposition of a sigmoidal function. Math. of
Control, Signals and Systems 2, 303-314 (1989)
21. Endres, D., Riegler, P.: Adaptive systems on different time scales. J. Phys. A:
Math. Gen. 32: 8655-9663 (1999)
22. Biehl, M., Schwarze, H.: Learning by on-line gradient descent. J. Phys A: Math.
Gen. 28, 643 (1995)
23. Saad, D., Solla, S.A.: Exact solution for on-line learning in multilayer neural net-
works. Phys. Rev. Lett. 74, 4337-4340 (1995)
24. Saad, D., Solla, S.A.: Online learning in soft committee machines. Phys. Rev. E
52, 4225-4243 (1995)
25. Biehl, M., Riegler, P., W ohler, C.: Transient Dynamics of Online-learning in two-
layered neural networks. J. Phys. A: Math. Gen. 29: 4769 (1996)
26. Saad, D, Rattray, M.: Globally optimal parameters for on-line learning in multilayer
neural networks. Phys. Rev. Lett. 79, 2578 (1997)
27. Saad, D. (ed.): On-line learning in neural networks. Cambridge University Press,
Cambridge, UK (1998)
28. Engel, A., Van den Broeck, C.: The Statistical Mechanics of Learning. Cambridge
University Press, Cambridge, UK (2001)
29. Schl
osser, E., Saad, D., Biehl, M.: Optimisation of on-line Principal Component
Analysis. J. Physics A: Math. Gen. 32, 4061 (1999)
30. Biehl, M., Schlosser, E.: The dynamics of on-line Principal Component Analysis.
J. Physics A: Math. Gen. 31: L97 (1998)
31. Biehl, M., Mietzner, A.: Statistical mechanics of unsupervised learning. Europhys.
Lett. 27, 421-426 (1993)
32. Kohonen, T.: Self-Organizing Maps. Springer, Berlin (1997)
33. Kohonen, T.: Learning vector quantization. In: Arbib, M.A. (ed.) The Handbook
of Brain Theory and Neural Networks., pp. 537-540. MIT Press, Cambridge, MA
(1995)
34. Van den Broeck, C., Reimann, P.: Unsupervised Learning by Examples: On-line
Versus Off-line. Phys. Rev. Lett. 76, 2188-2191, (1996)
35. Reimann, P, Van den Broeck, C, Bex, G.J.: A Gaussian Scenario for Unsupervised
Learning. J. Phys. A: Math. Gen. 29, 3521-3533 (1996)
36. Riegler, P., Biehl, M., Solla, S.A., Marangi, C.: On-line learning from clustered
input examples. In: Marinaro, M., Tagliaferri, R. (eds.) Neural Nets WIRN Vietri-
95, Proc. of the 7th Italian Workshop on Neural Nets, pp. 87-92. World Scientific,
Singapore (1996)
37. Marangi, C., Biehl, M., Solla, S.A.: Supervised learning from clustered input ex-
amples. Europhys. Lett. 30, 117-122 (1995)
38. Biehl, M.: An exactly solvable model of unsupervised learning. Europhysics Lett.
25, 391-396 (1994)
39. Meir, R.: Empirical risk minimization versus maximum-likelihood estimation: a
case study. Neural Computation 7, 144-157 (1995)
40. Barkai, N, Seung, H.S., Sompolinksy, H.: Scaling laws in learning of classification
tasks. Phys. Rev. Lett. 70, 3167-3170 (1993)
41. Neural Networks Research Centre. Bibliography on the self-organizing maps
(SOM) and learning vector quantization (LVQ). Helsinki University of Technology,
available on-line: http://liinwww.ira.uka.de/bibliography/Neural/SOM.LVQ.html
(2002)
XXIII
42. Biehl, M, Ghosh, A., Hammer, B.: Dynamics and generalization ability of LVQ
algorithms. J. Machine Learning Research 8, 323-360 (2007)
43. Biehl, M., Freking, A., Reents, G.: Dynamics of on-line competitive learning.
Europhysics Letters 38, 73-78 (1997)
44. Biehl, M., Ghosh, A., Hammer, B.: Learning Vector Quantization: The Dynamics
of Winner-Takes-All algorithms. Neurocomputing 69, 660-670 (2006)
45. Witeolar, A., Biehl, M., Ghosh, A., Hammer, B.: Learning Dynamics of Neural
Gas and Vector Quantization. Neurocomputing 71, 1210-1219 (2008)
46. Bojer, T., Hammer, B., Schunk, D., Tluk von Toschanowitz, K.: Relevance deter-
mination in learning vector quantization. In: Verleysen, M. (ed.) European Sym-
posium on Artificial Neural Networks ESANN 2001, pp. 271-276. D-facto publica-
tions, Belgium (2001)
47. Hammer, B., Villmann, T.: Generalized relevance learning vector quantization.
Neural Networks 15, 1059-1068 (2002)
48. Schneider, P, Biehl, M., Hammer, B.: Relevance Matrices in Learning Vector Quan-
tization In: Verleysen, M. (ed.) European Symposium on Artificial Neural Networks
ESANN 2007, pp. 37-43, d-side publishing, Belgium (2007)