arXiv:cond-mat/9501047v1 12 Jan 1995
GENERALIZED SIMULATED ANNEALING
Constantino TSALLIS1,2 and Daniel A. STARIOLO1
1 - Centro Brasileiro de Pesquisas Fı́sicas
Rua Xavier Sigaud, 150
22290-180 – Rio de Janeiro – RJ, Brazil
2 - Department of Physics and Astronomy and
Center for Fundamental Materials Research
Michigan State University
East Lansing, Michigan 48824-1116, USA
Abstract
We propose a new stochastic algorithm (generalized simulated annealing) for computationally finding the global minimum of a given
(not necessarily convex) energy/cost function defined in a continuous
D-dimensional space. This algorithm recovers, as particular cases,
the so called classical (“Boltzmann machine”) and fast (“Cauchy machine”) simulated annealings, and can be quicker than both.
Key-words: Simulated annealing; Nonconvex optimization; Gradient
descent; Generalized Statistical Mechanics.
1
1
INTRODUCTION
The central step of an enormous variety of problems (in Physics, Chemistry,
Statistics, Neural Networks, Engineering, Economics) is the minimization of
an appropriate energy/cost function defined in a D-dimensional continuous
space (~x ∈ R
I D ). If the energy is convex (single minimum), any gradient
descent method easily solves the problem. But if the energy is nonconvex
(multiple extrema) the solution requires more sophisticated methods, since a
gradient descent procedure could easily trap the system in a local minimum
(instead of one of the global minima we are looking for). This sophistication
must necessarily involve possible “hill climbings” (for detrapping from local
minima), and can be heavily computer-time-consuming. Consequently, various algorithmic strategies have been developed along the years for making
this important problem increasingly tractable. One of the generically most
efficient (hence popular) methods is simulated annealing, to which this paper is dedicated. In this technique, one or more artificial temperatures are
introduced and gradually cooled, in complete analogy with the well known
annealing technique frequently used in Metallurgy for making a molten metal
to reach its crystalline state (global minimum of the thermodynamical energy). This artificial temperature (or set of temperatures) acts as a source
of stochasticity, extremely convenient for eventually detrapping from local
minima. Near the end of the process, the system hopefully is inside the
attractive basin of the global minimum (or in one of the global minima, if
more than one exists, i.e., if there is degeneracy), the temperature is practically zero, and the procedure asymptotically becomes a gradient descent one.
The challenge is to cool the temperature the quickest we can but still having
the guarantee that no definite trapping in any local minimum will occur.
More precisely speaking, we search for the quickest annealing (i.e., in some
sense approaching a quenching) which preserves the probability of ending in
a global minimum being equal to one. The first nontrivial solution along
this line was provided in 1983 by Kirkpatrick et al [1] for classical systems,
and was extended in 1986 by Ceperley and Alder [2] for quantum systems.
It strictly follows quasi-equilibrium Boltzmann-Gibbs statistics. The system
“tries” to visit, according to a visiting distribution assumed to be Gaussian
(i.e., a local search distribution) in the neighborhood of its actual state ~x.
The jump is always accepted if it is down hill (of the energy/cost funtion);
if it is hill climbing, it might be accepted according to an acceptance proba2
bility assumed to be the canonical-ensemble Boltzmann-Gibbs one. Geman
and Geman [3] showed, for the classical case, that a necessary and sufficient condition for having probability one of ending in a global minimum is
that the temperature decreases logarithmically with time. This algorithm is
sometimes referred to as classical simulated annealing (CSA) or Boltzmann
machine. We easily recognize that, if instead of decreasing, the temperature
was maintained fixed, this procedure precisely is the well known Metropolis
et al [4] one for simulating thermostatistical equilibrium.
The next interesting step along the present line was Szu’s 1987 proposal [5]
of using a Cauchy-Lorentz visiting distribution, instead of the Gaussian one.
This is a semi-local search distribution: the jumps are frequently local, but
can occasionally be quite long (in fact, this is a Lévy-flight-like distribution).
The acceptance algorithm remains the same as before. As Szu and Hartley
showed, the cooling can now be much faster (the temperature is now allowed
to decrease like the inverse of time), which makes the entire procedure quite
more efficient. This algorithm is referred to as fast simulated annealing (FSA)
or Cauchy machine.
The goal of the present work is to generalize both annealings within an
unified picture which is inspired in the recently Generalized Statistical Mechanics [6, 7] (see also [8, 9]), with the supplementary bonus of providing an
algorithm which is even quicker than that of Szu’s. In Section 2, we briefly
review this generalized thermostatistics, describe the optimization algorithm
and prove that, if the cooling rithm is appropriate, the probability of ending in a global minimum equals one. In Section 3, we numerically discuss a
simple D = 1 example. Finally, we conclude in Section 4.
2
GENERALIZED SIMULATED ANNEALING (GSA)
Inspired by multifractals, one of us proposed [6] a generalized entropy Sq as
follows
X q
pi
1−
i
Sq = k
(q ∈ R)
I
(1)
q−1
where {pi } are the probabilities of the microscopic configurations and k is
a conventional positive constant. In the q → 1 limit, Sq recovers the well
3
known Shannon expression −kB
X
pi ln pi . Optimization of this entropy for
i
the canonical ensemble yields
1
[1 − β(1 − q)Ei ] 1−q
pi =
Zq
with
Zq ≡
X
(2)
1
(3)
[1 − β(1 − q)Ei ] 1−q
i
where β ≡ 1/kT is a Lagrange parameter, and {Ei } is the energy spectrum.
We immediately verify that, in the q → 1 limit, weXrecover Boltzmann-Gibbs
statistics, namely pi = exp(−βEi )/Z1 with Z1 ≡
exp(−βEi ).
i
Let us now focus the acceptance probability PqA (~xt → ~xt+1 ), where t is
the discrete time (t = 1, 2, 3, · · ·) corresponding to the computer iterations.
For the Boltzmann machine (qA = 1) we have, for example, the Metropolis
algorithm [4] :
P1 (~xt → ~xt+1 ) =
(
1
if E(~xt+1 ) < E(~xt )
[E(~
xt )−E(~
xt+1 )]/T1A (t)
e
if E(~xt+1 ) ≥ E(~xt )
(4)
where T1A (t) is the qA = 1 acceptance temperature at time t (k = 1 from
now on). We see that T1A (t) = +0 implies P1 = 1 if E(~xt+1 ) < E(~xt ), and
P1 = 0 if E(~xt+1 ) ≥ E(~xt ). These limits are important for the asymptotic
stabilization of the annealing process,and we will require them to be satisfied
by the generalized form. Eq.(4) satisfies detailed balance. A generalization
of Eq.(4) that also satisfies the detailed balance condition, and asymptotically generates states distributed according to Eq.(2), must involve the ratio
of terms of the form 1 − β(1 − q)E. Nevertheless, we could not find a generalization along this line which satisfies the T = 0 limits mentioned above.
Instead, we worked with a form that generalizes Eq.(4), satisfies the limits at
T = 0, and goes to an equilibrium distribution, although generically different
from that of Eq.(2). This generalized acceptance probability reads:
PqA (~xt → ~xt+1 ) =
1
1
1
[1+(qA −1)(E(~xt+1 )−E(~xt ))/TqAA ] qA −1
4
if E(~xt+1 ) < E(~xt )
if E(~xt+1 ) ≥ E(~xt )
(5)
Although it is possible to work under generic conditions, for simplicity we
shall assume here that E(~x) ≥ 0 (∀~x). Moreover, we shall assume that
qA ≥ 1, so TqAA (t) can decrease down to zero without any type of singularities.
Within these hyphothesis, PqA ∈ [0, 1] (∀qA ), and, for TqAA (t) decreasing from
infinity to zero, PqA monotonically varies from 1 to 0 if E(~xt+1 ) ≥ E(~xt ), and
equals 1 whenever E(~xt+1 ) < E(~xt ).
We can now focus the ~xt → ~xt+1 isotropic visiting distribution gqV (∆xt )
where ∆xt ≡ (~xt+1 − ~xt ). It satisfies
ΩD
Z
∞
0
dρ ρD−1 gqV (ρ) = 1
(6)
where ΩD ≡ DΠD/2 /Γ D2 + 1 is the D-dimensional complete solid angle.
For the Boltzmann machine (qV = 1) we have [1, 5]
(∆xt )2
g1 (∆xt ) ∝ exp − V
T1 (t)
"
#
(7)
where T1V (t) is the qV = 1 visiting temperature at time t. Using condition (6)
we obtain
2
−
(∆xt )
V
e T1 (t)
g1 (∆xt ) =
[π T1V (t)]D/2
(8)
For the Cauchy machine (qV = 2) we have [5]
g2 (∆xt ) ∝
T2V (t)
{[T2V (t)]2 + (∆xt )2 }
(9)
D+1
2
where T2V (t) is the qV = 2 visiting temperature at time t. The functional
form of Eq. (9) is the D-dimensional Fourier transform of exp{−T2V (t)|~y |}
(see [5]). Using condition (6) we obtain
g2 (∆xt ) =
)
Γ( D+1
2
π
D+1
2
T2V (t)
{[T2V (t)]2 + (∆xt )2 }
D+1
2
(10)
Within the present scheme, a natural proposal for unifying (8) and (10) is
gqV (∆xt ) = c
[TqVV (t)]d
a
{[TqVV (t)]e + (qV − 1)b(∆xt )2 } qV −1
5
(11)
where a, b, c, d and e are (qV , D)-dependent pure numbers to be determined.
Using condition (6) and recalling that ∆xt may carry dimensions (e.g, [length])
we immediately establish that
2a − D(qV − 1)
(∀qV , ∀D)
(12)
d=e
2(qV − 1)
To further decrease the number of independent pure numbers to be determined, let us address a central point, namely the fact that the method has to
guarantee that, at the t → ∞ limit, the system must be at a global minimum.
For this to occur (see [5] and references therein) the state visiting must be
“infinite often in time (iot)”, which indeed occurs if
∞
X
gqV (∆xt0 ) diverges
t=t0
for fixed ∆xt0 with t0 >> 1. Under these conditions we have that
∞
X
gqV (∆xt0 ) ∝
∞
X
[TqVV (t)]d
(13)
t=t0
t=t0
We know [5] that, for arbitrary D, T1V (t) = T1V (1) ln 2/ ln(1 + t) and T2V (t) =
T2V (1)/t, which are conveniently unified with
2qV −1 − 1
(1 + t)qV −1 − 1
2qV −1 − 1
(t → ∞)
∼ TqV (1) q −1
tV
Replacing (14’) into Eq. (13) we obtain
TqVV (t) = TqV (1)
∞
X
gqV (∆xt0 ) ∝
∞
X
t=t0
t=t0
(14)
(14′ )
1
(15)
t(qV −1)d
For arbitrary D and qV = 1, 2 it is [5] (qV − 1)d = 1. We assume, for
simplicity, that the same holds ∀qV , hence
1
d=
(∀qV , ∀D)
(16)
qV − 1
consequently the series (15) is the harmonic one, hence diverges (logarithmically) as desired. If we use Eqs. (12) and (16) into (11) we obtain
−D
gqV (∆xt ) = c
[TqVV (t)] 2a−D(qV −1)
1 + (qV − 1)b
(∆xt )2
2
[TqV (t)] 2a−D(qV −1)
V
6
a
qV −1
(17)
For qV = 1, Eq. (17) must recover Eq. (8), hence b = 1 and a = 1 (for
arbitrary D). For qV = 2, Eq. (17) must recover Eq. (10), hence b = 1 and
a = D+1
(for arbitrary D). For simplicity we assume
2
b=1
(∀qV , ∀D)
(18)
Finally, condition (6) univocally determines the normalizing pure number c
as a function of the rest of the free parameters. Using this and Eq. (18) into
Eq. (17) yields
qV − 1
gqV (∆xt ) =
π
D/2
a
qV −1
Γ qVa−1 − D2
Γ
D
− 2a−D(q
[TqVV ]
1 + (qV − 1)
V −1)
(∆xt )2
2
[TqVV (t)] 2a−D(qV −1)
a
qV −1
(19)
where only one undetermined pure number (namely a(qV , D)) is now left.
It satisfies, as already mentioned, a(1, D) = 1 and a(2, D) = (D + 1)/2.
Although more general forms are possible, we shall adopt the simplest qV dependence, namely a linear interpolation, hence
a=1+
D−1
(qV − 1)
2
(∀qV , ∀D)
(20)
Replacing this into Eq. (19) we otain our final visiting distribution
qV − 1
gqV (∆xt ) =
π
D/2
1
+ D−1
qV −1
2
1
Γ qV −1 − 12
Γ
D
− 3−q
[TqVV (t)]
1 + (qV − 1)
V
(∆xt )2
2
[TqVV (t)] 3−qV
1 + D−1
2
qV −1
(21)
≥ 5/3, and the
The second moment of this distribution diverges for qV
distribution becomes not normalizable for qV ≥ 3.
There is no particular reason for TqVV being equal to TqAA but, following
[5], we shall use here the simplest choice, i.e., TqAA (t) = TqVV (t), ∀t (given by
Eq. (14)). We can now summarize the whole algorithm for finding a global
minimum of a given energy/cost function E(~x):
(i) Fix (qA , qV ). Start, at t = 1, with an arbitrary value ~x1 and a high
enough value for TqV (1) (say about 2 times the height of the highest
expected “hill”of E(~x), and calculate E(~x1 );
7
(∀qV , ∀D )
(ii) Then randomly generate ~xt+1 from ~xt according to Eq. (21) to determine
the size of the jump ∆xt , and isotropically determine its direction;
(iii) Then calculate E(~xt+1 ):
If E(~xt+1 ) < E(~xt ), replace ~xt by ~xt+1 ;
If E(~xt+1 ) ≥ E(~xt ), run a random number r ∈ [0, 1]: if r > PqA given
by Eq. (5) with TqAA (t) = TqVV (t), retain ~xt ; otherwise, replace ~xt by
~xt+1 ;
(iv) Calculate the new temperature TqVV using Eq. (14) and go back to (ii)
until the minimum of E(~x) is reached within the desired precision.
3
A SIMPLE D=1 ILLUSTRATION
In this Section we numerically treat a simple D = 1 example with a double
purpose: on one hand to exhibit how the procedure works and, on the other,
to find for which pair (qV , qA ) the algorithm is the quickest. (We recall that
(qV , qA ) = (1, 1) corresponds to CSA and (2,1) to FSA).
We choose the same example treated in [5], namely
E(x) = x4 − 16x2 + 5x + E0
(22)
where we have introduced the additive constant E0 ≃ 78.3323 so that E(x) ≥
0, ∀x, thus satisfying the convention adopted below Eq. (5); see Fig. 1. As
initial conditions for all of our runs we used x1 = 2 and TqV = 100. In Fig.
2 we can see typical runs for qA = 1.1 and different values of qV . Clearly
the case qV = 2.5 is much faster and precise than classical and fast annealings (qV = 1.1 ≃ 1 and 2 respectively). To study the (qV , qA ) influence on
the quickness of the algorithm we have adopted once for ever, an arbitrary
convergence criterium. For each (qV , qA ) pair we evaluate the mean value
of xt in intervals of 100 time steps. Whenever two successive intervals presented mean values whose difference was smaller than a precision ε = 103 , we
stopped the run. We then evaluated the total iteration time τ and repeated
the whole annealing procedure 10 times. Finally, we compute the average
total calculation time < τ >. The (qV , qA ) dependence of the average < τ >
is presented in Fig. 3 for typical values of qA . Fig. 3 indicates that machines
with qA = 1.1 and qV = 2.9 are typically 5 times faster than the Cauchy machine [5], which is in turn about 5 times faster than the Boltzmann machine
8
[1, 3, 5]. We have done our simulations on a 486 DX microcomputer with
a clock of 50 MHz. In this machine each one of the 10 solutions demanded,
approximately, 1 minute and 20 seconds of CPU time for qV = 2, and only
20 seconds for qV = 2.5. Finally in Fig. 4 we present the dependence of
< τ > with qA for qV = 2; for this case the Cauchy machine [5] is the best
performant. These results indicate that the quickest algorithm occurs for
qA = 1 and qV = 3.
4
CONCLUSION
Inspired in a recent generalization of Boltzmann-Gibbs statistical mechanics,
we have heuristically developed a generalized simulated annealing (characterized by the parameters (qV , qA )) which unifies the so called classical (Boltzmann machine; qV = qA = 1) and fast (Cauchy machine; qV /2 = qA = 1)
ones. This computational method is based on stochastics dynamics (which
asymptotically becomes, as time runs to infinity, a gradient descent method),
and enables, with probability one, the identification of a global minimum of
any (sufficiently nonsingular) given energy/cost function which depends on a
continuous D-dimensional variable ~x. While the discrete time t increases, it
might happen that ~xt provisionally stabilizes on a given value, and eventually
abandons it running towards the global minimum. This temporary residence
can be used, as bonus of the present method, to identify some of the local
minima. If sufficiently many computational runs are done by starting at random initial positions {~x1 }, this property could in principle be used to identify
all the local minima as well as all the global ones.
For simplicity, we have mainly discussed herein the restricted region qV ≥
1 and qA ≥ 1 (with E(~x) ≥ 0, ∀~x), and have identified the (qV , qA ) ≃ (2.9, 1)
machines as the most performant ones in practical terms. This algorithm
has been illustrated herein with a simple two-minima D = 1 energy function,
and has already been successfully used [10] for recovering the global energy
minima (with respect to the dihedral angle) of a variety of simple molecules
(e.g., CH3 OH, H2 O2 , C2 H6 ). It should be very interesting to test the present
generalized algorithm with many-body systems presenting a great number
of minima (spin-glass-like frustrated systems, traveling salesman problem ,
neural networks, complex economic systems).
We acknowledge N. Caticha for stressing our attention onto Szu’s algo9
rithm [5], as well as K.C. Mundim and A.M.C. de Souza for very useful
discussions. At the time of submission of this paper we took notice that
this algorithm (informally communicated by us to a few people) has already
been succesfully implemented for the Traveling Salesman Problem [11, 12],
for fitting curves [13, 14], and for a problem of genetic mutations [15].
10
Caption for figures
Fig. 1: D = 1 energy function E(x) = x4 − 16x2 + 5x + E0 vs. x for E0 =
78.3323; global minimum x∗1 = −2.90353 and E(x∗1 ) = 0; maximum:
x∗2 = 0.156731 and E(x∗2 ) = 78.7235; local minimum: x∗3 = 2.74680 and
E(x∗3 ) = 28.2735.
Fig. 2: Typical runs of the GSA algorithm xt vs. t (annealing time) for
initial conditions x1 = 2, TqV (1) = 100. All four runs correspond to
qA = 1.1; a) qV = 1.1, b) qV = 1.5, c) qV = 2, d) qV = 2.5. Notice the
different scales for the ordinates.
Fig. 3: Average total calculating time < τ > vs. qV for two typical values
of qA (△: qA = 1.5; ⊙: qA = 1.1).
Fig. 4: Average total calculation time < τ > vs. qA for qV = 2.
11
References
[1] S. Kirkpatrick, C.D. Gelatt and M.P. Vecchi, Science 220, 671 (13 May
1983).
[2] D. Ceperley and B. Alder, Science 231, 555 (7 February 1986).
[3] S. Geman and D. Geman, IEEE Trans. Patt. Anan. Mach. Int., PAMI-6
(N0 6), 721 (1984).
[4] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller and E.
Teller, J. Chem. Phys. 21, 1087 (1953).
[5] H. Szu and R. Hartley, Phys. Lett. A 122, 157 (1987).
[6] C. Tsallis, J. Stat. Phys. 52, 479 (1988).
[7] E.M.F. Curado and C. Tsallis, J. Phys. A 24, L69 (1991); Corrigenda:
J. Phys. A 24, 3187 (1991) and 25, 1019 (1992).
[8] C. Tsallis, “Extensive Versus Nonextensive Physics”, in “New Trends
in Magnetism, Magnetic Materials and Their Applications”, eds. J.L.
Morán-Lopez and J.M. Sánchez (Plenum Press, New York, 1994), 451.
[9] C. Tsallis, “Some Comments on Boltzmann-Gibbs Statistical Mechanics”, in “Chaos, Solitons and Fractals”, ed. G. Marshall (Pergamon
Press, Oxford, 1994), in press.
[10] K.C. Mundim and C. Tsallis, “Geometry Optimization and Conformational Analysis Through Generalized Simulated Annealing”, preprint
(1994).
[11] T.J.P.Penna, “The Traveling Salesman Problem and Tsallis Entropy”,
to appear in Phys. Rev. E, 1995.
[12] P.Serra, private communication (1994).
[13] T.J.P.Penna, “Fitting Curves by Simulated Annealing”, preprint (1994).
[14] T.J.P.Penna, P.M.C.de Oliveira, J.C.Sartorelli, W.M.Gonçalves and
R.D.Pinto, “Long Range Anticorrelations: Is the Healthy Heart a Leaky
Faucet?”, preprint (1994).
12
[15] C.Miron, “Optimisation par Recuit Simulé Généralisé”, Report, Ecole
Normale Superieure de Lyon-France (1994).
13