Generalized simulated annealing

Constantino Tsallis; Daniel A. Stariolo

arXiv:cond-mat/9501047v1 12 Jan 1995 GENERALIZED SIMULATED ANNEALING Constantino TSALLIS1,2 and Daniel A. STARIOLO1 1 - Centro Brasileiro de Pesquisas Fı́sicas Rua Xavier Sigaud, 150 22290-180 – Rio de Janeiro – RJ, Brazil 2 - Department of Physics and Astronomy and Center for Fundamental Materials Research Michigan State University East Lansing, Michigan 48824-1116, USA Abstract We propose a new stochastic algorithm (generalized simulated annealing) for computationally finding the global minimum of a given (not necessarily convex) energy/cost function defined in a continuous D-dimensional space. This algorithm recovers, as particular cases, the so called classical (“Boltzmann machine”) and fast (“Cauchy machine”) simulated annealings, and can be quicker than both. Key-words: Simulated annealing; Nonconvex optimization; Gradient descent; Generalized Statistical Mechanics. 1 1 INTRODUCTION The central step of an enormous variety of problems (in Physics, Chemistry, Statistics, Neural Networks, Engineering, Economics) is the minimization of an appropriate energy/cost function defined in a D-dimensional continuous space (~x ∈ R I D ). If the energy is convex (single minimum), any gradient descent method easily solves the problem. But if the energy is nonconvex (multiple extrema) the solution requires more sophisticated methods, since a gradient descent procedure could easily trap the system in a local minimum (instead of one of the global minima we are looking for). This sophistication must necessarily involve possible “hill climbings” (for detrapping from local minima), and can be heavily computer-time-consuming. Consequently, various algorithmic strategies have been developed along the years for making this important problem increasingly tractable. One of the generically most efficient (hence popular) methods is simulated annealing, to which this paper is dedicated. In this technique, one or more artificial temperatures are introduced and gradually cooled, in complete analogy with the well known annealing technique frequently used in Metallurgy for making a molten metal to reach its crystalline state (global minimum of the thermodynamical energy). This artificial temperature (or set of temperatures) acts as a source of stochasticity, extremely convenient for eventually detrapping from local minima. Near the end of the process, the system hopefully is inside the attractive basin of the global minimum (or in one of the global minima, if more than one exists, i.e., if there is degeneracy), the temperature is practically zero, and the procedure asymptotically becomes a gradient descent one. The challenge is to cool the temperature the quickest we can but still having the guarantee that no definite trapping in any local minimum will occur. More precisely speaking, we search for the quickest annealing (i.e., in some sense approaching a quenching) which preserves the probability of ending in a global minimum being equal to one. The first nontrivial solution along this line was provided in 1983 by Kirkpatrick et al [1] for classical systems, and was extended in 1986 by Ceperley and Alder [2] for quantum systems. It strictly follows quasi-equilibrium Boltzmann-Gibbs statistics. The system “tries” to visit, according to a visiting distribution assumed to be Gaussian (i.e., a local search distribution) in the neighborhood of its actual state ~x. The jump is always accepted if it is down hill (of the energy/cost funtion); if it is hill climbing, it might be accepted according to an acceptance proba2 bility assumed to be the canonical-ensemble Boltzmann-Gibbs one. Geman and Geman [3] showed, for the classical case, that a necessary and sufficient condition for having probability one of ending in a global minimum is that the temperature decreases logarithmically with time. This algorithm is sometimes referred to as classical simulated annealing (CSA) or Boltzmann machine. We easily recognize that, if instead of decreasing, the temperature was maintained fixed, this procedure precisely is the well known Metropolis et al [4] one for simulating thermostatistical equilibrium. The next interesting step along the present line was Szu’s 1987 proposal [5] of using a Cauchy-Lorentz visiting distribution, instead of the Gaussian one. This is a semi-local search distribution: the jumps are frequently local, but can occasionally be quite long (in fact, this is a Lévy-flight-like distribution). The acceptance algorithm remains the same as before. As Szu and Hartley showed, the cooling can now be much faster (the temperature is now allowed to decrease like the inverse of time), which makes the entire procedure quite more efficient. This algorithm is referred to as fast simulated annealing (FSA) or Cauchy machine. The goal of the present work is to generalize both annealings within an unified picture which is inspired in the recently Generalized Statistical Mechanics [6, 7] (see also [8, 9]), with the supplementary bonus of providing an algorithm which is even quicker than that of Szu’s. In Section 2, we briefly review this generalized thermostatistics, describe the optimization algorithm and prove that, if the cooling rithm is appropriate, the probability of ending in a global minimum equals one. In Section 3, we numerically discuss a simple D = 1 example. Finally, we conclude in Section 4. 2 GENERALIZED SIMULATED ANNEALING (GSA) Inspired by multifractals, one of us proposed [6] a generalized entropy Sq as follows X q pi 1− i Sq = k (q ∈ R) I (1) q−1 where {pi } are the probabilities of the microscopic configurations and k is a conventional positive constant. In the q → 1 limit, Sq recovers the well 3 known Shannon expression −kB X pi ln pi . Optimization of this entropy for i the canonical ensemble yields 1 [1 − β(1 − q)Ei ] 1−q pi = Zq with Zq ≡ X (2) 1 (3) [1 − β(1 − q)Ei ] 1−q i where β ≡ 1/kT is a Lagrange parameter, and {Ei } is the energy spectrum. We immediately verify that, in the q → 1 limit, weXrecover Boltzmann-Gibbs statistics, namely pi = exp(−βEi )/Z1 with Z1 ≡ exp(−βEi ). i Let us now focus the acceptance probability PqA (~xt → ~xt+1 ), where t is the discrete time (t = 1, 2, 3, · · ·) corresponding to the computer iterations. For the Boltzmann machine (qA = 1) we have, for example, the Metropolis algorithm [4] : P1 (~xt → ~xt+1 ) = ( 1 if E(~xt+1 ) < E(~xt ) [E(~ xt )−E(~ xt+1 )]/T1A (t) e if E(~xt+1 ) ≥ E(~xt ) (4) where T1A (t) is the qA = 1 acceptance temperature at time t (k = 1 from now on). We see that T1A (t) = +0 implies P1 = 1 if E(~xt+1 ) < E(~xt ), and P1 = 0 if E(~xt+1 ) ≥ E(~xt ). These limits are important for the asymptotic stabilization of the annealing process,and we will require them to be satisfied by the generalized form. Eq.(4) satisfies detailed balance. A generalization of Eq.(4) that also satisfies the detailed balance condition, and asymptotically generates states distributed according to Eq.(2), must involve the ratio of terms of the form 1 − β(1 − q)E. Nevertheless, we could not find a generalization along this line which satisfies the T = 0 limits mentioned above. Instead, we worked with a form that generalizes Eq.(4), satisfies the limits at T = 0, and goes to an equilibrium distribution, although generically different from that of Eq.(2). This generalized acceptance probability reads: PqA (~xt → ~xt+1 ) =    1 1 1 [1+(qA −1)(E(~xt+1 )−E(~xt ))/TqAA ] qA −1 4 if E(~xt+1 ) < E(~xt ) if E(~xt+1 ) ≥ E(~xt ) (5) Although it is possible to work under generic conditions, for simplicity we shall assume here that E(~x) ≥ 0 (∀~x). Moreover, we shall assume that qA ≥ 1, so TqAA (t) can decrease down to zero without any type of singularities. Within these hyphothesis, PqA ∈ [0, 1] (∀qA ), and, for TqAA (t) decreasing from infinity to zero, PqA monotonically varies from 1 to 0 if E(~xt+1 ) ≥ E(~xt ), and equals 1 whenever E(~xt+1 ) < E(~xt ). We can now focus the ~xt → ~xt+1 isotropic visiting distribution gqV (∆xt ) where ∆xt ≡ (~xt+1 − ~xt ). It satisfies ΩD Z ∞ 0 dρ ρD−1 gqV (ρ) = 1 (6) where ΩD ≡ DΠD/2 /Γ D2 + 1 is the D-dimensional complete solid angle. For the Boltzmann machine (qV = 1) we have [1, 5] (∆xt )2 g1 (∆xt ) ∝ exp − V T1 (t) " # (7) where T1V (t) is the qV = 1 visiting temperature at time t. Using condition (6) we obtain 2 − (∆xt ) V e T1 (t) g1 (∆xt ) = [π T1V (t)]D/2 (8) For the Cauchy machine (qV = 2) we have [5] g2 (∆xt ) ∝ T2V (t) {[T2V (t)]2 + (∆xt )2 } (9) D+1 2 where T2V (t) is the qV = 2 visiting temperature at time t. The functional form of Eq. (9) is the D-dimensional Fourier transform of exp{−T2V (t)|~y |} (see [5]). Using condition (6) we obtain g2 (∆xt ) = ) Γ( D+1 2 π D+1 2 T2V (t) {[T2V (t)]2 + (∆xt )2 } D+1 2 (10) Within the present scheme, a natural proposal for unifying (8) and (10) is gqV (∆xt ) = c [TqVV (t)]d a {[TqVV (t)]e + (qV − 1)b(∆xt )2 } qV −1 5 (11) where a, b, c, d and e are (qV , D)-dependent pure numbers to be determined. Using condition (6) and recalling that ∆xt may carry dimensions (e.g, [length]) we immediately establish that 2a − D(qV − 1) (∀qV , ∀D) (12) d=e 2(qV − 1) To further decrease the number of independent pure numbers to be determined, let us address a central point, namely the fact that the method has to guarantee that, at the t → ∞ limit, the system must be at a global minimum. For this to occur (see [5] and references therein) the state visiting must be “infinite often in time (iot)”, which indeed occurs if ∞ X gqV (∆xt0 ) diverges t=t0 for fixed ∆xt0 with t0 >> 1. Under these conditions we have that ∞ X gqV (∆xt0 ) ∝ ∞ X [TqVV (t)]d (13) t=t0 t=t0 We know [5] that, for arbitrary D, T1V (t) = T1V (1) ln 2/ ln(1 + t) and T2V (t) = T2V (1)/t, which are conveniently unified with 2qV −1 − 1 (1 + t)qV −1 − 1 2qV −1 − 1 (t → ∞) ∼ TqV (1) q −1 tV Replacing (14’) into Eq. (13) we obtain TqVV (t) = TqV (1) ∞ X gqV (∆xt0 ) ∝ ∞ X t=t0 t=t0 (14) (14′ ) 1 (15) t(qV −1)d For arbitrary D and qV = 1, 2 it is [5] (qV − 1)d = 1. We assume, for simplicity, that the same holds ∀qV , hence 1 d= (∀qV , ∀D) (16) qV − 1 consequently the series (15) is the harmonic one, hence diverges (logarithmically) as desired. If we use Eqs. (12) and (16) into (11) we obtain −D gqV (∆xt ) = c    [TqVV (t)] 2a−D(qV −1) 1 + (qV − 1)b (∆xt )2 2 [TqV (t)] 2a−D(qV −1) V 6  a  qV −1  (17) For qV = 1, Eq. (17) must recover Eq. (8), hence b = 1 and a = 1 (for arbitrary D). For qV = 2, Eq. (17) must recover Eq. (10), hence b = 1 and a = D+1 (for arbitrary D). For simplicity we assume 2 b=1 (∀qV , ∀D) (18) Finally, condition (6) univocally determines the normalizing pure number c as a function of the rest of the free parameters. Using this and Eq. (18) into Eq. (17) yields qV − 1 gqV (∆xt ) = π D/2 a qV −1 Γ qVa−1 − D2 Γ D − 2a−D(q [TqVV ]   1 + (qV − 1)  V −1) (∆xt )2 2 [TqVV (t)] 2a−D(qV −1)  a  qV −1  (19) where only one undetermined pure number (namely a(qV , D)) is now left. It satisfies, as already mentioned, a(1, D) = 1 and a(2, D) = (D + 1)/2. Although more general forms are possible, we shall adopt the simplest qV dependence, namely a linear interpolation, hence a=1+ D−1 (qV − 1) 2 (∀qV , ∀D) (20) Replacing this into Eq. (19) we otain our final visiting distribution qV − 1 gqV (∆xt ) = π D/2 1 + D−1 qV −1 2 1 Γ qV −1 − 12 Γ D − 3−q [TqVV (t)]    1 + (qV − 1) V (∆xt )2 2 [TqVV (t)] 3−qV  1 + D−1 2  qV −1  (21) ≥ 5/3, and the The second moment of this distribution diverges for qV distribution becomes not normalizable for qV ≥ 3. There is no particular reason for TqVV being equal to TqAA but, following [5], we shall use here the simplest choice, i.e., TqAA (t) = TqVV (t), ∀t (given by Eq. (14)). We can now summarize the whole algorithm for finding a global minimum of a given energy/cost function E(~x): (i) Fix (qA , qV ). Start, at t = 1, with an arbitrary value ~x1 and a high enough value for TqV (1) (say about 2 times the height of the highest expected “hill”of E(~x), and calculate E(~x1 ); 7 (∀qV , ∀D ) (ii) Then randomly generate ~xt+1 from ~xt according to Eq. (21) to determine the size of the jump ∆xt , and isotropically determine its direction; (iii) Then calculate E(~xt+1 ): If E(~xt+1 ) < E(~xt ), replace ~xt by ~xt+1 ; If E(~xt+1 ) ≥ E(~xt ), run a random number r ∈ [0, 1]: if r > PqA given by Eq. (5) with TqAA (t) = TqVV (t), retain ~xt ; otherwise, replace ~xt by ~xt+1 ; (iv) Calculate the new temperature TqVV using Eq. (14) and go back to (ii) until the minimum of E(~x) is reached within the desired precision. 3 A SIMPLE D=1 ILLUSTRATION In this Section we numerically treat a simple D = 1 example with a double purpose: on one hand to exhibit how the procedure works and, on the other, to find for which pair (qV , qA ) the algorithm is the quickest. (We recall that (qV , qA ) = (1, 1) corresponds to CSA and (2,1) to FSA). We choose the same example treated in [5], namely E(x) = x4 − 16x2 + 5x + E0 (22) where we have introduced the additive constant E0 ≃ 78.3323 so that E(x) ≥ 0, ∀x, thus satisfying the convention adopted below Eq. (5); see Fig. 1. As initial conditions for all of our runs we used x1 = 2 and TqV = 100. In Fig. 2 we can see typical runs for qA = 1.1 and different values of qV . Clearly the case qV = 2.5 is much faster and precise than classical and fast annealings (qV = 1.1 ≃ 1 and 2 respectively). To study the (qV , qA ) influence on the quickness of the algorithm we have adopted once for ever, an arbitrary convergence criterium. For each (qV , qA ) pair we evaluate the mean value of xt in intervals of 100 time steps. Whenever two successive intervals presented mean values whose difference was smaller than a precision ε = 103 , we stopped the run. We then evaluated the total iteration time τ and repeated the whole annealing procedure 10 times. Finally, we compute the average total calculation time < τ >. The (qV , qA ) dependence of the average < τ > is presented in Fig. 3 for typical values of qA . Fig. 3 indicates that machines with qA = 1.1 and qV = 2.9 are typically 5 times faster than the Cauchy machine [5], which is in turn about 5 times faster than the Boltzmann machine 8 [1, 3, 5]. We have done our simulations on a 486 DX microcomputer with a clock of 50 MHz. In this machine each one of the 10 solutions demanded, approximately, 1 minute and 20 seconds of CPU time for qV = 2, and only 20 seconds for qV = 2.5. Finally in Fig. 4 we present the dependence of < τ > with qA for qV = 2; for this case the Cauchy machine [5] is the best performant. These results indicate that the quickest algorithm occurs for qA = 1 and qV = 3. 4 CONCLUSION Inspired in a recent generalization of Boltzmann-Gibbs statistical mechanics, we have heuristically developed a generalized simulated annealing (characterized by the parameters (qV , qA )) which unifies the so called classical (Boltzmann machine; qV = qA = 1) and fast (Cauchy machine; qV /2 = qA = 1) ones. This computational method is based on stochastics dynamics (which asymptotically becomes, as time runs to infinity, a gradient descent method), and enables, with probability one, the identification of a global minimum of any (sufficiently nonsingular) given energy/cost function which depends on a continuous D-dimensional variable ~x. While the discrete time t increases, it might happen that ~xt provisionally stabilizes on a given value, and eventually abandons it running towards the global minimum. This temporary residence can be used, as bonus of the present method, to identify some of the local minima. If sufficiently many computational runs are done by starting at random initial positions {~x1 }, this property could in principle be used to identify all the local minima as well as all the global ones. For simplicity, we have mainly discussed herein the restricted region qV ≥ 1 and qA ≥ 1 (with E(~x) ≥ 0, ∀~x), and have identified the (qV , qA ) ≃ (2.9, 1) machines as the most performant ones in practical terms. This algorithm has been illustrated herein with a simple two-minima D = 1 energy function, and has already been successfully used [10] for recovering the global energy minima (with respect to the dihedral angle) of a variety of simple molecules (e.g., CH3 OH, H2 O2 , C2 H6 ). It should be very interesting to test the present generalized algorithm with many-body systems presenting a great number of minima (spin-glass-like frustrated systems, traveling salesman problem , neural networks, complex economic systems). We acknowledge N. Caticha for stressing our attention onto Szu’s algo9 rithm [5], as well as K.C. Mundim and A.M.C. de Souza for very useful discussions. At the time of submission of this paper we took notice that this algorithm (informally communicated by us to a few people) has already been succesfully implemented for the Traveling Salesman Problem [11, 12], for fitting curves [13, 14], and for a problem of genetic mutations [15]. 10 Caption for figures Fig. 1: D = 1 energy function E(x) = x4 − 16x2 + 5x + E0 vs. x for E0 = 78.3323; global minimum x∗1 = −2.90353 and E(x∗1 ) = 0; maximum: x∗2 = 0.156731 and E(x∗2 ) = 78.7235; local minimum: x∗3 = 2.74680 and E(x∗3 ) = 28.2735. Fig. 2: Typical runs of the GSA algorithm xt vs. t (annealing time) for initial conditions x1 = 2, TqV (1) = 100. All four runs correspond to qA = 1.1; a) qV = 1.1, b) qV = 1.5, c) qV = 2, d) qV = 2.5. Notice the different scales for the ordinates. Fig. 3: Average total calculating time < τ > vs. qV for two typical values of qA (△: qA = 1.5; ⊙: qA = 1.1). Fig. 4: Average total calculation time < τ > vs. qA for qV = 2. 11 References [1] S. Kirkpatrick, C.D. Gelatt and M.P. Vecchi, Science 220, 671 (13 May 1983). [2] D. Ceperley and B. Alder, Science 231, 555 (7 February 1986). [3] S. Geman and D. Geman, IEEE Trans. Patt. Anan. Mach. Int., PAMI-6 (N0 6), 721 (1984). [4] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller and E. Teller, J. Chem. Phys. 21, 1087 (1953). [5] H. Szu and R. Hartley, Phys. Lett. A 122, 157 (1987). [6] C. Tsallis, J. Stat. Phys. 52, 479 (1988). [7] E.M.F. Curado and C. Tsallis, J. Phys. A 24, L69 (1991); Corrigenda: J. Phys. A 24, 3187 (1991) and 25, 1019 (1992). [8] C. Tsallis, “Extensive Versus Nonextensive Physics”, in “New Trends in Magnetism, Magnetic Materials and Their Applications”, eds. J.L. Morán-Lopez and J.M. Sánchez (Plenum Press, New York, 1994), 451. [9] C. Tsallis, “Some Comments on Boltzmann-Gibbs Statistical Mechanics”, in “Chaos, Solitons and Fractals”, ed. G. Marshall (Pergamon Press, Oxford, 1994), in press. [10] K.C. Mundim and C. Tsallis, “Geometry Optimization and Conformational Analysis Through Generalized Simulated Annealing”, preprint (1994). [11] T.J.P.Penna, “The Traveling Salesman Problem and Tsallis Entropy”, to appear in Phys. Rev. E, 1995. [12] P.Serra, private communication (1994). [13] T.J.P.Penna, “Fitting Curves by Simulated Annealing”, preprint (1994). [14] T.J.P.Penna, P.M.C.de Oliveira, J.C.Sartorelli, W.M.Gonçalves and R.D.Pinto, “Long Range Anticorrelations: Is the Healthy Heart a Leaky Faucet?”, preprint (1994). 12 [15] C.Miron, “Optimisation par Recuit Simulé Généralisé”, Report, Ecole Normale Superieure de Lyon-France (1994). 13

RELATED PAPERS

RELATED TOPICS

Log In

Generalized simulated annealing

Generalized simulated annealing

Generalized simulated annealing

Generalized simulated annealing

Generalized simulated annealing

Related Papers

RELATED PAPERS

RELATED TOPICS