CMA-ES With MATLAB Code
CMA-ES With MATLAB Code
Nikolaus Hansen
August 31, 2007
Contents
Nomenclature 2
0 Preliminaries 3
0.1 Eigendecomposition of a Positive Definite Matrix . . . . . . . . . . . . . . . 4
0.2 The Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . 5
0.3 Randomized Black Box Optimization . . . . . . . . . . . . . . . . . . . . . 6
0.4 Hessian and Covariance Matrices . . . . . . . . . . . . . . . . . . . . . . . . 7
4 Step-Size Control 16
5 Discussion 20
1
Nomenclature
We adopt the usual vector notation, where bold letters, v, are column vectors, capital bold
letters, A, are matrices, and a transpose is denoted by v T . A list of used abbreviations and
symbols is given in alphabetical order.
Abbreviations
CMA Covariance Matrix Adaptation
EMNA Estimation of Multivariate Normal Algorithm
ES Evolution Strategy
(µ/µ{I,W} , λ)-ES, Evolution Strategy with µ parents, with recombination of all µ parents,
either Intermediate or Weighted, and λ offspring.
RHS Right Hand Side.
Greek symbols
λ ≥ 2, population size, sample size, number of offspring, see (5).
µ ≤ λ parent number, number of selected search points in the population, see (6).
µcov , parameter for weighting between rank-one and rank-µ update, see (27).
2 −1
Pµ
µeff = i=1 wi , the variance effective selection mass, see (8).
σ (g) ∈ R+ , step-size.
Latin symbols
B ∈ Rn , an orthogonal matrix. Columns of B are eigenvectors of C with unit length and
correspond to the diagonal elements of D.
C (g) ∈ Rn×n , covariance matrix at generation g.
cii , diagonal elements of C.
cc ≤ 1, learning rate for cumulation for the rank-one update of the covariance matrix, see
(22) and (42), and Table 1.
ccov ≤ 1, learning rate for the covariance matrix update, see (14), (26), (27), and (43), and
Table 1.
cσ < 1, learning rate for the cumulation for the step-size control, see (28) and (40), and
Table 1.
D ∈ Rn , a diagonal matrix. The diagonal elements of D are square roots of eigenvalues of
C and correspond to the respective columns of B.
di > 0, diagonal elements of diagonal matrix D, d2i are eigenvalues of C.
dσ ≈ 1, damping parameter for step-size update, see (29), (34), and (41).
E Expectation value
f : Rn → R, x 7→ f (x), objective function (fitness function) to be minimized.
2
Pn
fsphere : Rn → R, x 7→ kxk2 = xT x = i=1 x2i .
g ∈ N0 , generation counter, iteration number.
I ∈ Rn×n , Identity matrix, unity matrix.
m(g) ∈ Rn , mean value of the search distribution at generation g.
n ∈ N, search space dimension, see f .
N (0, I), multivariate normal distribution with zero mean and unity covariance matrix. A
vector distributed according to N (0, I) has independent, (0, 1)-normally distributed
components.
N (m, C) ∼ m + N (0, C), multivariate normal distribution with mean m ∈ Rn and
covariance matrix C ∈ Rn×n . The matrix C is symmetric and positive definite.
p ∈ Rn , evolution path, a sequence of successive (normalized) steps, the strategy takes over
a number of generations.
wi , where i = 1, . . . , µ, recombination weights, see (6).
(g+1)
xk ∈ Rn , k-th offspring/individual from generation g + 1. We also refer to x(g+1) , as
search point, or object parameters/variables, commonly used synonyms are candidate
solution, or design variables.
(g+1) (g+1) (g+1)
xi:λ , i-th best individual out of x1 , . . . , xλ , see (5). The index i : λ denotes the
(g+1) (g+1) (g+1)
index of the i-th ranked individual and f (x1:λ ) ≤ f (x2:λ ) ≤ . . . ≤ f (xλ:λ ),
where f is the objective function to be minimized.
(g+1) (g+1)
yk = (xk − m(g) )/σ (g) corresponding to xk = m + σy k .
0 Preliminaries
This tutorial introduces the CMA Evolution Strategy (ES), where CMA stands for Covariance
Matrix Adaptation. The CMA-ES is a stochastic method for real-parameter (continuous do-
main) optimization of non-linear, non-convex functions (see also Section 0.3 below).1 We try
to motivate and derive the algorithm from intuitive concepts and from requirements of non-
linear, non-convex search in continuous domain. In order to refer to the described algorithm,
also cite [9]. For finding a concise algorithm description go directly to Appendix A. The
respective Matlab source code is given in Appendix B.
Before we start to introduce the algorithm in Sect. 1, a few required fundamentals are
summed up.
1 While CMA variants for multi-objective optimization and elitistic variants have been proposed, this tutorial is
solely dedicated to single objective optimization and to non-elitistic truncation selection, also referred to as comma-
selection.
3
0.1 Eigendecomposition of a Positive Definite Matrix
A symmetric, positive definite matrix, C ∈ Rn×n , is characterized in that for all x ∈ Rn \{0}
holds xT Cx > 0. The matrix C has an orthonormal basis of eigenvectors, B = [b1 , . . . , bn ],
with corresponding eigenvalues, d21 , . . . , d2n > 0.
That means for each bi holds
Cbi = d2i bi . (1)
The important message from (1) is that eigenvectors are not rotated by C. This feature
uniquely distinguishes eigenvectors.
Because we assume the orthogonal eigenvectors to be
1 if i = j
of unit length, bT
i bj = δij = , and B T B = I (obviously this means
0 otherwise
B −1 = B T , and it follows BB T = I). An basis of eigenvectors is practical, because
v ∈ Rn we can find coefficients αi , such that v = i αi bi , and then we have
P
for anyP
Cv = i d2i αi bi .
The matrix decomposition (2) is unique, apart from signs of columns of B and permutations
of columns in B and D 2 respectively, given all eigenvalues are different.2
Given the eigendecomposition (2), the inverse C −1 can be computed via
−1
C −1 = BD 2 B T
−1
= B T D −2 B −1
= B D −2BT
1 1
= B diag , . . . , 2 BT ,
d21 dn
From (2) we naturally define
1
C 2 = BDB T (3)
and therefore
1
C− 2 = BD −1 B
T
1 1
= B diag ,..., BT
d1 dn
2 Given m eigenvalues are equal, any orthonormal basis of their m-dimensional subspace can be used as column
4
N 0, σ 2 I N 0, D 2 N (0, C)
Figure 1: Six ellipsoids, depicting one-σ lines of equal density of six different normal distri-
butions, where σ ∈ R+ , D is a diagonal matrix, and C is a positive definite full covariance
matrix. Thin lines depict exemplary objective function contour lines
N (m, C) ∼ m + N (0, C)
1
∼ m + C 2 N (0, I)
∼ m + BD B T N (0, I)
| {z }
∼ N(0,I)
∼ m + B DN (0, I) , (4)
| {z }
∼ N(0,D 2 )
1
where “∼” denotes equality in distribution, and C 2 = BDB T . The last row can be well
interpreted, from right to left
N (0, I) produces an spherical (isotropic) distribution as in Fig. 1, left.
5
Initialize distribution parameters θ (0)
For generation g = 0, 1, 2, . . .
Sample λ independent points from distribution P x|θ (g) → x1 , . . . , xλ
Evalutate the sample x1 , . . . , xλ on f
Update parameters θ (g+1) = Fθ (θ (g) , (x1 , f (x1 )), . . . , (xλ , f (xλ )))
break, if termination criterion met
Equation (4) is useful to compute N (m, C) distributed vectors, because N (0, I) is a vector
of independent (0, 1)-normally distributed numbers that can easily be realized on a computer.
f : Rn → R
x 7→ f (x) .
The objective is to find one or more search points (candidate solutions), x ∈ Rn , with a func-
tion value, f (x), as small as possible. We do not state the objective of searching for a global
optimum, as this is often neither feasible nor relevant in practice. Black box optimization
refers to the situation, where function values of evaluated search points are the only accessible
information on f .3 The search points to be evaluated can be freely chosen. We define the
search costs as the number of executed function evaluations, in other words the amount of
information we needed to aquire from f 4 . Any performance measure must consider the search
costs together with the achieved objective function value.5
A randomized black box search algorithm is outlined in Fig. 2. In the CMA Evolution
Strategy the search distribution, P , is a multivariate normal distribution. Given all variances
and covariances, the normal distribution has the largest entropy of all distributions in Rn .
3 Knowledge about the underlying optimization problem might well enter the composition of f and the chosen
problem encoding.
4 Also f is sometimes denoted as cost function, but it should not to be confused with the search costs.
5 A performance measure can be obtained from a number of trials as, for example, the mean number of function
evaluations to reach a given function value, or the median best function value obtained after a given number of
function evaluations.
6
Furthermore, coordinate directions are not distinguished in any way. Both makes the normal
distribution a particularly attractive candidate for randomized search.
Randomized search algorithms are regarded to be robust in a rugged search landscape,
which can comprise discontinuities, (sharp) ridges, or local optima. The covariance matrix
adaptation (CMA) in particular is designed to tackle, additionally, ill-conditioned and non-
separable6 problems.
6 An n-dimensional separable problem can be solved by solving n 1-dimensional problems separately, which is a
7
where
∼ denotes the same distribution on the left and right side.
N(0, C (g) ) is a multivariate normal distribution with zero mean and covariance (g)
matrix C ,
(g) (g) (g) (g) (g) 2 (g)
see Sect. 0.2. It holds m + σ N(0, C ) ∼ N m , (σ ) C .
(g+1)
xk ∈ Rn , k-th offspring (individual, search point) from generation g + 1.
m(g) ∈ Rn , mean value of the search distribution at generation g.
σ (g) ∈ R+ , “overall” standard deviation, step-size, at generation g.
2
C (g) ∈ Rn×n , covariance matrix at generation g. Up to the scalar factor σ (g) , C (g) is the
covariance matrix of the search distribution.
λ ≥ 2, population size, sample size, number of offspring.
To define the complete iteration step, the remaining question is, how to calculate m(g+1) ,
C (g+1) , and σ (g+1) for the next generation g + 1. The next three sections will answer
these questions, respectively. An algorithm summary with all parameter settings and MAT-
LAB source code is given in Appendix A and B, respectively.
µ
(g+1)
X
(g+1)
m = wi xi:λ (6)
i=1
µ
X
wi = 1, w1 ≥ w2 ≥ . . . ≥ wµ > 0 (7)
i=1
where
µ ≤ λ is the parent population size, i.e. the number of selected points.
wi=1...µ ∈ R+ , positive weight coefficients for recombination. For wi=1...µ = 1/µ, Equation
(6) calculates the mean value of µ selected points.
(g+1) (g+1) (g+1)
xi:λ , i-th best individual out of x1 , . . . , xλ from (5). The index i : λ denotes the
(g+1) (g+1) (g+1)
index of the i-th ranked individual and f (x1:λ ) ≤ f (x2:λ ) ≤ . . . ≤ f (xλ:λ ),
where f is the objective function to be minimized.
8
Equation (6) implements truncation selection by choosing µ < λ out of λ offspring points.
Assigning different weights wi must also be interpreted as a selection mechanism. Equation
(6) implements weighted intermediate recombination by taking µ > 1 individuals into account
for a weighted average.
The measure !−1
Xµ
2
µeff = wi (8)
i=1
will be repeatedly used in the following and can be paraphrased as variance effective selection
mass. From the definition of wi in (7) we derive 1 ≤ µeff ≤ µ, and µeff = µ for equal
recombination weights, i.e. wi = 1/µ for all i = 1 . . . µ. Usually, µeff ≈ λ/4 indicates a
reasonable setting of wi . A typical setting could be wi ∝ µ − i + 1, and µ ≈ λ/2.
(g+1)
The empirical covariance matrix C emp is an unbiased estimator of C (g) : assuming the
(g+1)
xi , i = 1 . . . λ, to be random variables (rather than a realized sample), we have that
(g+1) (g)
E C emp C = C (g) . Consider now a slightly different approach to get an estimator for
(g)
C .
λ T
(g+1) 1 X (g+1)
(g+1)
Cλ = xi − m(g) xi − m(g) (10)
λ i=1
11 To re-estimate the covariance matrix, C, from a N (0, I) distributed sample such that cond(C) < 10 a sample
9
(g+1)
Also the matrix C λ is an unbiased estimator of C (g) . The remarkable difference between
(g+1)
(9) and (10) is the reference mean value. For C emp it is the mean of the actually realized
(g+1)
sample. For C λ it is the true mean value, m(g) , of the sampled distribution (see (5)).
(g+1) (g+1) (g+1)
Therefore, the estimators C emp and C λ can be interpreted differently: while C emp
(g+1)
estimates the distribution variance within the sampled points, C λ estimates variances of
(g+1)
sampled steps, xi − m(g) .
1
A minor difference between (9) and (10) is the different normalizations λ−1 versus λ1 ,
necessary to get an unbiased estimator in both cases. In (9) one degree of freedom is
already taken by the inner summand. In order to get a maximum likelihood estimator in
both cases λ1 must be used.
Equation (10) re-estimates the original covariance matrix. To “estimate” a “better” co-
variance matrix (10) is modified and the same, weighted selection mechanism as in (6) is used
[9].
µ T
(g+1) (g+1)
X
C (g+1)
µ = wi x i:λ − m (g)
x i:λ − m (g)
(11)
i=1
(g+1) (g+1)
The matrix Cµ an estimator for the distribution of selected steps, just as C λ
is is an
(g+1)
estimator of the original distribution of steps before selection. Sampling from C µ tends to
reproduce selected, i.e. successful steps, giving a justification for what a “better” covariance
matrix means.
Following [8], we compare (11) with the Estimation of Multivariate Normal Algorithm
EMNAglobal [15, 16]. The covariance matrix in EMNAglobal reads, similar to (9),
µ
(g+1) 1 X (g+1)
(g+1)
T
C EMNAglobal = xi:λ − m(g+1) xi:λ − m(g+1) , (12)
µ i=1
Pµ (g+1)
where m(g+1) = 1
µ i=1 xi:λ . Similarly, applying the so-called Cross-Entropy method
µ (g+1)
to continuous domain optimization [18] yields the covariance matrix µ−1 C EMNAglobal ,
i.e. the unbiased empirical covariance matrix of the µ best points. In both cases the subtle,
but most important difference to (11) is, again, the choice of the reference mean value.12
Equation (12) estimates the variance within the selected population while (11) estimates
selected steps. Equation (12) reveals always smaller variances than (11), because its ref-
erence mean value is the minimizer for the variances. Moreover, in most conceivable
selection situations (12) decreases the variances compared to C (g) .
Figure 3 demonstrates the estimation results on a linear objective function for λ = 150,
µ = 50, and wi = 1/µ. Equation (11) increases the expected variance in direction of
the gradient (where the selection takes place, here the diagonal), given ordinary settings
for parent number µ and recombination weights w1 , . . . , wµ . Equation (12) always de-
creases the variance in gradient direction! Therefore, (12) is highly susceptible to prema-
ture convergence, in particular with small parent populations, where the population cannot
be expected to bracket the optimum at any time. However, for large values of µ in large
populations with large initial variances, the impact of the different reference mean value
can become marginal.
12 Taking
Pµ 1 Pµ
a weighted sum, i=1 wi . . ., instead of the mean, µ i=1 . . ., is an appealing, but less important,
difference.
10
(g+1)
Cµ
(g+1)
C EMNAglobal
P2
Figure 3: Estimation of the covariance matrix on flinear (x) = − i=1 xi to be minimized.
Contour lines (dotted) indicate that the strategy should move toward the upper right corner.
(g+1)
Above: estimation of C µ according to (11), where wi = 1/µ. Below: estimation of
(g+1)
C EMNAglobal according to (12). Left: sample of λ = 150 N (0, I) distributed points. Middle:
the µ = 50 selected points (dots) determining the entries for the estimation equation (solid
straight lines). Right: search distribution of the next generation (solid ellipsoids). Given
(g+1)
wi = 1/µ, estimation via C µ increases the expected variance in gradient direction for all
(g+1)
µ < λ/2, while estimation via C EMNAglobal decreases this variance for any µ < λ
(g+1)
In order to ensure with (5), (6), and (11), that C µ is a reliable estimator, the variance
effective selection mass µeff (cf. (8)) must be large enough: getting condition numbers (cf.
(g) Pn
Sect. 0.4) smaller than ten for C µ on fsphere (x) = i=1 x2i , requires µeff ≈ 10n. The next
step is to circumvent this restriction on µeff .
3.2 Rank-µ-Update
To achieve fast search (opposite to more robust or more global search), e.g. competitive per-
formance on fsphere , the population size λ must be small. Because µeff ≈ λ/4 also µeff must
be small and we may assume, e.g., µeff ≤ 1 + ln n. Then, it is not possible to get a reliable
estimator for a good covariance matrix from (11). As a remedy, information from previous
generations is used additionally. For example, after a sufficient number of generations, the
mean of the estimated covariance matrices from all generations,
g
(g+1) 1 X 1
C = C (i+1) (13)
g + 1 i=0 σ (i) 2 µ
11
(g)
becomes a reliable estimator for the selected steps. To make C µ from different generations
comparable, the different σ (i) are incorporated. (Assuming σ (i) = 1, (13) resembles the
covariance matrix from the Estimation of Multivariate Normal Algorithm EMNAi [16].)
In (13), all generation steps have the same weight. To assign recent generations a higher
weight, exponential smoothing is introduced. Choosing C (0) = I to be the unity matrix and a
learning rate 0 < ccov ≤ 1, then C (g+1) reads
1
C (g+1) = (1 − ccov )C (g) + ccov 2 C (g+1)
µ
σ (g)
µ
(g+1) (g+1) T
X
= (1 − ccov )C (g) + ccov wi y i:λ y i:λ (14)
i=1
where
ccov ≤ 1 learning rate for updating the covariance matrix. For ccov = 1, no prior information
1 (g+1)
is retained and C (g+1) = σ(g) 2 Cµ . For ccov = 0, no learning takes place and
C (g+1) = C (0) . Here, ccov ≈ min(1, µeff /n2 ) is a reasonably choice.
(g+1) (g+1)
y i:λ = (xi:λ − m(g) )/σ (g) .
This covariance matrix update is called rank-µ-update [10], because the sum of outer products
in (14) is of rank min(µ, n) (with probability one). Note that this sum can even consist of a
single term, if µ = 1.
The number 1/ccov is the backward time horizon that contains roughly 63% of the overall
weight.
Because (14) expands to the weighted sum
g
X 1
C (g+1) = (1 − ccov )g+1 C (0) + ccov (1 − ccov )g−i 2 C (i+1)
µ , (15)
i=0 σ (i)
the backward time horizon, ∆g, where about 63% of the overall weight is summed up, is
defined by
g
X 1
ccov (1 − ccov )g−i ≈ 0.63 ≈ 1 − . (16)
i=g+1−∆g
e
Resolving the sum yields
1
(1 − ccov )∆g ≈ , (17)
e
and resolving for ∆g, using the Taylor approximation for ln, yields
1
∆g ≈ . (18)
ccov
That is, approximately 37% of the information in C (g+1) is older than 1/ccov generations,
and, according to (17), the original weight is reduced by a factor of 0.37 after approxi-
mately 1/ccov generations.13
13 This can be shown more easily, because (1 − c g = exp ln(1 − c g = exp(g ln(1 − c
cov ) cov ) cov )) ≈
exp(−gccov ) for small ccov , and for g ≈ 1/ccov we get immediately (1 − ccov )g ≈ exp(−1).
12
The choice of ccov is crucial. Small values lead to slow learning, too large values lead to
a failure, because the covariance matrix degenerates. Fortunately, a good setting seems to
be largely independent of the function to be optimized.14 A first order approximation for a
good choice is ccov ≈ µeff /n2 . Therefore, the characteristic time horizon for (14) is roughly
n2 /µeff .
Even for the learning rate ccov = 1, adapting the covariance matrix cannot be accom-
plished within one generation. The effect of the original sample distribution does not vanish
until a sufficient number of generations. Assuming fixed search costs (number of function
evaluations), a small population size λ allows a larger number of generations and therefore
usually leads to a faster adaptation of the covariance matrix.
3.3 Rank-One-Update
In Section 3.1 we estimated the complete covariance matrix from scratch, using all selected
steps from a single generation. We now take precisely the opposite viewpoint. We will re-
peatedly update the covariance matrix in the generation sequence using a single selected step
only. First, this perspective will give a justification of the adaptation rule (14). Second, we
will introduce the so-called evolution path that is finally used for a rank-one update of the
covariance matrix.
13
N 0, C (0) N 0, C (1) N 0, C (2)
Figure 4: Change of the distribution according to the covariance matrix update (20). Left:
vectors e1 and e2 , and C (0) = I = e1 eT T
1 + e2 e2 . Middle: vectors 0.91 e1 , 0.91 e2 , and
(1)
0.41 y 1 (the coefficients
deduce from ccov = 0.17), and C = (1 − ccov ) I + ccov y 1 y T 1,
−0.59
where y 1 = −2.2 . The distribution ellipsoid is elongated into the direction of y 1 , and
therefore increases the likelihood of y 1 . Right: C (2) = (1 − ccov ) C (1) + ccov y 2 y T
2 , where
0.97
y 2 = 1.5 .
Considering (19) and a slight simplification of (14), we try to gain insight into the adapta-
tion rule for the covariance matrix. Let the sum in (14) consist of a single summand only (e.g.
(g+1)
x1:λ −m(g)
µ = 1), and let y g+1 = σ (g)
. Then, the rank-one update for the covariance matrix
reads
The right summand is of rank one and adds the maximum likelihood term for y g+1 into the
covariance matrix C (g) . Therefore the probability to generate y g+1 in the next generation
increases.
An example of the first two iteration steps of (20) is shown in Figure 4. The distribution
N(0, C (1) ) tends to reproduce y 1 with a larger probability than the initial distribution N(0, I);
the distribution N(0, C (2) ) tends to reproduce y 2 with a larger probability than N(0, C (1) ),
and so forth. When y 1 , . . . , y g denote the formerly selected, favorable steps, N(0, C (g) )
tends to reproduce these steps. The process leads to an alignment of the search distribution
N(0, C (g) ) to the distribution of the selected steps. If both distributions become alike, as
under random selection, in expectation no further change of the covariance matrix takes place
[6].
14
We call a sequence of successive steps, the strategy takes over a number of generations,
an evolution path. An evolution path can be expressed by a sum of consecutive steps. This
summation is referred to as cumulation. To construct an evolution path, the step-size σ is
disregarded. For example, an evolution path of three steps of the distribution mean m can be
constructed by the sum
m(g+1) − m(g) m(g) − m(g−1) m(g−1) − m(g−2)
+ + . (21)
σ (g) σ (g−1) σ (g−2)
In practice, to construct the evolution path, pc ∈ Rn , we use exponential smoothing as in (14),
(0)
and start with pc = 0.15
p m(g+1) − m(g)
p(g+1)
c = (1 − cc )p(g)
c + cc (2 − cc )µeff (22)
σ (g)
where
(g)
pc ∈ Rn , evolution path at generation g.
cc ≤ 1. Again, 1/cc is the backward time horizon of the evolution path pc that contains
roughly
√ 63% of the overall weight (compare derivation of (18)). A time horizon between
n and n is reasonable.
p
The factor cc (2 − cc )µeff is a normalization constant for pc . For cc = 1 and µeff = 1, the
(g+1) (g+1)
factor reduces to one, and pc = (x1:λ − m(g) )/σ (g) .
p
The factor cc (2 − cc )µeff is chosen, such that
(g+1)
The (rank-one) update of the covariance matrix C (g) via the evolution path pc reads [11]
T
C (g+1) = (1 − ccov )C (g) + ccov p(g+1)
c p(g+1)
c . (26)
2
An empirically validated choice for the learning rate in (26) is ccov ≈ 2/n . For cc = 1 and
µ = 1, Equations (26), (20), and (14) are identical.
Using the evolution path for the update of C is a significant improvement of (14) for
small µeff , because correlations between consecutive steps are exploited. The leading signs of
steps, and the dependencies between consecutive steps play a significant role for the resulting
(g+1)
evolution path pc . For cc ≈ 3/n the number of function evaluations needed to adapt a
nearly optimal covariance matrix on cigar-like objective functions becomes O(n).
As a last step, we combine (14) and (26).
15 In the final algorithm (22) is still slightly modified, compare (42).
15
3.4 Combining Rank-µ-Update and Cumulation
The final CMA update of the covariance matrix combines (14) and (26), where µcov deter-
mines their relative weighting.
(g+1) (g) ccov (g+1) (g+1) T 1
C = (1 − ccov )C + p p + ccov 1 −
µcov | c {z c } µcov
rank-one update
µ T
(g+1) (g+1)
X
× wi y i:λ y i:λ (27)
i=1
| {z }
rank-µ update
where
µcov ≥ 1. Choosing µcov = µeff is most appropriate.
ccov ≈ min(µcov , µeff , n2 )/n2 .
(g+1) (g+1)
y i:λ = (xi:λ − m(g) )/σ (g) .
Equation (27) reduces to (14) for µcov → ∞ and to (26) for µcov = 1. The equation combines
the advantages of (14) and (26). On the one hand, the information within the population of
one generation is used efficiently by the rank-µ update. On the other hand, information of
correlations between generations is exploited by using the evolution path for the rank-one
update. The former is important in large populations, the latter is in particular important in
small populations.
4 Step-Size Control
The covariance matrix adaptation, introduced in the last section, does not explicitly control the
“overall scale” of the distribution, the step-size. The covariance matrix adaptation increases
the scale only in one direction for each selected step, and it decreases the scale only implicitly
by fading out old information via the factor 1−ccov . Less informally, we can state two specific
reasons to introduce a step-size control in addition to the adaptation rule (27) for C (g) .
1. The optimal overall step length cannot be well approximated by (27), in particular if
µeff is chosen larger than one.
Pn 2
For example, on fsphere (x) = i=1 xi , the optimal step-size σ equals approxi-
p (g)
mately µ fsphere (x)/n, given C ≈ I and µeff = µ n [3, 17]. This depen-
dency on µ cannot be realized by (14), and is also not well approximated by (27).
2. The largest reliable learning rate for the covariance matrix update in (27) is too slow to
achieve competitive change rates for the overall step length.
16
Figure 5: Three evolution paths of respectively six steps from different selection situations
(idealized). The lengths of the single steps are all comparable. The length of the evolution
paths (sum of steps) is remarkably different and is exploited for step-size control
To control the step-size σ (g) we utilize an evolution path, i.e. a sum of successive steps (see
page 14). The method can be applied independently of the covariance matrix update and is
denoted as cumulative path length control, cumulative step-size control, or cumulative step
length adaptation (CSA). The length of an evolution path is exploited, based on the following
reasoning (compare also Fig. 5).
• Whenever the evolution path is short, single steps cancel each other out (Fig. 5, left).
Loosely speaking, they are anti-correlated. If steps annihilate each other, the step-size
should be decreased.
• Whenever the evolution path is long, the single steps are pointing to similar directions
(Fig. 5, right). Loosely speaking, they are correlated. Because the steps are similar, the
same distance can be covered by fewer but longer steps into the same directions. In the
limit case, where consecutive steps have identical direction, they can be replaced by an
enlarged single step. Consequently, the step-size should be increased.
• Subsuming, in the desired situation the steps are (approximately) perpendicular in ex-
pectation and therefore uncorrelated (Fig. 5, middle).
To decide whether the evolution path is “long” or “short”, we compare the length of the path
with its expected length under random selection.16 Under random selection consecutive steps
16 Random (g+1)
selection means that the index i : λ (compare (6)) is independent of the value of xi:λ for all i =
1, . . . , λ, e.g. i : λ = i.
17
are independent and therefore uncorrelated (we just realized that “uncorrelated” steps are the
desired situation). If selection biases the evolution path to be longer then expected, σ is in-
creased, and, vice versa, if selection biases the evolution path to be shorter than expected, σ is
decreased. In the ideal situation, selection does not bias the length of the evolution path and
the length equals its expected length under random selection.
In practice, to construct the evolution path, pσ , the same techniques as in (22) are applied.
In contrast to (22), a conjugate evolution path is constructed, because the expected length
of the evolution path pc from (22) depends on its direction (compare (23)). Initialized with
(0)
pσ = 0, the conjugate evolution path reads
p − 1 m(g+1) − m(g)
p(g+1)
σ = (1 − cσ )p(g)
σ + cσ (2 − cσ )µeff C (g) 2 (28)
σ (g)
where
(g)
pσ ∈ Rn is the conjugate evolution path at generation g.
cσ < 1. Again, 1/cσ is the backward time √ horizon of the evolution path (compare (18)). For
small µeff , a time horizon between n and n is reasonable.
p
cσ (2 − cσ )µeff is a normalization constant, see (22).
− 1 def −1 T 2 T
C (g) 2 = B (g) D (g) B (g) , where C (g) = B (g) D (g) B (g) is an eigendecompo-
sition of C (g) , where B (g) is an orthonormal basis of eigenvectors, and the diagonal
elements of the diagonal matrix D (g) are square roots of the corresponding positive
eigenvalues (cf. Sect. 0.1).
−1 − 21
For C (g) = I, we have C (g) 2 = I and (28) replicates (22). The transformation C (g)
re-scales the step m(g+1) − m(g) within the coordinate system given by B (g) .
−1 −1 T
The single factors of the transformation C (g) 2
= B (g) D (g) B (g) can be explained
as follows (from right to left):
T
B (g) rotates the space such that the columns of B (g) , i.e. the principle axes of the
distribution N(0, C (g) ), rotate into the coordinate axes. Elements of the resulting
vector relate to projections onto the corresponding eigenvectors.
−1
D (g) applies a (re-)scaling such that all axes become equally sized.
B (g) rotates the result back into the original coordinate system. This last transforma-
tion ensures that the principal axes of the distribution are not rotated by the overall
transformation and directions of consecutive steps are comparable.
−1 (g+1)
Consequently, the transformation C (g) 2 makes the expected length of pσ independent of
(g)
its direction, and for any sequence of realized covariance matrices C g=0,1,2,... we have under
(g+1) (0)
random selection pσ ∼ N (0, I), given pσ ∼ N (0, I) [6].
(g) (g+1)
To update σ , we “compare” kpσ k with its expected length EkN (0, I) k, that is
cσ
ln σ (g+1) = ln σ (g) + kp(g+1)
σ k − EkN (0, I) k , (29)
dσ EkN (0, I) k
18
where
dσ ≈ 1, damping parameter, scales the change magnitude of ln σ (g) . The factor cσ /dσ /EkN (0, I) k
is based on in-depth investigations of the algorithm [6].
√ √
EkN (0, I) k = 2 Γ( n+1 n
2 )/Γ( 2 ) ≈ n + O(1/n), expectation of the Euclidean norm of a
N (0, I) distributed random vector.
(g+1)
For kpσ k = EkN (0, I) k the second summand in (29) is zero, and σ (g) is unchanged,
(g) (g+1) (g+1)
while σ is increased for kpσ k > EkN (0, I) k, and σ (g) is decreased for kpσ k<
EkN (0, I) k.
(g+1)
Alternatively, we might use the squared norm kpσ k2 in (29) and compare with its
expected value n [2]. In this case (29) would read
cσ (g+1) 2
ln σ (g+1) = ln σ (g) + kpσ k −n
dσ 2n
!
(g+1) 2
(g) cσ kpσ k
= ln σ + −1 . (30)
2dσ n
This update will presumable lead to faster step-size increments and slower step-size decre-
ments.
The step-size change is unbiased on the log scale, because E ln σ (g+1) σ (g) = ln σ (g)
(g+1)
for pσ ∼ N (0, I). The role of unbiasedness is discussed in Sect. 5. Equations (28)
−1
and (29) cause successive steps of the distribution mean m(g) to be approximately C (g) -
conjugate.
−1
In order to show that successive steps are approximately C (g) -conjugate first we re-
(g+1)
mark that (28) and (29) adapt σ such that the length of pσ equals approximately
(g+1) 2 (g+1) T (g+1)
EkN (0, I) k. Starting from (EkN (0, I) k)2 ≈ kpσ k = pσ pσ =
1
(g) − 2
RHST RHS of (28) and assuming that the expected squared length of C (m(g+1) −
m(g) ) is unchanged by selection (unlike its direction) we get
T −1
p(g)
σ C (g) 2
(m(g+1) − m(g) ) ≈ 0 , (31)
and T
1 −1
(g) 2 (g)
C pσ C (g) m(g+1) − m(g) ≈ 0 . (32)
(g−1) T −1
Given 1/ccov 1 and (31) we assume also pσ C (g) 2 (m(g+1) − m(g) ) ≈ 0 and
derive T −1
m(g) − m(g−1) C (g) m(g+1) − m(g) ≈ 0 . (33)
−1
That is, the steps taken by the distribution mean become approximately C (g) -conjugate.
19
Because σ (g) > 0, (29) is equivalent to
!!
(g+1)
(g+1) (g) cσ kpσ k
σ =σ exp −1 (34)
dσ EkN (0, I) k
The length of the evolution path is an intuitive and empirically well validated goodness
measure for the overall step length. For µeff > 1 it is the best measure to our knowledge.
Nevertheless, it fails to adapt nearly optimal step-sizes on very noisy objective functions [4].
5 Discussion
The CMA-ES is an attractive option for non-linear optimization, if “classical” search meth-
ods, e.g. quasi-Newton methods (BFGS) and/or conjugate gradient methods, fail due to a
non-convex or rugged search landscape (e.g. sharp bends, discontinuities, outliers, noise, and
local optima). Learning the covariance matrix in the CMA-ES is analogous to learning the
inverse Hessian matrix in a quasi-Newton method. In the end, any convex-quadratic (ellip-
soid) objective function is transformed into the spherical function fsphere . This can improve
the performance on ill-conditioned and/or non-separable problems by orders of magnitude.
The CMA-ES overcomes typical problems that are often associated with evolutionary al-
gorithms.
1. Poor performance on badly scaled and/or highly non-separable objective functions.
Equation (27) adapts the search distribution to badly scaled and non-separable prob-
lems.
2. The inherent need to use large population sizes. A typical, however intricate to diagnose
reason for the failure of population based search algorithms is the degeneration of the
population into a subspace. This is usually prevented by non-adaptive components in
the algorithm and/or by a large population size (considerably larger than the problem
dimension). In the CMA-ES, the population size can be freely chosen, because the
learning rate ccov in (27) prevents the degeneration even for small population sizes, e.g.
λ = 9. Small population sizes usually lead to faster convergence, large population sizes
help to avoid local optima.
3. Premature convergence of the population. Step-size control in (34) prevents the pop-
ulation to converge prematurely. It does not prevent the search to end up in a local
optimum.
Therefore, the CMA-ES is highly competitive on a considerable number of test functions
[6, 9, 10, 12, 13] and was successfully applied to real world problems.17 Finally we discuss a
few basic design principles that were applied in the previous sections.
Change rates We refer to a change rate as the expected parameter change per sampled
search point, given a certain selection situation. To achieve competitive performance on a
17 For a list of published applications see http://www.bionik.tu-berlin.de/user/niko/cmaapplications.pdf
20
wide range of objective functions, the possible change rates of the adaptive parameters need
to be adjusted carefully. The CMA-ES separately controls change rates for the mean value of
the distribution, m, the covariance matrix, C, and the step-size, σ.
• The change rate for the mean value m, given a fixed sample distribution, is determined
by the parent number and the recombination weights. The larger µeff , the smaller is the
possible change rate of m. Similar holds for most evolutionary algorithms.
• The change rate of the covariance matrix C is explicitly controlled by the learning rate
ccov and therefore detached from parent number and population size. The learning rate
reflects the model complexity. In evolutionary algorithms, the explicit control of change
rates of covariances, independent from the population size, is a rare feature.
• The change rate of the step-size σ is explicitly controlled by the damping parameter dσ
and is in particular independent from the change rate of C. The time constant 1/cσ ≤ n
ensures a sufficiently fast change of the overall step length in particular with small
population sizes.
Invariance should be a fundamental design criterion for any search algorithm. Together with
the ability to efficiently adapt the invariance governing parameters, invariance is a key to
competitive performance.
18 Special acknowledgments to Iván Santibán̄ez-Koref for first pointing this out to me.
21
Stationarity An important design criterion for a stochastic search procedure is unbiasedness
of variations of object and strategy parameters [5, 13]. Consider random selection, e.g. the
objective function f (x) = rand to be independent of x. The population
mean is unbiased
if its
expected value remains unchanged in the next generation, that is E m(g+1) m(g) = m(g) .
For the population mean, stationarity under random selection is a rather intuitive concept.
In the CMA-ES, stationarity is respected for all parameters in the basic equation (5). The
distribution mean m, the covariance matrix C, and ln σ are unbiased. Unbiasedness
of ln σ
does not imply that σ is unbiased. Actually, under random selection, E σ (g+1) σ (g) > σ (g) ,
compare (29).19
For distribution variances (or step-sizes) a bias toward increase or decrease entails the
danger of divergence or premature convergence, respectively, whenever the selection pressure
is low. Nevertheless, on noisy problems a properly controlled bias towards increase might be
appropriate even on the log scale.
Acknowledgments
The author wishes to gratefully thank Anne Auger, Christian Igel, Stefan Kern, and Fabrice
Marchal for the many valuable comments on the manuscript.
References
[1] Auger A, Hansen N. A restart CMA evolution strategy with increasing population size.
In Proceedings of the IEEE Congress on Evolutionary Computation, 2005.
[2] Arnold DV, Beyer HG. Performance analysis of evolutionary optimization with cumu-
lative step length adaptation. IEEE Transactions on Automatic Control, 49(4):617–622,
2004.
[3] Beyer HG. The Theory of Evolution Strategies. Springer, Berlin, 2001.
[4] Beyer HG, Arnold DV. Qualms regarding the optimality of cumulative path length con-
trol in CSA/CMA-evolution strategies. Evolutionary Computation, 11(1):19–28, 2003.
[5] Beyer HG, Deb K. On self-adaptive features in real-parameter evolutionary algorithms.
IEEE Transactions on Evolutionary Computation, 5(3):250–270, 2001.
[6] Hansen N. Verallgemeinerte individuelle Schrittweitenregelung in der Evolutionsstrate-
gie. Mensch und Buch Verlag, Berlin, 1998.
[7] Hansen N. Invariance, self-adaptation and correlated mutations in evolution strategies.
In Schoenauer M, Deb K, Rudolph G, Yao X, Lutton E, Merelo JJ, Schwefel HP, editors,
Parallel Problem Solving from Nature - PPSN VI, pages 355–364. Springer, 2000.
19
h Alternatively, ifi (34) would be designed to be unbiased for σ (g+1) , this would imply that
E ln σ (g+1) σ (g) < ln σ (g) , to our opinion a less desirable variant.
22
[8] Hansen N. The CMA evolution strategy: a comparing review. In Lozano JA, Larranaga P,
Inza I, and Bengoetxea E, editors, Towards a new evolutionary computation. Advances
on estimation of distribution algorithms, pages 75–102. Springer, 2006.
[9] Hansen N, Kern S. Evaluating the CMA evolution strategy on multimodal test functions.
In Xin Yao et al., editors, Parallel Problem Solving from Nature - PPSN VIII, pages
282–291. Springer, 2004.
[10] Hansen N, Müller SD, Koumoutsakos P. Reducing the time complexity of the deran-
domized evolution strategy with covariance matrix adaptation (CMA-ES). Evolutionary
Computation, 11(1):1–18, 2003.
[11] Hansen N, Ostermeier A. Adapting arbitrary normal mutation distributions in evolution
strategies: The covariance matrix adaptation. In Proceedings of the 1996 IEEE Confer-
ence on Evolutionary Computation (ICEC ’96), pages 312–317, 1996.
[12] Hansen N, Ostermeier A. Convergence properties of evolution strategies with the deran-
domized covariance matrix adaptation: The (µ/µI , λ)-CMA-ES. In Proceedings of the
5th European Congresson Intelligent Techniques and Soft Computing, pages 650–654,
1997.
[13] Hansen N, Ostermeier A. Completely derandomized self-adaptation in evolution strate-
gies. Evolutionary Computation, 9(2):159–195, 2001.
[14] Kern S, Müller SD, Hansen N, Büche D, Ocenasek J, Koumoutsakos P. Learning proba-
bility distributions in continuous evolutionary algorithms – a comparative review. Natu-
ral Computing, 3:77–112, 2004.
[15] Larrañaga P. A review on estimation of distribution algorithms. In P. Larrañaga and J. A.
Lozano, editors, Estimation of Distribution Algorithms, pages 80–90. Kluwer Academic
Publishers, 2002.
[16] Larrañaga P, Lozano JA, Bengoetxea E. Estimation of distribution algorithms based
on multivariate normal and Gaussian networks. Technical report, Dept. of Computer
Science and Artificial Intelligence, University of the Basque Country, 2001. KZAA-IK-
1-01.
[17] Rechenberg I. Evolutionsstrategie ’94. Frommann-Holzboog, Stuttgart, Germany, 1994.
[18] Rubenstein RY, Kroese DP. The Cross-Entropy Method: a unified approach to combina-
torial optimization, Monte-Carlo simulation, and machine learning. Springer, 2004.
23
Set parameters
Set parameters λ, µ, wi=1...µ , cσ , dσ , cc , µcov , and ccov to their default values according
to Table 1.
Initialization
zk ∼ N (0, I) (35)
yk = BDz k ∼ N (0, C) (36)
= m + σy k ∼ N m, σ 2 C
xk (37)
Step-size control
1
← (1 − cσ )pσ + cσ (2 − cσ )µeff C − 2 hyiw
p
pσ (40)
cσ kpσ k
σ ← σ × exp −1 (41)
dσ EkN (0, I) k
1
The optimum should presumably be within the initial cube m ± 3σ(1, . . . , 1)T . If the optimum is ex-
pected to be in the initial search interval [a, b]n we may choose the initial search point, m, uniformly randomly
in [a, b]n , and σ = 0.3(b − a). Different search intervals ∆si for different variables can be reflected by a
different initialization of C, in that the diagonal elements of C obey cii = (∆si )2 . Remark that the ∆si
should not disagree by several orders of magnitude. Otherwise a scaling of the variables should be applied.
1 def
C − 2 = BD −1 B T , see B, D above. The matrix D can be inverted by inverting its diag-
− 12
Pµ elements. From the definitions we find that C hyiw = B hziw with hziw =
onal
i=1 wi z i:λ .
√ √
EkN (0, I) k = 2 Γ( n+1 n 1 1
2 )/Γ( 2 ) ≈ n 1 − 4n + 21n 2 .
kpσ k
(
2
1 if √ < (1.4 + n+1 )EkN (0, I) k
hσ = 1−(1−cσ )2(g+1) , where g is the generation
0 otherwise
number. The Heaviside function hσ stalls the update of pc in (42) if kpσ k is large.
This prevents a too fast increase of axes of C in a linear surrounding, i.e. when the
step-size is far too small. This is useful when the initial step-size is chosen far too small
or when the objective function changes in time.
δ(hσ ) = (1 − hσ )cc (2 − cc ) ≤ 1 is of minor relevance. In the (unusual) case of hσ = 0, it
substitutes for the second summand from (42) in (43).
25
Pµ
Table 1: Default Strategy Parameters, where µeff = Pµ 1 2 ≥ 1 and wi = 1, taken
i=1 wi
i=1
from [9], where wi0 = ln(µ + 1) − ln i was used
Selection and Recombination:
λ−1
λ = 4 + b3 ln nc, µ = dµ0 e, µ0 = (44)
2
w0
wi = Pµ i , wi0 = ln(µ0 + 1) − ln i for i = 1, . . . , µ, (45)
j=1 wj0
Step-size control:
r !
µeff + 2 µeff − 1
cσ = , dσ = 1 + 2 max 0, − 1 + cσ (46)
n + µeff + 3 n+1
26
creased population size [1]) or to reconsidering the encoding and/or objective function
formulation. We recommend the following termination criteria [1] that are mostly re-
lated to numerical stability:
• TolFun: stop if the range of the best objective function values of the last 10 +
d30n/λe generations and all function values of the recent generation is below
TolFun. Choosing TolFun depends on the problem, while 10−12 is a conserva-
tive first guess.
• TolX: stop if the standard deviation of the normal distribution is smaller than in
all coordinates and σpc is smaller than TolX in all components. By default we
set TolX to 10−12 times the initial σ.
Flat fitness: In the case of equal function values for several individuals in the population, it
is feasible to increase the step-size (see lines 92–96 in the source code below). This
method can interfere with the termination criterion TolFun. In practice, observation
of a flat fitness should be rather a termination criterion and consequently lead to a re-
consideration of the objective function formulation.
Constraints: If the optimal solution in search space is not close to the infeasible region, a sim-
ple and sufficient way to handle any type of boundaries and constraints is re-sampling
infeasible xk until they become feasible. Otherwise, the handling of boundaries and
constraints in CMA-ES is subject to ongoing research and a more general, still simple
way computes the fitness of the infeasible search point x to
X
ffitness (x) = foffset + α 11ci >0 × ci (x)2 (49)
i
27
If a repair mechanism is available (e.g. in case of box boundaries) we use the re-
paired search point, xrepaired , only to evaluate its function value and to compute its
(Mahalanobis) distance to the infeasible point x. Similar to (49) we set ffitness (x) =
f (xrepaired ) + α xT C −1 xrepaired .
In either case α should be chosen such that the differences in f and the differences in
the second summand have a similar magnitude.
28
B MATLAB Source Code
1 function xmin=purecmaes
2 % CMA-ES: Evolution Strategy with Covariance Matrix Adaptation for
3 % nonlinear function minimization.
4 %
5 % This code is an excerpt from cmaes.m and implements the key parts
6 % of the algorithm. It is intendend to be used for READING and
7 % UNDERSTANDING the basic flow and all details of the CMA *algorithm*.
8 % Computational efficiency is sometimes disregarded.
9
10 % -------------------- Initialization --------------------------------
11
12 % User defined input parameters (need to be edited)
13 strfitnessfct = ’felli’; % name of objective/fitness function
14 N = 10; % number of objective variables/problem dimension
15 xmean = rand(N,1); % objective variables initial point
16 sigma = 0.5; % coordinate wise standard deviation (step-size)
17 stopfitness = 1e-10; % stop if fitness < stopfitness (minimization)
18 stopeval = 1e3*Nˆ2; % stop after stopeval number of function evaluations
19
20 % Strategy parameter setting: Selection
21 lambda = 4+floor(3*log(N)); % population size, offspring number
22 mup = (lambda-1)/2; % lambda=12; mu=3; weights = ones(mu,1); would be (3_I,12)-ES
23 mu = ceil(mup); % number of parents/points for recombination
24 weights = log(mup+1)-log(1:mu)’; % muXone recombination weights
25 weights = weights/sum(weights); % normalize recombination weights array
26 mueff=sum(weights)ˆ2/sum(weights.ˆ2); % variance-effective size of mu
27
28 % Strategy parameter setting: Adaptation
29 cc = 4/(N+4); % time constant for cumulation for covariance matrix
30 cs = (mueff+2)/(N+mueff+3); % t-const for cumulation for sigma control
31 mucov = mueff; % size of mu used for calculating learning rate ccov
32 ccov = (1/mucov) * 2/(N+1.4)ˆ2 + (1-1/mucov) * ... % learning rate for
33 ((2*mueff-1)/((N+2)ˆ2+2*mueff)); % covariance matrix
34 damps = 1 + 2*max(0, sqrt((mueff-1)/(N+1))-1) + cs; % damping for sigma
35
36 % Initialize dynamic (internal) strategy parameters and constants
37 pc = zeros(N,1); ps = zeros(N,1); % evolution paths for C and sigma
38 B = eye(N); % B defines the coordinate system
39 D = eye(N); % diagonal matrix D defines the scaling
40 C = B*D*(B*D)’; % covariance matrix
41 eigeneval = 0; % B and D updated at counteval == 0
42 chiN=Nˆ0.5*(1-1/(4*N)+1/(21*Nˆ2)); % expectation of
43 % ||N(0,I)|| == norm(randn(N,1))
44
45 % -------------------- Generation Loop --------------------------------
46
47 counteval = 0; % the next 40 lines contain the 20 lines of interesting code
48 while counteval < stopeval
49
50 % Generate and evaluate lambda offspring
51 for k=1:lambda,
52 arz(:,k) = randn(N,1); % standard normally distributed vector
53 arx(:,k) = xmean + sigma * (B*D * arz(:,k)); % add mutation % Eq. 37
54 arfitness(k) = feval(strfitnessfct, arx(:,k)); % objective function call
55 counteval = counteval+1;
56 end
57
58 % Sort by fitness and compute weighted mean into xmean
59 [arfitness, arindex] = sort(arfitness); % minimization
60 xmean = arx(:,arindex(1:mu))*weights; % recombination % Eq. 39
61 zmean = arz(:,arindex(1:mu))*weights; % == Dˆ-1*B’*(xmean-xold)/sigma
62
63 % Cumulation: Update evolution paths
64 ps = (1-cs)*ps + (sqrt(cs*(2-cs)*mueff)) * (B * zmean); % Eq. 40
65 hsig = norm(ps)/sqrt(1-(1-cs)ˆ(2*counteval/lambda))/chiN < 1.4+2/(N+1);
66 pc = (1-cc)*pc + hsig * sqrt(cc*(2-cc)*mueff) * (B*D*zmean); % Eq. 42
29
67
68 % Adapt covariance matrix C
69 C = (1-ccov) * C ... % regard old matrix % Eq. 43
70 + ccov * (1/mucov) * (pc*pc’ ... % plus rank one update
71 + (1-hsig) * cc*(2-cc) * C) ...
72 + ccov * (1-1/mucov) ... % plus rank mu update
73 * (B*D*arz(:,arindex(1:mu))) ...
74 * diag(weights) * (B*D*arz(:,arindex(1:mu)))’;
75
76 % Adapt step-size sigma
77 sigma = sigma * exp((cs/damps)*(norm(ps)/chiN - 1)); % Eq. 41
78
79 % Update B and D from C
80 if counteval - eigeneval > lambda/ccov/N/10 % to achieve O(Nˆ2)
81 eigeneval = counteval;
82 C=triu(C)+triu(C,1)’; % enforce symmetry
83 [B,D] = eig(C); % eigen decomposition, B==normalized eigenvectors
84 D = diag(sqrt(diag(D))); % D contains standard deviations now
85 end
86
87 % Break, if fitness is good enough
88 if arfitness(1) <= stopfitness
89 break;
90 end
91
92 % Escape flat fitness
93 if arfitness(1) == arfitness(ceil(0.7*lambda))
94 sigma = sigma * exp(0.2+cs/damps);
95 disp(’warning: flat fitness, consider reformulating the objective’);
96 end
97
98 disp([num2str(counteval) ’: ’ num2str(arfitness(1))]);
99
100 end % while, end generation loop
101
102 % -------------------- Ending Message ---------------------------------
103
104 disp([num2str(counteval) ’: ’ num2str(arfitness(1))]);
105 xmin = arx(:, arindex(1)); % Return best point of last generation.
106 % Notice that xmean is expected to be even
107 % better.
108
109 % ---------------------------------------------------------------
110 function f=felli(x)
111 N = size(x,1); if N < 2 error(’dimension must be greater one’); end
112 f=1e6.ˆ((0:N-1)/(N-1)) * x.ˆ2; % condition number 1e6
30