An Analytical Framework For A Consensus-Based Global Optimization Method
An Analytical Framework For A Consensus-Based Global Optimization Method
An Analytical Framework For A Consensus-Based Global Optimization Method
Abstract. In this paper we provide an analytical framework for investigating the efficiency of a
consensus-based model for tackling global optimization problems. This work justifies the optimization
algorithm in the mean-field sense showing the convergence to the global minimizer for a large class
of functions. Theoretical results on consensus estimates are then illustrated by numerical simulations
where variants of the method including nonlinear diffusion are introduced.
arXiv:1602.00220v5 [math.AP] 7 Feb 2018
1. Introduction
Over the last decades, individual-based models (IBMs) have been widely used in the investigation
of complex systems that manifest self-organization or collective behavior. Examples of such complex
systems include swarming behavior, crowd dynamics, opinion formation, synchronization, and many
more, that are present in the field of mathematical biology, ecology and social dynamics, see for
instance [5, 6, 13, 14, 18, 23, 28, 29, 34], and the references therein.
In the field of global optimization, IBMs may be found in a class of metaheuristics (e.g. evolutionary
algorithms,[1, 4, 32] and swarm intelligence,[20, 26]). They play an increasing role in the design of
fast algorithms to provide sufficiently good solutions in tackling hard optimization problems, which
includes the traveling salesman problem that is known to be NP hard. Metaheuristics, in general, may
be considered as high level concepts for exploring search spaces by using different strategies, chosen
in such a way, that a dynamic balance is achieved between the exploitation of the accumulated search
experience and the exploration of the search space.[9] Notable metaheuristics for global optimization
include, for example, the Ant Colony Optimization, Genetic Algorithms, Particle Swarm Optimization
and Simulated Annealing,[24, 25] all of which are stochastic in nature.[7] Despite having to stand the
test of time, a majority of metaheuristical methods lack the proper justification of its efficacy in
the mathematical sense—the universal intent of research in the field is to ascertain whether a given
metaheuristic is capable of finding an optimal solution when provided with sufficient information. Due
to the stochastic nature of metaheuristics, answers to this question are nontrivial, and they are always
probabilistic.
Recently, the use of opinion dynamics and consensus formation in global optimization has been
introduced in,[31] where the authors showed substantial numerical and partial analytical evidence of
its applicability to solving multi-dimensional optimization problems of the form
minx∈Ω f (x), Ω ⊂ Rd a domain,
for a given cost function f ∈ C(Ω) that achieves its global minimum at a unique point in Ω. Without
loss of generality, we may assume f to be positive and defined on the whole Rd by extending it outside
Ω without changing its global minimum.
Throughout the manuscript, we will use the notations f = inf f , f = sup f , and
x∗ = arg min f, f∗ = f (x∗ ).
As we assume to have a unique global minimizer, it holds f∗ = f .
The optimization algorithm involves the use of multiple agents located within the domain Ω to
dynamically establish a consensual opinion amongst themselves in finding the global minimizer to the
minimization problem, while taking into consideration the opinion of all active agents. First order
models for consensus have been studied within the mathematical community interested in granular
materials and swarming that lead to aggregation-diffusion and kinetic equations, which have nontrivial
stationary states or flock solutions (cf. [11, 16, 17, 12], and the references therein). They are also
common tools in control engineering to establish consensus in graphs (cf. [30, 37]).
In order to achieve the goal of optimizing a given function f (x), we consider an interacting stochastic
system of N ∈ N agents with position Xti ∈ Rd , described by the system of stochastic differential
equations
(1a) dXti = −λ(Xti − mt ) dt + σ|Xti − mt |dWti ,
N
X ωfα (Xti )
(1b) mt = Xti PN j
,
α
i=1 j=1 ωf (Xt )
with λ, σ > 0, where ωfα is a weight, which we take as ωfα (x) = exp(−αf (x)) for some appropriately
chosen α > 0. Notice that (1) resembles a geometric Brownian motion, which drifts towards mt ∈ Rd .
This system is a simplified version of the algorithm introduced in,[31] while keeping the essential
ingredients and mathematical difficulties. The first term in (1a) imposes a global relaxation towards
a position determined by the behavior of the normalized moment given by mt , while the diffusion
term tries to concentrate again around the behavior of mt . In fact, agents with a position differing a
lot from mt are diffused more. Hence they explore a larger portion of the landscape of the graph of
f (x), while the explorer agents closer to mt diffuse much less. The normalized moment mt is expected
to dynamically approach the global minimum of the function f , at least when α is large enough, see
below. This idea is also used in simulated annealing algorithms.[24, 25] The well-posedness of this
system will be thoroughly investigated in Section 2.
For the solution Xt1 , . . . , XtN , N ∈ N of the particle system (1), we can consider its empirical
measure given by
1 XN
ρN
t = δ i,
N i=1 Xt
where δx is the Dirac measure at x ∈ Rd . Observe that mt may be re-written in terms of ρN t , i.e.,
1
Z
mt = α x ωfα dρN N
t =: mf [ρt ].
kωf kL1 (ρN )
t
subject to the initial condition law(X̄0 ) = ρ0 . We call ηtα the α-weighted measure. The measure
ρt = law(X̄t ) ∈ P(Rd ) is a Borel probability measure, which describes the evolution of a one-particle
distribution resulting from the mean-field limit.
The (infinitesimal) generator corresponding to the nonlinear process (2a) is given by
Lϕ = κ∆ϕ − µ · ∇ϕ, for ϕ ∈ Cc2 (Rd ),
with drift and diffusion coefficients
µt = λ(x − mf [ρt ]), κt = (σ 2 /2)|x − mf [ρt ]|2 ,
respectively. Therefore, the Fokker–Planck equation associated to (2) reads
(3) ∂t ρt = ∆(κt ρt ) + ∇ · (µt ρt ), limt→0 ρt = ρ0 ,
where ρt ∈ P(Rd ) for t ≥ 0 satisfies (3) in the weak sense. Notice that the Fokker–Planck equation
(3) is a nonlocal, nonlinear degenerate drift-diffusion equation, which makes its analysis a nontrivial
task. Its well-posedness will be the topic of Section 3.
We recall from [31] (cf. [19]), that for any ρ ∈ P(Rd ), ωfα ρ satisfies the well-known Laplace principle:
1
Z
lim − log e−αf dρ = f |supp(ρ) > 0,
α→∞ α
CONSENSUS-BASED OPTIMIZATION 3
Therefore, if f attains its minimum at a single point x∗ ∈ supp(ρ), then the α-weighted measure
ηtα ∈ P(Rd ) assigns most of its mass to a small region around x∗ and hence it approximates a Dirac
distribution δx∗ at x∗ ∈ Rd for large α 1. Consequently, the first moment of ηtα , given by mf [ρ],
provides a good estimate of x∗ = arg min f . Using this fact, we proceed to give justifications for the
applicability of the microscopic system (1) as a tool for solving global optimization problems, via its
mean-field counterpart.
Our main results show in Section 4 that mild assumptions on the regularity of the objective function
f , f ∈ W 2,∞ (Rd ), one obtains a uniform consensus as the limiting measure (t → ∞) corresponding to
(3), i.e.,
ρt −→ δx̃ as t → ∞,
for some x̃ ∈ Rd possibly depending on the initial density ρ0 . It is also shown that this convergence
happens exponentially in time. Moreover, under the same assumptions on f , the point of consensus x̃
may be made arbitrarily close to x∗ = arg min f by choosing α 1 sufficiently large, which is the main
goal for global optimization. Our regularity assumptions allow for complicated landscapes of objective
functions with as many local minimizers as you want but with a well defined unique global minimum,
see for instance the Ackley function–a well-known benchmark for global optimization problems[3]—
used in [31] and depicted in Figure 1. Up to our knowledge, this work shows for the first time the
convergence of an agent-based stochastic scheme for global optimization with mild assumptions on the
regularity of the cost function. We conclude the paper with an extension of the Fokker–Planck equation
(3) to include nonlinear diffusion of porous medium type and provide numerical evidence for consensus
formation in the one dimensional case. For this reason, we introduce an equivalent formulation of the
mean-field equation in terms of the pseudo-inverse distribution χt (η) = inf{x ∈ R | ρt ((−∞, x]) > η}.
We also compare the microscopic approximation corresponding to the porous medium type Fokker–
Planck equation with the original consensus-based microscopic system (1) and the proposed algorithm
in [31], showcasing the exponential decay rate of the error in suitable transport distances towards the
global minimizer.
where ck = αk∇f kL∞ (Bk ) exp(αkf − f kL∞ (Bk ) ) and Bk = {x ∈ Rd | |x| ≤ k}.
Proof. Let X, X̂ ∈ RN d with |X|, |X̂| ≤ k for some k ≥ 0 and i ∈ {1, . . . , N } be arbitrary. Then
P i
− X j ) ωfα (X j )
P i − X̂ j ) ωfα (X̂ j )
j6=i (X j6=i (X̂ X3
FNi (X) − FNi (X̂) = − = I,
`=1 `
P α j
α
j ωf (X )
j
P
j ωf (X̂ )
Due to Lemma 2.1, we may invoke standard existence results of strong solutions for (4).[21]
Theorem 2.1. For each N ∈ N, the stochastic differential equation (4) has a unique strong solution
(N ) (N ) (N )
{Xt | t ≥ 0} for any initial condition X0 satisfying E|X0 |2 < ∞.
CONSENSUS-BASED OPTIMIZATION 5
Proof. As mentioned above, we make use of a standard result on existence of a unique strong solution.
To this end, we show the existence of a constant bN > 0, such that
(5) −2λX · FN (X) + σ 2 trace(MN M> 2
N )(X) ≤ bN |X| .
Indeed, since the following inequalities hold:
P i − X j )ωfα (X j )
i j6=i (X
−X · FNi (X) = −X · i
P α j
≤ −|X i |2 + |X i ||X|,
j ωf (X )
i − X j )ω α (X j ) 2
P
j6 =i (X f
|FNi (X)|2 = ≤ 2 |X i |2 + |X|2 ,
P α j
j ωf (X )
we conclude that
X
−2λX · FN (X) + σ 2 trace(MN M>
N )(X) = −2λX i · FNi (X) + dσ 2 |FNi (X)|2
i
X
≤ 2λ −|X | + |X i ||X| + 2dσ 2 |X i |2 + |X|2
i 2
i
√
≤ 2 λ N + 2dσ 2 N |X|2 =: bN |X|2 .
Along with the local Lipschitz continuity and linear growth of FN and MN , we obtain the assertion
by applying Theorem 3.1 of [21].
Remark 2.1. In fact, the estimate (5) yields a uniform bound on the second moment of Xt . Indeed,
by application of the Itô formula, we obtain
d (N ) (N ) (N )
E|Xt |2 = −2λE[Xt · FN (Xt )] + σ 2 E[trace(MN M>N )(Xt )]
dt
(N )
≤ bN E|Xt |2 .
Therefore, the Gronwall inequality yields
(N ) 2 (N )
E|Xt | ≤ ebN t E|X0 |2 for all t ≥ 0,
i.e., the solution exists globally in time for each fixed N ∈ N.
Unfortunately, for the mean-field limit (N → ∞) we lose control of the previous bound, since
bN → ∞ as N → ∞. Therefore, we will need a finer moment estimates on X(N ) , which we establish
at the end of the following section (cf. Lemma 3.4).
where Π(µ, µ̂) denotes the collection of all Borel probability measures on Rd × Rd with marginals µ
and µ̂ on the first and second factors respectively. The set Π(µ, µ̂) is also known as the set of all
couplings of µ and µ̂. Equivalently, the Wasserstein distance may be defined by
h i
W22 (µ, µ̂) = inf E |Z − Ẑ|2 ,
where the infimum is taken over all joint distributions of the random variables Z and Ẑ with marginals
µ and µ̂ respectively. It is well-known that W2 defines a metric on P2 (Rd ). Since Rd is a separa-
ble complete metric space and p a positive number, the metric space (Pp (Rd ), Wp ) is separable and
complete,[10] in particular this yields (P2 (Rd ), W2 ) is Polish. We remind the reader that a Polish space
is a separable completely metrizable topological space. With each point µ ∈ P2 (Rd ) and every > 0
we associate an -ball U (µ) = {ν ∈ P2 (Rd ) : W2 (µ, ν) < }. The -balls form the basis of a topology,
6 CARRILLO, CHOI, TOTZECK, AND TSE
called the topology of the metric space (P2 (Rd ), W2 ). If this topology agrees with a given topology O
on P2 (Rd ), we say that the metric W2 is compatible with the topology O. If for a topological space
(X, O) there exists a compatible metric, we say (X, O) is metrizable or that the metric metrizes the
topological space (X, O). For Rd and p ∈ [1, ∞) it is well-known that the Wasserstein distance Wp
is compatible with the weak topology in Pp (Rd ). Moreover, convergence in Wp implies convergence
of the first p moments (see Chapter 6 of [36] for more details). Altogether, W2 metrizes the weak
convergence in P2 (Rd ) and convergence in W2 implies convergence of the first two moments. Moreover,
P2 (Rd ) with the weak topology is Polish.
We split the results of this section into two parts, based on additional assumptions on f . We
begin our investigation with the easier of the two, which also provides the means to prove the other.
Throughout this section, we assume that f satisfies the following assumptions:
Assumption 3.1. (1) The cost function f : Rd → R is bounded from below with f := inf f .
(2) There exist constants Lf and cu > 0 such that
for all x, y ∈ Rd ,
(
|f (x) − f (y)| ≤ Lf (|x| + |y|)|x − y|
(A1)
f (x) − f ≤ cu (1 + |x|2 ) for all x ∈ Rd .
3.1. Bounded cost functions. In addition to Assumption 3.1 we consider cost functions f that are
bounded from above. In particular, f has the upper bound f := sup f .
e−αf
α ≤ exp αcu (1 + K) =: cK .
kωf kL1 (µ)
Proof. The proof follows from the Jensen inequality, which gives
R Z
−α (f −f ) dµ
e ≤ e−α(f −f ) dµ.
A simple rearrangement of the previous inequality and using (A1) yields the required estimate.
Lemma 3.2. Let f satisfy Assumption 3.1 and µ, µ̂ ∈ P2 (Rd ) with
Z Z
4
|x| dµ, |x̂|4 dµ̂ ≤ K.
Under the assumption (A1) on f and Lemma 3.1, the terms may be estimated by
|h(x) − h(x̂)| ≤ cK |x − x̂| + cK αLf |x|(|x| + |x̂|)|x − x̂|
ZZ
+ c2K αLf |x̂| (|x| + |x̂|)|x − x̂| dπ.
ZZ 12
2
≤ cK 1 + αLf (1 + cK )pK |x − x̂| dπ ,
where we applied the Hölder inequality in the second line and pK is a polynomial in K. Finally,
optimizing over all couplings π concludes the proof.
Proof of Theorem 3.1. Step 1: For some given u ∈ C([0, T ], Rd ), we may uniquely solve the SDE
(6) dYt = −λ(Yt − ut )dt + σ|Yt − ut |dWt , law(Y0 ) = ρ0 ,
for some fixed initial measure ρ0 ∈ P4 (Rd ), which induces νt = law(Yt ). Since Y ∈ C([0, T ], Rd ), we
obtain ν ∈ C([0, T ], P2 (Rd )), which satisfies the following Fokker–Planck equation
d
Z Z
(7) ϕ dνt = (σ 2 /2)|x − ut |2 ∆ϕ − λ(x − ut ) · ∇ϕ dνt ,
dt
for all ϕ ∈ Cb2 (Rd ). Setting mf [ν] ∈ C([0, T ], Rd ) provides the self-mapping property of the map
T : C([0, T ], Rd ) → C([0, T ], Rd ); u 7→ T u = mf [ν],
for which we show to be compact.
Step 2: Since ρ0 ∈ P4 (Rd ), standard theory of SDEs (see e.g. Chapter 7 of [2]) provides a fourth-
order moment estimate for solutions to (6) of the form
E|Yt |4 ≤ (1 + E|Y0 |4 )ect ,
for some constant c > 0. In particular, supt∈[0,T ] |x|4 dνt ≤ K for some K < ∞. On the other hand,
R
where γ = λ − dσ 2 and cλ = (dσ 2 + |γ|)(1 + eα(f −f ) ). From Gronwall’s inequality we easily deduce
Z Z
2
|x| dρt ≤ |x| dρ0 ecλ t ,
2
and consequently also an estimate for kuk∞ via (8). In particular, there is a constant q > 0 for which
kuk∞ < q. We conclude the proof by applying the Leray–Schauder fixed point theorem,[22, Chapter 11]
which provides a fixed point u for the mapping T and thereby a solution of (2) (respectively (3)).
Step 4: As for uniqueness, we first note that a fixed point u of T satisfies kuk∞ < q. Hence,
4
R
the fourth-order moment estimate provided in Step 2 holds and supt∈[0,T ] |x| dρt ≤ K < ∞. Now
suppose we have two fixed points u and û with
Z Z
kuk∞ , kûk∞ < q, sup |x|4 dρt , sup |x|4 dρ̂t ≤ K,
t∈[0,T ] t∈[0,T ]
and their corresponding processes Yt , Ŷt satisfying (6) respectively. Then taking the difference zt :=
Yt − Ŷt for the same Brownian path gives
Z t Z t Z t
zt = z0 − λ zs ds + λ (us − ûs ) ds + σ |Ys − us | − |Ŷs − ûs | dWs .
0 0 0
Squaring on both sides, taking the expectation and applying the Itô isometry yields
Z t Z t
2 2 2 2 2 2
E|zt | ≤ 2E|z0 | + 8(λ t + σ ) E|zs | ds + 4λ t |mf [ρt ] − mf [ρ̂t ]|2 ds.
0 0
Since Lemma 3.2 provides the estimate
q
(9) |mf [ρt ] − mf [ρ̂t ]| ≤ c0 W2 (ρt , ρ̂t ) ≤ c0 E|zt |2 ,
we further obtain Z t
E|zt |2 ≤ 2E|z0 |2 + 4 (2 + c20 )λ2 t + 2σ 2 E|zs |2 ds.
0
Therefore, applying Gronwall’s inequality and using the fact that E|z0 |2 = 0 yields E|zt |2 = 0 for all
t ∈ [0, T ]. In particular, ku − ûk∞ = 0, i.e., u ≡ û due to (9).
3.2. Cost functions with quadratic growth at infinity. In this subsection, we allow for cost
functions that have quadratic growth at infinity. More precisely, we suppose the following:
There exist constants M > 0 and cl > 0 such that
(A2) f (x) − f ≥ cl |x|2 for |x| ≥ M .
We provide a similar result as in the boundedness case under the assumption (A2).
Theorem 3.2. Let f satisfy Assumption 3.1 and quadratic growth (A2), and ρ0 ∈ P4 (Rd ). Then
there exists a unique nonlinear process X̄ ∈ C([0, T ], Rd ), T > 0, satisfying
dX̄t = −λ(X̄t − mf [ρt ]) dt + σ|X̄t − mf [ρt ]|dWt , ρt = law(X̄t ),
in the strong sense, and ρ ∈ C([0, T ], P2 (Rd )) satisfies the corresponding Fokker–Planck equation (3)
(in the weak sense) with limt→0 ρt = ρ0 ∈ P2 (Rd ).
Proof. The proof is a slight modification of Step 3 in the proof of Theorem 3.1. Since Steps 1, 2
and 4 remain the same, we only show Step 3.
Step 3: Let u ∈ C([0, T ], Rd ) satisfy u = τ T u for τ ∈ [0, 1], i.e., there exists ρ ∈ C([0, T ], P2 (Rd ))
satisfying (7) such that u = τ mf [ρ]. Due to Lemma 3.3 below, we have that
Z Z
(10) |ut |2 = τ 2 |mf [ρt ]|2 ≤ τ 2 |x|2 dηtα ≤ τ 2 b1 + b2 |x|2 dρt ,
for the constants b1 and b2 given in (11). Therefore, a similar computation of the second moment
estimate as above gives
d
Z Z Z
2 2 2 2 2
|x| dρt ≤ (dσ − 2λ + |γ|) |x| dρt + τ (dσ + |γ|) b1 + b2 |x| dρ
dt
Z
≤ (dσ 2 + |γ|) b1 + (dσ 2 + |γ|)(1 + b2 ) |x|2 dρt ,
CONSENSUS-BASED OPTIMIZATION 9
i=1 Ai i=1
for all i ∈ N. From estimate (12) and the choice ` = cl 2i−2 R02 , we further obtain
Z ∞
X
|x|2 dη α ≤ R02 + R02 2i − 1 η α ({f − f∗ ≥ cl 2i−1 R02 })
i=1
(13) ∞
2 i−2 µ({f − f∗ ≥ cl 2i−1 R02 })
2i − 1 e−αcl R0 2
X
≤ R02 + R02
.
i=1
µ({f − f∗ < cl 2i−2 R02 })
We are now left with estimating the last term on the right.
The Markov inequality and (A1) yields
1 1 cu
Z Z
µ({f − f∗ ≥ k}) ≤ |f − f∗ | dµ ≤ cu 1 + |x|2 dµ =: m,
k k k
for any k > 0. Furthermore, since
cu
µ({f − f∗ < k}) = 1 − µ({f − f∗ ≥ k}) ≥ 1 − m,
k
we have that
µ({f − f∗ ≥ k}) 1 m
≤ .
µ({f − f∗ < k/2}) 2 k/(2cu ) − m
Choosing k = cl 2i−1 R02 , we obtain
µ({f − f∗ ≥ cl 2i−1 R02 }) 1 m
2 ≤ .
i−2
µ({f − f∗ < cl 2 R0 }) 2 (cl /cu )2 R02 − m
i−2
10 CARRILLO, CHOI, TOTZECK, AND TSE
Lemma 3.4. Let f satisfy Assumption 3.1 and either (1) boundedness, or (2) quadratic growth at
infinity (A2), and ρ0 ∈ P2p (Rd ), p ≥ 1. Further, let X(N ) , N ∈ N be the solution of the particle
(N )
system (1) with ρ⊗N
0 -distributed initial data X0 and ρN the corresponding empirical measure. Then
there exists a constant K > 0, independent of N , such that
Z
(14) supt∈[0,T ] E |x|2p dρN
t , supt∈[0,T ] E|mN
t |
2p
≤ K,
Following the strategy given in Section 3, we obtain for cost functions f satisfying Assumption 3.1
and either boundedness or (A2), the estimate
Z Z
(17) |mN
s |
2
≤ |x|2
dηsα,N ≤ c1 + c2 |x|2 dρN
s ,
Inserting this into (16) and applying the Gronwall inequality provides a constant Kp > 0, independent
of N , such that supt∈[0,T ] E |x|2p dρN
R
t ≤ Kp holds, and consequently also
supt∈[0,T ] E|mN
t |
2p
≤ 2p−1 (cp1 + cp2 Kp ),
which concludes the proof of the estimates in (14) by choosing K sufficiently large. The other two
estimates easily follow by (17) and by applying the Grownwall inequality on (15), respectively.
Remark 3.2. Despite having the estimates provided in Lemma 3.4, we are unable to prove a mean-
field limit result for the interacting particle system (1) towards the nonlinear process (2) by means of
classical tools (see e.g. [33]). At this moment, we only postulate its corresponding mean-field equation
(3), for which the numerical simulations have indicated to be true.
Remark 4.1. If f were bounded, then an obvious way to obtain concentration is to use the estimate
kωfα kL1 (ρt ) ≥ e−αf to obtain
d
V (ρt ) ≤ − 2λ − dσ 2 eα(f −f ) V (ρt ).
dt
However, since f > f , we have that eα(f −f ) → ∞ as α → ∞, and we would have to choose λ 1
sufficiently large in order to obtain concentration. In the next subsection, we will see that this is not
desirable since α has to be chosen large to have a good approximation of the global minimizer.
In order to understand how kωfα kL1 (ρt ) evolves, we study its evolution given by
d α
Z
kω k 1 = αλ (∇f (x) − ∇f (mf [ρt ])) · (x − mf [ρt ]) ωfα dρt
dt f L (ρt )
σ2 2
Z
+ α |∇f (x)|2 − α∆f (x) |x − mf [ρt ]|2 ωfα dρt =: I1 + I2
2
where we used the fact that Z
(x − mf [ρt ]) ωfα dρt = 0,
for the first term.
Lemma 4.1. Under Assumption 4.1 for f and α ≥ c1 , we have
d α 2
(20) kω k 1 ≥ −b1 V (ρt ),
dt f L (ρt )
with b1 = b1 (α, λ, σ) = 2αe−2αf (c0 σ 2 + 2λcf ).
Note that from Assumption 4.1, we see that b1 → 0 as α → ∞.
Proof. From the assumptions on f , we obtain the following estimates
Z
−αf
I1 ≥ −αλcf e |x − mf [ρt ]|2 dρt
σ2
Z
I2 ≥ α(α − c1 ) |∇f (x)|2 |x − mf [ρt ]|2 ωfα dρt
2
σ 2 −αf
Z
− αc0e |x − mf [ρt ]|2 dρt .
2
When α ≥ c1 , the first term in the estimate for I2 is non-negative. Hence, we obtain
d α
Z
kωf kL1 (ρt ) ≥ −αe−αf (c0 σ 2 /2 + λcf ) |x − mf [ρt ]|2 dρt
dt
V (ρt )
≥ −αe−2αf (c0 σ 2 + 2λcf ) α ,
kωf kL1 (ρt )
where the second inequality follows from (18). We obtain (20) from the fact that
1d α 2 d
kωf kL1 (ρt ) = kωfα kL1 (ρt ) kωfα kL1 (ρt ) ≥ −αe−2αf (c0 σ 2 + 2λcf )V (ρt ),
2 dt dt
thereby concluding the proof.
We now have the ingredients to show the concentration of ρt . In particular, we show that the
estimates (19) and (20) provide the means to identify assumptions on the parameters α, λ and σ, for
which we obtain the convergence V (ρt ) → 0 as t → ∞ at an exponential rate.
Theorem 4.1. Let f satisfy Assumption 4.1. Furthermore let the parameters α, λ and σ satisfy
3
b1 < , 2λb20 − K − 2dσ 2 b0 e−αf ≥ 0,
4
with K = V (ρ0 ) and b0 = kωfα kL1 (ρ0 ) . Then V (ρt ) ≤ V (ρ0 )e−qt with
q = 2 λ − (dσ 2 /b0 )e−αf ≥ K/b20 .
Remark 4.2. Since b1 → 0 and e−αf → 0 as α → ∞, the set of parameters for which the above
conditions are satisfied is non-empty.
Proof of Theorem 4.1. Let γ = K/b20 . Choose any ε > 0 and set
n o
T ε := t > 0 : V (ρs )eγs < Kε for s ∈ [0, t) ,
where Kε := K + ε. Then, by continuity of V (ρt ), we get T ε 6= ∅. Set T∗ε := sup T ε , and we claim
T∗ε = ∞. Suppose not, i.e., T∗ε < ∞. Then this yields
limt%T∗ε V (ρt )eγt = Kε .
On the other hand, it follows from (20) that for t < T∗ε , we have
Z t Z t
kωfα k2L1 (ρt ) ≥ b20 − b1 V (ρs ) ds > b20 − b1 Kε e−γs ds
0 0
Kε b1 Kε −γT∗ε
= b20 1 − b1 + e .
K γ
Since b1 < 3/4 and Kε → K as ε → 0, there exists ε0 > 0 such that 1 − b1 Kε /K ≥ 1/4 for 0 < ε ≤ 0 .
Thus we obtain that for t < T∗ε with 0 < ε ≤ ε0
b0
kωfα kL1 (ρt ) ≥ + ξε for some ξε > 0.
2
Inserting this into (19) gives
!
d dσ 2
V (ρt ) ≤ −2 λ − e−αf V (ρt ).
dt b0 + 2ξ
Applying Gronwall’s inequality yields
!
dσ 2
V (ρt )e qε t
≤K where qε = 2 λ − e−αf ,
b0 + 2ξε
for t < T∗ε with 0 < ε ≤ ε0 . On the other hand, taking the limit t % T∗ε to the above inequality gives
Kε = limt→T∗ε − V (ρt )eγt < limt→T∗ε − V (ρt )eqε t = K < K ,
where we used the second condition, which gives qε > γ for all 0 < ε ≤ ε0 . This is a contradiction and
implies that T∗ε = ∞. Finally, by taking the limit ε → 0, we complete the proof.
For the second part of the statement, we first observe that the expectation of ρt satisfies
d d
Z Z
(21) E(ρt ) = x dρt = −λ (x − mf [ρt ]) dρt .
dt dt
Taking the absolute value of the equation above and then integrating in time yields
Z t Z tZ Z t
d −αf
e−qs ds
E(ρs ) ds ≤ λ |x − m [ρ
f s ]| dρs ds ≤ 4(λ/b0 )e V (ρ0 )
ds
0 0 0
4λ −αf 4λ −αf
= e V (ρ0 )(1 − e−qt ) ≤ e V (ρ0 ),
qb0 qb0
where we used the fact that kωfα k2L1 (ρt ) ≥ b0 /2 and the concentration estimate V (ρt ) ≤ V (ρ0 )e−qt .
The previous estimate tells us that dE(ρt )/dt ∈ L1 (0, ∞) and thus, provides the existence of a point
x̃ ∈ Rd , possibly depending on ρ0 , such that
Z ∞
d
x̂ = E(ρ0 ) + E(ρt ) dt = limt→∞ E(ρt ).
dt 0
Moreover, since |E(ρt ) − mf [ρt ]| → 0 for t → ∞, we obtain limt→∞ mf [ρt ] = x̃.
Remark 4.3. A very important takeaway from the conditions provided in Theorem 4.1 is the fact that
λ and σ may be chosen independently of α. Indeed, if
(22) 2λb20 − K − 2dσ 2 b0 ≥ 0,
then the second condition of Theorem 4.1 is trivially satisfied since f ≥ 0 by assumption. Therefore,
when (22) is satisfied, consensus is achieved for arbitrarily large α satisfying b1 ≤ 3/4.
14 CARRILLO, CHOI, TOTZECK, AND TSE
4.2. Approximate global minimizer. While the previous results provided a sufficient condition for
uniform consensus to occur, we will argue further that the point of consensus x̃ ∈ Rd may be made
arbitrarily close to the global minimizer x∗ of f , for f ∈ C 2 (Rd ) satisfying Assumption 4.1.
In the following, we assume further that f attains a unique global minimum at x∗ ∈ supp(ρ0 ).
Theorem 4.2. Let f satisfy Assumption 4.1. For any given 0 < 0 1 arbitrarily small, there exist
some α0 1 and appropriate parameters (λ, σ) such that uniform consensus is obtained at a point
x̃ ∈ B0 (x∗ ). More precisely, we have that ρt → δx̃ for t → ∞, with x̃ ∈ B0 (x∗ ).
For the proof of Theorem 4.2, we will make use of the following auxiliary result.
Lemma 4.2. Let f satisfy Assumption 4.1 and ρt satisfy
(1 − b(α))kωfα k2L1 (ρ0 ) ≤ kωfα k2L1 (ρt ) for all t ≥ 0,
with 0 ≤ b(α) ≤ 1 and b(α) → 0 as α → ∞. Furthermore, assume that V (ρt ) → 0 and E(ρt ) → x̃
as t → ∞. Then for any given 0 < 0 1 arbitrarily small, there exist some α0 1 such that
x̃ ∈ B0 (x∗ ).
Proof. From the assumptions V (ρt ) → 0 and E(ρt ) → x̃ as t → ∞, we deduce
Z Z Z
e−αf dρt − e−αf dρt = e−αf dρt ≤ ρt ({|x − x̃| ≥ k})
Bδ (x̃) {|x−x̃|≥k}
1
Z
≤ 2 |x − x̃|2 dρt
k {|x−x̃|≥k}
2
≤ 2 V (ρt ) + |E(ρt ) − x̃|2 ,
k
for any k > 0. Furthermore, from Chebyshev’s inequality, we know that ρt * δx̃ narrowly as t → ∞.
Thus, since f is continuous and bounded on Bδ (x̃), we may pass to the limit in t to obtain
Z
lim e−αf dρt = e−αf (x̃) .
t→∞ Bδ (x̃)
and consequently,
kωfα k2L1 (ρt ) ≥ kωfα k2L1 (ρ0 ) (1 − b1 (α)).
We conclude the proof by choosing α ≥ α0 with α0 obtained from Lemma 4.2.
5.2. Porous media version of the evolution equation. One very common application of the
pseudo-inverse distribution χt is to study the behavior of the support supp(ρt ) of the corresponding
probability measure ρt . This is especially interesting when ρt has compact support. Unfortunately,
we do not have that in the present case due to the diffusion, which causes ρt to have full support in
R. This naturally leads to the idea of increasing the power of ρt in the diffusion term, inspired by the
porous media equation.[15] The evolution equation for ρt then becomes
(24) ∂t ρt + ∂x (µt ρt ) = ∂xx (κt ρpt ).
with porous media coefficient p ≥ 1. Notice that the previous model is included here for p = 1. The
derivation of the evolution equation for χt corresponding to this equation may be analogously done,
which leads to
(25) ∂t χt + µt = −∂η (κt (∂η χt )−p ).
Further investigation of the diffusion term results in
∂η (κt (∂η χt )−p ) = ∂η (κt ρpt ) = ρpt ∂η κt + κt ∂η (ρpt ) = ρpt ∂η κt + pκt ρp−1
t ∂η ρt
in (η, t) variables. For p > 1 we can do the following formal computations. Due to mass conservation
of ρt we assume a no flux condition for (24) which in (x, t) variables reads
∂η (κt (∂η χt )−p )(Ft (x)) = [ρpt ∂x κt + pκt ρtp−1 ∂x ρt ]∂η χt (Ft (x))
= µt ρt ∂η χt (Ft (x)) = µt ,
on the boundary points of supp(ρt ). Therefore, restricting (25) onto the boundary points yields
(26) ∂t χt (η) = −2µt (η) = −2λ(χt (η) − mf [χt ]) for η ∈ {0, 1}.
As mf [χt ] is contained in the interior of supp(ρt ) by definition, µt is negative at the left boundary
point η = 0 and positive at the right boundary point η = 1. Hence, (26) implies the shrinking of
supp(ρt ). In particular, one expects the concentration of χt at mf as t → ∞. We will numerical check
this behavior in the next subsection.
5.3. Discretization of the evolution equation for χt . To investigate the behavior of the pseudo-
inverse χt numerically, we use an implicit finite difference scheme. Following the ideas in [8] we denote
the discretized version of χt by χik , where the spatial discretization is indexed by k and the temporal
discretization by i. The spatial and temporal step sizes are denoted by h and τ , respectively. A
straight forward discretization of the general equation (25) yields
κ(χi+1 i κ(χi+1 i
!
χi+1 − χik k+1 , mf ) k , mf )
(27) k
=− − hp−1 + λ(mif − χi+1
k ),
τ (χi+1 i+1 p
k+1 − χk ) (χi+1
k − χi+1
k−1 )
p
for η ∈ (0, 1), where mif = mf [χi ]. At the boundary points η = 0, 1 the expressions
−p −p
χik − χik−1 and χik+1 − χik
are set to zero, respectively. As stopping criterion for the iteration procedure we use
kχi+1 − χi kL2 (0,1) < tol.
Since we expect the density ρt to concentrate close to the minimizer x∗ ∈ Rd of the cost function
f , the pseudo-inverse χt should converge towards the constant function with value x∗ ∈ Rd . This
causes problems in the computation of the fractions appearing in (27). Our workaround is to evaluate
the fractions up to a tolerance and set them artificially to zero if the denominator is too small. The
scheme is tested with the well-known Ackley benchmark function for global optimization problems
(see e.g. [3]) shown in Figure 1.
CONSENSUS-BASED OPTIMIZATION 17
5.4. Particle approximation. To compare the results of the extension p > 1 to the scheme in [31],
we are interested in a particle scheme corresponding to the evolution equation for p = 2. Note that
in contrast to the pseudo-inverse distribution case, we are not restricted to one dimension here. We
derive a numerical scheme by rewriting (24) as
∂t ρt = −∇x (µt ρt ) + ∆(κt ρ2t ) = ∇x [−µt ρt + ρt (∇x (κt ρt ) + κt ∇x ρt )].
The terms ∇x (κt ρt ) and ∇x ρt are mollified in the spirit of [27] with the help of a mollifier ϕ ,
∇x (κt ρt ) ≈ ∇x ϕ ∗ (κt ρt ) and ∇x ρt ≈ ∇x ϕ ∗ ρt .
Altogether this yields the approximate deterministic microscopic system
N
σ X h i
(28) Ẋti = −λ(Xti − mt ) + ∇x ϕ (Xti − Xtj ) |Xtj − mt |2p + |Xti − mt |2p ,
N j=1
for i = 1, . . . N , using the notation in Introduction.
Remark 5.1. Note that the scheme (28) is deterministic in contrast to the scheme (1) for p = 1.
Unfortunately, it is not trivial to extend the particle scheme to p > 2.
5.5. Numerical results. In the following, numerical results corresponding to the above discretiza-
tions are reported. We use 200 grid points for the spatial discretization of χt and 500 particles for the
particle approximation schemes. Further parameters are fixed as
τ = 2.5 · 10−3 , α = 30, σ = 0.8, tol = 10−6 .
The mollifier is chosen to be ϕ = −d ϕ(x/), where
1 exp |x|21−1 , if |x| < 1
ϕ(x) = ,
Zd 0, else
with normalizing constant Zd .
Figure 2 shows the progression of χt over time. On the left side the case p = 1 is depicted. The tails
mentioned in the discussion of (24) can be seen near the boundary. On the right side the diffusion
coefficient is p = 2, in this case no tails occur as expected.
In [31], the following scheme with an approximate Heaviside function was proposed:
√
dXti = −λ(Xti − mt ) H (f (Xti ) − f (mt )) dt + 2σ|Xti − mf |dWti ,
where mt is as given in (1b). Initially, the Heaviside function was included to assure that the particles
do not concentrate abruptly. This is essential in cases where the weight parameter α > 0 is chosen
too small, thereby yielding a rough approximation of the minimizer at the start of the simulation. In
fact, the presence of the Heaviside function prevents particles that attain function values smaller than
the function values at the average, i.e., f (X i ) < f (mt ), from drifting to mt . In those cases, only the
diffusion part is active.
An analogous particle scheme for the porous media equation with p = 2 reads
Ẋti = −λ(Xti − mt ) H (f (Xti ) − f (mf ))
N
σ X h i
+ ∇x ϕ(Xti − Xtj ) |Xtj − mt |2p + |Xti − mt |2p .
N j=1
For both schemes a smooth approximation of the Heaviside function of the form,
H = (1 + erf(x/))/2
is used. We compare the results with and without the Heaviside function in Figure 3. In these
simulations, we see the damping effect of the Heaviside function. The simulations without Heaviside
converge faster. Due to the large value of α, the minimizer is approximated well by mf , thus the
concentration happens very close to the actual minimum of the objective functions.
The graphs show the L2 -distance of Xt (left) and χt (right) to the known minimizer x∗ or equiv-
alently the 2-Wasserstein distance between the solutions of the mean-field equation and the particle
scheme to the global consensus at δx∗ . The schemes with nonlinear diffusion p = 2 converge faster than
their corresponding schemes with linear diffusion. Nevertheless, for practical applications with large
number of particles, the scheme with linear diffusion is more reasonable due to shorter computation
18 CARRILLO, CHOI, TOTZECK, AND TSE
Figure 3. Ackley benchmark in one dimension. L2 error of the solution with respect
to the minimizer x∗ or, equivalently, the 2-Wasserstein distance between the solution
and δx∗ for the different benchmarks. Left: Particle scheme. The stochastic influence
is visible for p = 1, as the graphs rely on data of one realization. Right: Pseudo-inverse
distribution.
times. In fact, in each iteration of the scheme (28) the convolution of all particles has to be computed.
The error of the simulation for χt is smaller then the one for Xt at equal times. The linear graphs
with respect to the logarithmic scaling of the y-axis in Figure 3 indicate the exponential convergence
shown in the theoretical section (cf. Theorem 4.1). The stochasticity influencing the schemes for p = 1
can be observed in the graphs in Figure 3 (left).
In contrast to the pseudo-inverse distribution function, which is only available in one dimension,
the particle scheme can be easily generalized to higher dimensions. We conclude the manuscript with
some numerical results obtained with the particle scheme in two dimensions applied to the Ackley
benchmark. In Figure 4 (left) we see the surface and contour plot of the Ackley function in two
dimensions. In Figure 4 (right), the convergence of the different particle schemes is illustrated. For
the stochastic scheme with p = 1 the data is averaged over 100 Monte Carlo simulations. The graphs
are in good agreement with the corresponding graphs of the pseudo-inverse distribution function in
one dimension. Due to the averaging the stochastic influence which can be seen in Figure 3 (left)
disappears.
Acknowledgments
JAC was partially supported by the Royal Society by a Wolfson Research Merit Award and by
EPSRC grant number EP/P031587/1. Y-PC was supported by the Alexander Humboldt Foundation
CONSENSUS-BASED OPTIMIZATION 19
Figure 4. Ackley benchmark function in two dimensions. Left: Surface and contour
plot of the function in two dimensions. Right: Convergence of the particle schemes.
For the stochastic cases with p = 1 the average of 100 Monte Carlo simulations is
shown.
through the Humboldt Research Fellowship for Postdoctoral Researchers. Y-PC was also supported
by NRF grants (NRF-2017R1C1B2012918 and 2017R1A4A1014735). CT was partially supported by a
’Kurzstipendium für Doktorandinnen und Doktoranden’ by the German Academic Exchange Service.
OT is thankful to Jim Portegies for stimulating discussions.
References
[1] Emile Aarts and Jan Korst. Simulated annealing and Boltzmann machines. New York, NY; John Wiley and Sons
Inc., 1988.
[2] Ludwig Arnold. Stochastic differential equations. New York, 1974.
[3] Omid Askari-Sichani and Mahdi Jalili. Large-scale global optimization through consensus of opinions over complex
networks. Complex Adaptive Systems Modeling, 1(1):1–18, 2013.
[4] Thomas Back, David B. Fogel, and Zbigniew Michalewicz. Handbook of evolutionary computation. IOP Publishing
Ltd., 1997.
[5] Nicola Bellomo, Abdelghani Bellouquid, and Damian Knopoff. From the microscale to collective crowd dynamics.
Multiscale Modeling & Simulation, 11(3):943–963, 2013.
[6] Andrea L. Bertozzi, Jesús Rosado, Martin B. Short, and Li Wang. Contagion shocks in one dimension. J. Stat.
Phys., 158(3):647–664, 2015.
[7] Leonora Bianchi, Marco Dorigo, Luca Maria Gambardella, and Walter J Gutjahr. A survey on metaheuristics for
stochastic combinatorial optimization. Natural Computing: an international journal, 8(2):239–287, 2009.
[8] Adrien Blanchet, Vincent Calvez, and José A. Carrillo. Convergence of the mass-transport steepest descent scheme
for the subcritical patlak-keller-segel model. SIAM Journal on Numerical Analysis, 46(2):691–721, 2008.
[9] Christian Blum and Andrea Roli. Metaheuristics in combinatorial optimization: Overview and conceptual compar-
ison. ACM Computing Surveys (CSUR), 35(3):268–308, 2003.
[10] François Bolley. Separability and completeness for the wasserstein distance. In Séminaire de probabilités XLI, pages
371–377. Springer, 2008.
[11] François Bolley, Ivan Gentil, and Arnaud Guillin. Uniform convergence to equilibrium for granular media. Arch.
Ration. Mech. Anal., 208(2):429–445, 2013.
[12] José A. Carrillo, Young-Pil Choi, and Maxime Hauray. The derivation of swarming models: mean-field limit and
Wasserstein distances. In Collective dynamics from bacteria to crowds, volume 553 of CISM Courses and Lectures,
pages 1–46. Springer, Vienna, 2014.
[13] José A. Carrillo, Massimo Fornasier, Jesús Rosado, and Giuseppe Toscani. Asymptotic flocking dynamics for the
kinetic cucker-smale model. SIAM Journal on Mathematical Analysis, 42(1):218–236, 2010.
[14] José A. Carrillo, Massimo Fornasier, Giuseppe Toscani, and Francesco Vecil. Particle, kinetic, and hydrodynamic
models of swarming. In Mathematical modeling of collective behavior in socio-economic and life sciences, Model.
Simul. Sci. Eng. Technol., pages 297–336. Birkhäuser Boston, Inc., Boston, MA, 2010.
[15] José A. Carrillo, Maria Pia Gualdani, and Giuseppe Toscani. Finite speed of propagation in porous media by mass
transportation methods. Comptes Rendus Mathematique, 338(10):815–818, 2004.
[16] José A Carrillo, Yanghong Huang, and Stephan Martin. Explicit flock solutions for quasi-morse potentials. European
Journal of Applied Mathematics, 25(05):553–578, 2014.
[17] José A Carrillo, Yanghong Huang, and Stephan Martin. Nonlinear stability of flock solutions in second-order swarm-
ing models. Nonlinear Anal. Real World Appl., 17:332–343, 2014.
20 CARRILLO, CHOI, TOTZECK, AND TSE
[18] Felipe Cucker and Steve Smale. On the mathematics of emergence. Jpn. J. Math., 2(1):197–227, 2007.
[19] Amir Dembo and Ofer Zeitouni. Large deviations techniques and applications, volume 38. Springer Science & Business
Media, 2009.
[20] Marco Dorigo and Christian Blum. Ant colony optimization theory: A survey. Theoretical computer science,
344(2):243–278, 2005.
[21] Richard Durrett. Stochastic calculus: a practical introduction, volume 6. CRC press, 1996.
[22] David Gilbarg and Neil S Trudinger. Elliptic partial differential equations of second order. Springer, 2015.
[23] Seung-Yeal Ha and Eitan Tadmor. From particle to kinetic and hydrodynamic descriptions of flocking. Kinet. Relat.
Models, 1(3):415–435, 2008.
[24] Richard Holley and Daniel Stroock. Simulated annealing via sobolev inequalities. Comm. Math. Phys., 115(4):553–
569, 1988.
[25] Richard A Holley, Shigeo Kusuoka, and Daniel W Stroock. Asymptotics of the spectral gap with applications to the
theory of simulated annealing. Journal of functional analysis, 83(2):333–347, 1989.
[26] James Kennedy. Particle swarm optimization. In Encyclopedia of Machine Learning, pages 760–766. Springer, 2010.
[27] Axel Klar and Sudarshan Tiwari. A multiscale meshfree method for macroscopic approximations of interacting
particle systems. Multiscale Modeling & Simulation, 12(3):1167–1192, 2014.
[28] Theodore Kolokolnikov, José A. Carrillo, Andrea Bertozzi, Razvan Fetecau, and Mark Lewis. Emergent behaviour
in multi-particle systems with non-local interactions [Editorial]. Phys. D, 260:1–4, 2013.
[29] Sebastien Motsch and Eitan Tadmor. Heterophilious dynamics enhances consensus. SIAM Rev., 56(4):577–621, 2014.
[30] Reza Olfati-Saber and Richard M. Murray. Consensus problems in networks of agents with switching topology and
time-delays. IEEE Trans. Automat. Control, 49(9):1520–1533, 2004.
[31] René Pinnau, Claudia Totzeck, Oliver Tse, and Stephan Martin. A consensus-based model for global optimization
and its mean-field limit. Math. Mod. Meth. Appl. Sci., 27(01):183–204, 2017.
[32] Colin Reeves. Genetic algorithms. Springer, 2003.
[33] Alain-Sol Sznitman. Topics in propagation of chaos. In Ecole d’été de probabilités de Saint-Flour XIXâĂŤ1989,
pages 165–251. Springer, 1991.
[34] Giuseppe Toscani. Kinetic models of opinion formation. Commun. Math. Sci., 4(3):481–496, 2006.
[35] Cédric Villani. Topics in Optimal Transportation (Graduate Studies in Mathematics, Vol. 58). American Mathe-
matical Society, 2003.
[36] Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
[37] James von Brecht, Theodore Kolokolnikov, Andrea L. Bertozzi, and Hui Sun. Swarming on random graphs. J. Stat.
Phys., 151(1-2):150–173, 2013.
(José A. Carrillo)
Department of Mathematics, Imperial College London,
London SW7 2AZ, United Kingdom
E-mail address: carrillo@imperial.ac.uk
(Young-Pil Choi)
Department of Mathematics and Institute of Applied Mathematics, Inha University,
Incheon 402-751, Republic of Korea
E-mail address: ypchoi@inha.ac.kr
(Claudia Totzeck)
Department of Mathematics, Technische Universität Kaiserslautern,
Erwin-Schrödinger-Strasse, 67663 Kaiserslautern, Germany
E-mail address: totzeck@mathematik.uni-kl.de
(Oliver Tse)
Department of Mathematics and Computer Science, Eindhoven University of Technology,
P.O. Box 513, 5600MB Eindhoven, The Netherlands
E-mail address: o.t.c.tse@tue.nl