Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Siam J. O - Paper Submitted To Society For Industrial and Applied Mathematics Vol. X, No. X, Pp. XXX-XXX

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

SIAM J. OPTIM.

c Paper submitted to Society for Industrial and Applied Mathematics


Vol. x, No. x, pp. xxx-xxx

MESH ADAPTIVE DIRECT SEARCH ALGORITHMS


FOR CONSTRAINED OPTIMIZATION
CHARLES AUDET AND J.E. DENNIS JR.

Abstract. This paper introduces the Mesh Adaptive Direct Search (MADS) class of algorithms
for nonlinear optimization. MADS extends the Generalized Pattern Search (GPS) class by allowing
local exploration, called polling, in a dense set of directions in the space of optimization variables. This
means that under certain hypotheses, including a weak constraint qualification due to Rockafellar,
MADS can treat constraints by the extreme barrier approach of setting the objective to infinity for
infeasible points and treating the problem as unconstrained. The main GPS convergence result is
to identify limit points where the Clarke generalized derivatives are nonnegative in a finite set of
directions, called refining directions. Although in the unconstrained case, nonnegative combinations
of these directions spans the whole space, the fact that there can only be finitely many GPS refining
directions limits rigorous justification of the barrier approach to finitely many constraints for GPS.
The MADS class of algorithms extend this result; the set of refining directions may even be dense in
Rn , although we give an example where it is not.
We present an implementable instance of MADS, and we illustrate and compare it with GPS on
some test problems. We also illustrate the limitation of our results with examples.

Key words. Mesh adaptive direct search algorithms (MADS), convergence analysis, constrained
optimization, nonsmooth analysis, Clarke derivatives, hypertangent, contingent cone.

1. Introduction. We present and analyze a new Mesh Adaptive Direct Search


(MADS) class of algorithms for minimizing a nonsmooth function f : Rn R {}
under general constraints x 6= Rn . For the form of the algorithm given here,
the feasible region may be defined through blackbox constraints given by an oracle,
such as a computer code, that returns a yes or no indicating whether or not a specified
trial point is feasible.
In the unconstrained case, where = Rn , this new class of algorithms occupies a
position somewhere between the Generalized Pattern Search (GPS) [22, 6] algorithms
and the Coope and Price frame-based methods [10]. A key advantage of MADS
over GPS is that local exploration of the space of variables is not restricted to a
finite number of directions (called poll directions). This is the primary drawback of
GPS algorithms in our opinion, and our main motivation in defining MADS was to
overcome this restriction. MADS algorithms are frame-based methods. We propose
a less general choice of frames than the choices allowed by Coope and Price, but
they are specifically aimed at ensuring a dense set of polling directions, and they are
effective and easy to implement. We illustrate with an example algorithm that we
call LTMADS because it is based on a random lower triangular matrix.
The convergence analysis here is based on Clarkes calculus [8] for nonsmooth
functions. It evolved from our previous work on GPS [3] where we give a hierarchy of
convergence results for GPS that show the limitations inherent in the restriction to
finitely many directions. Specifically, we show there that for unconstrained optimiza-
tion, GPS produces a limit point at which the gradient is zero if the function at that

Work of the first author was supported by FCAR grant NC72792 and NSERC grant 239436-

01, and both authors were supported by AFOSR F49620-01-1-0013, The Boeing Company, Sandia
LG-4253, ExxonMobil, and the LANL Computer Science Institute (LACSI) contract 03891-99-23.
GERAD and D epartement de Math ematiques et de G
enie Industriel, Ecole Polytech-
nique de Montr eal, C.P. 6079, Succ. Centre-ville, Montreal (Quebec), H3C 3A7 Canada
(Charles.Audet@gerad.ca, http://www.gerad.ca/Charles.Audet)
Computational and Applied Mathematics Department, Rice University - MS 134, 6100 Main

Street, Houston, Texas, 77005-1892 (dennis@caam.rice.edu, http://www.caam.rice.edu/dennis)


1
2 CHARLES AUDET AND J. E. DENNIS, JR.

point is strictly differentiable [16], but if the function f is only Lipschitz near such a
limit point, then Clarkes generalized directional derivatives [8] are provably nonneg-
Rn whose nonnegative linear combinations
ative only for a finite set of directions D
span the whole space:
f (y + td) f (y)
f (
x; d) := lim sup 0 for all d D. (1.1)
y
x, t0 t

D is called the set of refining directions. This result (1.1) for GPS is not as strong as
stating that the generalized derivative is nonnegative for every directions in Rn , i.e.,
that the limit point is a Clarke stationary point, or equivalently that 0 f ( x), the
generalized gradient of f at x defined by
f ( x) := {s Rn : f (
x; v) 0 for all v Rn 0 f ( x; v) v T s for all v R n
}.
(1.2)
Example F in [2] shows that indeed the GPS algorithm does not necessarily produce
a Clarke stationary point for Lipschitz functions because of the restriction to finitely
many poll directions. For the unconstrained case, this restriction can be overcome by
assuming more smoothness for f , e.g., strict differentiability at x [3]. Strict differen-
tiability is just the requirement that the generalized gradient is a singleton, i.e., that
x) = {f (
f ( x)} in addition to the requirement that f is Lipschitz near x . However,
the directional dependence of GPS in the presence of even bound constraints can not
be overcome by any amount of smoothness, by using penalty functions, or by the use
of the more flexible filter approach for handling constraints [4].
The MADS algorithms can generate a dense set of polling directions in Rn . The
set of directions for which we can show that the Clarke generalized derivatives are
nonnegative, the refining directions, is a subset of this dense set. This does not nec-
essarily ensure Clarke stationarity, but is stronger than if the poll directions belonged
to a fixed finite set, as it it the case with GPS.
Besides the advantages for the unconstrained case of a dense set of polling di-
rections, this also allows MADS to treat a wide class of nonlinear constraints by the
barrier approach. By this we mean that the algorithm is not applied directly to
f but to the barrier function f , defined to be equal to f on and + outside
. This way of rejecting infeasible points was shown to be effective for GPS with
linear constraints by Lewis and Torczon [17] if one included in the poll directions the
tangent cone generators of the feasible region at boundary points near an iterate. For
LTMADS, no special effort is needed for the barrier approach to be provably effective
with probability 1 on constraints satisfying a reasonable constraint qualification due
to Rockafellar [21] that there exists a hypertangent vector at the limit point. A key
advantage of the barrier approach is that one can avoid expensive function calls to f
whenever a constraint is violated. Indeed, the question of feasibility of a trial point
needs only a yes or no answer - the constraints do not need to be given by a known
algebraic condition.
Thus the class of algorithms presented here differs significantly from previous GPS
extensions [4, 18] to nonlinear constraints. Treating constraints as we do motivates
us to use the generalization of the Clarke derivative presented in Jahn [14] where the
evaluation of f is restricted to points in the domain . Thus we use the following
definition of the Clarke generalized derivative at x in the direction v Rn :
f (y + tv) f (y)
f (
x; v) := lim sup . (1.3)
yx , y t
t 0, y + tv
MESH ADAPTIVE DIRECT SEARCH ALGORITHMS 3

Both definitions (1.1) and (1.3) coincide when = Rn or when x int().


The main theoretical objective of the paper is to show that under appropriate
assumptions, any MADS algorithm produces a constrained Clarke stationary point,
satisfying the following necessary optimality condition
i.e., a limit point x

f (
x; v) 0 for all v TCl (
x), (1.4)

where TCl (
x) is the Clarke tangent cone to at x
(see [8] or Definition 3.5).
The paper is organized as follows. Section 2 presents the theoretical MADS
algorithm class and Section 3 contains its convergence analysis. We show that (1.4)
holds, and discuss its consequences when the algorithm is applied to an unconstrained
problem, or when the set is regular in the sense of Definition 3.7 or [8]. We give
some contraint qualification conditions ensuring that MADS produces a contingent
KKT stationary point. Two implementable instances are proposed and analyzed in
Section 4. Numerical experiments are conducted in Section 5 to compare MADS with
standard GPS for an unconstrained, a bound constrained, a disk constrained, and a
nasty exponentially constrained optimization problem. On an artificial example where
GPS and Nelder-Mead are well known to stagnate, we show that MADS reaches the
global optimum. We give a comparison on a parameter fitting problem in catalytic
combustion kinetics on which we know that GPS performs well [13], and we give an
example illustrating the power of being able to handle more constraints by the barrier
approach. The final example shows the value of randomly generated polling directions
for a problem with an ever narrower feasible region.
Notation. R, Z and N respectively denote the sets of real numbers, integers and
nonnegative integers. For x Rn and R+ , B (x) denotes the open ball of radius
centered at x. For a matrix D, the notation d D indicates that d is a column of
D.
2. Mesh Adaptive Direct Search algorithms. Given an initial iterate x0
, a MADS algorithm attempts to locate a minimizer of the function f over by
evaluating f at some trial points. The algorithm does not require the use or the
approximations of derivatives of f . This is useful when f is contaminated with noise,
or when f is unavailable, does not exists or cannot be accurately estimated, or
when there are several local optima. MADS is an iterative algorithm where at each
iteration (the iteration number is denoted by the index k) a finite number of trial
points are generated and their objective function values are compared with the current
incumbent value f (xk ), i.e., the best feasible objective function value found so far.
Each of these trial points lies on the current mesh, constructed from a finite fixed set
of nD directions D Rn scaled by a mesh size parameter m k R+ .
There are two restrictions on the set D. First, D must be a positive spanning
set [11], i.e., nonnegative linear combinations of its elements must span Rn . Second,
each direction dj D (for j = 1, 2, . . . , nD ) must be the product G zj of some fixed
non-singular generating matrix G Rnn by an integer vector zj Zn . For conve-
nience, the set D is also viewed as a real n nD matrix. Similarly, the matrix whose
columns are zj , for j = 1, 2, . . . , nD is denoted by Z; we can therefore use matrix
multiplication to write D = GZ. This is all in common with GPS.
Definition 2.1. At iteration k, the current mesh is defined to be the following
union:
[
Mk = {x + mk Dz : z N
nD
} ,
xSk
4 CHARLES AUDET AND J. E. DENNIS, JR.

where Sk is the set of points where the objective function f had been evaluated by the
start of iteration k.
In the definition above, the mesh is defined to be an union of sets over Sk . Defining
the mesh this way ensures that all previously visited points lie on the mesh, and that
new trial points can be selected around any of them. This definition of the mesh is
identical to the one in [4] and generalizes the one in [3].
The mesh is conceptual in the sense that it is never actually constructed. In
practice, one can easily make sure that the strategy for generating trial points is such
that they all belong to the mesh. One simply has to verify in Definition 2.1 that x
belongs to Sk and that z is an integer vector. The objective of the iteration is to find
a trial mesh point with a lower objective function value than the current incumbent
value f (xk ). Such a trial point is called an improved mesh point, and the iteration
is called a successful iteration. There are no sufficient decrease requirements.
The evaluation of f at a trial point x is done as follows. First, the constraints
defining are tested to determine if x is feasible or not. Indeed, since some of the
constraints defining might be expensive or inconvenient to test, one would order
the constraints to test the easiest ones first. If x 6 , then f (x) is set to +
without evaluating f (x). On the other hand, if x , then f (x) is evaluated. This
remark may seem obvious, but it saves computation, and it is needed in the proof of
Theorem 3.12.
Each iteration is divided into two steps. The first, called the search step has
the same flexibility as in GPS. It allows evaluation of f at any finite number of mesh
points. Any strategy can be used in the search step to generate a finite number of
trial mesh points. Restricting the search points to lie on the mesh is a way in which
MADS is less general than the frame methods of Coope and Price [10]. The search
is said to be empty when no trial points are considered. The drawback to the search
flexibility is that it cannot be used in the convergence analysis except to provide
counterexamples as in [2]. More discussion of search steps is given in [1, 19, 6].
When an improved mesh point is generated, then the iteration may stop, or it
may continue if the user hopes to find a better improved mesh point. In either case,
the next iteration will be initiated with a new incumbent solution xk+1 with
f (xk+1 ) < f (xk ) and with a mesh size parameter m m
k+1 equal to or larger than k
(the exact rules for updating this parameter are presented below, and they are the
same as for GPS). Coarsening the mesh when improvements in f are obtained can
speed convergence.
Whenever the search step fails to generate an improved mesh point, the second
step, called the poll, is invoked before terminating the iteration. The difference
between the MADS and the GPS algorithms lies in this step. For this reason, our
numerical comparisons in the sequel use empty, or very simple, search steps in order
not to confound the point we wish to make about the value of the MADS poll step.
When the iteration fails in generating an improved mesh point, then the next
iteration is initiated from any point xk+1 Sk+1 with f (xk+1 ) = f (xk ); though
there is usually a single such incumbent solution, and then xk+1 is set to xk . The
mesh size parameter m k+1 is reduced to increase the mesh resolution, and therefore
to allow the evaluation of f at trial points closer to the incumbent solution.
More precisely, given a fixed rational number > 1, and two integers w 1
MESH ADAPTIVE DIRECT SEARCH ALGORITHMS 5

and w+ 0, the mesh size parameter is updated as follow :


k+1 = wk k
{0, 1, . . . , w+ }

if an improved mesh point is found
for some wk (2.1)
{w , w + 1, . . . , 1} otherwise.

Everything up to this point in the section applies to both GPS and MADS. We
now present the key difference between both classes of algorithms. For MADS, we
introduce the poll size parameter pk R+ for iteration k. This new parameter
dictates the magnitude of the distance from the trial points generated by the poll
step to the current incumbent solution xk . In GPS, there is a single parameter to
represent these quantities : k = pk = m p
k . In MADS, the strategy for updating k
m p
must be such that k k for all k, and moreover, it must satisfy
p
lim m
k = 0if and only if lim k = 0 for any infinite subset of indices K.(2.2)
kK kK

An implementable updating strategy satisfying these requirements is presented in


Section 4.
We now move away from the GPS terminology, and toward that of Coope and
Price. The set of trial points considered during the poll step is called a frame (the
definition of Coope and Price is more general). The frame is constructed using a
current incumbent solution xk (called the frame center) and the poll and mesh size
parameters pk , m
k and a positive spanning matrix Dk .
Definition 2.2. At iteration k, the MADS frame is defined to be the set:
Pk = {xk + m
k d : d Dk } Mk ,

where Dk is a positive spanning set such that for each d Dk ,


d 6= 0 can be written as a nonnegative integer combination of the directions
in D :
d = Du for some vector u NnD that may depend on the iteration number k
the distance from the frame center xk to a poll point xk + mk d is bounded by
p 0
a constant times the poll size parameter : mk kdk k max{kd k : d0 D}
limits (as defined in Coope and Price [9]) of the normalized sets Dk are pos-
itive spanning sets.
If the poll step fails to generate an improved mesh point then the frame is called
a minimal frame, and the frame center xk is said to be a minimal frame center. At
each iteration, the columns of Dk are called the poll directions.
The algorithm is stated formally below. It is very similar to GPS, with differences
in the poll step, and in the new poll size parameter.
A general MADS algorithm
p
Initialization: Let x0 , m 0 0 , G, Z, , w and w+ satisfy the require-
ments given above. Set the iteration counter k to 0.
Search and poll step: Perform the search and possibly the poll steps (or
only part of them) until an improved mesh point xk+1 is found on the mesh Mk
(see Definition 2.1).
Optional search: Evaluate f on a finite subset of trial points on the
mesh Mk .
Local poll: Evaluate f on the frame Pk (see Definition 2.2).
p
Parameter update: Update m k+1 according to equation (2.1), and k+1 ac-
cording to (2.2). Increase k k + 1 and go back to the search and poll step.
6 CHARLES AUDET AND J. E. DENNIS, JR.

The crucial distinction with GPS is that if m k goes to zero more rapidly than
pk ,then the directions in Dk used to define the frame may be selected in a way so
p
that they are not confined to a finite set. Note that in GPS both m k and k are
equal : a single parameter plays the role of the mesh and poll size parameters, and
therefore, the number of positive spanning sets that can be formed by subsets of D is
constant over all iterations. For example, suppose that in R2 the set D is composed
of the eight directions {(d1 , d2 )T 6= (0, 0)T : d1 , d2 {1, 0, 1}}. There are a total
of 14 distinct positive bases that can be constructed from D. Figure 2.1 illustrates
p
some possible frames in R2 for three values of m k = k . Another figure in Section 4
contrasts with this one by illustrating how the new MADS algorithm may select the
directions of Dk from a larger set.

p p 1 p 1
m
k = k = 1 m
k = k = 2 m
k = k = 4
p1
r

rp2
p1 r

r rx rp3 xk
r p2 r r
xk
k
p2 @ @ @3 r
@ p
@rp1

r
p3

Fig. 2.1. GPS (a special case of MADS) : Example of frames Pk = {xk + m k d : d Dk } =


p
{p1 , p2 , p3 } for different values of m
k = k . In all three figures, the mesh Mk is the intersection
of all lines.

3. Convergence analysis of MADS. The convergence analysis below relies


on the assumptions that x0 , that f (x0 ) is finite, and that all iterates {xk }
produced by the MADS algorithm lie in a compact set. Future work will relax the
first assumption by incorporating the filter approach given in [4].
The section is divided in three subsections. The first recalls Torczons [22] analysis
of the behavior of the mesh size parameter and defines refining sequences as in [3]. It
also defines the idea of a refining subsequence and a refining direction. The second
subsection recalls the definitions of the hypertangent, Clarke, and contingent cones in
addition to some results on generalized derivatives. The third contains a hierarchy of
convergence results based on local properties of the feasible region .

3.1. Preliminaries. Torczon [22] first showed the following result for uncon-
strained pattern search algorithms. Then Audet and Dennis [3] used the same tech-
nique for a description of GPS which is much closer to our description of MADS.
The proof of this result for MADS is identical to that of GPS. The element neces-
proof is that for any integer N 1, the iterate xN may be written as
sary to the P
N 1
xN = x0 + k=0 k Dzk for some vectors zk NnD . This is still true with our new
way of defining the mesh and the frame (see Definitions 2.1 and 2.2).
MESH ADAPTIVE DIRECT SEARCH ALGORITHMS 7

Proposition 3.1. The poll and mesh size parameters produced by a MADS
instance satisfy

lim inf pk = lim inf m


k = 0.
k+ k+

Since the mesh size parameter shrinks only at minimal frames, Proposition 3.1
guarantees that there are infinitely many minimal frame centers. The following defi-
nition specifies the subsequences of iterates and limit directions we use.
Definition 3.2. A subsequence of the MADS iterates consisting of minimal
frame centers, {xk }kK for some subset of indices K, is said to be a refining subse-
quence if {pk }kK converges to zero.
If the limit limkL kddkk k exists for some subset L K with poll direction dk Dk ,
and if xk + m k dk for infinitely many k L, then this limit is said to be a refining
direction for x.
It is shown in [3], that there exists at least one convergent refining subsequence.
We now present some definitions that will be used later to guarantee the existence of
refining directions.
3.2. Cones and generalized derivatives. Three different types of tangent
cones play a central role in our analysis. Their definition, and equivalent ones, may
be found in [21, 8, 14]. After presenting them, we supply an example where the three
cones differ to illustrate some of our results.
Definition 3.3. A vector v Rn is said to be a hypertangent vector to the set
Rn at the point x if there exists a scalar  > 0 such that

y + tw for all y B (x), w B (v) and 0 < t < . (3.1)

The set of hypertangent vectors to at x is called the hypertangent cone to at x


and is denoted by TH (x).
Since the definition of a hypertangent is rather technical and crucial to our results,
we will pause for a short discussion. The reader could easily show that if is a full
dimensional polytope defined by linear constraints, then every direction from x into
the interior of is a hypertangent. That follows immediately from the following result
relating hypertangents to the constraint qualification suggested by Gould and Tolle
[12]. See also [5] for a discussion of the Gould and Tolle constraint qualification.
Theorem 3.4. Let C : Rn Rm be continuously differentiable at a point x
= {x Rn : C(x) 0}, and let A = {i {1, 2, . . . , m} : ci ( x) = 0} be the active
. Then v Rn is a hypertangent vector to at x
set at x x)T v < 0
if and only if ci (
for each i A with ci ( x) 6= 0. Proof. Let v be a hypertangent vector to at
x
. Then, there exists an  > 0 such that x + tv for any 0 < t < . Let i A.
Continuous differentiability of ci at x implies that
x + tv) ci (
ci ( x)
x)T v = lim
ci ( 0.
t0 t
It only remains to show that ci ( x)T v 6= 0 when ci (
x) 6= 0. Suppose by way
T
of contradiction that ci (
x) v = 0 and ci ( x) 6= 0. Since the hypertangent cone
is an open set [21], for any nonnegative R sufficiently small, v + ci (x) is a
hypertangent vector to at x. It follows that

x)T (v + ci (
0 ci ( x)k22 > 0,
x)) = kci (
8 CHARLES AUDET AND J. E. DENNIS, JR.

x)T v < 0 when ci (


which is a contradiction. Thus, ci ( x) 6= 0.
To prove the converse, let i A be such that ci (x) 6= 0 and v Rn be such
that kvk = 1 and ci ( x)T v < 0. The product ci (y)w is a continuous function at
(y; w) = (x; v), and so there is some 1 > 0 such that

ci (y)w < 0 for all y B1 (


x) and w B1 (v). (3.2)

Take  = min{1, 31 } and let y, w be in B (


x) and B (v) respectively with y , and
let 0 < t < . We will show that y + tw . Our construction ensures that ci (y) 0
and  < 1 , and so by the mean value theorem, we have

ci (y + tw) ci (y + tw) ci (y) = ci (y + tw)(tw) for some [0, 1].(3.3)

But, ky + tw x k ky xk + t(kw vk + kvk) <  + ( + 1) 3 1 , thus


y + tw B1 ( x), and w B (v) B1 (v). It follows that equation (3.2) applies
and therefore ci (y + tw)T w < 0. Combining this with (3.3) and with the fact that
t > 0 implies that ci (y + tw) 0. But ci was any active component function, and so
C(y + tw) 0 and thus y + tw .
Let us now present two other types of tangent cones.
Definition 3.5. A vector v Rn is said to be a Clarke tangent vector to the set
Rn at the point x in the closure of if for every sequence {yk } of elements of
that converges to x and for every sequence of positive real numbers {tk } converging to
zero, there exists a sequence of vectors {wk } converging to v such that yk + tk wk .
The set TCl (x) of all Clarke tangent vectors to at x is called the Clarke tangent
cone to at x.
Definition 3.6. A vector v Rn is said to be a tangent vector to the set
Rn at the point x in the closure of if there exists a sequence {yk } of elements
of that converges to x and a sequence of positive real numbers {k } for which
v = limk k (yk x). The set TCo (x) of all tangent vectors to at x is called the
contingent cone (or sequential Bouligand tangent cone) to at x.
Definition 3.7. The set is said to be regular at x provided TCl (x) = TCo (x).
Any convex set is regular at each of its points [8]. Both TCo (x) and TCl (x) are
closed cones, and both TCl (x) and TH (x) are convex cones. Moreover, TH (x)
TCl (x) TCo (x). Rockafellar [21] showed that TH (x) = int(TCl (x)) whenever TH (x)
is nonempty.
Clarke [8] showed that the generalized derivative, as defined in equation (1.1),
is Lipschitz continuous with respect to the direction v Rn . We use Jahns defini-
tion (1.3) of the Clarke generalized derivative, and need the following lemma to shows
that it Lipschitz continuous with respect to v on the hypertangent cone.
Lemma 3.8. Let f be Lipschitz near x with Lipschitz constant . If u and v
belong to TH (
x), then

f (
x; u) f (
x; v) ku vk.

Proof. Let f be Lipschitz near x with Lipschitz constant and let u and
v belong to TH ( x). Let  > 0 be such that y + tw whenever y B ( x),
w B (u) B (v) and 0 < t < . This can be done by taking  to be the smaller of
the values for u and v guaranteed by the definition of a hypertangent. In particular,
if y B (
x) and if 0 < t < , then both y + tu and y + tv belong to . This allows
MESH ADAPTIVE DIRECT SEARCH ALGORITHMS 9

us to go from the first to the second line of the following chain of equalities:
f (y+tu)f (y)
f (
x; u) = lim sup t
yx , y
t 0, y + tu
f (y+tu)f (y)
= lim sup t
yx , y
t 0, y + tv
f (y+tv)f (y) f (y+tu)f (y+tv)
= lim sup t + t
yx , y
t 0, y + tv
f (y+tu)f (y+tv)
= f (
x; v) + lim sup t f (
x; v) ku vk.
yx , y
t 0, y + tv

Based on the previous lemma, the next Proposition shows that the Clarke gen-
eralized derivative is continuous with respect to v on the Clarke tangent cone. The
result is necessary to the proofs of Theorems 3.12 and 3.13.
Proposition 3.9. Let f be Lipschitz near x x) 6= and if v TCl (
. If TH ( x)
then

f (
x; v) = lim f (
x; w).
w v,
H
w T (
x)

Proof. Let be a Lipschitz constant for f near x and let {wk } TH ( x)


be a sequence of directions converging to a vector v TCl (
x). By definition of the
hypertangent cone, let 0 < k < k1 be such that

y + tw whenever y Bk (
x), w Bk (wk ) and 0 < t < k . (3.4)

We first show the inequality f (


x; v) limk f (
x; wk ). Equation (3.4) implies
that
f (y+tv)f (y)
f (
x; v) = lim sup t
yx , y
t 0, y + tv
f (y+tv)f (y)
= lim sup t
yx , y
t 0, y + tv
y + twk
f (y+twk )f (y) f (y+twk )f (y+tv)
lim sup t t
yx , y
t 0, y + twk
f (y+twk )f (y+tv)
= f (
x; wk ) + lim sup t .
yx , y
t 0, y + twk

As k goes to infinity, one gets that f (y+twk )f (y+tv)
kwk vk goes to zero. Since

t
{wk } was arbitrary in the hypertangent cone, it follows that

f (
x; v) lim f (
x; w).
w v,
H
w T (
x)
10 CHARLES AUDET AND J. E. DENNIS, JR.

Second, we show the reverse inequality: f ( x; v) limk f (


x; wk ). Let us define
1 1 1
uk = k wk + (1 k )v = wk + (1 k )(v wk ). Observe that since the hypertangent
cone is a convex set, and since v lies in the closure of the hypertangent cone, then it
follows that uk TH (
x) for every k = 1, 2, . . .
We now consider the generalized directional derivative
f (y+tuk )f (y)
f (
x; uk ) = lim sup t .
yx , y
t 0, y + tuk

tk
The fact that uk TH (
x) implies that there exist yk Bk (
x) and 0 < k < k
such that yk + tk uk and

f (yk + tk uk ) f (yk )
f (
x; uk ) k , (3.5)
tk

where k is the constant from equation (3.4). We now define the sequence zk =
yk + tkk wk converging to x , and the sequence of scalars hk = (1 k1 )tk > 0
converging to zero. Notice that
 
1 1
zk + h k v = y k + t k wk + (1 )v = yk + tk uk ,
k k

and therefore
f (z+hv)f (z)
f (
x; v) = lim sup t
zx , z
h 0, z + hv

lim f (zk +hkhv)f


k
(zk )
k
t
f (yk )f (yk + kk wk )
= lim f (yk +tk uk )f (yk )
(1 1 )t
+ 1
(1 k )tk
k k k
t
f (yk )f (yk + kk wk )
by equation (3.5) : lim f (
x; uk ) k + 1
(1 k )tk
k
1
k kwk k
by Lemma 3.8 : lim f (
x; wk ) kuk wk k k + 1
(1 k )
k
= lim f (
x; wk ) kv wk k + k1 kvk = f (
x; wk ).
k

Unfortunately, the above Proposition is not necessarily true when the hypertan-
gent cone is empty : We cannot show in equation (3.4) that y +twk belongs to when
y is close to x and when t > 0 is small. The following example in R2 shows that
in this case, the Clarke generalized derivative is not necessarily upper semi-continuous
on the boundary of the Clarke tangent cone.
Example 3.10. Consider the continuous concave function in R2 : f (a, b) =
max{0, a}. Moreover, define the feasible region to be the union of

1 = {(a, b)T : a 0, b 0} with 2 = {(a, b)T : b = a2 , a 0}.

One can verify that

TH (0) = , TCl (0) = 1 , and TCo (0) = 1 {(a, 0)t : a 0},


MESH ADAPTIVE DIRECT SEARCH ALGORITHMS 11

and therefore is not regular at the origin. We will show that f (0; w) is positive for
w in the interior of the Clarke tangent cone but f (0; e1 ) = 1 with e1 = (1, 0)T on
the boundary of the Clarke tangent cone.
Let w = (w1 , w2 )T be any direction in int(TCl (0)). We will construct appropriate
subsequences in order to compute a valid lower bound on f (0; w). Define
T
w1 w12

1
yk = , and tk = for every positive integer k.
k k2 k
T
w12

1
One can easily check that yk 2 and yk + tk w = 0, k (v 2 ) 1
w12
for every k > w2 . It follows that

f (yk +tk w)f (yk )


f (0; w) lim tk = lim 00
1 = 0.
k k k

In particular, f (0; (1, )T ) is nonnegative for any  > 0. However, if one considers the
direction e1 = (1, 0)T on the boundary of TCl (0) and computes the Clarke generalized
derivative, then the origin cannot be approached by points yk = (ak , bk )T with
bk < 0 and with yk + tk e1 . A necessary condition for both sequences to be in
is that yk belongs to 1 , where f is linear, and therefore, one gets that the Clarke
generalized derivative is f (0; e1 ) = 1.
This example shows that when the hypertangent cone is empty, but the interior
of the Clarke tangent cone is nonempty, it is possible that f ( x; w) is nonnegative for
every w in the interior of the Clarke tangent cone and jumps to a negative value on
the boundary of the tangent cone: f ( x; e1 ) < lim supwe1 f (
x; w).
We now present two types of necessary optimality conditions based on the tangent
cone definitions.
Definition 3.11. Let f be Lipschitz near x . Then, x is said to be a Clarke,
or contingent stationary point of f over , if f ( x; v) 0 for every direction v in the
Clarke, or contingent cone of at x , respectively.
In addition, x is said to be a Clarke, or contingent KKT stationary point of f
over , if f ( x) exists and belongs to the polar of the Clarke, or contingent cone
of at x, respectively.
3.3. A hierarchy of convergence results for MADS. We now present our
basic result on refining directions from which our hierarchy of results are derived.
Theorem 3.12. Let f be Lipschitz near a limit x of a refining subsequence,
and v TH (x) be a refining direction for x . Then the generalized directional deriva-
tive of f at x in the direction v is nonnegative, i.e., f ( x; v) 0. Proof. Let
{xk }kK be a refining subsequence converging to x and v = limkL kddkk k TH ( x) be
, with dk Dk for every k L. Since f is Lipschitz near x
a refining direction for x ,
Proposition 3.9 ensures that f (x; v) = limkL f (x; kddkk k ). But, for any k L, one
can apply the definition of the Clarke generalized derivative with the roles of y and t
played by xk and m k kdk k, respectively. Note that this last quantity indeed converges
to zero since Definition 2.2 ensures that it is bounded above by pk max{kd0 k : d0 D},
where D is a finite set of directions, and Equation (2.2) states that pk goes to zero.
Therefore
d
f (xk +m k
k kdk k kd k )f (xk ) f (xk +m
k dk )f (xk )
f (
x; wk ) lim sup m
k
= lim sup m 0.
k kdk k k kdk k
kL kL
12 CHARLES AUDET AND J. E. DENNIS, JR.

The last inequality follows from the fact that for each k L, xk + m k dk and
f (xk + m k d k ) = f (x k + m
k d k ) was evaluated and compared by the algorithm to
f (xk ), but xk is a minimal frame center.
We now show that Clarke derivatives of f at the limit x of minimal frame centers,
for meshes that get infinitely fine, are nonnegative for all directions in the hyper-
tangent cone. Note that even though the algorithm is applied to f instead of f ,
the convergence results are linked to the local smoothness of f and not f , which
is obviously discontinuous on the boundary of . This is because we use (1.3) as
the definition of the Clarke generalized derivative instead of (1.1). The constraint
qualification used in these results is that the hypertangent cone is non-empty at the
feasible limit point x . Further discussion on non-empty hypertangent cones is found
in Rockafellar [21].
Theorem 3.13. Let f be Lipschitz near a limit x of a refining subsequence,
and assume that TH ( x) 6= . If the set of refining directions for x is dense in TH ( x),
then x is a Clarke stationary point of f on . Proof. The proof follows directly from
Theorem 3.12 and Proposition 3.9.
A corollary to this result is that if f is strictly differentiable at x , then it is a
Clarke KKT point.
Corollary 3.14. Let f be strictly differentiable at a limit x of a refining
subsequence, and assume that TH ( x) 6= . If the set of refining directions for x is
dense in TH ( x), then x is a Clarke KKT stationary point of f over . Proof. Strict
differentiability ensures that the gradient f ( x) exists and that f ( x)T v = f (
x; v)
for any directions. It follows directly from the previous proposition that f ( x)T v
0 for every direction v in TCl ( x), thus x is a Clarke KKT stationary point.
Our next two results are based on the definition of set regularity (see Defini-
tion 3.7).
Proposition 3.15. Let f be Lipschitz near a limit x of a refining subse-
quence, and assume that TH ( x) 6= . If the set of refining directions for x is dense
in TH (x), and if is regular at x , then x is a contingent stationary point of f over
. Proof. The definition of regularity of a set ensures that f ( x; w) 0 for all w in
TCo (
x).
The following result is the counterpart to Corollary 3.14 for contingent station-
arity. The proof is omitted since it is essentially the same.
Corollary 3.16. Let f be strictly differentiable at a limit x of a refining
subsequence, and assume that TH ( x) 6= . If the set of refining directions for x is
dense in TH ( x), and if is regular at x , then x is a contingent KKT stationary point
of f over .
Example F in [2] presents an instance of a GPS algorithm such that when applied
to a given unconstrained optimization problem, generates a single limit point x which
is not a Clarke stationary point. In fact, it is shown that f is differentiable but not
strictly differentiable at x and f ( x) is nonzero. This unfortunate circumstance is
due to the fact that GPS uses a finite number of poll directions. MADS can use
infinitely many.
The following result shows that the algorithm ensures strong optimality conditions
for unconstrained optimization, or when x is in the interior of .
Theorem 3.17. Let f be Lipschitz near a limit x of a refining subsequence. If
= Rn , or if x int(), and if the set of refining directions for x is dense in Rn ,
then 0 f ( x). Proof. Let x be as in the statement of the result, then TH ( x) = Rn .
Combining the previous corollary with equation (1.2) yields the result.
MESH ADAPTIVE DIRECT SEARCH ALGORITHMS 13

We have show that MADS, an algorithm that deals only with function values and
does not evaluate or estimate derivatives, can produce a limit point x such that if the
function f is Lipschitz near x and if the hypertangent cone to at x is nonempty, then
x
is a Clarke stationary point. The main algorithmic condition in this result is that the
set of refining directions is dense in TH (
x). In the general statement of the algorithm
we did not present a strategy that would guarantee a dense set of refining directions
in the hypertangent cone. We want to keep the algorithm framework as general as
possible. There are different strategies that could be used to generate a dense set of
poll directions. The selection of the set Dk could be done in a deterministic way or
may use some randomness. We present, analyze, and test one strategy in the next
section.
4. Practical implementation LTMADS. We now present two examples of
a stochastic implementation of the MADS algorithm. We call either variant LTMADS,
because of the underlying lower triangular basis construction, and we show that with
probability 1, the set of poll directions generated by the algorithm is dense in the
whole space, and in particular in the hypertangent cone.
4.1. Implementable instances of a MADS algorithm. Let G = I, the
identity matrix, and let D = Z = [I I], = 4, w = 1 and w+ = 1 be the fixed
p
algorithmic parameters. Choose m 0 = 1, 0 = 1 to be the initial mesh and poll size
parameters, and define the update rules as follows :
m
4k if xk is a minimal frame center
m 1
k+1 = 4mk if an improved mesh point is found, and if m k 4
m
k otherwise.

A consequence of these rules is that the mesh size parameter is always a power of 4
and never exceeds 1. Thus, 1 m 1 is always a nonnegative power of 2 and hence
k
integral. The positive spanning set Dk contains a positive basis of n+1 or 2n directions
constructed as follows. It goes without saying that including more directions from D
is allowed.
Generation of the positive basis Dk .
Basis construction: Let B be a lower triangular matrix where each term on
1
the diagonal is either plus or minus m , and the lower components are integers
k

in the open interval 1 m


, 1
m
randomly chosen with equal probability.
k k

Permutation of lines and columns of B: Let {i1 , i2 , . . . , in } and


{j1 , j2 , . . . , jn } be random permutations of the set {1, 2, . . . , n}. Set dqp = Bip ,jq
for each p and q in {1, 2, . . . , n}.
Completion to a positive basis:
A minimal positive basis : n + 1 directions.
Set dn+1 = n i 1 2 n+1
P
i=1 d and let Dk = {dp, d , . . . , d }.
Set the poll size parameter to pk = n m k m
k .
A maximal positive basis : 2n directions.
Set dn+i = di for i = 1, 2, . . . , n andplet Dk = {d1 , d2 , . . . , d2n }.
Set the poll size parameter to pk = m m
k k .

Some comments are in order, since MADS is allowed to be opportunistic and end a
poll step as soon as a better point is found, we want the order of the poll directions
we generate to be random as are the directions themselves. Thus, the purpose of the
14 CHARLES AUDET AND J. E. DENNIS, JR.

second step is to permute the lines so that the line with n 1 zeroes is not always
the first in Dk , and permute the columns so that the dense column is not always the
first in Dk . The name LTMADS is based on the lower triangular matrix at the heart
of the construction of the frames.
The following result shows that the frames generated by the LTMADS algorithm
satisfy the conditions of Definition 2.2.
Proposition 4.1. At each iteration k, the procedure above yields a Dk and a
MADS frame Pk such that:

Pk = {xk + m
k d : d Dk } Mk ,

where Mk is given by Definition 2.1 and Dk is a positive spanning set such that for
each d Dk ,
d 6= 0 can be written as a nonnegative integer combination of the directions
in D = [I I] : d = Du for some vector u NnD that may depend on the
iteration number k
the distance from the frame center xk to a poll point xk +mk d is bounded by a
p 0
constant times the poll size parameter : m
k kdk k max{kd k : d0 D}.
limits (as defined in Coope and Price [9]) of the normalized sets Dk are pos-
itive spanning sets.
Proof. The first n columns of Dk form a basis of Rn because they are obtained by
permuting rows and columns of the lower triangular matrix B, which is nonsingular
because it has nonzero terms on the diagonal. Moreover, taking the last direction
to be the negative of the sum of the others leads to a minimal positive basis, and
combining the first n columns of Dk with their negatives gives a maximal positive
basis [11].
Again by construction, Dk has all integral entries in the interval [ 1 m , 1 m ],
k k
and so clearly each column d of Dk can be written as a nonnegative integer combina-
tion of the columns of D = [I, I]. Hence, the frame defined by Dk is on the mesh
Mk .
Now the ` distance from the frame center to any poll point is km k dk =
m
k kdk . There p are two cases. If the maximal positive basis construction is used,
p
then m k kdkp = m k = k . If the minimal positive basis construction is used, then
m m p
k kdk n k = m . The proof of the second bullet follows by noticing that
max{kd0 k : d0 [I I]} = 1. p
The frame can be rewritten in the equivalent form {xk + m k v : v V} where
V ispa set whose columns are the same as those of B after permutation and multiplied
by m k . Coope and Price [9] show that a sufficient condition for the third bullet
to hold is that each element of |V| is bounded above and that |det(V)| is bounded
below by positive constants that are independent of k. This is trivial to show with
our construction. Indeed, each entry of V lies between 1 and 1 and every term on
the diagonal is 1. B is a triangular matrix, and therefore |det(V)| = 1.

Figure 4.1 illustrates some possible poll sets in R2 for three values of m k and
pk with minimal positive bases. With standard GPS, the frame would have to be
chosen among the eight neighboring mesh points. With the new algorithm, the frame
may be chosen among the mesh points lying inside the square with the dark con-
p
tour. On can see that as m k and k go to zero, the number of candidates for poll
points increases rapidly. For the example illustrated in the figure, the sets of direc-
tions Dk are respectively {(1, 0)T , (0, 1)T , (1, 1)T }, {(2, 1)T , (1, 2)T , (2, 2)T }
MESH ADAPTIVE DIRECT SEARCH ALGORITHMS 15

and {(4, 1)T , (3, 4)T , (4, 4)T }. In the rightmost figure, there are a total of 104
distinct possible frames that MADS may choose from.
p 1 p 1 p 1
m
k = 1, k = 2 m
k = 4 , k = 1 m
k = 16 , k = 2
p3
r
r3
p

pr 1

r rx rx


SSr r
xk p2
k k
p1
r 1
  r
p p3
rp2

r
p2

Fig. 4.1. Example of frames Pk = {xk + m 1 2 3


k d : d Dk } = {p , p , p } for different values of
p
m
k and k . In all three figures, the mesh M k is the intersection of all lines.

In addition to an opportunistic strategy, i.e., terminating a poll step as soon


as an improved mesh point is detected, a standard trick we use in GPS to improve
the convergence speed consists in promoting a successful poll direction to the top of
the list of directions for the next poll step. We call this dynamic ordering of the
polling directions. This strategy can not be directly implemented in MADS since
at a successful iteration k 1, the poll size parameter is increased, and therefore a
step of k in the successful direction will often be outside the mesh. The way we
mimic GPS dynamic ordering in MADS is that when the previous iteration succeeded
in finding an improved mesh point, we execute a simple one point dynamic search
in the next iteration as follows. Suppose that f (xk ) < f (xk1 ) and that d is
the direction for which xk = xk1 + m k1 d. Then, the trial point produced by the
search step is t = xk1 + 4m k1 d. Note that with this construction, if m k1 < 1,
then t = xk1 + k d and otherwise, t = xk1 + 4m
m
k d. In both cases t lies on the
current mesh Mk . If this search finds a better point, then we go on to the next
iteration, but if not, then we proceed to the poll step. The reader will see in the
numerical results below that this seems to be a good strategy.
4.2. Convergence analysis. The convergence results in Section 3.3 are based
on the assumption that the set of refining directions for the limit of a refining sequence
is dense in the hypertangent cone at that limit. The following result shows that the
above instances of LTMADS generates a dense set of poll directions, and therefore,
the convergence results based on the local smoothness of the objective function f and
on the local topology of the feasible region can be applied.
Theorem 4.2. Let x be the limit of a refining subsequence produced by either
instance of LTMADS. Then the set of poll directions for the subsequence converging
is dense in TH (
to x x) with probability one. Proof. Let x be the limit of a refining
subsequence {xk }kK produced by one of the above instances of LTMADS (either
with the minimal or maximal positive basis). Consider the sequence of positive bases
{Dk }kK . Each one of these bases is generated independently.
We use the notation P [E] to denote the probability that E occurs. Let v be a
direction in Rn with kvk = 1 such that P [|vj | = 1] n1 and P [vj = 1 | |vj | = 1] =
16 CHARLES AUDET AND J. E. DENNIS, JR.

P [vj = 1 | |vj | = 1] = 21 . We will find a lower bound on the probability that a


normalized direction in Dk is arbitrarily close to the vector v.
Let k be an index of K. Recall that in the generation of the positive basis
Dk , the column di1 is such that |dij11 | = 1
m , and the other components of d
i1
are
 k 
i1
random integers in the open interval 1m , 1 m . Set u = kddi1 k . It follows by
k k

construction that u = di1 m


p
k and |uj1 | = 1. We will now show for any 0 <  < 1,
that the probability that ku vk <  is bounded below by some non-negative
number independent of k, as k K goes to infinity. Let us estimate the probability
that |uj vj | <  for each j. For j = j1 we have

P [|uj1 vj1 | < ] P [uj1 = vj1 = 1] + P [uj1 = vj1 = 1]


= P [uj1 = 1] P [vj1 = 1] + P [uj1 = 1] P [vj1 = 1]
1 1 1 1 1
+ = .
2 2n 2 2n 2n
For j {j2 , j3 , . . . , jn } we have
" #
vj  i1 vj + 
P [|uj vj | < ] = P [vj  < uj < vj + ] = P p m < dj < p m .
k k
 
vj  vj +
We will use the fact that the number of integers in the interval ,
mk mk
 
1m , 1 m is bounded below by the value  m 1. Now, since the bases Dk
k k k

are independently generated, and since dij1


is an integer randomly chosen with equal
 
2 1 1
probability among the m
1 integers in the interval m
, m
, then it follows
k k k
that
 m 1 p
k  m
k
P [|uj vj | < ] > .
2 m 1 2
k

Recall thatpx is the limit of a refining subsequence, and so, there exists an integer

such that m k 2 whenever k K, and so
p
 m k 
P [|uj vj | < ] for any k K with k .
2 4
It follows that
n  n1

Y
4
P [ku vk < ] = P [|uj vj | < ] for any k K with k .
j=1
2n

We have shown when k is sufficiently large, that P [ku vk < ] is larger than
a strictly positive constant, and is independent of m k . Thus, there will be a poll
direction in Dk for some k K arbitrarily close to any direction v Rn , and in
particular to any direction v TH ( x).
The proof of the previous result shows that the set of directions consisting of the
di1 directions over all iterations is dense in Rn . Nevertheless, we require the algorithm
MESH ADAPTIVE DIRECT SEARCH ALGORITHMS 17

to use a positive spanning set at each iteration instead of a single poll direction. This
ensures that any limit of a refining subsequence is the limit of minimal frame centers
on meshes that get infinitely fine. At this limit point, the set of refining directions
is generated from the set of poll directions which is dense in LTMADS and finite in
GPS. Therefore with both MADS and GPS, the set of directions for which the Clarke
generalized derivatives are nonnegative positively span the whole space. However,
only LTMADS allow the possibility for the set of refining directions to be dense.
5. Numerical results. In this section, we look at 4 test problems. The first
problem is unconstrained, but GPS is well-known to stagnate if it is given an unsuit-
able set of directions. MADS has no problem converging quickly to a global optimizer.
The second case is a bound constrained chemical engineering problem where GPS is
known to behave well enough to justify publication of the results [13]. Still, on the
whole, MADS does better. The third case is a simple nonlinearly constrained prob-
lem where GPS and our filter version of GPS are both known to converge short of an
optimizer. As the theory given here predicts, MADS has no difficulty.
The last example is such that the feasible region gets narrow very quickly. This is
meant to be a test for any derivative-free feasible point algorithm - like GPS or MADS
with the extreme barrier approach to constraints. MADS does better than GPS with
the filter or the barrier, both of which stagnate due to the limitation of finitely many
poll directions. MADS stops making progress when the mesh size gets smaller than
the precision of the arithmetic. This is not to say that in exact arithmetic MADS
would have converged to , the optimizer. Most likely, a search procedure would
allow even more progress in all the algorithms, but that is not the point. The point
is to test the poll procedure.
Of course, even when one tries to choose carefully, 4 examples are not conclusive
evidence. However, we believe that these numerical results coupled with the more
powerful theory for MADS make a good case for MADS versus GPS.
5.1. An unconstrained problem where GPS does poorly. Consider the
unconstrained optimization problem in R2 presented in [15] where GPS algorithms
are known to converge to non-stationary points :

f (x) = 1 exp(kxk2 max{kx ck2 , kx dk2 },




where c = d = (30, 80)T . Figure 5.1 shows level sets of this function. It can be shown
that f is locally Lipschitz and and strictly differentiable at its global minimizer (0, 0)T .
The GPS and MADS runs are initiated at x0 = (3.3, 1.2)T . The gradient
of f exists and is non-zero at that point, and therefore both GPS and MADS will
move away from it. Since there is some randomness involved in the MADS instances
described in Section 4.1, we ran it a total of 5 times, to see how it compares to our
standard NOMAD implementation of GPS. Figure 5.2 shows a log plot of the progress
of the objective function value for each sets of runs. The runs were stopped when
a minimal frame with poll size parameter less than 1010 was detected. For GPS,
the maximal 2n positive basis refers to the set of positive and negative coordinate
directions, and the two minimal n + 1 positive bases are {(1, 0)T , (0, 1)T , (1, 1)T }
and {(1, 0)T , (0.5, 0.866025)T , (0.5, 0.866025)T }.
18 CHARLES AUDET AND J. E. DENNIS, JR.

1. 6

3 1. 5

2 1. 4

1. 3
1

1. 2
0
y

y
1. 1

1
1

2
0. 9

3 0. 8

10 5 0 5 10 4 3. 5 3 2. 5 2
x x

Fig. 5.1. Level sets of f .

bas ic 2n directions dynamic 2n directions


GPS GPS
MADS (5 runs ) MADS (5 runs )
0 0
10 10
f

5 5
10 10

10 10
10 10

0 100 200 300 0 100 200 300


Number of function evaluations Number of function evaluations
bas ic n+1 directions dynamic n+1 directions
G P S (2 runs ) G P S (2 runs )
MADS (5 runs ) MADS (5 runs )
0 0
10 10
f

5 5
10 10

10 10
10 10

0 100 200 300 0 100 200 300


Number of function evaluations Number of function evaluations

Fig. 5.2. Progression of the objective function value vs the number of evaluations.

Every GPS run converged to the nearby point (3.2, 1.2)T , where f is not differen-
tiable. As proved in [3], the Clarke generalized directional derivatives are nonnegative
for D at that point, but, it is not an optimizer. One can see by looking at the level
sets of f that no descent directions at (3.2, 1.2)T can be generated by the GPS algo-
rithm. However, all MADS runs eventually generated good directions and converged
to the origin, the global optimal solution. Figure 5.1 suggests that even if randomness
appears in these instances of MADS, the behavior of the algorithm is very stable in
converging quickly to the origin.
MESH ADAPTIVE DIRECT SEARCH ALGORITHMS 19

5.2. A test problem where GPS does well. The above was one of our moti-
vational examples for MADS, and so we next tried a test case where GPS was known
to behave so well that the results merited a publication. Hayes et al. [13] describe
a method for evaluating the kinetic constants in a rate expression for catalytic com-
bustion applications using experimental light-off curves. The method uses a transient
one-dimensional single channel monolith finite element reactor model to simulate re-
actor performance. The objective is to find the values of four parameters in a way such
that the model estimates as closely as possible (in a weighted least square sense) an
experimental conversion rate. This is a bound constrained nonsmooth optimization
problem in R4+ , where the objective function measures the error between experimental
data and values predicted by the model.
For the three sets of experimental data analyzed in [13], we ran the instances of
GPS and MADS discussed above. The algorithms terminate whenever a a minimal
frame center with poll size parameter equal to 26 is detected, or whenever 500
functions evaluations are performed, whichever comes first. Figures 5.3, 5.4, and 5.5
show the progression of the objective function value versus the number of evaluations
for each data set.

basic 2n directions dynamic 2n directions


300 300
GPS GPS
280 MADS (5 runs) 280 MADS (5 runs)

260 260

240 240
f

220 220

200 200

180 180
100 200 300 400 500 100 200 300 400 500
Number of function evaluations Number of function evaluations
basic n+1 directions dynamic n+1 directions
300 300
GPS (2 runs) GPS (2 runs)
280 MADS (5 runs) 280 MADS (5 runs)

260 260

240 240
f

220 220

200 200

180 180
100 200 300 400 500 100 200 300 400 500
Number of function evaluations Number of function evaluations

Fig. 5.3. Data set 1 Progression of the objective function value vs the number of evaluations.
20 CHARLES AUDET AND J. E. DENNIS, JR.

basic 2n directions dynamic 2n directions


650 650
GPS GPS
MADS (5 runs) MADS (5 runs)
600 600

550 550
f

f
500 500

450 450
100 200 300 400 500 100 200 300 400 500
Number of function evaluations Number of function evaluations
basic n+1 directions dynamic n+1 directions
650 650
GPS (2 runs) GPS (2 runs)
MADS (5 runs) MADS (5 runs)
600 600

550 550
f

500 500

450 450
100 200 300 400 500 100 200 300 400 500
Number of function evaluations Number of function evaluations

Fig. 5.4. Data set 2 Progression of the objective function value vs the number of evaluations.

basic 2n directions dynamic 2n directions


90 90
GPS GPS
80 MADS (5 runs) 80 MADS (5 runs)
70 70
60 60
50 50
f

40 40
30 30
20 20
10 10
100 200 300 400 500 100 200 300 400 500
Number of function evaluations Number of function evaluations
basic n+1 directions dynamic n+1 directions
90 90
GPS (2 runs) GPS (2 runs)
80 MADS (5 runs) 80 MADS (5 runs)
70 70
60 60
50 50
f

40 40
30 30
20 20
10 10
100 200 300 400 500 100 200 300 400 500
Number of function evaluations Number of function evaluations

Fig. 5.5. Data set 3 Progression of the objective function value vs the number of evaluations.
MESH ADAPTIVE DIRECT SEARCH ALGORITHMS 21

The plots suggest that the objective function value without the dynamic search
procedure decreases more steadily with GPS than with MADS. This is because GPS
uses a fixed set of poll directions that we know to be excellent for this problem. By
allowing more directions, MADS eventually generates a steep descent direction, and
the dynamic runs capitalizes on this by evaluating f further in that direction thus
sharply reducing the objective function value in a few evaluations. In general, if the
number of function evaluations is limited to a fixed number, then it appears that
MADS gives better result than GPS. For all three data sets, the dynamic runs are
preferable to the basic runs. It also appears that MADS runs with maximal 2n basis
perform better than the minimal n + 1 runs. In each of the three data sets, the best
overall solution was always produced by MADS with the dynamic 2n directions.
The quality of the best solutions produced by GPS and MADS can be visualized
in Figure 5.6 where the difference between the experimental and predicted conversions
are plotted versus time. A perfect model with perfectly tuned parameters would have
had a difference of zero everywhere. The superiority of the solution produced by
MADS versus GPS is mostly visible for the third data set.

Data set 1 Data set 2 Data set 3


6 2.5
GPS GPS GPS
MADS 6 MADS MADS
5
2
Predicted experimental conversion (%)

Predicted experimental conversion (%)

Predicted experimental conversion (%)


4
4 1.5

3
1
2
2

0.5
1
0
0
0

2
1 0.5

2
4 1

3
1.5
6
4
2
0 100 200 0 100 200 0 100 200
Elapsed time (sec) Elapsed time (sec) Elapsed time (sec)

Fig. 5.6. Conversion rate error versus time.


22 CHARLES AUDET AND J. E. DENNIS, JR.

5.3. A quadratically constrained problem. The third example shows again


the difficulty caused by being restricted to a finite number of polling directions. This is
a problem with a linear objective and a disk constraint, surely the simplest nonlinearly
constrained problem imaginable.

min a+b
x=(a,b)T
s.t. a2 + b2 6.
The starting point is the origin, and the same stopping criteria as for the first example
was used, p < 1010 , with the following results. For GPS we always used Dk =
D = [I, I] with dynamic ordering. The GPS filter method of [4] reached the point
x = (1.42, 1.99)T with f (x ) = 3.417. The GPS using the barrier approach
to the constraint got to x = (1.41, 2)T with f (x ) = 3.414. All 5 runs of
the MADS method of the previous section ended with f (x ) = 3.464, which is the
global optimum, though the arguments varied between -1.715 and -1.749 for all the
runs. The progression of the algorithms is illustrated in Figure 5.7.

dynamic 2n directions
3.36
GPS Filter
GPS Barrier
MADS (5 runs)

3.38

3.4
f

3.42

3.44

3.46

50 100 150 200 250 300 350 400 450


Number of function evaluations

Fig. 5.7. Progression of the objective function value vs the number of evaluations on an easy
nonlinear problem

Thus the GPS filter method did slightly better than the GPS barrier, and MADS
solved the problem easily, as the theory given here predicts.
5.4. A nasty constrained problem for both GPS and MADS. This last
example does not satisfy the hypotheses of any GPS or MADS theorems because the
optimizer is at . But, it is intended to see how well the various algorithms track
a feasible region that gets narrow quickly. For this reason, the objective is not meant
to be an issue, it is linear.

min a
x=(a,b)T
s.t. ea b 2ea .
323
The starting point is (0, 1)T , and the stopping criteria is activated when m
k < 10
i.e., when the mesh size parameter drops below the smallest positive representable
MESH ADAPTIVE DIRECT SEARCH ALGORITHMS 23

number. We admit that this is excessive, but we wanted to run the algorithms to
their limits. The same strategies as in Section 5.3 are used.
The progression of the algorithms is illustrated in Figure 5.8. Both GPS with
the barrier and filter approaches to constraints converged quickly to points where the
standard 2n basis does not contain a feasible descent direction. The filter GPS ap-
proach to constraints did better than the GPS barrier approach, because it is allowed
to become infeasible.

dynamic 2n directions
GPS Filter
0 GPS Barrier
MADS (5 runs)

10

15
f

20

25

30

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2


Number of function evaluations x 10
4

Fig. 5.8. Progression of the objective function value vs the number of evaluations on a difficult
nonlinear problem

All 5 runs of the LTMADS method of the previous section ended with roughly
the same solution, a point where a pk = a, which all one can ask. The fact that
LTMADS uses a dense set of poll directions explains why it does better.
The feasible region is very narrow, and therefore it gets quite improbable that
the MADS poll directions generate a feasible point, and when such a feasible point is
generated it is always very close to the frame center since the mesh and poll parameters
are very small.
Even if the algorithm instances failed to solve this problem to optimality and
converged to points that are not Clarke stationary points, the GPS and MADS con-
vergence theory is not violated yet. In all cases, there is a set of directions that
positively span R2 such that for each directions either the Clarke generalized deriva-
tive is non-negative or is an infeasible directions.

6. Discussion. GPS is a valuable algorithm, but the application of nonsmooth


analysis techniques in [3] showed its limitations due to the finite choice of directions
in [3]. MADS removes the restriction of GPS to finitely many poll directions. We
have long felt that this was the major impediment to stronger proofs of optimality for
GPS limit points (and better behavior), and in this paper we do find more satisfying
optimality conditions for MADS in addition to opening new possibilities for handling
nonlinear constraints.
We gave a stochastic version of MADS, LTMADS, which performed well, espe-
cially for a first implementation. We expect others will find more, and perhaps better,
implementations. Our priority is to explore MADS for constraints.
24 CHARLES AUDET AND J. E. DENNIS, JR.

We think that the work here is readily applied to choosing templates for implicit
filtering [7], another very successful algorithm for nasty nonlinear problems.
Finally, we wish to thank Gilles Couture for coding nomad, the c++ implemen-
tation of MADS and GPS, and to acknowledge useful discussions with Andrew Booker
and Mark Abramson.

REFERENCES

[1] Abramson M. A. , Audet C. and Dennis J.E.Jr. (2003), Generalized pattern searches
with derivative information, Technical Report TR02-10, Department of Computational
and Applied Mathematics, Rice University, Houston Texas, to appear in Mathematical
Programming Series B.
[2] Audet C. (2002), Convergence results for pattern search algorithms are tight, Les Cahiers
du GERAD G-2002-56, Montr eal. To appear in Optimization and Engineering.
[3] Audet C. and Dennis J.E.Jr. (2003), Analysis of generalized pattern searches, SIAM
Journal on Optimization 13, 889-903.
[4] Audet C. and Dennis J.E.Jr. (2000): A pattern search filter method for nonlinear program-
ming without derivatives, Technical Report TR00-09, Department of Computational and
Applied Mathematics, Rice University, Houston Texas, to appear in SIAM Journal on
Optimization.
[5] Avriel M. (1976) Nonlinear Programming Analysis and Methods, Prentice-Hall, Englewood
Cliffs, NJ.
[6] Booker A.J., Dennis J.E.Jr, Frank P.D., Serafini D.B., Torczon V. and Trosset M.W.
(1999), A rigorous framework for optimization of expensive functions by surrogates,
Structural Optimization Vol.17 No.1, 1-13.
[7] Choi T.D. and Kelley C.T. (1999), Superlinear convergence and implicit filtering, SIAM
Journal on Optimization Vol.10 No.4, 11491162.
[8] Clarke, F.H. (1990) Optimization and Nonsmooth Analysis, SIAM Classics in Applied
Mathematics Vol.5, Philadelphia.
[9] Coope I.D. and Price C.J. (2000), Frame-based methods for unconstrained optimization,
Journal of Optimization Theory and Application Vol. 107, 261274.
[10] Coope I.D. and Price C.J. (2001), On the convergence of grid-based methods for uncon-
strained optimization, SIAM Journal on Optimization Vol.11, 859869.
[11] Davis C. (1954): Theory of positive linear dependence, Amer. J. Math., 76, 733746.
[12] Gould F.J. and Tolle J.W. (1972), Geometry of optimality conditions and constraint
qualifications, Mathematical Programming, 2, 118.
[13] Hayes R.E., Bertrand F.H., Audet C. and Kolaczkowski S.T. (2002), Catalytic Com-
bustion Kinetics: Using a Direct Search Algorithm to Evaluate Kinetic Parameters from
Light-Off Curves, Les Cahiers du GERAD G-2002-20, Montr eal. To appear in The Cana-
dian journal of Chemical Engineering.
[14] Jahn J. (1994), Introduction to the Theory of Nonlinear Optimization, Springer, Berlin.
[15] Kolda T.G., Lewis R.M. and Torczon V. (2003), Optimization by direct search : New
perspectives on some classical and modern methods, SIAM Review, Vol.46, 385482.
[16] Leach E.B. (1961) A note on inverse function theorem, Proc. AMS Vol.12, 694-697.
[17] Lewis R.M. and Torczon V. (1996), Rank ordering and positive basis in pattern search
algorithms, ICASE TR 96-71.
[18] Lewis R.M. and Torczon V. (2002), A globally convergent augmented Lagrangian pattern
search algorithm for optimization with general constraints and simple bounds, SIAM
Journal on Optimization, Vol.12 No.4, 1075-1089.
[19] McKay M.D., Conover W.J. and Beckman R.J. (1979): A comparison of three methods
for selecting values of input variables in the analysis of output from a computer code,
Technometrics, Vol.21, 239245.
[20] Rockafellar R.T. (1970) Convex analysis, Princeton University Press.
[21] Rockafellar R.T. (1980) Generalized directional derivatives and subgradients of nonconvex
functions, Canadian Journal of Mathematics Vol.32, 157180.
[22] Torczon V. (1997), On the Convergence of Pattern Search Algorithms, SIAM Journal on
Optimization Vol.7 No.1, 125.

You might also like