Lect ModConvOpt
Lect ModConvOpt
LECTURES
ON
MODERN CONVEX OPTIMIZATION
†
The William Davidson Faculty of Industrial Engineering & Management,
Technion – Israel Institute of Technology, abental@ie.technion.ac.il
http://ie.technion.ac.il/Home/Users/morbt0.html
∗
H. Milton Stewart School of Industrial & Systems Engineering,
Georgia Institute of Technology, nemirovs@isye.gatech.edu
http://www.isye.gatech.edu/faculty-staff/profile.php?entry=an63
Preface
Γ being a given set of pairs (i, j) of indices i, j. This is a fundamental combinatorial problem of computing the
stability number of a graph; the corresponding “covering story” is as follows:
Assume that we are given n letters which can be sent through a telecommunication channel, say,
n = 256 usual bytes. When passing trough the channel, an input letter can be corrupted by errors;
as a result, two distinct input letters can produce the same output and thus not necessarily can be
distinguished at the receiving end. Let Γ be the set of “dangerous pairs of letters” – pairs (i, j) of
distinct letters i, j which can be converted by the channel into the same output. If we are interested
in error-free transmission, we should restrict the set S of letters we actually use to be independent
– such that no pair (i, j) with i, j ∈ S belongs to Γ. And in order to utilize best of all the capacity
of the channel, we are interested to use a maximal – with maximum possible number of letters –
independent sub-alphabet. It turns out that the minus optimal value in (A) is exactly the cardinality
of such a maximal independent sub-alphabet.
where λmin (A) denotes the minimum eigenvalue of a symmetric matrix A. This problem is responsible for the
design of a truss (a mechanical construction comprised of linked with each other thin elastic bars, like an electric
mast, a bridge or the Eiffel Tower) capable to withstand best of all to k given loads.
When looking at the analytical forms of (A) and (B), it seems that the first problem is easier than the second:
the constraints in (A) are simple explicit quadratic equations, while the constraints in (B) involve much more
complicated functions of the design variables – the eigenvalues of certain matrices depending on the design vector.
The truth, however, is that the first problem is, in a sense, “as difficult as an optimization problem can be”, and
4
the worst-case computational effort to solve this problem within absolute inaccuracy 0.5 by all known optimization
methods is about 2n operations; for n = 256 (just 256 design variables corresponding to the “alphabet of bytes”),
the quantity 2n ≈ 1077 , for all practical purposes, is the same as +∞. In contrast to this, the second problem is
quite “computationally tractable”. E.g., for k = 6 (6 loads of interest) and m = 100 (100 degrees of freedom of
the construction) the problem has about 600 variables (twice the one of the “byte” version of (A)); however, it
can be reliably solved within 6 accuracy digits in a couple of minutes. The dramatic difference in computational
effort required to solve (A) and (B) finally comes from the fact that (A) is a non-convex optimization problem,
while (B) is convex.
Note that realizing what is easy and what is difficult in Optimization is, aside of theoretical
importance, extremely important methodologically. Indeed, mathematical models of real world
situations in any case are incomplete and therefore are flexible to some extent. When you know in
advance what you can process efficiently, you perhaps can use this flexibility to build a tractable
(in our context – a convex) model. The “traditional” Optimization did not pay much attention
to complexity and focused on easy-to-analyze purely asymptotical “rate of convergence” results.
From this viewpoint, the most desirable property of f and gi is smoothness (plus, perhaps,
certain “nondegeneracy” at the optimal solution), and not their convexity; choosing between
the above problems (A) and (B), a “traditional” optimizer would, perhaps, prefer the first of
them. We suspect that a non-negligible part of “applied failures” of Mathematical Programming
came from the traditional (we would say, heavily misleading) “order of preferences” in model-
building. Surprisingly, some advanced users (primarily in Control) have realized the crucial role
of convexity much earlier than some members of the Optimization community. Here is a real
story. About 7 years ago, we were working on certain Convex Optimization method, and one of
us sent an e-mail to people maintaining CUTE (a benchmark of test problems for constrained
continuous optimization) requesting for the list of convex programs from their collection. The
answer was: “We do not care which of our problems are convex, and this be a lesson for those
developing Convex Optimization techniques.” In their opinion, the question is stupid; in our
opinion, they are obsolete. Who is right, this we do not know...
♠ Discovery of interior-point polynomial time methods for “well-structured” generic convex
programs and throughout investigation of these programs.
By itself, the “efficient solvability” of generic convex programs is a theoretical rather than
a practical phenomenon. Indeed, assume that all we know about (P) is that the program is
convex, its objective is called f , the constraints are called gj and that we can compute f and gi ,
along with their derivatives, at any given point at the cost of M arithmetic operations. In this
case the computational effort for finding an -solution turns out to be at least O(1)nM ln( 1 ).
Note that this is a lower complexity bound, and the best known so far upper bound is much
worse: O(1)n(n3 + M ) ln( 1 ). Although the bounds grow “moderately” – polynomially – with
the design dimension n of the program and the required number ln( 1 ) of accuracy digits, from
the practical viewpoint the upper bound becomes prohibitively large already for n like 1000.
This is in striking contrast with Linear Programming, where one can solve routinely problems
with tens and hundreds of thousands of variables and constraints. The reasons for this huge
difference come from the fact that
When solving an LP program, our a priory knowledge is far beyond the fact that the
objective is called f , the constraints are called gi , that they are convex and we can
compute their values at derivatives at any given point. In LP, we know in advance
what is the analytical structure of f and gi , and we heavily exploit this knowledge
when processing the problem. In fact, all successful LP methods never never compute
the values and the derivatives of f and gi – they do something completely different.
5
One of the most important recent developments in Optimization is realizing the simple fact
that a jump from linear f and gi ’s to “completely structureless” convex f and gi ’s is too long: in-
between these two extremes, there are many interesting and important generic convex programs.
These “in-between” programs, although non-linear, still possess nice analytical structure, and
one can use this structure to develop dedicated optimization methods, the methods which turn
out to be incomparably more efficient than those exploiting solely the convexity of the program.
The aforementioned “dedicated methods” are Interior Point polynomial time algorithms,
and the most important “well-structured” generic convex optimization programs are those of
Linear, Conic Quadratic and Semidefinite Programming; the last two entities merely did not
exist as established research subjects just 15 years ago. In our opinion, the discovery of Interior
Point methods and of non-linear “well-structured” generic convex programs, along with the
subsequent progress in these novel research areas, is one of the most impressive achievements in
Mathematical Programming.
♠ We have outlined the most revolutionary, in our appreciation, changes in the theoretical
core of Mathematical Programming in the last 15-20 years. During this period, we have witnessed
perhaps less dramatic, but still quite important progress in the methodological and application-
related areas as well. The major novelty here is certain shift from the traditional for Operations
Research applications in Industrial Engineering (production planning, etc.) to applications in
“genuine” Engineering. We believe it is completely fair to say that the theory and methods
of Convex Optimization, especially those of Semidefinite Programming, have become a kind
of new paradigm in Control and are becoming more and more frequently used in Mechanical
Engineering, Design of Structures, Medical Imaging, etc.
The aim of the course is to outline some of the novel research areas which have arisen in
Optimization during the past decade or so. We intend to focus solely on Convex Programming,
specifically, on
• Conic Programming, with emphasis on the most important particular cases – those of
Linear, Conic Quadratic and Semidefinite Programming (LP, CQP and SDP, respectively).
Here the focus will be on
Acknowledgements. The first four lectures of the five comprising the core of the course are
based upon the book
Ben-Tal, A., Nemirovski, A., Lectures on Modern Convex Optimization: Analysis, Algo-
rithms, Engineering Applications, MPS-SIAM Series on Optimization, SIAM, Philadelphia,
2001.
We are greatly indebted to our colleagues, primarily to Yuri Nesterov, Stephen Boyd, Claude
Lemarechal and Keessas Roos, who over the years have influenced significantly our understanding
6
of the subject expressed in this course. Needless to say, we are the only persons responsible for
the drawbacks in what follows.
The Lecture Notes were “renovated” in Fall 2011, Summer 2013 and Summer 2015. The most
important added material is the one in Sections 1.3 (applications of LP in Compressive Sensing
and for synthesis of linear controllers in linear dynamic systems), 3.6 (Lagrangian approximation
of chance constraints), and 5.3, 5.5 (advanced First Order methods for large-scale optimization).
I have also added three “operational exercises” (Exercises 1.3.1, 1.3.2, 2.6.1, 5.3.1); in contrast to
“regular exercises,” where the task usually is to prove something, an operational exercise requires
creating software and numerical experimentation to implement a broadly outlined approach and
as such offers much room for reader’s initiative. The operational exercises are typed in blue.
Arkadi Nemirovski,
August 2015.
Contents
7
8 CONTENTS
4 Polynomial Time Interior Point algorithms for LP, CQP and SDP 255
4.1 Complexity of Convex Programming . . . . . . . . . . . . . . . . . . . . . . . . . 255
4.1.1 Combinatorial Complexity Theory . . . . . . . . . . . . . . . . . . . . . . 255
4.1.2 Complexity in Continuous Optimization . . . . . . . . . . . . . . . . . . . 258
4.1.3 Computational tractability of convex optimization problems . . . . . . . . 259
4.1.4 What is inside Theorem 4.1.1: Black-box represented convex programs
and the Ellipsoid method . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
4.1.5 Difficult continuous optimization problems . . . . . . . . . . . . . . . . . 271
4.2 Interior Point Polynomial Time Methods for LP, CQP and SDP . . . . . . . . . . 272
4.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
4.2.2 Interior Point methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
4.2.3 But... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
4.3 Interior point methods for LP, CQP, and SDP: building blocks . . . . . . . . . . 277
4.3.1 Canonical cones and canonical barriers . . . . . . . . . . . . . . . . . . . . 278
4.3.2 Elementary properties of canonical barriers . . . . . . . . . . . . . . . . . 280
4.4 Primal-dual pair of problems and primal-dual central path . . . . . . . . . . . . . 281
4.4.1 The problem(s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
4.4.2 The central path(s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
4.5 Tracing the central path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
4.5.1 The path-following scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 288
4.5.2 Speed of path-tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
4.5.3 The primal and the dual path-following methods . . . . . . . . . . . . . . 291
4.5.4 The SDP case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
4.6 Complexity bounds for LP, CQP, SDP . . . . . . . . . . . . . . . . . . . . . . . . 308
4.6.1 Complexity of LP b . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
4.6.2 Complexity of CQP b . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
4.6.3 Complexity of SDP b . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
4.7 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
Bibliography 427
• The space of all n-dimensional vectors is denoted Rn , the set of all m × n matrices is
denoted Rm×n or Mm×n , and the set of symmetric n × n matrices is denoted Sn . By
default, all vectors and matrices are real.
x = [x1 ; ...; xn ]
1
(pay attention to semicolon “;”). For example, 2 is written as [1; 2; 3].
3
More generally,
— if A1 , ..., Am are matrices with the same number of columns, we write [A1 ; ...; Am ] to
denote the matrix which is obtained when writing A2 beneath A1 , A3 beneath A2 , and so
on.
— if A1 , ..., Am are matrices with the same number of rows, then [A1 , ..., Am ] stands for
the matrix which is obtained when writing A2 to the right of A1 , A3 to the right of A2 ,
and so on.
Examples:
" # 1 2 3
1 2 3 h i
• A1 = , A2 = 7 8 9 ⇒ [A1 ; A2 ] = 4 5 6
4 5 6
7 8 9
" # " # " #
1 2 7 1 2 7
• A1 = , A2 = ⇒ [A1 , A2 ] =
3 4 8 4 5 8
• [1, 2, 3, 4] = [1; 2; 3; 4]T "" # " ## " #
1 2 5 6 1 2 5 6
• [[1, 2; 3, 4], [5, 6; 7, 8]] = , =
3 4 7 8 3 4 7 8
= [1, 2, 5, 6; 3, 4, 7, 8]
O(1)’s. Below O(1)’s denote properly selected positive absolute constants. We write f ≤
O(1)g, where f and g are nonnegative functions of some parameters, to express the fact that
for properly selected positive absolute constant C the inequality f ≤ Cg holds true in the entire
range of the parameters, and we write f = O(1)g when both f ≤ O(1)g and g ≤ O(1)f .
14 CONTENTS
Lecture 1
where
• x ∈ Rn is the design vector
• c ∈ Rn is a given vector of coefficients of the objective function cT x
• A is a given m × n constraint matrix, and b ∈ Rm is a given right hand side of the
constraints.
(LP) is called
– feasible, if its feasible set
F = {x : Ax − b ≥ 0}
is nonempty; a point x ∈ F is called a feasible solution to (LP);
– bounded below, if it is either infeasible, or its objective cT x is bounded below on F.
For a feasible bounded below problem (LP), the quantity
c∗ ≡ inf cT x
x:Ax−b≥0
is called the optimal value of the problem. For an infeasible problem, we set c∗ = +∞,
while for feasible unbounded below problem we set c∗ = −∞.
(LP) is called solvable, if it is feasible, bounded below and the optimal value is attained, i.e.,
there exists x ∈ F with cT x = c∗ . An x of this type is called an optimal solution to (LP).
A priori it is unclear whether a feasible and bounded below LP program is solvable: why should
the infimum be achieved? It turns out, however, that a feasible and bounded below program
(LP) always is solvable. This nice fact (we shall establish it later) is specific for LP. Indeed, a
very simple nonlinear optimization program
1
min x≥1
x
is feasible and bounded below, but it is not solvable.
15
16 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING
how to find a systematic way to bound from below its optimal value c∗ ?
Why this is an important question, and how the answer helps to deal with LP, this will be seen
in the sequel. For the time being, let us just believe that the question is worthy of the effort.
A trivial answer to the posed question is: solve (LP) and look what is the optimal value.
There is, however, a smarter and a much more instructive way to answer our question. Just to
get an idea of this way, let us look at the following example:
x 1 + 2x 2 + ... + 2001x 2001 + 2002x 2002 − 1 ≥ 0,
min x1 + x2 + ... + x2002 2002x1 + 2001x2 + ... + 2x2001 + x2002 − 100 ≥ 0, .
..... ... ...
101
We claim that the optimal value in the problem is ≥ 2003 . How could one certify this bound?
This is immediate: add the first two constraints to get the inequality
and divide the resulting inequality by 2003. LP duality is nothing but a straightforward gener-
alization of this simple trick.
(??) How to certify that (S) has, or does not have, a solution.
Imagine that you are very smart and know the correct answer to (?); how could you convince
somebody that your answer is correct? What could be an “evident for everybody” certificate of
the validity of your answer?
If your claim is that (S) is solvable, a certificate could be just to point out a solution x∗ to
(S). Given this certificate, one can substitute x∗ into the system and check whether x∗ indeed
is a solution.
Assume now that your claim is that (S) has no solutions. What could be a “simple certificate”
of this claim? How one could certify a negative statement? This is a highly nontrivial problem
not just for mathematics; for example, in criminal law: how should someone accused in a murder
prove his innocence? The “real life” answer to the question “how to certify a negative statement”
is discouraging: such a statement normally cannot be certified (this is where the rule “a person
is presumed innocent until proven guilty” comes from). In mathematics, however, the situation
is different: in some cases there exist “simple certificates” of negative statements. E.g., in order
to certify that (S) has no solutions, it suffices to demonstrate that a consequence of (S) is a
contradictory inequality such as
−1 ≥ 0.
For example, assume that λi , i = 1, ..., m, are nonnegative weights. Combining inequalities from
(S) with these weights, we come to the inequality
m
X
λi fi (x) Ω 0 (Cons(λ))
i=1
where Ω is either ” > ” (this is the case when the weight of at least one strict inequality from
(S) is positive), or ” ≥ ” (otherwise). Since the resulting inequality, due to its origin, is a
consequence of the system (S), i.e., it is satisfied by every solution to S), it follows that if
(Cons(λ)) has no solutions at all, we can be sure that (S) has no solution. Whenever this is the
case, we may treat the corresponding vector λ as a “simple certificate” of the fact that (S) is
infeasible.
Let us look what does the outlined approach mean when (S) is comprised of linear inequal-
ities:
T ”>”
(S) : {ai x Ωi bi , i = 1, ..., m} Ωi =
”≥”
Here the “combined inequality” is linear as well:
m
X m
X
T
(Cons(λ)) : ( λai ) x Ω λbi
i=1 i=1
(Ω is ” > ” whenever λi > 0 for at least one i with Ωi = ” > ”, and Ω is ” ≥ ” otherwise). Now,
when can a linear inequality
dT x Ω e
be contradictory? Of course, it can happen only when d = 0. Whether in this case the inequality
is contradictory, it depends on what is the relation Ω: if Ω = ” > ”, then the inequality is
contradictory if and only if e ≥ 0, and if Ω = ” ≥ ”, it is contradictory if and only if e > 0. We
have established the following simple result:
18 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING
with n-dimensional vector of unknowns x. Let us associate with (S) two systems of linear
inequalities and equations with m-dimensional vector of unknowns λ:
(a) λ ≥ 0;
m
P
(b) λi ai = 0;
i=1
TI : m
λi bi ≥ 0;
P
(cI )
i=1
m
Ps
(dI ) λi > 0.
i=1
(a) λ ≥ 0;
m
P
(b) λi ai = 0;
TII : i=1
m
P
(cII ) λi bi > 0.
i=1
Assume that at least one of the systems TI , TII is solvable. Then the system (S) is infeasible.
Proposition 1.2.1 says that in some cases it is easy to certify infeasibility of a linear system of
inequalities: a “simple certificate” is a solution to another system of linear inequalities. Note,
however, that the existence of a certificate of this latter type is to the moment only a sufficient,
but not a necessary, condition for the infeasibility of (S). A fundamental result in the theory of
linear inequalities is that the sufficient condition in question is in fact also necessary:
Theorem 1.2.1 [General Theorem on Alternative] In the notation from Proposition 1.2.1, sys-
tem (S) has no solutions if and only if either TI , or TII , or both these systems, are solvable.
There are numerous proofs of the Theorem on Alternative; in my taste, the most instructive one is to
reduce the Theorem to its particular case – the Homogeneous Farkas Lemma:
[Homogeneous Farkas Lemma] A homogeneous nonstrict linear inequality
aT x ≤ 0
aTi x ≤ 0, i = 1, ..., m
if and only if it can be obtained from the system by taking weighted sum with nonnegative
weights:
(a) aTi x ≤ 0, i = 1, ..., m ⇒ aT x ≤ 0,
m P (1.2.1)
(b) ∃λi ≥ 0 : a = λi ai .
i
The reduction of GTA to HFL is easy. As about the HFL, there are, essentially, two ways to prove the
statement:
• The “quick and dirty” one based on separation arguments (see Section B.2.6 and/or Exercise B.14),
which is as follows:
1.2. DUALITY IN LINEAR PROGRAMMING 19
1. First, we demonstrate that if A is a nonempty closed convex set in Rn and a is a point from
Rn \A, then a can be strongly separated from A by a linear form: there exists x ∈ Rn such
that
xT a < inf xT b. (1.2.2)
b∈A
has a solution b∗ ;
(b) Setting x = b∗ − a, one ensures (1.2.2).
Both (a) and (b) are immediate.
2. Second, we demonstrate that the set
m
X
A = {b : ∃λ ≥ 0 : b = λi ai }
i=1
– the cone spanned by the vectors a1 , ..., am – is convex (which is immediate) and closed (the
proof of this crucial fact also is not difficult).
3. Combining the above facts, we immediately see that
— either a ∈ A, i.e., (1.2.1.b) holds,
— or there exists x such that xT a < inf xT
P
λi ai .
λ≥0 i
The latter inf is finite if and only if xT ai ≥ 0 for all i, and in this case the inf is 0, so that
the “or” statement says exactly that there exists x with aTi x ≥ 0, aT x < 0, or, which is the
same, that (1.2.1.a) does not hold.
Thus, among the statements (1.2.1.a) and the negation of (1.2.1.b) at least one (and, as it
is immediately seen, at most one as well) always is valid, which is exactly the equivalence
(1.2.1).
• “Advanced” proofs based purely on Linear Algebra facts (see Section B.2.5.A). The advantage of
these purely Linear Algebra proofs is that they, in contrast to the outlined separation-based proof,
do not use the completeness of Rn as a metric space and thus work when we pass from systems
with real coefficients and unknowns to systems with rational (or algebraic) coefficients. As a result,
an advanced proof allows to establish the Theorem on Alternative for the case when the coefficients
and unknowns in (S), TI , TII are restricted to belong to a given “real field” (e.g., are rational).
We formulate here explicitly two very useful principles following from the Theorem on Al-
ternative:
aTi x Ωi bi , i = 1, ..., m
has no solutions if and only if one can combine the inequalities of the system in
a linear fashion (i.e., multiplying the inequalities by nonnegative weights, adding
the results and passing, if necessary, from an inequality aT x > b to the inequality
aT x ≥ b) to get a contradictory inequality, namely, either the inequality 0T x ≥ 1, or
the inequality 0T x > 0.
B. A linear inequality
aT0 x Ω0 b0
20 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING
The inequality (e) is clearly a consequence of (a) – (d). However, if we extend the system of
inequalities (a) – (b) by all “trivial” (i.e., identically true) linear and quadratic inequalities
with 2 variables, like 0 > −1, u2 + v 2 ≥ 0, u2 + 2uv + v 2 ≥ 0, u2 − uv + v 2 ≥ 0, etc.,
and ask whether (e) can be derived in a linear fashion from the inequalities of the extended
system, the answer will be negative. Thus, Principle A fails to be true already for quadratic
inequalities (which is a great sorrow – otherwise there were no difficult problems at all!)
We are about to use the Theorem on Alternative to obtain the basic results of the LP duality
theory.
aT2
∗ T
∈ Rm×n
c = min c x Ax − b ≥ 0 A = (LP)
x ...
aTm
1.2. DUALITY IN LINEAR PROGRAMMING 21
is the desire to generate, in a systematic way, lower bounds on the optimal value c∗ of (LP).
An evident way to bound from below a given function f (x) in the domain given by system of
inequalities
gi (x) ≥ bi , i = 1, ..., m, (1.2.3)
is offered by what is called the Lagrange duality and is as follows:
Lagrange Duality:
• Let us look at all inequalities which can be obtained from (1.2.3) by linear aggre-
gation, i.e., at the inequalities of the form
X X
yi gi (x) ≥ yi bi (1.2.4)
i i
with the “aggregation weights” yi ≥ 0. Note that the inequality (1.2.4), due to its
origin, is valid on the entire set X of solutions of (1.2.3).
• Depending on the choice of aggregation weights, it may happen that the left hand
side in (1.2.4) is ≤ f (x) for all x ∈ Rn . Whenever it is the case, the right hand side
P
yi bi of (1.2.4) is a lower bound on f in X.
i
P
Indeed, on X the quantity yi bi is a lower bound on yi gi (x), and for y in question
i
the latter function of x is everywhere ≤ f (x).
It follows that
• The optimal value in the problem
X y ≥ 0, (a)
max yi bi : P yi gi (x) ≤ f (x) ∀x ∈ Rn (b) (1.2.5)
y
i i
is a lower bound on the values of f on the set of solutions to the system (1.2.3).
Let us look what happens with the Lagrange duality when f and gi are homogeneous linear
functions: f = cT x, gi (x) = aTi x. In this case, the requirement (1.2.5.b) merely says that
yi ai (or, which is the same, AT y = c due to the origin of A). Thus, problem (1.2.5)
P
c =
i
becomes the Linear Programming problem
n o
max bT y : AT y = c, y ≥ 0 , (LP∗ )
y
has no solution. We know by the Theorem on Alternative that the latter fact means that some
other system of linear equalities (more exactly, at least one of a certain pair of systems) does
have a solution. More precisely,
(*) (Sa ) has no solutions if and only if at least one of the following two systems with
m + 1 unknowns:
λ = (λ0 , λ1 , ..., λm ) ≥ 0;
(a)
m
−λ0 c +
P
(b) λi ai = 0;
TI : i=1
m
−λ0 a + λi bi ≥ 0;
P
(cI )
i=1
(dI ) λ0 > 0,
or
(a) λ = (λ0 , λ1 , ..., λm ) ≥ 0;
m
−λ0 c −
P
(b) λi ai = 0;
TII : i=1
m
−λ0 a −
P
(cII ) λi bi > 0
i=1
– has a solution.
Now assume that (LP) is feasible. We claim that under this assumption (Sa ) has no solutions
if and only if TI has a solution.
The implication ”TI has a solution ⇒ (Sa ) has no solution” is readily given by the above
remarks. To verify the inverse implication, assume that (Sa ) has no solutions and the system
Ax ≤ b has a solution, and let us prove that then TI has a solution. If TI has no solution, then
by (*) TII has a solution and, moreover, λ0 = 0 for (every) solution to TII (since a solution
to the latter system with λ0 > 0 solves TI as well). But the fact that TII has a solution λ
with λ0 = 0 is independent of the values of a and c; if this fact would take place, it would
mean, by the same Theorem on Alternative, that, e.g., the following instance of (Sa ):
0T x ≥ −1, Ax ≥ b
has no solutions. The latter means that the system Ax ≥ b has no solutions – a contradiction
with the assumption that (LP) is feasible.
Now, if TI has a solution, this system has a solution with λ0 = 1 as well (to see this, pass from
a solution λ to the one λ/λ0 ; this construction is well-defined, since λ0 > 0 for every solution
to TI ). Now, an (m + 1)-dimensional vector λ = (1, y) is a solution to TI if and only if the
m-dimensional vector y solves the system of linear inequalities and equations
y ≥ 0;
m
AT y ≡
P
yi ai = c; (D)
i=1
bT y ≥ a
Summarizing our observations, we come to the following result.
Proposition 1.2.2 Assume that system (D) associated with the LP program (LP) has a solution
(y, a). Then a is a lower bound on the optimal value in (LP). Vice versa, if (LP) is feasible and
a is a lower bound on the optimal value of (LP), then a can be extended by a properly chosen
m-dimensional vector y to a solution to (D).
1.2. DUALITY IN LINEAR PROGRAMMING 23
We see that the entity responsible for lower bounds on the optimal value of (LP) is the system
(D): every solution to the latter system induces a bound of this type, and in the case when
(LP) is feasible, all lower bounds can be obtained from solutions to (D). Now note that if
(y, a) is a solution to (D), then the pair (y, bT y) also is a solution to the same system, and the
lower bound bT y on c∗ is not worse than the lower bound a. Thus, as far as lower bounds on
c∗ are concerned, we lose nothing by restricting ourselves to the solutions (y, a) of (D) with
a = bTy; the ∗
best lower bound
on c given by (D) is therefore the optimal value of the problem
maxy bT y AT y = c, y ≥ 0 , which is nothing but the dual to (LP) problem (LP∗ ). Note that
Proposition 1.2.3 Whenever y is a feasible solution to (LP∗ ), the corresponding value of the
dual objective bT y is a lower bound on the optimal value c∗ in (LP). If (LP) is feasible, then for
every a ≤ c∗ there exists a feasible solution y of (LP∗ ) with bT y ≥ a.
Then
1) The duality is symmetric: the problem dual to dual is equivalent to the primal;
2) The value of the dual objective at every dual feasible solution is ≤ the value of the primal
objective at every primal feasible solution
3) The following 5 properties are equivalent to each other:
Whenever (i) ≡ (ii) ≡ (iii) ≡ (iv) ≡ (v) is the case, the optimal values of the primal and the dual
problems are equal to each other.
Proof. 1) is quite straightforward: writing the dual problem (LP∗ ) in our standard form, we
get
I 0
m
min −bT y AT y − −c ≥ 0 ,
y
−AT
c
24 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING
where Im is the m-dimensional unit matrix. Applying the duality transformation to the latter
problem, we come to the problem
ξ ≥ 0
η ≥ 0
max 0T ξ + cT η + (−c)T ζ
,
ξ,η,ζ
ζ ≥ 0
ξ − Aη + Aζ = −b
(i)⇒(iv): If the primal is feasible and bounded below, its optimal value c∗ (which
of course is a lower bound on itself) can, by Proposition 1.2.3, be (non-strictly)
majorized by a quantity bT y ∗ , where y ∗ is a feasible solution to (LP∗ ). In the
situation in question, of course, bT y ∗ = c∗ (by already proved item 2)); on the other
hand, in view of the same Proposition 1.2.3, the optimal value in the dual is ≤ c∗ . We
conclude that the optimal value in the dual is attained and is equal to the optimal
value in the primal.
(iv)⇒(ii): evident;
(ii)⇒(iii): This implication, in view of the primal-dual symmetry, follows from the
implication (i)⇒(iv).
(iii)⇒(i): evident.
We have seen that (i)≡(ii)≡(iii)≡(iv) and that the first (and consequently each) of
these 4 equivalent properties implies that the optimal value in the primal problem
is equal to the optimal value in the dual one. All which remains is to prove the
equivalence between (i)–(iv), on one hand, and (v), on the other hand. This is
immediate: (i)–(iv), of course, imply (v); vice versa, in the case of (v) the primal is
not only feasible, but also bounded below (this is an immediate consequence of the
feasibility of the dual problem, see 2)), and (i) follows.
An immediate corollary of the LP Duality Theorem is the following necessary and sufficient
optimality condition in LP:
Theorem 1.2.3 [Necessary and sufficient optimality conditions in linear programming] Con-
sider an LP program (LP) along with its dual (LP∗ ). A pair (x, y) of primal and dual feasible
solutions is comprised of optimal solutions to the respective problems if and only if
Indeed, the “zero duality gap” optimality condition is an immediate consequence of the fact
that the value of primal objective at every primal feasible solution is ≥ the value of the
dual objective at every dual feasible solution, while the optimal values in the primal and the
dual are equal to each other, see Theorem 1.2.2. The equivalence between the “zero duality
gap” and the “complementary slackness” optimality conditions is given by the following
1.3. SELECTED ENGINEERING APPLICATIONS OF LP 25
computation: whenever x is primal feasible and y is dual feasible, the products yi [Ax − b]i ,
i = 1, ..., m, are nonnegative, while the sum of these products is precisely the duality gap:
Thus, the duality gap can vanish at a primal-dual feasible pair (x, y) if and only if all products
yi [Ax − b]i for this pair are zeros.
y = Ax + η ∈ Rm (1.3.1)
where η is observation noise. Our goal is to recover x from the observed y. The outlined problem
is responsible for an extremely wide variety of applications and, depending on a particular
application, is studied in different “regimes.” For example, in the traditional Statistics x is
interpreted not as a signal, but as the vector of parameters of a “black box” which, given on
input a vector a ∈ Rn , produces output aT x. Given a collection a1 , ..., am of n-dimensional inputs
to the black box and the corresponding outputs (perhaps corrupted by noise) yi = [ai ]T x + ηi ,
we want to recover the vector of parameters x; this is called linear regression problem,. In order
to represent this problem in the form of (1.3.1) one should make the row vectors [ai ]T the rows
of an m × n matrix, thus getting matrix A, and to set y = [y1 ; ...; ym ], η = [η1 ; ...; ηm ]. The
typical regime here is m n - the number of observations is much larger than the number
of parameters to be recovered, and the challenge is to use this “observation redundancy” in
order to get rid, to the best extent possible, of the observation noise. In Compressed Sensing
the situation is opposite: the regime of interest is m n. At the first glance, this regime
seems to be completely hopeless: even with no noise (η = 0), we need to recover a solution x
to an underdetermined system of linear equations y = Ax. When the number of variables is
greater than the number of observations, the solution to the system either does not exist, or is
not unique, and in both cases our goal seems to be unreachable. This indeed is so, unless we
have at our disposal some additional information on x. In Compressed Sensing, this additional
information is that x is s-sparse — has at most a given number s of nonzero entries. Note that
in many applications we indeed can be sure that the true signal x is sparse. Consider, e.g., the
following story about signal detection:
1
For related reading, see, e.g., [16] and references therein.
26 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING
There are n locations where signal transmitters could be placed, and m locations with
the receivers. The contribution of a signal of unit magnitude originating in location
j to the signal measured by receiver i is a known quantity aij , and signals originating
in different locations merely sum up in the receivers; thus, if x is the n-dimensional
vector with entries xj representing the magnitudes of signals transmitted in locations
j = 1, 2, ..., n, then the m-dimensional vector y of (noiseless) measurements of the m
receivers is y = Ax, A ∈ Rm×n . Given this vector, we intend to recover y.
Now, if the receivers are hydrophones registering noises emitted by submarines in certain part of
Atlantic, tentative positions of submarines being discretized with resolution 500 m, the dimension
of the vector x (the number of points in the discretization grid) will be in the range of tens of
thousands, if not tens of millions. At the same time, the total number of submarines (i.e.,
nonzero entries in x) can be safely upper-bounded by 50, if not by 20.
Sparsity changes dramatically our possibilities to recover high-dimensional signals from their
low-dimensional linear images: given in advance that x has at most s m nonzero entries, the
possibility of exact recovery of x at least from noiseless observations y becomes quite natural.
Indeed, let us try to recover x by the following “brute force” search: we inspect, one by one,
all subsets I of the index set {1, ..., n} — first the empty set, then n singletons {1},...,{n}, then
n(n−1)
2 2-element subsets, etc., and each time try to solve the system of linear equations
y = Ax, xj = 0 when j 6∈ I;
when arriving for the first time at a solvable system, we terminate and claim that its solution
is the true vector x. It is clear that we will terminate before all sets I of cardinality ≤ s are
inspected. It is also easy to show (do it!) that if every 2s distinct columns in A are linearly
independent (when m ≥ 2s, this indeed is the case for a matrix A in a “general position”2 ),
then the procedure is correct — it indeed recovers the true vector x.
A bad news is that the outlined procedure becomes completely impractical already for
“small” values of s and n because of the astronomically large number of linear systems we need
to process3 . A partial remedy is as follows. The outlined approach is, essentially, a particular
way to solve the optimization problem
where nnz(x) is the number of nonzero entries in a vector x. At the present level of our knowl-
edge, this problem looks completely intractable (in fact, we do not know algorithms solving
the problem essentially faster than the brute force search), and there are strong reasons, to be
addressed later in our course, to believe that it indeed is intractable. Well, if we do not know
how to minimize under linear constraints the “bad” objective nnz(x), let us “approximate”
this objective with one which we do know how to minimize. The true objective is separable:
nnz(x) = ni=1 ξ(xj ), where ξ(s) is the function on the axis equal to 0 at the origin and equal
P
to 1 otherwise. As a matter of fact, the separable functions which we do know how to minimize
under linear constraints are sums of convex functions of x1 , ..., xn 4 . The most natural candidate
to the role of convex approximation of ξ(s) is |s|; with this approximation, (∗) converts into the
`1 -minimization problem
n
( )
X
min kxk1 := |xj | : Ax = y , (1.3.2)
x
i=1
which is equivalent to the LP program
( n )
X
min wj : Ax = y, −wj ≤ xj ≤ wj , 1 ≤ j ≤ n .
x,w
i=1
For the time being, we were focusing on the (unrealistic!) case of noiseless observations
η = 0. A realistic model is that η 6= 0. How to proceed in this case, depends on what we know
on η. In the simplest case of “unknown but small noise” one assumes that, say, the Euclidean
norm k · k2 of η is upper-bounded by a given “noise level“ δ: kηk2 ≤ δ. In this case, the `1
recovery usually takes the form
Now we cannot hope that our recovery x b will be exactly equal to the true s-sparse signal x, but
perhaps may hope that x b is close to x when δ is small.
Note that (1.3.3) is not an LP program anymore5 , but still is a nice convex optimization
program which can be solved to high accuracy even for reasonable large m, n.
In other words, for every nonzero vector z ∈ KerA, the sum kzks,1 of the s largest magnitudes
of entries in z should be strictly less than half of the sum of magnitudes of all entries.
4
A real-valued function f (s) on the real axis is called convex, if its graph, between every pair of its points, is
below the chord linking these points, or, equivalently, if f (x + λ(y − x)) ≤ f (x) + λ(f (y) − f (x)) for every x, y ∈ R
and every λ ∈ [0, 1]. For example, maxima of (finitely many) affine functions ai s + bi on the axis are convex. For
more detailed treatment of convexity of functions, see Appendix C.
5
To get an LP, we should replace the Euclidean norm kAw − yk2 of the residual with, say, the uniform norm
kAw − yk∞ , which makes perfect sense when we start with coordinate-wise bounds on observation errors, which
indeed is the case in some applications.
28 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING
The necessity and sufficiency of the nullspace property for s-goodness of A can be
derived “from scratch” — from the fact that s-goodness means that every s-sparse
signal x should be the unique optimal solution to the associated LP minw {kwk1 :
Aw = Ax} combined with the LP optimality conditions. Another option, which
we prefer to use here, is to guess the condition and then to prove that its indeed is
necessary and sufficient for s-goodness of A. The necessity is evident: if the nullspace
property does not take place, then there exists 0 6= z ∈ KerA and s-element subset
I of the index set {1, ..., n} such that if J is the complement of I in {1, ..., n}, then
the vector zI obtained from z by zeroing out all entries with indexes not in I along
with the vector zJ obtained from z by zeroing out all entries with indexes not in J
satisfy the relation kzI k1 ≥ 21 kzk1 = 12 [kzI k1 + kzJ k1 , that is,
kzI k1 ≥ kzJ k1 .
Since Az = 0, we have AzI = A[−zJ ], and we conclude that the s-sparse vector zI
is not the unique optimal solution to the LP minw {kwk1 : Aw = AzI }, since −zJ is
feasible solution to the program with the value of the objective at least as good as
the one at zJ , on one hand, and the solution −zJ is different from zI (since otherwise
we should have zI = zJ = 0, whence z = 0, which is not the case) on the other hand.
To prove that the nullspace property is sufficient for A to be s-good is equally easy:
indeed, assume that this property does take place, and let x be s-sparse signal, so
that the indexes of nonzero entries in x are contained in an s-element subset I of
{1, ..., n}, and let us prove that if x
b is an optimal solution to the LP (1.3.2), then
x
b = X. Indeed, denoting by J the complement of I setting z = x b − x and assuming
that z 6= 0, we have Az = 0. Further, in the same notation as above we have
kxI k1 − kx
bI k1 ≤ kzI k1 < kzkJ ≤ kx
bJ k − kxJ k1
(the first and the third inequality are due to the Triangle inequality, and the second –
due to the nullspace property), whence kxk1 = kxI k1 +kxJ k1 < kx bi k1 +kx
b J k = kx
b k1 ,
which contradicts the origin of x b.
Denoting by hv an optimal solution to the right hand side LP, let us set
Observe that the maxima in question are well defined reals, since Vs is a finite set, and that the
nullspace property is nothing but the relation
kx − xs k1 ≤ ρ
where xs is the best s-sparse approximation of x (to get this approximation, one zeros out
all but the s largest in magnitude entries in x, the ties, if any, being resolved arbitrarily);
2. y is a noisy observation of x:
y = Ax + η, kηk2 ≤ δ;
3. x
b is a µ-suboptimal and -feasible solution to (1.3.3), specifically,
kx
bk1 ≤ µ + min {kwk1 : kAw − yk2 ≤ δ} & kAx
b − yk2 ≤ .
w
holds true with some parameters γ < 1/2 and β < ∞ (as definitely is the case when A is s-good,
γ = γs (A) and β = βs (A)). The for the outlined imperfect `1 recovery the following error bound
holds true:
2β(δ + ) + µ + 2ρ
kx
b − xk1 ≤ , (1.3.8)
1 − 2γ
i.e., the recovery error is of order of the maximum of the “imperfections” mentioned in 1) —
3).
30 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING
kx
bk1 ≤ µ + kxk1 ,
kxI k1 − kx
bI k1 ≥ kx
bJ k1 − kxJ k1 −µ
| {z } | {z }
≤kzI k1 ≥kzJ k1 −2kxJ k1
whence
kzJ k1 ≤ µ + kzI k1 + 2kxJ k1 ,
so that
kzk1 ≤ µ + 2kzI k1 + 2kxJ k1 . (a)
We further have
kzI k1 ≤ βkAzk2 + γkzk1 ,
which combines with (a) to imply that
1. For given m, n with m n (say, m/n ≤ 1/2), there exist m × n sensing matrices which
m 6
are s-good for the values of s “nearly as large as m,” specifically, for s ≤ O(1) ln(n/m)
Moreover, there are natural families of matrices where this level of goodness “is a rule.”
E.g., when drawing an m × n matrix at random from the Gaussian or the ±1 distributions
(i.e., filling the matrix with independent realizations of a random variable which is either
6
From now on, O(1)’s denote positive absolute constants – appropriately chosen numbers like 0.5, or 1, or
perhaps 100,000. We could, in principle, replace all O(1)’s by specific numbers; following the standard mathe-
matical practice, we do not do it, partly from laziness, partly because the particular values of these numbers in
our context are irrelevant.
1.3. SELECTED ENGINEERING APPLICATIONS OF LP 31
√
Gaussian (zero mean, variance 1/m), or takes values ±1/ m with probabilities 0.5 7 , the
result will be s-good, for the outlined value of s, with probability approaching 1 as m and
n grow. Moreover, for the indicated values of s and randomly selected matrices A, one has
√
βs (A) ≤ O(1) s with probability approaching one when m, n grow.
2. The above results can be considered as a good news. A bad news is, that we do not
know how to check efficiently, given an s and a sensing matrix A, that the matrix is s-
good. Indeed, we know that a necessary and sufficient condition for s-goodness of A is
the nullspace property (1.3.5); this, however, does not help, since the quantity γs (A) is
difficult to compute: computing it by definition requires solving 2s Csn LP programs (Pv ),
v ∈ Vs , which is an astronomic number already for moderate n unless s is really small, like
1 or 2. And no alternative efficient way to compute γs (A) is known.
As a matter of fact, not only we do not know how to check s-goodness efficiently; there still
is no efficient recipe allowing to build, given m, an m × 2m matrix A which is provably
√
s-good for s larger than O(1) m — a much smaller “level of goodness” then the one
(s = O(1)m) promised by theory for typical randomly generated matrices.8 The “common
life” analogy of this pitiful situation would be as follows: you know that with probability
at least 0.9, a brick in your wall is made of gold, and at the same time, you do not know
how to tell a golden brick from a usual one.9
that is, γs (A) is the maximum of a convex function kzks,1 over the convex set {z : Az = 0, kzk≤ 1}.
Although both the function and the set are simple, maximizing of convex function over a convex
7 √
entries “of order of 1/ m” make the Euclidean norms of columns in m × n matrix A nearly one, which is the
most convenient for Compressed Sensing normalization of A.
8
Note that the naive algorithm “generate m × 2m matrices at random until an s-good, with s promised by the
theory, matrix is generated” is not an efficient recipe, since we do not know how to check s-goodness efficiently.
9
This phenomenon is met in many other situations. E.g., in 1938 Claude Shannon (1916-2001), “the father
of Information Theory,” made (in his M.Sc. Thesis!) a fundamental discovery as follows. Consider a Boolean
function of n Boolean variables (i.e., both the function and the variables take values 0 and 1 only); as it is easily
n
seen there are 22 function of this type, and every one of them can be computed by a dedicated circuit comprised of
“switches” implementing just 3 basic operations AND, OR and NOT (like computing a polynomial can be carried
out on a circuit with nodes implementing just two basic operation: addition of reals and their multiplication). The
discovery of Shannon was that every Boolean function of n variables can be computed on a circuit with no more
than Cn−1 2n switches, where C is an appropriate absolute constant. Moreover, Shannon proved that “nearly all”
Boolean functions of n variables require circuits with at least cn−1 2n switches, c being another absolute constant;
“nearly all” in this context means that the fraction of “easy to compute” functions (i.e., those computable by
circuits with less than cn−1 2n switches) among all Boolean functions of n variables goes to 0 as n goes to ∞. Now,
computing Boolean functions by circuits comprised of switches was an important technical task already in 1938;
its role in our today life can hardly be overestimated — the outlined computation is nothing but what is going
on in a computer. Given this observation, it is not surprising that the Shannon discovery of 1938 was the subject
of countless refinements, extensions, modifications, etc., etc. What is still missing, is a single individual example
of a “difficult to compute” Boolean function: as a matter of fact, all multivariate Boolean functions f (x1 , ..., xn )
people managed to describe explicitly are computable by circuits with just linear in n number of switches!
32 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING
set typically is difficult. The only notable exception here is the case of maximizing a convex
function f over a convex set X given as the convex hull of a finite set: X = Conv{v 1 , ..., v N }. In
this case, a maximizer of f on the finite set {v 1 , ..., v N } (this maximizer can be found by brute
force computation of the values of f at v i ) is the maximizer of f over the entire X (check it
yourself or see Section C.5).
Given that the nullspace property “as it is” is difficult to check, we can look for “the second
best thing” — efficiently computable upper and lower bounds on the “goodness” s∗ (A) of A
(i.e., on the largest s for which A is s-good).
Let us start with efficient lower bounding of s∗ (A), that is, with efficiently verifiable suffi-
cient conditions for s-goodness. One way to derive such a condition is to specify an efficiently
computable upper bound γbs (A) on γs (A). With such a bound at our disposal, the efficiently
verifiable condition γbs (A) < 1/2 clearly will be a sufficient condition for the validity of (1.3.5).
The question is, how to find an efficiently computable upper bound on γs (A), and here is
one of the options:
γs (A) = max max v T z : Az = 0, kzk1 ≤ 1
z v∈Vs
⇒ ∀H ∈ Rm×n : γs (A) = max max v T [1 − H T A]z : Az = 0, kzk1 ≤ 1
z v∈Vs
≤ max max v T [1 − H T A]z : kzk1 ≤ 1
z v∈Vs
= max k[I − H T A]zks,1 , Z = {z : kzk1 ≤ 1}.
z∈Z
We see that whatever be “design parameter” H ∈ Rm×n , the quantity γs (A) does not exceed
the maximum of a convex function k[I − H T A]zks,1 of z over the unit `1 -ball Z. But the latter
set is perfectly well suited for maximizing convex functions: it is the convex hull of a small (just
2n points, ± basic orths) set. We end up with
The function Ψ(H) is efficiently computable and convex, this is why its minimization can be
carried out efficiently. Thus, γbs (A) is an efficiently computable upper bound on γs (A).
Some instructive remarks are in order.
1. The trick which led us to γbs (A) is applicable to bounding from above the maximum of a
convex function f over the set X of the form {x ∈ Conv{v 1 , ..., v N } : Ax = 0} (i.e., over
the intersection of an “easy for convex maximization” domain and a linear subspace. The
trick is merely to note that if A is m × n, then for every H ∈ Rm×n one has
n o
max f (x) : x ∈ Conv{v 1 , ..., v N }, Ax = 0 ≤ max f ([I − H T Ax]v i ) (!)
x 1≤i≤N
Indeed, a feasible solution x to the left hand side optimization problem can be represented
as a convex combination i λi v i , and since Ax = 0, we have also x = i λi [I − H T A]v i ;
P P
1.3. SELECTED ENGINEERING APPLICATIONS OF LP 33
since f is convex, we have therefore f (x) ≤ max f ([I − H T A]v i ), and (!) follows. Since (!)
i
takes place for every H, we arrive at
n o
max f (x) : x ∈ Conv{v 1 , ..., v N }, Ax = 0 ≤ γb := max f ([I − H T A]v i ),
x 1≤i≤N
2. The efficiently computable upper bound γbs (A) is polyhedrally representable — it is the
optimal value in an explicit LP program. To derive this problem, we start with important
by itself polyhedral representation of the function kzks,1 :
Proof. One way to get (1.3.11) is to note that kzks,1 = max v T z = max v T z and
v∈Vs v∈Conv(Vs )
to verify that the convex hull of the set Vs is exactly the polytope Vs = {v ∈ Rn
: |vi | ≤
1 ∀i, i |vi | ≤ s} (or, which is the same, to verify that the vertices of the latter polytope
P
are exactly the vectors from Vs ). With this verification at our disposal, we get
( )
X
kzks,1 = max v T z : |vi | ≤ 1∀i, |vi | ≤ s ;
v
i
On the other hand, let |zi1 | ≥ |zi2 | ≥ ... ≥ |zis | are the s largest magnitudes of entries in
z (so that i1 , ..., is are distinct from each other), and let t = |zis |, wi = max[|zi | − t, 0]. It
is immediately seen that (t, w) is feasible for the right hand side problem in (1.3.11) and
that st + i wi = sj=1 |zij | = kzks,1 . Thus, the right hand side in (1.3.11) is ≤ the left
P P
hand side. 2
Lemma 1.3.1 straightforwardly leads to the following polyhedral representation of γbs (A):
3. The quantity γb1 (A) is exactly equal to γ1 (A) rather than to be an upper bound on the
latter quantity.
Indeed, we have
where ei are the standard basic orths in Rn . Denoting by hi optimal solutions to the latter
problem and setting H = [H 1 , ..., hn ], we get
since the opposite inequality γ1 (A) ≤ γb1 (A) definitely holds true, we conclude that
Observe that an optimal solution H to the latter problem can be found column by column,
with j-th column hj of H being an optimal solution to the LP (Pj ); this is in a nice contrast
with computing γbs (A) for s > 1, where we should solve a single LP with O(n2 ) variables
and constraints, which is typically much more time consuming that solving O(n) LP’s with
O(n) variables and constraints each, as it is the case when computing γb1 (A).
Observe also that if p, q are positive integers, then for every vector z one has kzkpq,1 ≤
qkzkp,1 , and in particular kzks,1 ≤ skzk1,1 = skzk∞ . It follows that if H is such that
γbp (A) = max kColj [I − H T A]kp,1 , then γbpq (A) ≤ q max kColj [I − H T A]kp,1 ≤ q γbp (A). In
j j
particular,
γbs (A) ≤ sγb1 (A),
meaning that the easy-to-verify condition
1
γb1 (A) <
2s
is sufficient for the validity of the condition
4. Assume that A and s are such that s-goodness of A can be certified via our verifiable
sufficient condition, that is, we can point out an m × n matrix H such that
Now, for every n × n matrix B, any norm k · k on Rn and every vector z ∈ Rn we clearly
have
kBzk ≤ max kColj [B]k kzk1
j
(why?) Therefore form the definition of γ, for every vector z we have k[I − H T A]zks,1 ≤
γkzk1 , so that
kzks,1 ≤ kH T Azks,1 + k[I − H T A]zks,1 ≤ s max kColj [H]k2 kAzk2 + γkzk1 ,
j
meaning that H certifies not only the s-goodness of A, but also an inequality of the form
(1.3.7) and thus – the associated error bound (1.3.8) for imperfect `1 recovery.
is called mutual incoherence of A, the smaller is this quantity, the closer the columns of A
are to mutual orthogonality.
µ(A) µ(A)+1
(a) Prove that γ1 (A) = γb1 (A) ≤ µ(A)+1 and that whenever s < 2µ(A) , the relation
(1.3.7) is satisfied with
sµ(A) sµ(A)
γ= < 1/2, β = .
µ(A) + 1 µ(A) + 1
Hint: Look what happens with the verifiable sufficient condition for s-goodness when
µ(A)
H = µ(A)+1 A.
(b) Let A be randomly selected m×n matrix, 1 m ≤ n, with independent entries taking
√
values ±1/ m with probabilities p 0.5. Verify that for a properly chosen absolute
constant C we have µ(A) ≤ C ln(n)/m. Derive from this observation that the above
verifiable sufficient condition for s-goodness forpproperly selected m × n matrices can
certify their s -goodness with s as large as O( m/ ln(n)).
3. Upper-bounding the goodness level. In order to demonstrate that A is not s-good for a
given s, it suffices to point out a vector z ∈ KerA\{0} such that kzk1 ≤ 1 and kzks,1 ≥ 12 .
The simplest way to attempt to achieve this goal is as follows: Start with a whatever
v 0 ∈ Vs and solve the LP max{[v 0 ]T z : Az = 0, kzk1 ≤ 1}, thus getting z 1 such that
z
kz 1 ks,1 ≥ [v 0 ]T z 1 and Az 1 = 0, kzn1 k1 ≤ 1. Now choose v 1 ∈oVs such that [v 1 ]T z 1 = kz 1 ks,1 ,
and solve the LP program max [v 1 ]T z : Az = 0, kzk1 ≤ 1 , thus getting a new z = z 2 ,
z
then define v 2 ∈ Vs such that [v 2 ]T z 2 = kz 2 ks,1 , solve the new LP, and so on. In short,
the outlined process is an attempt to lower-bound γs (A) = max v T z by switching
v∈Vs
z:Az=0,kzk1 ≤1
36 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING
feature vector and the label; given this sample, we want to build a classifier – a function f (x)
on the space of feature vectors x taking values ±1 – which we intend to use to predict, given
the value of a new feature vector, the value of the corresponding label. In our example this
setup reads: we are given medical records containing both the results of medical tests and the
diagnoses of N patients; given this data, we want to learn how to predict the diagnosis given
the results of the tests taken from a new patient.
The simplest predictors we can think about are just the “linear” ones looking as follows. We
fix an affine form z T x + b of a feature vector, choose a positive threshold γ and say that if the
value of the form at a feature vector x is “well positive” – is ≥ γ – then the proposed label for
x is +1; similarly, if the value of the form at x is “well negative” – is ≤ −γ, then the proposed
label will be −1. In the “gray area” −γ < z T x + b < γ we decline to classify. Noting that the
actual value of the threshold is of no importance (to compensate a change in the threshold by
certain factor, it suffices to multiply by this factor both z and b, without affecting the resulting
classification), we from now on normalize the situation by setting the threshold to the value 1.
Now, we have explained how a linear classifier works, but where from to take it? An intu-
itively appealing idea is to use the training sample in order to “train” our potential classifier –
to choose z and b in a way which ensures correct classification of the examples in the sample.
This amounts to solving the system of linear inequalities
y i (ziT xi + b) ≥ 1 ∀i = 1, ..., N.
The latter problem, of course, not necessarily is feasible: it well can happen that it is impossible
to separate the positive and the negative examples in the training sample by a stripe between two
parallel hyperplanes. To handle this possibility, we allow for classification errors and minimize
a weighted sum of kwk2 and total penalty for these errors. Since the absence of classification
penalty at an example (xi , y i ) in outer context is equivalent to the validity of the inequality
y i (wT xi + b) ≥ 1, the most natural penalty for misclassification of the example is max[1 −
y i (z T xi + b), 0]. With this in mind, the problem of building “the best on the training sample”
classifier becomes the optimization problem
N
( )
X
i T i
min kzk2 + λ max[1 − y (z x + b), 0] , (1.3.13)
z,b
i=1
38 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING
where λ > 0 is responsible for the “compromise” between the width of the stripe (∗) and the
“separation quality” of this stripe; how to choose the value of this parameter, this is an additional
story we do not touch here. Note that the outlined approach to building classifiers is the most
basic and the most simplistic version of what in Machine Learning is called “Support Vector
Machines.”
Now, (1.3.13) is not an LO program: we know how to get rid of nonlinearities max[1 −
y i (wT xi + b), 0] by adding slack variables and linear constraints, but we cannot get rid of the
nonlinearity brought by the term kzk2 . Well, there are situations in Machine Learning where
it makes sense to get rid of this term by “brute force,” specifically, by replacing the k · k2 with
k · k1 . The rationale behind this “brute force” action is as follows. The dimension n of the
feature vectors can be large, In our medical example, it could be in the range of tens, which
perhaps is “not large;” but think about digitalized images of handwritten letters, where we want
to distinguish between handwritten letters ”A” and ”B;” here the dimension of x can well be
in the range of thousands, if not millions. Now, it would be highly desirable to design a good
classifier with sparse vector of weights z, and there are several reasons for this desire. First,
intuition says that a good on the training sample classifier which takes into account just 3 of the
features should be more “robust” than a classifier which ensures equally good classification of
the training examples, but uses for this purpose 10,000 features; we have all reasons to believe
that the first classifier indeed “goes to the point,” while the second one adjusts itself to random,
irrelevant for the “true classification,” properties of the training sample. Second, to have a
good classifier which uses small number of features is definitely better than to have an equally
good classifier which uses a large number of them (in our medical example: the “predictive
power” being equal, we definitely would prefer predicting diagnosis via the results of 3 tests to
predicting via the results of 20 tests). Finally, if it is possible to classify well via a small number
of features, we hopefully have good chances to understand the mechanism of the dependencies
between these measured features and the feature which presence/absence we intend to predict
— it usually is much easier to understand interaction between 2-3 features than between 2,000-
3,000 of them. Now, the SVMs (1.3.12), (1.3.13) are not well suited for carrying out the outlined
feature selection task, since minimizing kzk2 norm under constraints on z (this is what explicitly
goes on in (1.3.12) and implicitly goes on in (1.3.13)13 ) typically results in “spread” optimal
solution, with many small nonzero components. In view of our “Compressed Sensing” discussion,
we could expect that minimizing the `1 -norm of z will result in “better concentrated” optimal
solution, which leads us to what is called “LO Support Vector Machine.” Here the classifier is
given by the solution of the k · k1 -analogy of (1.3.13), specifically, the optimization problem
N
( )
X
i T i
min kzk1 + λ max[1 − y (z x + b), 0] . (1.3.14)
z,b
i=1
13
PN
To understand the latter claim, take an optimal solution (z∗ , b∗ ) to (1.3.13), set Λ = i=1
max[1 − y i (z∗T xi +
b∗ ), 0] and note that (z∗ , b∗ ) solves the optimization problem
( N
)
X i T i
min kzk2 : max[1 − y (z x + b), 0] ≤ Λ
z,b
i=1
(why?).
1.3. SELECTED ENGINEERING APPLICATIONS OF LP 39
Concluding remarks. A reader could ask, what is the purpose of training the classifier on the
training set of examples, where we from the very beginning know the labels of all the examples?
why a classifier which classifies well on the training set should be good at new examples? Well,
intuition says that if a simple rule with a relatively small number of “tuning parameters” (as it is
the case with a sparse linear classifier) recovers well the labels in examples from a large enough
sample, this classifier should have learned something essential about the dependency between
feature vectors and labels, and thus should be able to classify well new examples. Machine
Learning theory offers a solid probabilistic framework in which “our intuition is right”, so that
under assumptions (not too restrictive) imposed by this framework it is possible to establish
quantitative links between the size of the training sample, the behavior of the classifier on this
sample (quantified by the k · k2 or k · k1 norm of the resulting z and the value of the penalty
for misclassification), and the predictive power of the classifier, quantified by the probability of
misclassification of a new example; roughly speaking, good behavior of a linear classifier achieved
at a large training sample ensures low probability of misclassifying a new example.
x0 = z [initial condition]
xt+1 = At xi + Bt ut + Rt dt , [state equations] (1.3.16)
yt = Ct xt + Dt dt [outputs]
here Ut are arbitrary everywhere defined functions of their arguments taking values in Rnu .
Plant augmented by controller is called a closed loop system; its behavior clearly depends on
the initial state and external disturbances only.
Affine control. The simplest (and extremely widely used) form of control law is affine control,
where ut are affine functions of the outputs:
Design specifications and the Analysis problem. The entities of primary interest in
control are states and controls; we can arrange states and controls into a long vector —- the
state-control trajectory
wN = [xN ; xN −1 ; ...; x1 ; uN −1 ; ...; u0 ];
here N is the time horizon on which we are interested in system’s behavior. with affine control
law (1.3.18), this trajectory is an affine function of z and dN :
wN = wN (z, dN ) = ωN + ΩN [z; dN ]
with vector ωN and matrix ΩN are readily given by the matrices At , ..., Dt , 0 ≤ t < N , from the
description of the open loop system and by the collection ξ~N = {ξt , Ξtτ : 0 ≤ τ ≤ t < N } of the
parameters of the affine control law (1.3.18).
Imagine that the “desired behaviour” of the closed loop system on the time horizon in
question is given by a system of linear inequalities
BwN ≤ b (1.3.19)
which should be satisfied by the state-control trajectory provided that the initial state z and the
disturbances dN vary in their “normal ranges” Z and DN , respectively; this is a pretty general
form of design specifications. The fact that wN depends affinely on [z; dN ] makes it easy to
solve the Analysis problem: to check wether a given control law (1.3.18) ensures the validity
of design specifications (1.3.19). Indeed, to this end we should check whether the functions
[BwN − b]i , 1 ≤ i ≤ I (I is the number of linear inequalities in (1.3.19)) remain nonpositive
whenever dN ∈ DN , z ∈ Z; for a given control law, the functions in question are explicitly given
affine function φi (z, dN ) of [z; dN ], so that what we need to verify is that
n o
max φi (z, dN ) : [z; dN ] ∈ ZDN := Z × DN ≤ 0, i = 1, ..., I.
[z;dN ]
Whenever DN and Z are explicitly given convex sets, the latter problems are convex and thus
easy to solve. Moreover, if ZDN is given by polyhedral representation
n o
ZDN := DN × Z = [z; dN ] : ∃v : P [z; dN ] + Qv ≤ r (1.3.20)
1.3. SELECTED ENGINEERING APPLICATIONS OF LP 41
(this is a pretty flexible and enough general way to describe typical ranges of disturbances and
initial states), the analysis problem reduces to a bunch of explicit LPs
n o
max φi ([z; dN ]) : P [z; dN ] + Qv ≤ r , 1 ≤ i ≤ I;
z,dN ,v
the answer in the analysis problem is positive if and only if the optimal values in all these LPs
are nonpositive.
Synthesis problem. As we have seen, with affine control and affine design specifications, it
is easy to check whether a given control law meets the specifications. The basic problem in
linear control is, however, somehow different: usually we need to build an affine control which
meets the design specifications (or to detect that no such control exists). And here we run into
difficult problem: while state-control trajectory for a given affine control is an easy-to-describe
affine function of [z; dN ], its dependence on the collection ξ~N of the parameters of the control
law is highly nonlinear, even for a time-invariant system (the matrices At , ..., Dt are independent
of t) and a control law as simple as time-invariant linear feedback: ut = Kyt . Indeed, due to the
dynamic nature of the system, in the expressions for states and controls powers of the matrix
K will be present. Highly nonlinear dependence of states and controls on S makes it impossible
to optimize efficiently w.r.t. the parameters of the control law and thus makes the synthesis
problem extremely difficult.
The situation, however, is far from being hopeless. We are about to demonstrate that one
can re-parameterize affine control law in such a way that with the new parameterization both
Analysis and Synthesis problems become tractable.
System:
x0 = z
xt+1 = At xi + Bt ut + Rt dt ,
yt = Ct xt + Dt dt
Model:
(1.3.21)
x
b0 = 0
x
bt+1 = At x bi + Bt ut ,
ybt = Ct xbt
Controller:
ut = Ut (y0 , ..., yt ) (!)
Assuming that we know the matrices At , ..., Dt , we can run the model in an on-line fashion,
so that at instant t, when the control ut should be specified, we have at our disposal both
the actual outputs y0 , ..., yt and the model outputs yb0 , ..., ybt , and thus have at our disposal the
purified outputs
vτ = yτ − ybτ , 0 ≤ τ ≤ t.
Now let us ask ourselves what will happen if we, instead of building the controls ut on the basis
of actual outputs yτ , 0 ≤ τ ≤ t, pass to controls ut built on the basis of purified outputs vτ ,
42 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING
0 ≤ τ ≤ t, i.e., replace the control law (!) with control law of the form
ut = Vt (v0 , ..., vt ) (!!)
It easily seen that nothing will happen:
For every control law {Ut (y0 , ..., yt )}∞ t=0 of the form (!) there exists a control law
{Vt (v0 , ..., vt )}∞
t=0 of the form (!!) (and vice versa, for every control law of the form
(!!) there exists a control law of the form (!)) such that the dependencies of actual
states, outputs and controls on the disturbances and the initial state for both control
laws in question are exactly the same. Moreover, the above “equivalence claim”
remains valid when we restrict controls (!), (!!) to be affine in their arguments.
The bottom line is that every behavior of the close loop system which can be obtained with
affine non-anticipative control (1.3.18) based on actual outputs, can be also obtained with affine
non-anticipative control law
ut = ηt + H0t v0 + H1t v1 + ... + Htt vt , (1.3.22)
based on purified outputs (and vice versa).
Exercise 1.1 Justify the latter conclusion.
We have said that as far as achievable behaviors of the closed loop system are concerned, we
loose (and gain) nothing when passing from affine output-based control laws (1.3.18) to affine
purified output-based control laws (1.3.22). At the same time, when passing to purified outputs,
we get a huge bonus:
(#) With control (1.3.22), the trajectory wN of the closed loop system turns out to
be bi-affine: it is affine in [z; dN ], the parameters ~η N = {ηt , Hτt , 0 ≤ τ ≤ t ≤ N } of
the control law being fixed, and is affine in the parameters of the control law ~η N ,
[z; dN ] being fixed, and this bi-affinity, as we shall see in a while, is the key to efficient
solvability of the synthesis problem.
The reason for bi-affinity is as follows (after this reason is explained, verification of bi-affinity
itself becomes immediate): purified output vt is completely independent on the controls and is
a known in advance affine function of dt , z. Indeed, from (1.3.21) it follows that
vt = Ct (xt − x
bt ) + Dt dt
| {z }
δt
Tractability of the Synthesis problem. Assume that the normal range ZDN of [z; dN ] is
a nonempty and bounded set given by polyhedral representation (1.3.20), and let us prove that
in this case the design specifications (1.3.19) reduce to a system of explicit linear inequalities in
variables ~η N – the parameters of the purified-output-based affine control law used to close the
open-loop system – and appropriate slack variables. Thus, (parameters of) purified-output-based
affine control laws meeting the design specifications (1.3.19) form a polyhedrally representable,
and thus easy to work with, set.
The reasoning goes as follows. As stated by (#), the state-control trajectory wN associated
with affine purified-output-based control with parameters ~η N is bi-affine in dN , z and in ~η N , so
that it can be represented in the form
where the vector-valued and the matrix-valued functions w[·], W [·] are affine and are readily
given by the matrices At , ..., Dt , 0 ≤ t ≤ N − 1. Plugging in the representation of wN into the
design specifications (1.3.19), we get a system of scalar constraints of the form
where the vector-valued functions αi (·) and the scalar functions βi (·) are affine and readily given
by the description of the open-loop system and by the data B, b in the design specifications.
What we want from ~η N is to ensure the validity of every one of the constraints (∗) for all [z; dN ]
from ZDn , or, which is the same in view of (1.3.20), we want the optimal values in the LPs
n o
max αiT (~η N )[z; dN ] : P [z; dN ] + Qv ≤ r
[z;dN ],v
to be ≤ βi (~η N ) for 1 ≤ i ≤ I. Now, the LPs in question are feasible; passing to their duals,
what we want become exactly the relations
n o
min rT si : P T si = βi (~η N ), QT si = 0, si ≥ 0 ≤ βi (~η N ), 1 ≤ i ≤ I.
si
Remark 1.3.1 We can say that when passing from an affine output-based control laws to
affine purifies output based ones we are all the time dealing with the same entities (affine non-
anticipative control laws), but switch from one parameterization of these laws (the one by ξ~N -
parameters) to another parameterization (the one by ~η N -parameters). This re-parameterization
is nonlinear, so that in principle there is no surprise that what is difficult in one of them (the Syn-
thesis problem) is easy in another one. This being said, note that with our re-parameterization
we neither lose nor gain only as far as the entire family of linear controllers is concerned. Spe-
cific sub-families of controllers can be “simply-looking” in one of the parameterizations and be
extremely difficult to describe in the other one. For example, time-invariant linear feedback
44 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING
ut = Kyt looks pretty simple (just a linear subspace) in the ξ~N -parameterization and form a
highly nonlinear manifold in the ~η N -one. Similarly, the a.p.o.b. control ut = Kvt looks simple in
the ~η N -parameterization and is difficult to describe in the ξ~N -one. We can say that there is no
such thing as “the best” parameterization of affine control laws — everything depends on what
are our goals. For example, the ~η N parameterization is well suited for synthesis of general-type
affine controllers and becomes nearly useless when a linear feedback is sought.
Modifications. We have seen that if the “normal range” ZDN of [z; dN ] is given by polyhedral
representation (which we assume from now on), the parameters of “good” (i.e., meeting the de-
sign specifications (1.3.19) for all [z; dN ] ∈ ZDN ) affine purified-output-based (a.p.o.b.) control
laws admit an explicit polyhedral representation – these are exactly the collections ~η N which can
be extended by properly selected slacks s1 , ..., sN to a solution of the explicit system of linear
inequalities (1.3.23). As a result, various synthesis problems for a.p.o.b. controllers become just
explicit LPs. This is the case, e.g., for the problem of checking whether the design specifications
can be met (this is just a LP feasibility problem of checking solvability of (1.3.23)). If this is
the case, we can further minimize efficiently a “good” (i.e., convex and efficiently computable)
objective over the a.p.o.b. controllers meeting the design specifications; when the objective is
just linear, this will be just an LP program.
In fact, our approach allows for solving efficiently other meaningful problems of a.p.o.b.
controller design, e.g., as follows. Above, we have assumed that the range of disturbances
and initial states is known, and we want to achieve “good behavior” of the closed loop sys-
tem in this range; what happens when the disturbances and the initial states go beyond their
normal range, is not our responsibility. In some cases, it makes sense to take care of the
disturbances/initial states beyond their normal range, specifically, to require the validity of de-
sign specifications (1.3.19) in the normal range of disturbances/initial states, and to allow for
controlled deterioration of these specifications when the disturbances/initial states run beyond
their normal range. This can be modeled as follows: Let us fix a norm k · k on the linear space
LN = {ζ = [z; dN ] ∈ Rnd × ... × Rnd × Rnx } where the disturbances and the initial states live,
and let
dist([z; dN ], ZDN ) = min k[z; dN ] − ζk
ζ∈ZDN
be the deviation of [z; dN ] from the normal range of this vector. A natural extension of our
previous design specifications is to require the validity of the relations
for all [z; dN ] ∈ LN ; here, as above, wN (dN , z) is the state-control trajectory of the closed loop
system on time horizon N treated as a function of [z; dN ], and α = [α1 ; ...; αI ] is a nonnegative
vector of sensitivities, which for the time being we consider as given to us in advance. In other
words, we still want the validity of (1.3.19) in the normal range of disturbances and initial
states, and on the top of it, impose additional restrictions on the global behaviour of the closed
loop system, specifically, want the violations of the design specifications for [z; dN ] outside of its
normal range to be bounded by given multiples of the distance from [z; dN ] to this range.
As before, we assume that the normal range ZDN of [z; dN ] is a nonempty and bounded set
given by polyhedral representation (1.3.20). We are about to demonstrate that in this case the
specifications (1.3.24) still are tractable (and even reduce to LP when k · k is a simple enough
norm). Indeed, let us look at i-th constraint in (1.3.24). Due to bi-affinity of wN in [z; dN ] and
1.3. SELECTED ENGINEERING APPLICATIONS OF LP 45
in the parameters ~η N of the a.p.o.b. control law in question, this constraint can be written as
where αi (
cdot), βi (·) are affine functions. We claim that
Postponing for the moment the verification of this claim, let us look at the consequences. We
already know that (a) “admits polyhedral representation:” we can point out a system Si of
linear inequalities on ~η N and slack vector si such that (a) takes place if and only if ~κN can
be extended, by a properly chosen si , to a solution of Si . Further, the function fi (·) clearly is
convex (since αi (·) is affine) and efficiently computable, provided that the norm k · k is efficiently
computable (indeed, in this case computing fi at a point reduces to maximizing a linear form
over a convex set). Thus, not only (a), but (b) as well is a tractable constraint on ~κN .
we have
n o
fi (~η N ) = max αiT (~η N )[z; dN ] : U · [z; dN ] + Y y ≤ q
[z;dN n
],y o
= min q T ti : U T ti = αiT (~η N ), Y T ti = 0, ti ≥ 0 [by LP Duality Theorem],
ti
Thus, for a polyhedral norm k · k, (a), (b) can be expressed by a system of linear
inequalities in the parameters ~η N of the a.p.o.b. control law we are interested in and
in slack variables si , ti .
Remark 1.3.2 For the time being, we treated sensitivities αi as given parameters. Our analy-
sis, however, shows that in “tractable reformulation” of (1.3.25) αi are just the right hand sides
of some inequality constraints, meaning that the nice structure of these constraints (convex-
ity/linearity) is preserved when treating αi as variables (restricted to be ≥ 0) rather than given
parameters. As a result, we can solve efficiently the problem of synthesis an a.p.o.b. controller
which ensures (1.3.25) and minimizes one of αi ’s (or some weighted sum of αi ’s) under addi-
tional, say, linear, constraints on αi ’s, and there are situations when such a goal makes perfect
sense.
46 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING
Clearing debts. It remains to prove (&), which is easy. Leaving the rest of the justification of
(&) to the reader, let us focus on the only not completely evident part of the claim, namely, that
(1.3.25) implies (b). The reasoning is as follows: take a vector [z̄; d¯N ] ∈ LN with k[z̄; d¯N ]k ≤ 1
and a vector [zb; dbN ] ∈ ZDN , and let t > 0. (1.3.25) should hold for [z; dN = [zb; dbN ] + t[z̄; d¯N ]],
meaning that
αiT (~η N ) [zb; dbN ] + t[z̄; d¯N ] ≤ βi (~η N ) + αi dist([z; dN ], ZDN )
| {z }
≤tk[z;dN ]−[ẑ;dˆN ]k=
tk[z̄;d¯N ]k≤t
and this is so for every [z̄; d¯N ] of k · k-norm not exceeding 1; this is exactly (b).
Operational Exercise 1.3.2 The goal of this exercise is to play with the outlined synthesis of
linear controllers via LP.
1. Situation and notation. From now on, we consider discrete time linear dynamic system
(1.3.16) which is time-invariant, meaning that the matrices At , ..., Dt are independent of t
(so that we write A, ..., D instead of At , ..., Dt ). As always, N is the time horizon.
Let us start with convenient notation. Affine p.o.b. control law ~η N on time horizon N
can be represented by a pair comprised of the block-vector ~h = [h0 ; h1 ; ..., hN −1 ] and the
block-lower-triangular matrix
H00
H01 H11
H02 H12 H22
H=
.. .. .. ..
. . . .
H0N −1 N −1
· · · · · · · · · HN −1
with nu × ny blocks Hτt (from now on, when drawing a matrix, blanks represent zero
entries/blocks). Further, we can stack in long vectors other entities of interest, specifically,
x1
..
" #
.
~x xN
wN := = [state-control trajectory]
~u u0
..
.
uN −1
v0
~v =
..
[purified oputputs]
.
vN −1
1.3. SELECTED ENGINEERING APPLICATIONS OF LP 47
In this notation, the dependencies between [z; dN ], ~η N = (h, H), wN and ~v are given by
N
~v = " · [z; d ],
ζ2v #
~x = [u2x · H · ζ2v + ζ2x] · [z; dN ] + u2x · h (1.3.26)
wN := .
~u = H · ζ2v · [z; dN ] + h
2. Task 1:
(a) Verify (1.3.26).
(b) Verify that the system closed by an a.p.o.b. control law ~η N = (~h, H) satisfies the
specifications
BwN ≤ b
whenever [z; dN ] belongs to a nonempty set given by polyhedral representation
{[z; dN ] : ∃u : P [z; dN ] + Qu ≤ r}
if and only if there exists entrywise nonnegative matrix Λ such that
" #
u2x · H · ζ2v + ζ2x
ΛP = B ·
H · ζ2v
ΛQ = 0 " #
u2x · h
Λr + B · ≤ b.
h
(c) Assume that at time t = 0 the system is at a given state, and what we want is to
have it at time N in another given state. Due to linearity, we lose nothing when
assuming that the initial state is 0, and the target state is a given vector x∗ . The
design specifications are as follows:
• If there are no disturbances (i.e., the initial state is exactly 0, and all d’s are
zeros, we want the trajectory (let us call it the nominal one) wN = wnomN to have
xN exactly equal to x∗ . In other words, we say that the nominal range of [z; dN ]
is just the origin, and when [z; dN ] is in this range, the state at time N should
be x∗ .
48 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING
• When [z; dN ] deviates form 0, the trajectory wN will deviate from the nominal
N , and we want the scaled uniform norm kDiag{β}(w N −w N )k
trajectory wnom nom ∞
not to exceed a multiple, with the smallest possible factor, of the scaled uniform
norm kDiag{γ}[z; dN ]k∞ of the deviation of [z; dn ] from its normal range (the
origin). Here β and γ are positive weight vectors of appropriate sizes.
Prove that the outlined design specifications reduce to the following LP in variables
~η N = (~h, H) and additional matrix variable Λ of appropriate size:
(a) H is block lower triangular
(b) the last nx entries in u2x · ~h form the vector x∗
(c) Λij ≥ 0 ∀i, j
min α: Diag{γ} u2x · H · ζ2v + ζ2x (1.3.27)
~N =(~
η h,H), (d) Λ =
−Diag{γ} −u2x · H · ζ2v − ζ2x
Λ,τ,α
β
(e) Λ · 1 ≤ α
β
Boeing 747 is flying horizontally along the X-axis in the XZ plane with the
velocity 774 ft/sec at the altitude 40000 ft and has 100 sec to change the altitude
to 41000 ft. You should plan this maneuver.
The details are as follows14 . First, you should keep in mind that in the description to follow
the state variables are deviations of the “physical” state variables (like velocity, altitude,
etc.) from their values at the “physical” state of the aircraft at the initial time instant,
and not the physical states themselves, and similarly for the controls – they are deviations
of actual controls from some “background” controls. The linear dynamics approximation
is enough accurate, provided that these deviations are not too large. This linear dynamics
is as follows:
• State variables: ∆h: deviation in altitude, ft; ∆u: deviation in velocity along the
aircraft’s X-axis; ∆v: deviation in velocity orthogonal to the aircraft’s axis, ft/sec
(positive is down); ∆θ: deviation in angle θ between the aircraft’s axis and the X-axis
(positive is up), crad (hundredths of radian); q: angular velocity of the aircraft (pitch
rate), crad/sec.
• Disturbances: uw : wind velocity along the aircraft’s axis, ft/sec; vw : wind velocity
orthogonal to the aircraft’s axis, ft/sec. We assume that there is no side wind.
Assuming that δ stays small, we will not distinguish between ∆u (∆v) and the devi-
ations of the aircraft’s velocities along the X-, respectively, Z-axes, and similarly for
uw and vw .
• Controls: δe : elevator angle deviation (positive is down); δt : thrust deviation.
• Outputs: velocity deviation ∆u and climb rate ḣ = −∆v + 7.74θ.
14
Up to notation, the data below are taken from Lecture 14, of Prof. Stephen Boyd’s EE263 Lecture Notes, see
http://www.stanford.edu/∼boyd/ee263/lectures/aircraft.pdf
1.3. SELECTED ENGINEERING APPLICATIONS OF LP 49
• Linearized dynamics:
∆h 0 0 −1 0 0 ∆h 0 0
∆u 0 −.003 .039 0 −.322 ∆u .01 1
δe
d
∆v = 0 −.065 −.319 7.74 0 · ∆v + −.18 −.04
ds δ
t
q 0 .020 −.101 −.429 0 q −1.16 .598
| {z }
∆θ 0 0 0 1 0 ∆θ 0 0 υ(s)
| {z } | {z } | {z }
A χ(s) B
0 −1
.003 −.039
uw
+ .065 .319 · v
w
−.020 .101
| {z }
0 0 δ(s)
| {z }
R
∆h
∆u
0 1 0 0 0
y = · ∆v
0 0 −1 0 7.74
| {z }
q
C ∆θ
• From continuous to discrete time: Note that our system is a continuous time one;
we denote this time by s and measure it in seconds (this is why in the Boeing state
d
dynamics the derivative with respect to time is denoted by ds , Boeing’s state in a
(continuous time) instant s is denoted by χ(s), etc.) Our current goal is to approxi-
mate this system by a discrete time one, in order to use our machinery for controller
design. To this end, let us choose a “time resolution” ∆s > 0, say, ∆s = 10 sec, and
look at the physical system at continuous time instants t∆s, t = 0, 1, ..., so that the
state xt of the system at discrete time instant t is χ(t∆s), and similarly for controls,
outputs, disturbances, etc. With this approach, 100 sec-long maneuver corresponds
to the discrete time horizon N = 10.
We now should decide how to approximate the actual – continuous time – system with
a discrete time one. The simplest way to do so would be just to replace differential
equations with their finite-difference approximations:
We can, however, do better than this. Let us associate with discrete time instant t
the continuous time interval [t∆s, (t + 1)∆s), and assume (this is a quite reasonable
assumption) that the continuous time controls are kept constant on these continuous
time intervals:
t∆s ≤ s < (t + 1)∆s ⇒ υ(s) ≡ ut ; (∗)
ut will be our discrete time controls. Let us now define the matrices of the discrete
time dynamics in such a way that with our piecewise constant continuous time con-
trols, the discrete time states xt of the discrete time system will be exactly the same
as the states χ(t∆s) of the continuous time system. To this end let us note that with
the piecewise constant continuous time controls (∗), we have exact equations
hR i
∆s
χ((t + 1)∆s) = [exp{∆sA}] χ(t∆s) + 0 exp{rA}Bdr ut
hR i
∆s
+ 0 exp{rA}Rδ((t + 1)∆s − r)dr ,
50 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING
where the exponent of a (square) matrix can be defined by either one of the straight-
forward matrix analogies of the (equivalent to each other) definitions of the univariate
exponent:
exp{B} = limn→∞ (I + n−1 B)n , or
P∞ 1 n
exp{B} = n=0 n! B .
If we now assume (this again makes complete sense) that continuous time outputs
are measured at continuous time instants t∆s, t = 0, 1, ..., we can augment the above
discrete time dynamics with description of discrete time outputs:
yt = Cxt .
Now imagine that the decisions on ut are made as follows: at “continuous time”
instant t∆s, after yt is measured, we immediately specify ut as an affine function of
y1 , ..., yt and keep the controls in the “physical time” interval [t∆s, (t + 1)∆s) at the
level ut . Now both the dynamics and the candidate affine controllers in continuous
time fit their discrete time counterparts, and we can pass from affine output based
controllers to affine purified output based ones, and utilize the outlined approach
for synthesis of a.p.o.b. controllers to build efficiently a controller for the “true”
continuous time system. Note that with this approach we should not bother much
on how small is ∆s, since our discrete time approximation reproduces exactly the
behaviour of the continuous time system along the discrete grid t∆s, t = 0, 1, .... The
only restriction on ∆s comes from the desire to make design specifications expressed
in terms of discrete time trajectory meaningful in terms of the actual continuous time
behavior, which usually is not too difficult.
The only difficulty which still remains unresolved is how to translate our a priori
information on continuous time disturbances δ(s) into information on their discrete
time counterparts et . For example, when assuming that the normal range of the
continuous time disturbances δ(·) is given by an upper bound ∀s, kδ(s)k ≤ ρ on a
given normnk · k of the disturbance, the “translated” information o on et is et ∈ ρE,
R ∆s
where E = e = 0 exp{rA}Rδ(r)dr : kδ(r)k ≤ 1, 0 ≤ r ≤ ∆s .15 . We then can set
dt = et , R = Inx and say that the normal range of the discrete time disturbances
dt is dt ∈ E for all t. A disadvantage of this approach is that for our purposes, we
need E to admit a simple representation, which not necessary is the case with E “as
it is.” Here is a (slightly conservative) way to resolve this difficulty. Assume that
15
Note that this “translation” usually makes et vectors of the same dimension as the state vectors xt , even
when the dimension of the “physical disturbances” δ(t) is less than the dimension of the state vectors. This is
natural: continuous time dynamics usually “spreads” the potential contribution, accumulated over a continuous
time interval [t∆s, (t + 1)∆s), of low-dimensional time-varying disturbances δ(s) to the state χ((t + 1)∆s) over a
full-dimensional domain in the state space.
1.3. SELECTED ENGINEERING APPLICATIONS OF LP 51
the normal range of the continuous time disturbance is given as kδ(s)k∞ ≤ ρ ∀s. Let
abs[M ] be the matrix comprised of the moduli of the entries of a matrix M . When
kδ(s)k∞ ≤ 1 for all s, we clearly have
"Z #
∆s Z ∆s
abs exp{rA}Rδ(r)dr ≤ r := abs[exp{rA}R] · 1dr,
0 0
meaning that E ⊂ {Diag{r}d : kdk∞ ≤ 1}. In other words, we can only increase
the set of allowed disturbances et by assuming that they are of the form et = Rdt
with R = Diag{r} and kdt k∞ ≤ ρ. As a result, every controller for the discrete time
system
xt+1 = Axt + But + Rdt ,
yt = Cxt
with just defined A, B, R and C = C which meets desired design specifications, the
normal range of dN being {kdt k∞ ≤ ρ ∀t}, definitely meets the design specifications
for the system ($), the normal range of the disturbances et being et ∈ ρE for all t.
(a) Implement the outlined strategy to the Boeing maneuver example. The still missing
components of the setup are as follows:
(774 ft/sec ≈ 528 mph is the aircraft speed w.r.t to the air, 73.3 ft/sec= 50 mph is
the steady-state tail wind).
• Specify the a.o.p.b. control as given by an optimal solution to (1.3.27). What is the
optimal value α∗ in the problem?
• With the resulting control law, specify the nominal state-control trajectories of the
discrete and the continuous time models of the maneuver (i.e., zero initial state, no
wind); plot the corresponding XZ trajectory of the aircraft.
• Run 20 simulations of the discrete time and the continuous time models of the ma-
neuver when the initial state and the disturbances are randomly selected vectors
of uniform norm not exceeding ρ = 1 (which roughly corresponds to up to 1 mph
randomly oriented in the XZ plane deviations of the actual wind velocity from the
steady-state tail 50 mph one) and plot the resulting bunch of the aircraft’s XZ tra-
jectories. Check that the uniform deviation of the state-control trajectories from the
nominal one all the time does not exceed α∗ ρ (as it should be due to the origin of
α∗ ). What is the mean, over the observed trajectories, ratio of the deviation to αρ?
• Repeat the latter experiment with ρ increased to 10 (up to 10 mph deviations in the
wind velocity).
52 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING
1000 1000
800 800
600 600
400 400
200 200
0 0
−200 −200
0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18
ρ=1 ρ = 10
Bunches of 20 XZ trajectories
[X-axis: mi; Z-axis: ft, deviations from 40000 ft initial altitude]
ρ=1 ρ = 10
mean max mean max
14.2% 22.8% 12.8% 36.3%
kwN −wnom
N
k∞
ρα∗data ,
over 20 simulations, α∗ = 57.18
to its nonlinear extensions, we should expect to encounter some nonlinear components in the
problem. The traditional way here is to say: “Well, in (LP) there are a linear objective function
f (x) = cT x and inequality constraints fi (x) ≥ bi with linear functions fi (x) = aTi x, i = 1, ..., m.
Let us allow some/all of these functions f, f1 , ..., fm to be nonlinear.” In contrast to this tra-
ditional way, we intend to keep the objective and the constraints linear, but introduce “nonlin-
earity” in the inequality sign ≥.
1.4. FROM LINEAR TO CONIC PROGRAMMING 53
In the latter relation, we again meet with the inequality sign ≥, but now it stands for the
“arithmetic ≥” – a well-known relation between real numbers. The above “coordinate-wise”
partial ordering of vectors in Rm satisfies a number of basic properties of the standard ordering
of reals; namely, for all vectors a, b, c, d, ... ∈ Rm one has
1. Reflexivity: a ≥ a;
• The standard inequality ” ≥ ” is neither the only possible, nor the only interesting way to
define the notion of a vector inequality fitting the axioms 1. – 4.
As a result,
A generic optimization problem which looks exactly the same as (LP), up to the
fact that the inequality ≥ in (LP) is now replaced with and ordering which differs
from the component-wise one, inherits a significant part of the properties of LP
problems. Specifying properly the ordering of vectors, one can obtain from (LP)
generic optimization problems covering many important applications which cannot
be treated by the standard LP.
To the moment what is said is just a declaration. Let us look how this declaration comes to
life.
We start with clarifying the “geometry” of a “vector inequality” satisfying the axioms 1. –
4. Thus, we consider vectors from a finite-dimensional Euclidean space E with an inner product
h·, ·i and assume that E is equipped with a partial ordering (called also vector inequality), let it
be denoted by : in other words, we say what are the pairs of vectors a, b from E linked by the
inequality a b. We call the ordering “good”, if it obeys the axioms 1. – 4., and are interested
to understand what are these good orderings.
Our first observation is:
54 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING
Indeed, let a b. By 1. we have −b −b, and by 4.(b) we may add the latter
inequality to the former one to get a − b 0. Vice versa, if a − b 0, then, adding
to this inequality the one b b, we get a b.
The set K in Observation A cannot be arbitrary. It is easy to verify that it must be a pointed
cone, i.e., it must satisfy the following conditions:
a, a0 ∈ K ⇒ a + a0 ∈ K;
2. K is a conic set:
a ∈ K, λ ≥ 0 ⇒ λa ∈ K.
3. K is pointed:
a ∈ K and − a ∈ K ⇒ a = 0.
Geometrically: K does not contain straight lines passing through the origin.
Exercise 1.3 Prove that the outlined properties of K are necessary and sufficient for the vector
inequality a b ⇔ a − b ∈ K to be good.
Thus, every pointed cone K in E induces a partial ordering on E which satisfies the axioms
1. – 4. We denote this ordering by ≥K :
a ≥K b ⇔ a − b ≥K 0 ⇔ a − b ∈ K.
What is the cone responsible for the standard coordinate-wise ordering ≥ on E = Rm we have
started with? The answer is clear: this is the cone comprised of vectors with nonnegative entries
– the nonnegative orthant
Rm T m
+ = {x = (x1 , ..., xm ) ∈ R : xi ≥ 0, i = 1, ..., m}.
(Thus, in order to express the fact that a vector a is greater than or equal to, in the component-
wise sense, to a vector b, we were supposed to write a ≥Rm +
b. However, we are not going to be
that formal and shall use the standard shorthand notation a ≥ b.)
The nonnegative orthant Rm + is not just a pointed cone; it possesses two useful additional
properties:
I. The cone is closed: if a sequence of vectors ai from the cone has a limit, the latter also
belongs to the cone.
II. The cone possesses a nonempty interior: there exists a vector such that a ball of positive
radius centered at the vector is contained in the cone.
1.4. FROM LINEAR TO CONIC PROGRAMMING 55
These additional properties are very important. For example, I is responsible for the possi-
bility to pass to the term-wise limit in an inequality:
ai ≥ bi ∀i, ai → a, bi → b as i → ∞ ⇒ a ≥ b.
It makes sense to restrict ourselves with good partial orderings coming from cones K sharing
the properties I, II. Thus,
From now on, speaking about vector inequalities ≥K , we always assume that the
underlying set K is a pointed and closed cone with a nonempty interior.
Note that the closedness of K makes it possible to pass to limits in ≥K -inequalities:
ai ≥K bi , ai → a, bi → b as i → ∞ ⇒ a ≥K b.
The nonemptiness of the interior of K allows to define, along with the “non-strict” inequality
a ≥K b, also the strict inequality according to the rule
a >K b ⇔ a − b ∈ int K,
where int K is the interior of the cone K. E.g., the strict coordinate-wise inequality a >Rm
+
b
(shorthand: a > b) simply says that the coordinates of a are strictly greater, in the usual
arithmetic sense, than the corresponding coordinates of b.
Examples. The partial orderings we are especially interested in are given by the following
cones:
• The nonnegative orthant Rm n
+ in R ;
• The Lorentz (or the second-order, or the less scientific name the ice-cream) cone
v
um−1
uX
Lm = x = (x1 , ..., xm−1 , xm )T ∈ Rm : xm ≥ t x2i
i=1
matrices (equipped with the Frobenius inner product hA, Bi = Tr(AB) = Aij Bij ) and
P
i,j
consists of all m × m matrices A which are positive semidefinite, i.e.,
A = AT ; xT Ax ≥ 0 ∀x ∈ Rm .
We shall refer to (CP) as to a conic problem associated with the cone K. Note that the only
difference between this program and an LP problem is that the latter deals with the particular
choice E = Rm , K = Rm + . With the formulation (CP), we get a possibility to cover a much
wider spectrum of applications which cannot be captured by LP; we shall look at numerous
examples in the sequel.
56 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING
with weight vectors λ ≥ 0. By its origin, an inequality of this type is a consequence of the system
of constraints Ax ≥ b of (LP), i.e., it is satisfied at every solution to the system. Consequently,
whenever we are lucky to get, as the left hand side of (Cons(λ)), the expression cT x, i.e.,
whenever a nonnegative weight vector λ satisfies the relation
AT λ = c,
the inequality (Cons(λ)) yields a lower bound bT λ on the optimal value in (LP). And the dual
problem n o
max bT λ : λ ≥ 0, AT λ = c
was nothing but the problem of finding the best lower bound one can get in this fashion.
The same scheme can be used to develop the dual to a conic problem
n o
min cT x : Ax ≥K b , K ⊂ E. (CP)
Here the only step which needs clarification is the following one:
(?) What are the “admissible” weight vectors λ, i.e., the vectors such that the scalar
inequality
hλ, Axi ≥ hλ, bi
is a consequence of the vector inequality Ax ≥K b?
Example 1.4.1 Consider the ordering ≥L3 on E = R3 given by the 3-dimensional ice-cream
cone:
a1 0
q
a2 ≥L3 0 ⇔ a3 ≥ a2 + a2 .
1 2
a3 0
The inequality
−1 0
−1 ≥L3 0
2 0
1.4. FROM LINEAR TO CONIC PROGRAMMING 57
1
is valid; however, aggregating this inequality with the aid of a positive weight vector λ = 1 ,
0.1
we get the false inequality
−1.8 ≥ 0.
Thus, not every nonnegative weight vector is admissible for the partial ordering ≥L3 .
To answer the question (?) is the same as to say what are the weight vectors λ such that
∀a ≥K 0 : hλ, ai ≥ 0. (1.4.1)
hλ, ai ≥ hλ, bi
a ≥K b
⇔ a − b ≥K 0 [additivity of ≥K ]
⇒ hλ, a − bi ≥ 0 [by (1.4.1)]
⇔ hλ, ai ≥ λT b.
K∗ = {λ ∈ E : hλ, ai ≥ 0 ∀a ∈ K}.
The set K∗ is comprised of vectors whose inner products with all vectors from K are nonnegative.
K∗ is called the cone dual to K. The name is legitimate due to the following fact (see Section
B.2.7.B):
Theorem 1.4.1 [Properties of the dual cone] Let E be a finite-dimensional Euclidean space
with inner product h·, ·i and let K ⊂ E be a nonempty set. Then
(i) The set
K∗ = {λ ∈ Em : hλ, ai ≥ 0 ∀a ∈ K}
is a closed cone.
(ii) If int K 6= ∅, then K∗ is pointed.
(iii) If K is a closed convex pointed cone, then int K∗ 6= ∅.
(iv) If K is a closed cone, then so is K∗ , and the cone dual to K∗ is K itself:
(K∗ )∗ = K.
From the dual cone to the problem dual to (CP). Now we are ready to derive the dual
problem of a conic problem (CP). As in the case of Linear Programming, we start with the
observation that whenever x is a feasible solution to (CP) and λ is an admissible weight vector,
i.e., λ ∈ K∗ , then x satisfies the scalar inequality
A∗ λ = c,
one has
cT x = (A∗ λ)T x = hλ, Axi ≥ hb, λi
for all x feasible for (CP), so that the quantity hb, λi is a lower bound on the optimal value of
(CP). The best bound one can get in this fashion is the optimal value in the problem
Proposition 1.4.1 [Weak Conic Duality Theorem] The optimal value of (CD) is a lower bound
on the optimal value of (CP).
c ∈ ImA∗ . (1.4.2)
16)
For a linear operator x 7→ Ax : Rn → E, A∗ is the conjugate operator given by the identity
When representing the operators by their matrices in orthogonal bases in the argument and the range spaces,
the matrix representing the conjugate operator is exactly the transpose of the matrix representing the operator
itself.
1.4. FROM LINEAR TO CONIC PROGRAMMING 59
In the case of (1.4.2) the primal problem (CP) can be posed equivalently as the following problem:
min {hd, yi : y ∈ L, y ≥K 0} ,
y
In the case of (1.4.2) the primal problem, geometrically, is the problem of minimizing
a linear form over the intersection of the affine plane L with the cone K, and the
dual problem, similarly, is to maximize another linear form over the intersection of
the affine plane L∗ with the dual cone K∗ .
Now, what happens if the condition (1.4.2) is not satisfied? The answer is very simple: in this
case (CP) makes no sense – it is either unbounded below, or infeasible.
Indeed, assume that (1.4.2) is not satisfied. Then, by Linear Algebra, the vector c is not
orthogonal to the null space of A, so that there exists e such that Ae = 0 and cT e > 0. Now
let x be a feasible solution of (CP); note that all points x − µe, µ ≥ 0, are feasible, and
cT (x − µe) → ∞ as µ → ∞. Thus, when (1.4.2) is not satisfied, problem (CP), whenever
feasible, is unbounded below.
From the above observation we see that if (1.4.2) is not satisfied, then we may reject (CP) from
the very beginning. Thus, from now on we assume that (1.4.2) is satisfied. In fact in what
follows we make a bit stronger assumption:
A. The mapping A is of full column rank, i.e., it has trivial null space.
Assuming that the mapping x 7→ Ax has the trivial null space (“we have eliminated
from the very beginning the redundant degrees of freedom – those not affecting the
value of Ax”), the equation
A∗ d = q
is solvable for every right hand side vector q.
In view of A, problem (CP) can be reformulated as a problem (P) of minimizing a linear objective
hd, yi over the intersection of an affine plane L and a cone K. Conversely, a problem (P) of this
latter type can be posed in the form of (CP) – to this end it suffices to represent the plane L as
the image of an affine mapping x 7→ Ax − b (i.e., to parameterize somehow the feasible plane)
and to “translate” the objective hd, yi to the space of x-variables – to set c = A∗ d, which yields
y = Ax − b ⇒ hd, yi = cT x + const.
Thus, when dealing with a conic problem, we may pass from its “analytic form” (CP) to the
“geometric form” (P) and vice versa.
What are the relations between the “geometric data” of the primal and the dual problems?
We already know that the cone K∗ associated with the dual problem is dual of the cone K
associated with the primal one. What about the feasible planes L and L∗ ? The answer is
60 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING
simple: they are orthogonal to each other! More exactly, the affine plane L is the translation,
by vector −b, of the linear subspace
L = ImA ≡ {y = Ax : x ∈ Rn }.
And L∗ is the translation, by any solution λ0 of the system A∗ λ = c, e.g., by the solution d to
the system, of the linear subspace
L∗ = Null(A∗ ) ≡ {λ : A∗ λ = 0}.
A well-known fact of Linear Algebra is that the linear subspaces L and L∗ are orthogonal
complements of each other:
of minimizing a linear objective hd, yi over the intersection of a cone K with an affine
plane L = L − b given as a translation, by vector −b, of a linear subspace L.
n o
max hb, λi : λ ∈ L⊥ + d, λ ≥K∗ 0 . (CD)
λ
of maximizing the linear objective hb, λi over the intersection of the dual cone K∗
with an affine plane L∗ = L⊥ + d given as a translation, by the vector d, of the
orthogonal complement L⊥ of L.
What we get is an extremely transparent geometric description of the primal-dual pair of conic
problems (P), (CD). Note that the duality is completely symmetric: the problem dual to (CD)
is (P)! Indeed, we know from Theorem 1.4.1 that (K∗ )∗ = K, and of course (L⊥ )⊥ = L. Switch
from maximization to minimization corresponds to the fact that the “shifting vector” in (P) is
(−b), while the “shifting vector” in (CD) is d. The geometry of the primal-dual pair (P), (CD)
17)
recall that we have restricted ourselves to the problems satisfying the assumption A
1.4. FROM LINEAR TO CONIC PROGRAMMING 61
K* L
Finally, note that in the case when (CP) is an LP program (i.e., in the case when K is the
nonnegative orthant), the “conic dual” problem (CD) is exactly the usual LP dual; this fact
immediately follows from the observation that the cone dual to Rm m
+ is R+ itself.
We have explored the geometry of a primal-dual pair of conic problems: the “geometric
data” of such a pair are given by a pair of dual to each other cones K, K∗ in E and a pair of
affine planes L = L − b, L∗ = L⊥ + d, where L is a linear subspace in E and L⊥ is its orthogonal
complement. The first problem from the pair – let it be called (P) – is to minimize hb, yi over
y ∈ K∩L, and the second (CD) is to maximize hd, λi over λ ∈ K∗ ∩L∗ . Note that the “geometric
data” (K, K∗ , L, L∗ ) of the pair do not specify completely the problems of the pair: given L, L∗ ,
we can uniquely define L, but not the shift vectors (−b) and d: b is known up to shift by a
vector from L, and d is known up to shift by a vector from L⊥ . However, this non-uniqueness
is of absolutely no importance: replacing a chosen vector d ∈ L∗ by another vector d0 ∈ L∗ , we
pass from (P) to a new problem (P0 ) which is completely equivalent to (P): indeed, both (P)
and (P0 ) have the same feasible set, and on the (common) feasible plane L of the problems their
objectives hd, yi and hd0 , yi differ from each other by a constant:
y ∈ L = L − b, d − d0 ∈ L⊥ ⇒ hd − d0 , y + bi = 0 ⇒ hd − d0 , yi = h−(d − d0 ), bi ∀y ∈ L.
Similarly, shifting b along L, we do modify the objective in (CD), but in a trivial way – on the
feasible plane L∗ of the problem the new objective differs from the old one by a constant.
1) The duality is symmetric: the dual problem is conic, and the problem dual to dual is the
primal.
2) The value of the dual objective at every dual feasible solution λ is ≤ the value of the primal
objective at every primal feasible solution x, so that the duality gap
cT x − hb, λi
Proof. 1): The result was already obtained when discussing the geometry of the primal and
the dual problems.
2): This is the Weak Conic Duality Theorem.
3): Assume that (CP) is strictly feasible and bounded below, and let c∗ be the optimal value
of the problem. We should prove that the dual is solvable with the same optimal value. Since
we already know that the optimal value of the dual is ≤ c∗ (see 2)), all we need is to point out
a dual feasible solution λ∗ with bT λ∗ ≥ c∗ .
Consider the convex set
M = {y = Ax − b : x ∈ Rn , cT x ≤ c∗ }.
(i) is evident (why?). To verify (ii), assume, on contrary, that there exists a point x̄, cT x̄ ≤ c∗ ,
such that ȳ ≡ Ax̄ − b >K 0. Then, of course, Ax − b >K 0 for all x close enough to x̄, i.e., all
points x in a small enough neighbourhood of x̄ are also feasible for (CP). Since c 6= 0, there are
points x in this neighbourhood with cT x < cT x̄ ≤ c∗ , which is impossible, since c∗ is the optimal
value of (CP).
Now let us make use of the following basic fact (see Section B.2.6):
Theorem 1.4.3 [Separation Theorem for Convex Sets] Let S, T be nonempty non-
intersecting convex subsets of a finite-dimensional Euclidean space E with inner prod-
uct h·, ·i. Then S and T can be separated by a linear functional: there exists a nonzero
vector λ ∈ E such that
suphλ, ui ≤ inf hλ, ui.
u∈S u∈T
From the inequality it follows that the linear form hλ, yi of y is bounded below on K = int K.
Since this interior is a conic set:
y ∈ K, µ > 0 ⇒ µy ∈ K
(why?), this boundedness implies that hλ, yi ≥ 0 for all y ∈ K. Consequently, hλ, yi ≥ 0 for all
y from the closure of K, i.e., for all y ∈ K. We conclude that λ ≥K∗ 0, so that the inf in (1.4.4)
is nonnegative. On the other hand, the infimum of a linear form over a conic set clearly cannot
be positive; we conclude that the inf in (1.4.4) is 0, so that the inequality reads
sup hλ, ui ≤ 0.
u∈M
for all x from the half-space cT x ≤ c∗ . But the linear form [A∗ λ]T x can be bounded above on
the half-space if and only if the vector A∗ λ is proportional, with a nonnegative coefficient, to
the vector c:
A∗ λ = µc
for some µ ≥ 0. We claim that µ > 0. Indeed, assuming µ = 0, we get A∗ λ = 0, whence hλ, bi ≥ 0
in view of (1.4.5). It is time now to recall that (CP) is strictly feasible, i.e., Ax̄ − b >K 0 for
some x̄. Since λ ≥K∗ 0 and λ 6= 0, the product hλ, Ax̄ − bi should be strictly positive (why?),
while in fact we know that the product is −hλ, bi ≤ 0 (since A∗ λ = 0 and, as we have seen,
hλ, bi ≥ 0).
Thus, µ > 0. Setting λ∗ = µ−1 λ, we get
We see that λ∗ is feasible for (CD), the value of the dual objective at λ∗ being at least c∗ , as
required.
It remains to consider the case c = 0. Here, of course, c∗ = 0, and the existence of a dual
feasible solution with the value of the objective ≥ c∗ = 0 is evident: the required solution is
λ = 0. 3.a) is proved.
3.b): the result follows from 3.a) in view of the primal-dual symmetry.
4): Let x be primal feasible, and λ be dual feasible. Then
(!) For every primal-dual feasible pair (x, λ) the duality gap cT x − hb, λi is equal to
the inner product of the primal slack vector y = Ax − b and the dual vector λ.
Note that (!) in fact does not require “full” primal-dual feasibility: x may be ar-
bitrary (i.e., y should belong to the primal feasible plane ImA − b), and λ should
belong to the dual feasible plane A∗ λ = c, but y and λ not necessary should belong
to the respective cones.
In view of (!) the complementary slackness holds if and only if the duality gap is zero; thus, all
we need is to prove 4.a).
The “primal residual” cT x − c∗ and the “dual residual” b∗ − hb, λi are nonnegative, provided
that x is primal feasible, and λ is dual feasible. It follows that the duality gap
is nonnegative (recall that c∗ ≥ b∗ by 2)), and it is zero if and only if c∗ = b∗ and both primal
and dual residuals are zero (i.e., x is primal optimal, and λ is dual optimal). All these arguments
hold without any assumptions of strict feasibility. We see that the condition “the duality gap
at a primal-dual feasible pair is zero” is always sufficient for primal-dual optimality of the pair;
and if c∗ = b∗ , this sufficient condition is also necessary. Since in the case of 4) we indeed have
c∗ = b∗ (this is stated by 3)), 4.a) follows.
A useful consequence of the Conic Duality Theorem is the following
Corollary 1.4.2 Assume that both (CP) and (CD) are strictly feasible. Then both problems are
solvable, the optimal values are equal to each other, and each one of the conditions 4.a), 4.b) is
necessary and sufficient for optimality of a primal-dual feasible pair.
Indeed, by the Weak Conic Duality Theorem, if one of the problems is feasible, the other is
bounded, and it remains to use the items 3) and 4) of the Conic Duality Theorem.
1.4.5.1 Refinement
We can slightly refine the Conic Duality Theorem, applying “special treatment” to scalar linear
equality and inequality constraints. Specifically, assume that our primal problem is
A1 x − b1 ≥ 0 (a)
Opt(P ) = min cT x : Ai x − bi ∈ Ki , 2 ≤ i ≤ m, (b) (P )
x
Rx = r (c)
1.4. FROM LINEAR TO CONIC PROGRAMMING 65
where Ki , i = 2, 3, ..., m are regular cones in Euclidean spaces. Applying the same “bounding
mechanism” as when deriving the conic dual (CD) of (CP), we arrive at the dual problem
m
X y1 ≥ 0,
T T ∗
Opt(D) = maxm r z + b1 y1 + hbi , yi i : yi ∈ Ki , 2 ≤ i ≤ m, (D)
z,{yi }i=1
R∗ z + m ∗
P
i=1 Ai yi = c
i=2
Essentially strict feasibility. Note that the structure of problem (D) is completely similar
to the one of (P ) – the variables, let them be called ξ, are subject to finitely many scalar linear
equalities and inequalities and, on the top of it, finitely many conic inequalities Pi ξ − pi ∈ Li ,
i ∈ I, where Li are regular cones in Euclidean spaces. Let us call such a conic problem essentially
strictly feasible, if it admits a feasible solution ξ¯ at which the conic inequalities are satisfied
strictly: Pi ξ¯−pi ∈ int Li , i ∈ I. It turns out that the Conic Duality Theorem remains valid when
one replaces in it “strict feasibility” with “essentially strict feasibility,” which is some progress:
a strictly feasible conic problem clearly is essentially strictly feasible, but not necessarily vice
versa. Thus, we intend to prove
Theorem 1.4.4 [Refined Conic Duality Theorem] Consider a primal-dual pair of conic pro-
grams (P ), (D). Then
(i) [Weak Duality] One has Opt(D) ≤ Opt(P ).
(ii) [Symmetry] The duality is symmetric: (D) is a conic program, and the program dual to
(D) is (equivalent to) (P ).
(iii) [Refined Strong Duality] If one of the programs (P ), (D) is essentially strictly feasible
and bounded, then the other program is solvable, and Opt(P ) = Opt(D).
If both the programs are essentially strictly feasible, then both are solvable with equal optimal
values.
Note that the Refined Conic Duality Theorem covers the usual Linear Programming Duality
Theorem: the latter is the particular case m = 1 of the former.
For proof of Theorem 1.4.4, see. e.g., [23, Section 7.1.3.].
Example 1.4.2 Consider the following conic problem with 2 variables x = (x1 , x2 )T and the
3-dimensional ice-cream cone K:
x1 − x2
min x1 : Ax − b ≡ 1 ≥L3 0 .
x1 + x2
66 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING
Example 1.4.3 Consider the following conic problem with two variables x = (x1 , x2 )T and the
3-dimensional ice-cream cone K:
x1
min x2 : Ax − b = x2 ≥L3 0 .
x1
In spite of the fact that primal is solvable, the dualqis infeasible: indeed, assuming that λ is dual
feasible, we have λ ≥L3 0, which means that λ3 ≥ λ21 + λ22 ; since also λ1 + λ3 = 0, we come to
λ2 = 0, which contradicts the equality λ2 = 1.
We see that the weakness of the Conic Duality Theorem as compared to the LP Duality one
reflects pathologies which indeed may happen in the general conic case.
Ax − b ≥K 0. (I)
Ax − b0 ≥K 0
is solvable.
Moreover,
(iii) (II) is solvable if and only if (I) is not “almost solvable”.
Note the difference between the simple case when ≥K is the usual partial ordering ≥ and the
general case. In the former, one can replace in (ii) “nearly solvable” by “solvable”; however, in
the general conic case “almost” is unavoidable.
Recalling the definition of the ice-cream cone L3 , we can write the inequality equivalently as
√ q p
2x ≥ (x + 1)2 + (x − 1)2 ≡ 2x2 + 2, (i)
From the second of these relations, λ3 = − √12 (λ1 + λ2 ), so that from the first inequality we get
0 ≤ (λ1 − λ2 )2 , whence λ1 = λ2 . But then the third inequality in (ii) is impossible! We see that
here both (i) and (ii) have no solutions.
The geometry of the example is as follows. (i) asks to find a point in the intersection of
the 3D ice-cream cone and a line. This line is an asymptote of the cone (it belongs to a 2D
plane which crosses the cone in such way that the boundary of the cross-section is a branch of
a hyperbola, and the line is one of two asymptotes of the hyperbola). Although the intersection
is empty ((i) is unsolvable), small shifts of the line make the intersection nonempty (i.e., (i) is
unsolvable and “almost solvable” at the same time). And it turns out that one cannot certify
the fact that (i) itself is unsolvable by providing a solution to (ii).
68 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING
min {t : Ax + tσ − b ≥K 0} (CP)
x,t
in variables (x, t). Clearly, the problem is strictly feasible (why?). Now, if (I) is not almost solvable, then,
first, the matrix of the problem [A; σ] satisfies the full column rank condition A (otherwise the image of
the mapping (x, t) 7→ Ax + tσ − b would coincide with the image of the mapping x 7→ Ax − b, which is
not he case – the first of these images does intersect K, while the second does not). Second, the optimal
value in (CP) is strictly positive (otherwise the problem would admit feasible solutions with t close to 0,
and this would mean that (I) is almost solvable). From the Conic Duality Theorem it follows that the
dual problem of (CP)
max {hb, λi : A∗ λ = 0, hσ, λi = 1, λ ≥K∗ 0}
λ
has a feasible solution with positive hb, λi, i.e., (II) is solvable.
It remains to prove (iii). Assume first that (I) is not almost solvable; then (II) must be solvable by
(ii). Vice versa, assume that (II) is solvable, and let λ be a solution to (II). Then λ solves also all systems
of the type (II) associated with small enough perturbations of b instead of b itself; by (i), it implies that
all inequalities obtained from (I) by small enough perturbation of b are unsolvable.
Ax ≥K b (V)
Proposition 1.4.3 (i) If (S) can be obtained from (V) and from the trivial inequality 1 ≥ 0 by
admissible aggregation, i.e., there exist weight vector λ ≥K∗ 0 such that
A∗ λ = c, hλ, bi ≥ d,
The difference between the case of the partial ordering ≥ and a general partial ordering ≥K is
in the word “strictly” in (ii).
1.4. FROM LINEAR TO CONIC PROGRAMMING 69
Proof of the proposition. (i) is evident (why?). To prove (ii), assume that (V) is strictly feasible and
(S) is a consequence of (V) and consider the conic problem
x Ax − b
min t : Ā − b̄ ≡ ≥ 0 ,
x,t t d − cT x + t K̄
K̄ = {(x, t) : x ∈ K, t ≥ 0}
The problem is clearly strictly feasible (choose x to be a strictly feasible solution to (V) and then choose
t to be large enough). The fact that (S) is a consequence of (V) says exactly that the optimal value in
the problem is nonnegative. By the Conic Duality Theorem, the dual problem
λ
max hb, λi − dµ : A∗ λ − c = 0, µ = 1, ≥K̄∗ 0
λ,µ µ
has a feasible solution with the value of the objective ≥ 0. Since, as it is easily seen, K̄∗ = {(λ, µ) : λ ∈
K∗ , µ ≥ 0}, the indicated solution satisfies the requirements
λ ≥K∗ 0, A∗ λ = c, hb, λi ≥ d,
Data of a conic problem. When asked “What are the data of an LP program min{cT x :
Ax − b ≥ 0}”, everybody will give the same answer: “the objective c, the constraint matrix A
and the right hand side vector b”. Similarly, for a conic problem
n o
min cT x : Ax − b ≥K 0 , (CP)
its data, by definition, is the triple (c, A, b), while the sizes of the problem – the dimension n
of x and the dimension m of K, same as the underlying cone K itself, are considered as the
structure of (CP).
Robustness. A question of primary importance is whether the properties of the program (CP)
(feasibility, solvability, etc.) are stable with respect to perturbations of the data. The reasons
which make this question important are as follows:
• In actual applications, especially those arising in Engineering, the data are normally inex-
act: their true values, even when they “exist in the nature”, are not known exactly when
the problem is processed. Consequently, the results of the processing say something defi-
nite about the “true” problem only if these results are robust with respect to small data
perturbations i.e., the properties of (CP) we have discovered are shared not only by the
particular (“nominal”) problem we were processing, but also by all problems with nearby
data.
70 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING
• Even when the exact data are available, we should take into account that processing them
computationally we unavoidably add “noise” like rounding errors (you simply cannot load
something like 1/7 to the standard computer). As a result, a real-life computational routine
can recognize only those properties of the input problem which are stable with respect to
small perturbations of the data.
Due to the above reasons, we should study not only whether a given problem (CP) is feasi-
ble/bounded/solvable, etc., but also whether these properties are robust – remain unchanged
under small data perturbations. As it turns out, the Conic Duality Theorem allows to recognize
“robust feasibility/boundedness/solvability...”.
Let us start with introducing the relevant concepts. We say that (CP) is
• robust feasible, if all “sufficiently close” problems (i.e., those of the same structure
(n, m, K) and with data close enough to those of (CP)) are feasible;
• robust bounded below, if all sufficiently close problems are bounded below (i.e., their
objectives are bounded below on their feasible sets);
Note that a problem which is not robust feasible, not necessarily is robust infeasible, since among
close problems there may be both feasible and infeasible (look at Example 1.4.3 – slightly shifting
and rotating the plane Im A − b, we may get whatever we want – a feasible bounded problem,
a feasible unbounded problem, an infeasible problem...). This is why we need two kinds of
definitions: one of “robust presence of a property” and one more of “robust absence of the same
property”.
Now let us look what are necessary and sufficient conditions for the most important robust
forms of the “solvability status”.
Proposition 1.4.4 [Robust feasibility] (CP) is robust feasible if and only if it is strictly feasible,
in which case the dual problem (D) is robust bounded above.
Proof. The statement is nearly tautological. Let us fix δ >K 0. If (CP) is robust feasible, then for small
enough t > 0 the perturbed problem min{cT x : Ax − b − tδ ≥K 0} should be feasible; a feasible solution
to the perturbed problem clearly is a strictly feasible solution to (CP). The inverse implication is evident
(a strictly feasible solution to (CP) remains feasible for all problems with close enough data). It remains
to note that if all problems sufficiently close to (CP) are feasible, then their duals, by the Weak Conic
Duality Theorem, are bounded above, so that (D) is robust above bounded.
Proposition 1.4.5 [Robust infeasibility] (CP) is robust infeasible if and only if the system
hb, λi = 1, A∗ λ = 0, λ ≥K∗ 0
is robust feasible, or, which is the same (by Proposition 1.4.4), if and only if the system
has a solution.
1.4. FROM LINEAR TO CONIC PROGRAMMING 71
Proof. First assume that (1.4.6) is solvable, and let us prove that all problems sufficiently close to (CP)
are infeasible. Let us fix a solution λ̄ to (1.4.6). Since A is of full column rank, simple Linear Algebra
says that the systems [A0 ]∗ λ = 0 are solvable for all matrices A0 from a small enough neighbourhood U
of A; moreover, the corresponding solution λ(A0 ) can be chosen to satisfy λ(A) = λ̄ and to be continuous
in A0 ∈ U . Since λ(A0 ) is continuous and λ(A) >K∗ 0, we have λ(A0 ) is >K∗ 0 in a neighbourhood of A;
shrinking U appropriately, we may assume that λ(A0 ) >K∗ 0 for all A0 ∈ U . Now, bT λ̄ = 1; by continuity
reasons, there exists a neighbourhood V of b and a neighbourhood U 0 of A such that b0 ∈ V and all
A0 ∈ U 0 one has hb0 , λ(A0 )i > 0.
Thus, we have seen that there exist a neighbourhood U 0 of A and a neighbourhood V of b, along with
a function λ(A0 ), A0 ∈ U 0 , such that
for all b0 ∈ V and A0 ∈ U . By Proposition 1.4.2.(i) it means that all the problems
min [c0 ]T x : A0 x − b0 ≥K 0
A0 x − b0 ≥K 0
is not almost solvable (see Proposition 1.4.2). We conclude from Proposition 1.4.2.(ii) that for every
A0 ∈ U and b0 ∈ V there exists λ = λ(A0 , b0 ) such that
Now let us choose λ0 >K∗ 0. For all small enough positive we have A = A + b[A∗ λ0 ]T ∈ U . Let us
choose an with the latter property to be so small that hb, λ0 i > −1 and set A0 = A , b0 = b. According
to the previous observation, there exists λ = λ(A0 , b) such that
Setting λ̄ = λ + hb, λiλ0 , we get λ̄ >K∗ 0 (since λ ≥K∗ 0, λ0 >K∗ 0 and hb, λi > 0), while A∗ λ̄ = 0 and
hb, λ̄i = hb, λi(1 + hb, λ0 i) > 0. Multiplying λ̄ by appropriate positive factor, we get a solution to (1.4.6).
Proposition 1.4.6 For a conic problem (CP) the following conditions are equivalent to each
other
(i) (CP) is robust feasible and robust bounded (below);
(ii) (CP) is robust solvable;
(iii) (D) is robust solvable;
(iv) (D) is robust feasible and robust bounded (above);
(v) Both (CP) and (D) are strictly feasible.
In particular, under every one of these equivalent assumptions, both (CP) and (D) are solv-
able with equal optimal values.
72 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING
Proof. (i) ⇒ (v): If (CP) is robust feasible, it also is strictly feasible (Proposition 1.4.4). If, in addition,
(CP) is robust bounded below, then (D) is robust solvable (by the Conic Duality Theorem); in particular,
(D) is robust feasible and therefore strictly feasible (again Proposition 1.4.4).
(v) ⇒ (ii): The implication is given by the Conic Duality Theorem.
(ii) ⇒ (i): trivial.
We have proved that (i)≡(ii)≡(v). Due to the primal-dual symmetry, we also have proved that
(iii)≡(iv)≡(v).
1.5 Exercises
1.5.1 Around General Theorem on Alternative
Exercise 1.4 Derive General Theorem on Alternative from Homogeneous Farkas Lemma
Hint: Verify that the system
aTi x >
bi , i = 1, ..., ms ,
(S) :
aTi x ≥ bi , i = ms + 1, ..., m.
≤0
in these variables.
There exist several particular cases of GTA (which in fact are equivalent to GTA); the goal of
the next exercise is to prove the corresponding statements.
Exercise 1.5 Derive the following statements from the General Theorem on Alternative:
1. [Gordan’s Theorem on Alternative] One of the inequality systems
(I) Ax < 0, x ∈ Rn ,
(II) AT y = 0, 0 6= y ≥ 0, y ∈ Rm ,
(A being an m × n matrix, x are variables in (I), y are variables in (II)) has a solution if
and only if the other one has no solutions.
aT x ≤ p (1.5.1)
Nx ≤ q (1.5.2)
if and only if it is a ”linear consequence” of the system and the trivial inequality
0T x ≤ 1,
1.5. EXERCISES 73
i.e., if it can be obtained by taking weighted sum, with nonnegative coefficients, of the
inequalities from the system and this trivial inequality.
Algebraically: (1.5.1) is a consequence of solvable system (1.5.2) if and only if
a = NT ν
ν T q ≤ p.
Sx < 0, N x ≤ 0
S T σ + N T ν = 0, σ ≥ 0, ν ≥ 0, σ 6= 0
x+y ≤2
−x ≤ −100
Our inequality clearly is a consequence of the system – it is satisfied at every solution to it (simply
because there are no solutions to the system at all). According to the Inhomogeneous Farkas
Lemma, the inequality should be a linear consequence of the system and the trivial inequality
0 ≤ 1, i.e., there should exist nonnegative ν1 , ν2 such that
−1
1 1
= ν1 + ν2 , ν1 − 1000ν2 ≤ 2,
1 0 0
which clearly is not the case. What is the reason for the observed “contradiction”?
{x ≥K 0 : λT x ≤ 1}
is compact.
74 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING
A−1 (K) = {u : Au ∈ K}
is a cone in Rk .
Prove that the cone dual to A−1 (K) is AT K∗ , i.e.
3) [stability with respect to taking linear image] Let K be a cone in Rn and y = Ax be a linear
mapping from Rn onto RN (i.e., the image of A is the entire RN ). Assume Null(A) ∩ K = {0}.
Prove that then the set
AK = {Ax : x ∈ K}
is a cone in RN .
Prove that the cone dual to AK is
(AK)∗ = {λ ∈ RN : AT λ ∈ K∗ }.
Demonstrate by example that if in the above statement the assumption Null(A) ∩ K = {0} is
weakened to Null(A) ∩ int K = ∅, then the set A(K) may happen to be non-closed.
Hint. Look what happens when the 3D ice-cream cone is projected onto its tangent plane.
is primal-dual optimal if and only if the “primal slack” y = Ax − b and λ are complementary.
Exercise 1.11 [Nonnegative orthant] Prove that the n-dimensional nonnegative orthant Rn+ is
a cone and that it is self-dual:
(Rn+ )∗ = Rn+ .
What are complementary pairs?
(Ln )∗ = Ln .
Exercise 1.13 [Positive semidefinite cone] Let Sn+ be the cone of n × n positive semidefinite
matrices in the space Sn of symmetric n × n matrices. Assume that Sn is equipped with the
Frobenius inner product
n
X
hX, Y i = Tr(XY ) = Xij Yij .
i,j=1
(Sn+ )∗ = Sn+ ,
i.e., that the Frobenius inner products of a symmetric matrix Λ with all positive semidefinite ma-
trices X of the same size are nonnegative if and only if the matrix Λ itself is positive semidefinite.
3) Prove the following characterization of the complementary pairs:
Two matrices X ∈ Sn+ , Λ ∈ (Sn+ )∗ ≡ Sn+ are complementary (i.e., hΛ, Xi = 0) if and
only if their matrix product is zero: ΛX = XΛ = 0. In particular, matrices from a
complementary pair commute and therefore share a common orthonormal eigenbasis.
76 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING
Imagine, e.g., that n = 2, b1 , ..., bN are locations of villages and you are interested to locate
a fire station for which the worst-case distance to a possible fire is as small as possible.
1) Pose the problem as a conic quadratic one – a conic problem associated with a direct product
of ice-cream cones.
2) Build the dual problem.
3) What is the geometric interpretation of the dual? Are the primal and the dual strictly
feasible? Solvable? With equal optimal values? What is the meaning of the complementary
slackness?
Exercise 1.15 [The weighted Steiner problem] Consider the problem as follows:
Given N points b1 , ..., bN in Rn along with positive weights ωi , i = 1, ..., N , find a
point x ∈ Rn which minimizes the weighted sum of its (Euclidean) distances to the
points b1 , ..., bN , i.e., solve the problem
N
X
min ωi kx − bi k2 .
x
i=1
Imagine, e.g., that n = 2, b1 , ..., bN are locations of N villages and you are interested to place
a telephone station for which the total cost of cables linking the station and the villages is
as small as possible. The weights can be interpreted as the per mile cost of the cables (they
may vary from village to village due to differences in populations and, consequently, in the
required capacities of the cables).
1) Pose the problem as a conic quadratic one.
2) Build the dual problem.
3) What is the geometric interpretation of the dual? Are the primal and the dual strictly
feasible? Solvable? With equal optimal values? What is the meaning of the complementary
slackness?
Exercise 1.16 Let (CP) be feasible. Then the following four properties are equivalent:
(i) the feasible set of the problem is bounded;
(ii) the set of primal slacks Y = {y : y ≥K 0, y = Ax − b} is bounded.
(iii) Im A ∩ K = {0};
(iv) the system of vector (in)equalities
AT λ = 0, λ >K∗ 0
is solvable.
Corollary. The property of (CP) to have a bounded feasible set is independent of the particular
value of b, provided that with this b (CP) is feasible!
Exercise 1.17 Let problem (CP) be feasible. Prove that the following two conditions are equiv-
alent:
(i) (CP) has bounded level sets;
(ii) The dual problem n o
max bT λ : AT λ = c, λ ≥K∗ 0
is strictly feasible.
Corollary. The property of (CP) to have bounded level sets is independent of the particular
value of b, provided that with this b (CP) is feasible!
78 LECTURE 1. FROM LINEAR TO CONIC PROGRAMMING
Lecture 2
Several “generic” families of conic problems are of special interest, both from the viewpoint
of theory and applications. The cones underlying these problems are simple enough, so that
one can describe explicitly the dual cone; as a result, the general duality machinery we have
developed becomes “algorithmic”, as in the Linear Programming case. Moreover, in many cases
this “algorithmic duality machinery” allows to understand more deeply the original model,
to convert it into equivalent forms better suited for numerical processing, etc. The relative
simplicity of the underlying cones also enables one to develop efficient computational methods
for the corresponding conic problems. The most famous example of a “nice” generic conic
problem is, doubtless, Linear Programming; however, it is not the only problem of this sort. Two
other nice generic conic problems of extreme importance are Conic Quadratic and Semidefinite
programs. We are about to consider the first of these two problems.
Lm1 ×Lm2 ×
K = ... × Lmk
y[1]
y[2]
m
(2.1.1)
= y= : y[i] ∈ L , i = 1, ..., k .
i
...
y[k]
In other words, a conic quadratic problem is an optimization problem with linear objective and
finitely many “ice-cream constraints”
Ai x − bi ≥Lmi 0, i = 1, ..., k,
79
80 LECTURE 2. CONIC QUADRATIC PROGRAMMING
where
[A1 ; b1 ]
[A2 ; b2 ]
[A; b] =
...
[Ak ; bk ]
is the partition of the data matrix [A; b] corresponding to the partition of y in (2.1.1). Thus, a
conic quadratic program can be written as
n o
min cT x : Ai x − bi ≥Lmi 0, i = 1, ..., k . (2.1.2)
x
Recalling the definition of the relation ≥Lm and partitioning the data matrix [Ai , bi ] as
" #
Di di
[Ai ; bi ] =
pTi qi
where Di is of the size (mi − 1) × dim x, we can write down the problem as
n o
min cT x : kDi x − di k2 ≤ pTi x − qi , i = 1, ..., k ; (QP)
x
this is the “most explicit” form is the one we prefer to use. In this form, Di are matrices of the
same row dimension as x, di are vectors of the same dimensions as the column dimensions of
the matrices Di , pi are vectors of the same dimension as x and qi are reals.
It is immediately seen that (2.1.1) is indeed a cone, in fact a self-dual one: K∗ = K.
Consequently, the problem dual to (CP) is
n o
max bT λ : AT λ = c, λ ≥K 0 .
λ
λ1
λ
Denoting λ = 2 with mi -dimensional blocks λi (cf. (2.1.1)), we can write the dual problem
...
λk
as ( k k
)
X X
max bTi λi : ATi λi = c, λi ≥Lmi 0, i = 1, ..., k .
λ1 ,...,λm
i=1 i=1
µi
Recalling the meaning of ≥Lmi 0 and representing λi = with scalar component νi , we
νi
finally come to the following form of the problem dual to (QP):
( k k
)
X X
max [µTi di + νi qi ] : [DiT µi + νi pi ] = c, kµi k2 ≤ νi , i = 1, ..., k . (QD)
µi ,νi
i=1 i=1
The design variables in (QD) are vectors µi of the same dimensions as the vectors di and reals
νi , i = 1, ..., k.
Since from now on we will treat (QP) and (QD) as the standard forms of a conic quadratic
problem and its dual, we now interpret for these two problems our basic assumption A from
Lecture 2 and notions like feasibility, strict feasibility, boundedness, etc. Assumption A now
reads (why?):
2.2. EXAMPLES OF CONIC QUADRATIC PROBLEMS 81
There is no nonzero x which is orthogonal to all rows of all matrices Di and to all
vectors pi , i = 1, ..., k
and we always make this assumption by default. Now, among notions like feasibility, solvability,
etc., the only notion which does need an interpretation is strict feasibility, which now reads as
follows (why?):
Strict feasibility of (QP) means that there exist x̄ such that kDi x̄ − di k2 < pTi x̄ − qi
for all i.
Strict feasibility of (QD) means that there exists a feasible solution {µ̄i , ν¯i }ki=1 to the
problem such that kµ̄i k2 < ν¯i for all i = 1, ..., k.
Fi
fi
pi
O vi
Let v i be the unit inward normal to the surface of the body at the point pi where i-th finger
touches the body, f i be the contact force exerted by i-th finger, and F i be the friction force
caused by the contact. Physics (Coulomb’s law) says that the latter force is tangential to the
surface of the body:
(F i )T v i = 0 (2.2.1)
and its magnitude cannot exceed µ times the magnitude of the normal component of the contact
force, where µ is the friction coefficient:
kF i k2 ≤ µ(f i )T v i . (2.2.2)
Assume that the body is subject to additional external forces (e.g., gravity); as far as their
mechanical consequences are concerned, all these forces can be represented by a single force –
their sum – F ext along with the torque T ext – the sum of vector products of the external forces
and the points where they are applied.
82 LECTURE 2. CONIC QUADRATIC PROGRAMMING
In order for the body to be in static equilibrium, the total force acting at the body and the
total torque should be zero:
N
(f i + F i ) + F ext = 0
P
i=1 (2.2.3)
N
pi × (f i + F i ) + T ext
P
= 0,
i=1
The robot should hold a cylinder by four fingers, all acting in the vertical direction.
The external forces and torques acting at the cylinder are the gravity Fg and an
externally applied torque T along the cylinder axis, as shown in the picture:
1
0 f2 f1 f1
0
1 1
0
0
1 0
1
0
1 0
1
0
1 0
1
0
1 0
1
0
1 0
1
0
1
0
1
0
1 T
0
1
0
1
0
1
0
1 F
0
1 g
0
1
0
1 Fg
T
f3 f4 f3
The magnitudes νi of the forces fi may vary in a given segment [0, Fmax ].
What can be the largest magnitude τ of the external torque T such that a stable
grasp is still possible?
Denoting by ui the directions of the fingers, by v i the directions of the inward normals to
cylinder’s surface at the contact points, and by u the direction of the axis of the cylinder, we
p1
!
q1
!
1)
Here is the definition: if p = p2 ,q= q2 are two 3D vectors, then
p3 q3
p2 p3
Det q2 q3
p3 p1
[p, q] = Det
q q1
3
p1 p2
Det
q1 q2
The vector [p, q] is orthogonal to both p and q, and k[p, q]k2 = kpk2 kqk2 sin(pq).
b
2.3. WHAT CAN BE EXPRESSED VIA CONIC QUADRATIC CONSTRAINTS? 83
max τ
s.t.
4
(νi ui + F i ) + Fg = 0
P
[total force equals 0]
i=1
4
pi × (νi ui + F i ) + τ u
P
= 0 [total torque equals 0]
i=1
(v i )T F i = 0, i = 1, ..., 4 [F i are tangential to the surface]
i i T i
kF k2 ≤ [µ[u ] v ]νi , i = 1, ..., 4 [Coulomb’s constraints]
0 ≤ νi ≤ Fmax , i = 1, ..., 4 [bounds on νi ]
in the design variables τ, νi , Fi , i = 1, ..., 4. This is a conic quadratic program, although not
in the standard form (QP). To convert the problem to this standard form, it suffices, e.g., to
replace all linear equalities by pairs of linear inequalities and further represent linear inequalities
αT x ≤ β as conic quadratic constraints
" #
0
Ax − b ≡ ≥L2 0.
β − αT x
every Xi is the set of vectors admissible for a particular design restriction which in many cases
is given by
Xi = {x ∈ Rn : gi (x) ≤ 0}, (2.3.3)
where gi (x) is i-th constraint function2) .
It is well-known that the objective f in always can be assumed linear, otherwise we could
move the original objective to the list of constraints, passing to the equivalent problem
n o
min t : (t, x) ∈ X
b ≡ {(x, t) : x ∈ X, t ≥ f (f )} .
t,x
In order to recognize that X is in one of our “catalogue” forms, one needs a kind of dictionary,
where different forms of the same structure are listed. We shall build such a dictionary for
the conic quadratic programs. Thus, our goal is to understand when a given set X can be
represented by conic quadratic inequalities (c.q.i.’s), i.e., one or several constraints of the type
kDx − dk2 ≤ pT x − q. The word “represented” needs clarification, and here it is:
We say that a set X ⊂ Rn can be represented via conic quadratic inequalities (for
short: is CQr – Conic Quadratic representable),
if there exists a system S of finitely
x
many vector inequalities of the form Aj − bj ≥Lmj 0 in variables x ∈ Rn and
u
additional variables u such that X is the projection of the solution set of S onto the
x-space, i.e., x ∈ X if and only if one can extend x to a solution (x, u) of the system
S:
x
x ∈ X ⇔ ∃u : Aj − bj ≥Lmj 0, j = 1, ..., N.
u
Every such system S is called a conic quadratic representation (for short: a CQR)
of the set X 3)
and assume that X is CQr. Then the problem is equivalent to a conic quadratic
program. The latter program can be written down explicitly, provided that we are
given a CQR of X.
Indeed, let S be a CQR of X, and u be the corresponding vector of additional variables. The
problem
min cT x : (x, u) satisfy S
x,u
with design variables x, u is equivalent to the original problem (P), on one hand, and is a
conic quadratic program, on the other hand.
Let us call a problem of the form (P) with CQ-representable X a good problem.
How to recognize good problems, i.e., how to recognize CQ-representable sets? Well, how
we recognize continuity of a given function, like f (x, y) = exp{sin(x + exp{y})} ? Normally it is
not done by a straightforward verification of the definition of continuity, but by using two kinds
of tools:
When we see that a function is obtained from “simple” functions – those of type A – by operations
of type B (as it is the case in the above example), we immediately infer that the function is
continuous.
This approach which is common in Mathematics is the one we are about to follow. In fact,
we need to answer two kinds of questions:
(??) What are CQ-representable functions g(x), i.e., functions which possess CQ-representable
epigraphs
Epi{g} = {(x, t) ∈ Rn × R : g(x) ≤ t}.
where αj and βj are, respectively, vector-valued and scalar affine functions of their arguments.
In order to get from this representation a CQ-representation of a level set {x : g(x) ≤ a}, it
suffices to fix in the conic quadratic inequalities kαj (x, t, u)k2 ≤ βj (x, t, u) the variable t at
the value a.
We list below our “raw materials” – simple functions and sets admitting CQR’s.
(t − 1)2 (t + 1)2
x t+1
xT x ≤ t ⇔ xT x + ≤ ⇔ t−1 ≤
4 4 2 2 2
(check the second ⇔!), and the last relation is a conic quadratic inequality.
86 LECTURE 2. CONIC QUADRATIC PROGRAMMING
T
xsx,
s>0
5. The fractional-quadratic function g(x, s) = 0, s = 0, x = 0 (x vector, s scalar).
+∞, otherwise
Indeed, with the convention that (xT x)/0 is 0 or +∞, depending on whether x = 0 or not, and taking
2 2
into account that ts = (t+s)
4 − (t−s)
4 , we have:
T (t−s)2 (t+s)2
{ x sx≤ t, s≥ T T
0} ⇔ {x x ≤ ts, t ≥ 0, s ≥ 0} ⇔ {x x + 4 ≤ 4 ,t ≥ 0, s ≥ 0}
x
⇔ t−s ≤ t+s 2
2 2
(check the third ⇔!), and the last relation is a conic quadratic inequality.
The level sets of the CQr functions 1 – 5 provide us with a spectrum of “elementary” CQr
sets. We add to this spectrum one more set:
B. Direct product: If sets Xi ⊂ Rni , i = 1, ..., k, are CQr, then so is their direct product
X1 × ... × Xk .
N
Indeed, if Si = {kαji (xi , ui )k2 ≤ βji (xi , ui }j=1
j
, i = 1, ..., k, are CQR’s of the sets Xi , then the union over
i of this system of inequalities, regarded as a system with design variables x = (x1 , ..., xk ) and additional
variables u = (u1 , ..., uk ) is a CQR for the direct product of X1 , ..., Xk .
2.3. WHAT CAN BE EXPRESSED VIA CONIC QUADRATIC CONSTRAINTS? 87
D. Inverse affine image: Let X ⊂ Rn be a CQr set, and let `(y) = Ay + b be an affine mapping
from Rk to Rn . Then the inverse image `−1 (X) = {y ∈ Rk : Ay + b ∈ X} of X under the
mapping is also CQr.
Indeed, let S = {kαj (x, u)k2 ≤ βj (x, u)}N
i=1 be a CQR for X. Then the system of c.q.i.’s
It should be stressed that the above statements are not just existence theorems – they are
“algorithmic”: given CQR’s of the “operands” (say, m sets X1 , ..., Xm ), we may build completely
m
T
mechanically a CQR for the “result of the operation” (e.g., for the intersection Xi ).
i=1
E. Taking maximum: If functions gi (x), i = 1, ..., m, are CQr, then so is their maximum
g(x) = max gi (x).
i=1,...,m T
Indeed, Epi{g} = Epi{gi }, and the intersection of finitely many CQr sets again is CQr.
i
F. Summation with nonnegative weights: If functions gi (x), x ∈ Rn , are CQr, i = 1, ..., m, and
m
P
αi are nonnegative weights, then the function g(x) = αi gi (x) is also CQr.
i=1
Indeed, consider the set
m
X
n
Π = {(x1 , t1 ; x2 , t2 ; ...; xm , tm ; t) : xi ∈ R , ti ∈ R, t ∈ R, gi (xi ) ≤ ti , i = 1, ..., m; αi ti ≤ t}.
i=1
The set is CQr. Indeed, the set is the direct product of the epigraphs of gi intersected with the half-space
m
P
given by the linear inequality αi ti ≤ t. Now, a direct product of CQr sets is also CQr, a half-space is
i=1
CQr (it is a level set of an affine function, and such a function is CQr), and the intersection of CQr sets
is also CQr. Since Π is CQr, so is its projection on subspace of variables x1 , x2 , ..., xm , t, i.e., the set
m
X m
X
{(x1 , ..., xm , t) : ∃t1 , ..., tm : gi (xi ) ≤ ti , i = 1, ..., m, αi ti ≤ t} = {(x1 , ..., xm , t) : αi gi (x) ≤ t}.
i=1 i=1
Since the latter set is CQr, so is its inverse image under the mapping
G. Direct summation: If functions gi (xi ), xi ∈ Rni , i = 1, ..., m, are CQr, so is their direct
sum
g(x1 , ..., xm ) = g1 (x1 ) + ... + gm (xm ).
Indeed, the functions ĝi (x1 , ..., xm ) = gi (xi ) are clearly CQr – their epigraphs are inverse images of the
P
epigraphs of gi under the affine mappings (x1 , ..., xm , t) 7→ (xi , t). It remains to note that g = ĝi .
i
I. Partial minimization: Let g(x) be CQr. Assume that x is partitioned into two sub-vectors:
x = (v, w), and let ĝ be obtained from g by partial minimization in w:
and assume that for every v the minimum in w is achieved. Then ĝ is CQr.
Indeed, under the assumption that the minimum in w always is achieved, Epi{ĝ} is the image of the
epigraph of Epi{g} under the projection (v, w, t) 7→ (v, t).
J. Arithmetic summation of sets. Let Xi , i = 1, ..., k, be nonempty convex sets in Rn , and let X1 +
X2 + ... + Xk be the arithmetic sum of these sets:
We claim that
If all Xi are CQr, so is their sum.
X = X1 × X2 × ... × Xk ⊂ Rnk
is CQr by B.; it remains to note that X1 +...+Xk is the image of X under the linear mapping
and by C. the image of a CQr set under an affine mapping is also CQr (see C.)
J.1. inf-convolution. The operation with functions related to the arithmetic summation of sets is the
inf-convolution defined as follows. Let fi : Rn → R∪{∞}, i = 1, ..., n, be functions. Their inf-convolution
is the function
f (x) = inf{f1 (x1 ) + ... + fk (xk ) : x1 + ... + xk = x}. (∗)
We claim that
If all fi are CQr, their inf-convolution is > −∞ everywhere and for every x for which the
inf in the right hand side of (*) is finite, this infimum is achieved, then f is CQr.
Indeed, under the assumption in question the epigraph Epi{f } = Epi{f1 } + ... + Epi{fk }.
K. Taking conic hull of a convex set. Let X ∈ Rn be a nonempty convex set. Its conic hull is the set
thus getting an affine embedding of Rn in Rn+1 . We take the image of X under this mapping – “lift”
X by one along the (n + 1)st axis – and then form the set X + by taking all (open) rays emanating from
the origin and crossing the “lifted” X.
90 LECTURE 2. CONIC QUADRATIC PROGRAMMING
The conic hull is not closed (e.g., it does not contain the origin, which clearly is in its closure). The
closed convex hull of X is the closure of its conic hull:
n o
b + = cl X + = (x, t) ∈ Rn × R : ∃{(xi , ti )}∞ −1
X i=1 : ti > 0, ti xi ∈ X, t = lim ti , x = lim xi .
i i
Note that if X is a closed convex set, then the conic hull X + of X is nothing but the intersection of the
closed convex hull Xb + and the open half-space {t > 0} (check!); thus, the closed conic hull of a closed
convex set X is larger than the conic hull by some part of the hyperplane {τ = 0}. When X is closed
and bounded, then the difference between the hulls is pretty small: X b + = X + ∪ {0} (check!). Note also
that if X is a closed convex set, you can obtain it from its (closed) convex hull by taking intersection
with the hyperplane {t = 1}:
x ∈ X ⇔ (x, 1) ∈ X b + ⇔ (x, 1) ∈ X + .
X = {x : ∃u : Ax + Bu + b ≥K 0} , (2.3.4)
where K is a direct product of the ice-cream cones, then the conic hull X + is CQr as well:
2
X + = {(x, t) : ∃(u, s) : Ax + Bu + tb ≥K 0, s − t ≥L3 0}. (2.3.5)
s+t
(ii) If the set X given by (2.3.4) is closed, then the CQr set
\
e + = {(x, t) : ∃u : Ax + Bu + tb ≥K 0}
X {(x, t) : t ≥ 0} (2.3.6)
X+ ⊂ X
e+ ⊂ X
b +.
e+ = X
(iii) If the CQR (2.3.4) is such that Bu ∈ K implies that Bu = 0, then X b + , so that X
b + is
CQr.
(i): We have
X+ ≡ {(x, t) : t > 0, x/t ∈ X}
= {(x, t) : ∃u : A(x/t) + Bu + b ≥K 0, t > 0}
= {(x, t) : ∃v : Ax + Bv + tb ≥K 0, t > 0}
= {(x, t) : ∃v, s : Ax + Bv + tb ≥K 0, t, s ≥ 0, ts ≥ 1},
and we arrive at (2.3.5).
(ii): We should prove that the set X e + (which by construction is CQr) is between X + and X b + . The
inclusion X + ⊂ X e + is readily given by (2.3.5). Next, let us prove that X
e+ ⊂ Xb + . Let us choose a point
x̄ ∈ X, so that for a properly chosen ū it holds
Ax̄ + B ū + b ≥K 0,
∃u = u : Ax + Bu + t b ≥K 0.
Y = {y : ∃v : P y + Qv + r ≥K 0}
(ii) Y is closed.
Proof of Lemma. (i): Assume, on the contrary to what should be proved, that there exists a sequence
{yi , vi } such that
By Linear Algebra, for every b such that the linear system Qv = b is solvable, it admits a solution v such
that kvk2 ≤ C1 kbk2 with C1 < ∞ depending on Q only; therefore we can assume, in addition to (2.3.8),
that
kvi k2 ≤ C1 kQvi k2 (2.3.9)
for all i. Now, from (2.3.8) it clearly follows that
kQvi k2 → ∞ as i → ∞; (2.3.10)
setting
1
vbi = vi ,
kQvi k2
we have
(a) kQb vi k2 = 1 ∀i,
(b) kb
vi k ≤ C1 ∀i, [by (2.3.9)]
(c) Qbvi + kQvi k−1
2 (P yi + r) ∈ K ∀i,
(d) kQvi k−1
2 kP yi + rk2 ≤ αi
−1
→ 0 as i → ∞ [by (2.3.8)]
Taking into account (b) and passing to a subsequence, we can assume that vbi → vb as i → ∞; by (c, d)
Qbv ∈ K, while by (a) kQb v k2 = 1, i.e., Qb
v 6= 0, which is the desired contradiction.
(ii) To prove that Y is closed, assume that yi ∈ Y and yi → y as i → ∞, and let us verify that
y ∈ Y . Indeed, since yi ∈ Y , there exist vi such that P yi + Qvi + r ∈ K. Same as above, we can assume
that (2.3.9) holds. Since yi → y, the sequence {Qvi } is bounded by (2.3.7), so that the sequence {vi }
is bounded by (2.3.9). Passing to a subsequence, we can assume that vi → v as i → ∞; passing to the
limit, as i → ∞, in the inclusion P yi + Qvi + r ∈ K, we get P y + Qv + r ∈ K, i.e., y ∈ Y .
K.1. “Projective transformation” of a CQr function. The operation with functions related to tak-
ing conic hull of a convex set is the “projective transformation” which converts a function f (x) : Rn →
R ∪ {∞} 4) into the function
The epigraph of f + is the conic hull of the epigraph of f with the origin excluded:
The closure cl Epi{f + } is the epigraph of certain function, let it be denoted fb+ (x, s); this function is
called the projective transformation of f . E.g., the fractional-quadratic function from Example 5 is
4)
Recall that “a function” for us means a proper function – one which takes a finite value at least at one point
92 LECTURE 2. CONIC QUADRATIC PROGRAMMING
the projective transformation of the function f (x) = xT x. Note that the function fb+ (x, s) does not
necessarily coincide with f + (x, s) even in the open half-space s > 0; this is the case if and only if the
epigraph of f is closed (or, which is the same, f is lower semicontinuous: whenever xi → x and f (xi ) → a,
we have f (x) ≤ a). We are about to demonstrate that the projective transformation “nearly” preserves
CQ-representability:
where K is a direct product of ice-cream cones. Assume that the CQR is such that Bu ≥K 0 implies that
Bu = 0. Then the projective transformation fb+ of f is CQr, namely,
Epi{fb+ } = {(x, t, s) : s ≥ 0, ∃u : Ax + tp + Bu + sb ≥K 0} .
L. The polar of a convex set. Let X ⊂ Rn be a convex set containing the origin. The polar of X is the
set
X∗ = y ∈ Rn : y T x ≤ 1 ∀x ∈ X .
In particular,
• the polar of the singleton {0} is the entire space;
• the polar of the entire space is the singleton {0};
• the polar of a linear subspace is its orthogonal complement (why?);
• the polar of a closed convex pointed cone K with a nonempty interior is −K∗ , minus the dual cone
(why?).
Polarity is “symmetric”: if X is a closed convex set containing the origin, then so is X∗ , and twice taken
polar is the original set: (X∗ )∗ = X.
We are about to prove that the polarity X 7→ X∗ “nearly” preserves CQ-representability:
X = {x : ∃u : Ax + Bu + b ≥K 0} , (2.3.12)
Ax̄ + B ū + b >K 0.
X∗ = y : ∃ξ : AT ξ + y = 0, B T ξ = 0, bT ξ ≤ 1, ξ ≥K 0
(2.3.13)
min −y T x : Ax + Bu + b ≥K 0 .
(Py )
x,u
2.3. WHAT CAN BE EXPRESSED VIA CONIC QUADRATIC CONSTRAINTS? 93
A vector y belongs to X∗ if and only if (Py ) is bounded below and its optimal value is at least −1. Since
(Py ) is strictly feasible, from the Conic Duality Theorem it follows that these properties of (Py ) hold if
and only if the dual problem
(recall that K is self-dual) has a feasible solution with the value of the dual objective at least -1. Thus,
X∗ = y : ∃ξ : AT ξ + y = 0, B T ξ = 0, bT ξ ≤ 1, ξ ≥K 0 ,
as claimed in (2.3.13). It remains to note that X∗ is obtained from the CQr set K by operations preserving
CQ-representability: intersection with the CQr set {ξ : B T ξ = 0, bT ξ ≤ 1} and subsequent affine mapping
ξ 7→ −AT ξ.
L.1. The Legendre transformation of a CQr function. The operation with functions related to tak-
ing polar of a convex set is the Legendre (or conjugate) transformation. The Legendre transformation
(≡ the conjugate) of a function f (x) : Rn → R ∪ {∞} is the function
In particular,
• the conjugate of a constant f (x) ≡ c is the function
−c, y=0
f∗ (y) = ;
6 0
+∞, y =
It is worth mentioning that the Legendre transformation is symmetric: if f is a proper convex lower
semicontinuous function (i.e., ∅ =
6 Epi{f } is convex and closed), then so is f∗ , and taken twice, the
Legendre transformation recovers the original function: (f∗ )∗ = f .
We are about to prove that the Legendre transformation “nearly” preserves CQ-representability:
where K is a direct product of ice-cream cones. Assume that there exist x̄, t̄, ū such that
Indeed, we have
min −y T x + t : Ax + tp + Bu + b ≥ K0 .
(Py )
x,t,u
By (2.3.15), a pair (y, s) belongs to Epi{f∗ } if and only if (Py ) is bounded below with optimal value
≥ −s. Since (Py ) is strictly feasible, this is the case if and only if the dual problem
has a feasible solution with the value of the dual objective ≥ −s. Thus,
as claimed in (2.3.14). It remains to note that the right hand side set in (2.3.14) is CQr (as a set
obtained from the CQr set K × Rs by operations preserving CQ-representability – intersection with the
set {ξ : B T ξ = 0, pT ξ = 1, bT ξ ≤ s} and subsequent affine mapping ξ 7→ −AT ξ).
Indeed, SuppX (·) is the conjugate of the characteristic function of X, and it remains to refer to Corollary
2.3.4.
M. Taking convex hull of several sets. The convex hull of a set Y ⊂ Rn is the smallest convex set which
contains Y : ( )
kx
X X
Conv(Y ) = x= αi xi : xi ∈ Y, αi ≥ 0, αi = 1
i=1 i
The closed convex hull Conv(Y ) = cl Conv(Y ) of Y is the smallest closed convex set containing Y .
Following Yu. Nesterov, let us prove that taking convex hull “nearly” preserves CQ-representability:
is between the convex hull and the closed convex hull of the set X1 ∪ ... ∪ Xk :
k
[ k
[
Conv( Xi ) ⊂ Y ⊂ Conv( Xi ).
i=1 i=1
is CQr.
k
S
First, the set Y clearly contains Conv( Xi ). Indeed, since the sets Xi are convex, the convex hull of
i=1
their union is ( )
k
X k
X
i i
x= ti x : x ∈ Xi , ti ≥ 0, ti = 1
i=1 i=1
Ai xi + Bi ui + bi ≥Ki 0.
We get
x = (t1 x1 ) + ... + (tk xk )
= ξ 1 + ... + ξ k ,
[ξ i = ti xi ];
t1 , ..., tk ≥ 0; (2.3.18)
t1 + ... + tk = 1;
Ai ξ i + Bi η i + ti bi ≥Ki 0, i = 1, ..., k,
[η i = ti ui ],
so that x ∈ Y (see the definition of Y ).
k
[
To complete the proof that Y is between the convex hull and the closed convex hull of Xi , it
i=1
k
[
remains to verify that if x ∈ Y then x is contained in the closed convex hull of Xi . Let us somehow
i=1
choose x̄i ∈ Xi ; for properly chosen ūi we have
x = ξ 1 + ... + ξ k ,
t1 , ..., tk ≥ 0,
(2.3.20)
t1 + ... + tk = 1,
Ai ξ i + Bi η i + ti bi ≥Ki 0, i = 1, ..., k.
96 LECTURE 2. CONIC QUADRATIC PROGRAMMING
In view of the latter relations and (2.3.19), we have for 0 < < 1:
Ai [(1 − )ξ i + k −1 x̄i ] + Bi [(1 − )η i + k −1 ūi ] + [(1 − )ti + k −1 ]bi ≥Ki 0;
setting
ti, = (1 − )ti + k −1 ;
= t−1 −1 i
xi i
i, (1 − )ξ + k x̄ ;
ui = t−1 i
i, (1 − )η + k
−1 i
ū ,
we get
Ai xi + Bi ui + bi ≥Ki 0⇒
xi ∈ Xi ,
t1, , ..., tk, ≥ 0,
t1, + ... + tk, = 1
⇒
k
ti, xi
P
x ≡
i=1
k
[
∈ Conv( Xi ).
i=1
k
X k
X
(1 − )ξ i + k −1 x̄i → x = ξ i as → +0,
x =
i=1 i=1
k
[
so that x belongs to the closed convex hull of Xi , as claimed.
i=1
k
[
It remains to verify that in the cases of (i), (ii) the convex hull of Xi is the same as the closed
i=1
convex hull of this union. (i) is a particular case of (ii) corresponding to W = {0}, so that it suffices to
prove (ii). Assume that
k
P
xt = µti [zti + pti ] → x as i → ∞
i=1
P
zti ∈ Zi , pti ∈ W, µti ≥ 0, µti = 1
i
and let us prove that x belongs to the convex hull of the union of Xi . Indeed, since Zi are closed and
bounded, passing to a subsequence, we may assume that
converge as t → ∞ to some vector p, and since W is closed and convex, p ∈ W . We now have
" k # k k
X X X
x = lim µti zti + pt = µ i zi + p = µi [zi + p],
i→∞
i=1 i=1 i=1
so that x belongs to the convex hull of the union of Xi (as a convex combination of points zi + p ∈ Xi ).
2.3. WHAT CAN BE EXPRESSED VIA CONIC QUADRATIC CONSTRAINTS? 97
P. The recessive cone of a CQr set. Let X be a closed convex set. The recessive cone Rec(X) of X is
the set
Rec(X) = {h : x + th ∈ X ∀(x ∈ X, t ≥ 0)}.
It can be easily verified that Rec(X) is a closed cone, and that
Rec(X) = {h : x̄ + th ∈ X ∀t ≥ 0} ∀x̄ ∈ X,
i.e., that Rec(X) is the set of all directions h such that the ray emanating from a point of X and directed
by h is contained in X.
Proposition 2.3.6 Let X be a nonempty CQr set with CQR
X = {x ∈ Rn : ∃u : Ax + Bu + b ≥K 0},
where K is a direct product of ice-cream cones, and let the CQR be such that Bu ∈ K only if Bu = 0.
Then X is closed, and the recessive cone of X is CQr:
Rec(X) = {h : ∃v : Ah + Bv ≥K 0}. (2.3.21)
Proof. The fact that X is closed is given by Lemma 2.3.1. In order to prove (2.3.21), let us temporary
denote by R the set in the left hand side of this relation; we should prove that R = Rec(X). The inclusion
R ⊂ Rec(X) is evident. To prove the inverse inclusion, let x̄ ∈ X and h ∈ Rec(X), so that for every
i = 1, 2, ... there exists ui such that
A(x̄ + ih) + Bui + b ∈ K. (2.3.22)
By Lemma 2.3.1,
kBui k2 ≤ C(1 + kA(x̄ + ih) + bk2 ) (2.3.23)
for certain C < ∞ and all i. Besides this, we can assume w.l.o.g. that
kui k2 ≤ C1 kBui k2 (2.3.24)
(cf. the proof of Lemma 2.3.1). By (2.3.23) – (2.3.24), the sequence {vi = i−1 ui } is bounded; passing to
a subsequence, we can assume that vi → v as i → ∞. By (2.3.22, we have for all i
i−1 A(x̄ + ih) + Bvi + i−1 b ∈ K,
whence, passing to limit as i → ∞, Ah + Bv ∈ K. Thus, h ∈ R.
Theorem 2.3.1 Under the above setting, the superposition g is CQr with CQR
Remark 2.3.1 If part of the “inner” functions, say, f1 , ..., fk , are affine, it suffices to require the mono-
tonicity of the “outer” function f with respect to the variables yk+1 , ..., ym only. A CQR for the super-
position in this case becomes
t − qT x − r
T T T
Dx
{(x, t) : x D Dx + q x + r ≤ t} = {(x, t) : t+q x+r ≤
T } (2.3.27)
2 2
2
Surprisingly, our set is just the ice-cream cone, more precisely, its inverse image under the one-to-one
linear mapping
x x
σ1 7→ σ1 −σ2 .
2
σ1 +σ2
σ2 2
√
9. The “half-cone” K+2 = {(x1 , x2 , t) ∈ R3 : x1 , x2 ≥ 0, 0 ≤ t ≤ x1 x2 } is CQr.
Indeed, our set is the intersection of the cone {t2 ≤ x1 x2 , x1 , x2 ≥ 0} from the previous example and the
half-space t ≥ 0.
Here is the explicit CQR of K+ :
t x1 + x2
K+ = {(x1 , x2 , t) : t ≥ 0, x1 −x2 ≤ }. (2.3.29)
2 2 2
2.3. WHAT CAN BE EXPRESSED VIA CONIC QUADRATIC CONSTRAINTS? 99
√
10. The hypograph of the geometric mean – the set K 2 = {(x1 , x2 , t) ∈ R3 : x1 , x2 ≥ 0, t ≤ x1 x2 } –
is CQr.
Note the difference with the previous example – here t is not required to be nonnegative!
Here is the explicit CQR for K 2 (cf. Example 9):
2
τ x1 + x2
K = (x1 , x2 , t) : ∃τ : t ≤ τ ; τ ≥ 0, x1 −x2 ≤
.
2 2 2
l l
11. The hypograph of the geometric mean of 2l variables – the set K 2 = {(x1 , ..., x2l , t) ∈ R2 +1 :
l
xi ≥ 0, i = 1, ..., 2l , t ≤ (x1 x2 ...x2l )1/2 } – is CQr. To see it and to get its CQR, it suffices to iterate the
construction of Example 10. Indeed, let us add to our initial variables a number of additional x-variables:
– let us call our 2l original x-variables the variables of level 0 and write x0,i instead of xi . Let us add
one new variable of level 1 per every two variables of level 0. Thus, we add 2l−1 variables x1,i of level 1.
– similarly, let us add one new variable of level 2 per every two variables of level 1, thus adding 2l−2
variables x2,i ; then we add one new variable of level 3 per every two variables of level 2, and so on, until
level l with a single variable xl,1 is built.
Now let us look at the following system S of constraints:
√
layer 1: x1,i ≤ x0,2i−1 x0,2i , x1,i , x0,2i−1 , x0,2i ≥ 0, i = 1, ..., 2l−1
√
layer 2: x2,i ≤ x1,2i−1 x1,2i , x2,i , x1,2i−1 , x1,2i ≥ 0, i = 1, ..., 2l−2
.................
√
layer l: xl,1 ≤ xl−1,1 xl−1,2 , xl,1 , xl−1,1 , xl−1,2 ≥ 0
(∗) t ≤ xl,1
The inequalities of the first layer say that the variables of the zero and the first level should be nonnegative
and every one of the variables of the first level should be ≤ the geometric mean of the corresponding pair
of our original x-variables. The inequalities of the second layer add the requirement that the variables
of the second level should be nonnegative, and every one of them should be ≤ the geometric mean of
the corresponding pair of the first level variables, etc. It is clear that if all these inequalities and (*)
are satisfied, then t is ≤ the geometric mean of x1 , ..., x2l . Vice versa, given nonnegative x1 , ..., x2l and
a real t which is ≤ the geometric mean of x1 , ..., x2l , we always can extend these data to a solution of
l
S. In other words, K 2 is the projection of the solution set of S onto the plane of our original variables
x1 , ..., x2l , t. It remains to note that the set of solutions of S is CQr (as the intersection of CQr sets
√
{(v, p, q, r) ∈ RN × R3+ : r ≤ qp}, see Example 9), so that its projection is also CQr. To get a CQR of
l
K 2 , it suffices to replace the inequalities in S with their conic quadratic equivalents, explicitly given in
Example 9.
p/q
12. The convex increasing power function x+ of rational degree p/q ≥ 1 is CQr.
Indeed, given positive integers p, q, p > q, let us choose the smallest integer l such that p ≤ 2l , and
consider the CQr set
l l l
K 2 = {(y1 , ..., y2l , s) ∈ R2+ +1 : s ≤ (y1 y2 ...y2l )1/2 }. (2.3.30)
l
Setting r = 2l − p, consider the following affine parameterization of the variables from R2 +1 by two
variables ξ, t:
– s and r first variables yi are all equal to ξ (note that we still have 2l − r = p ≥ q “unused” variables
yi );
– q next variables yi are all equal to t;
– the remaining yi ’s, if any, are all equal to 1.
l
The inverse image of K 2 under this mapping is CQr and it is the set
l l
K = {(ξ, t) ∈ R2+ : ξ 1−r/2 ≤ tq/2 } = {(ξ, t) ∈ R2+ : t ≥ ξ p/q }.
100 LECTURE 2. CONIC QUADRATIC PROGRAMMING
p/q
It remains to note that the epigraph of x+ can be obtained from the CQr set K by operations preserving
the CQr property. Specifically, the set L = {(x, ξ, t) ∈ R3 : ξ ≥ 0, ξ ≥ x, t ≥ ξ p/q } is the intersection
p/q
of K × R and the half-space {(x, ξ, t) : ξ ≥ x} and thus is CQr along with K, and Epi{x+ } is the
projection of the CQr set L on the plane of x, t-variables.
−p/q
x , x>0
13. The decreasing power function g(x) = (p, q are positive integers) is CQr.
+∞, x≤0
Same as in Example 12, we choose the smallest integer l such that 2l ≥ p + q, consider the CQr set
(2.3.30) and parameterize affinely the variables yi , s by two variables (x, t) as follows:
– s and the first (2l − p − q) yi ’s are all equal to one;
– p of the remaining yi ’s are all equal to x, and the q last of yi ’s are all equal to t.
l
It is immediately seen that the inverse image of K 2 under the indicated affine mapping is the epigraph
of g.
14. The even power function g(x) = x2p on the axis (p positive integer) is CQr.
Indeed, we already know that the sets P = {(x, ξ, t) ∈ R3 : x2 ≤ ξ} and K 0 = {(x, ξ, t) ∈ R3 : 0 ≤ ξ, ξ p ≤
t} are CQr (both sets are direct products of R and the sets with already known to us CQR’s). It remains
to note that the epigraph of g is the projection of P ∩ Q onto the (x, t)-plane.
Example 14 along with our combination rules allows to build a CQR for a polynomial p(x) of the
form
XL
p(x) = pl x2l , x ∈ R,
l=1
p1 pn
15. The concave monomial xπ1 1 ...xπnn . Let π1 = p , ..., πn = p be positive rational numbers
with π1 + ... + πn ≤ 1. The function
is CQr.
The construction is similar to the one of Example 12. Let l be such that 2l ≥ p. We recall that the set
l
Y = {(y1 , ..., y2l , s) : y1 , ..., y2l , s) : y1 , ..., y2l ≥ 0, 0 ≤ s ≤ (y1 ..., y2l )1/2 }
is CQr, and therefore so is its inverse image under the affine mapping
(x1 , ..., xn , s) 7→ (x1 , ..., x1 , x2 , ..., x2 , ..., xn , ..., xn , s, ..., s, 1, ..., 1 , s),
| {z } | {z } | {z } | {z } | {z }
p1 p2 pn 2l −p p−p1 −...−pn
which is the intersection of the half-space {s ≥ 0} and the inverse image of Z under the affine mapping
(x1 , ..., xn , t, s) 7→ (x1 , ..., xn , s − t). It remains to note that the epigraph of f is the projection of Z 0 onto
the plane of the variables x1 , ..., xn , t.
2.4. MORE APPLICATIONS: ROBUST LINEAR PROGRAMMING 101
16. The convex monomial x1−π1 ...xn−πn . Let π1 , ..., πn be positive rational numbers. The func-
tion
f (x) = x1−π1 ...x−πn
n
: {x ∈ Rn : x > 0} → R
is CQr.
The verification is completely similar to the one in Example 15.
n 1/p
The p-norm kxkp = |xi |p : Rn → R (p ≥ 1 is a rational number). We claim that
P
17.
i=1
the function kxkp is CQr.
It is immediately seen that
n
1/p
X
kxkp ≤ t ⇔ t ≥ 0 & ∃v1 , ..., vn ≥ 0 : |xi | ≤ t(p−1)/p vi , i = 1, ..., n, vi ≤ t. (2.3.31)
i=1
n n
|xi |p ≤ tp−1 vi ≤ tp , i.e., kxkp ≤ t. Vice versa,
P P
Indeed, if the indicated vi exist, then
i=1 i=1
assume that kxkp ≤ t. If t = 0, then x = 0, and the right hand side relations in (2.3.31)
are satisfied for vi = 0, i = 1, ..., n. If t > 0, we can satisfy these relations by setting
vi = |xi |p t1−p .
(2.3.31) says that the epigraph of kxkp is the projection onto the (x, t)-plane of the set of
solutions to the system of inequalities
t≥0
vi ≥ 0, i = 1, ..., n
1/p
xi ≤ t(p−1)/p vi , i = 1, ..., n
1/p
−xi ≤ t(p−1)/p vi , i = 1, ..., n
v1 + ... + vn ≤ t
Each of these inequalities defines a CQr set (in particular, for the nonlinear inequalities this
is due to Example 15). Thus, the solution set of the system is CQr (as an intersection of
finitely many CQr sets), whence its projection on the (x, t)-plane – i.e., the epigraph of kxkp
– is CQr.
n 1/p
The function kx+ kp = maxp [xi , 0] : Rn → R (p ≥ 1 a rational number) is
P
17b.
i=1
CQr.
Indeed,
t ≥ kx+ kp ⇔ ∃y1 , ..., yn : 0 ≤ yi , xi ≤ yi , i = 1, ..., n, kykp ≤ t.
Thus, the epigraph of kx+ kp is a projection of the CQr set (see Example 17) given by the
system of inequalities in the right hand side.
From the above examples it is seen that the “expressive abilities” of c.q.i.’s are indeed strong:
they allow to handle a wide variety of very different functions and sets.
In real world applications, the data c, A, b of (LP) is not always known exactly; what is typically
known is a domain U in the space of data – an “uncertainty set” – which for sure contains the
“actual” (unknown) data. There are cases in reality where, in spite of this data uncertainty,
our decision x must satisfy the “actual” constraints, whether we know them or not. Assume,
e.g., that (LP) is a model of a technological process in Chemical Industry, so that entries of
x represent the amounts of different kinds of materials participating in the process. Typically
the process includes a number of decomposition-recombination stages. A model of this problem
must take care of natural balance restrictions: the amount of every material to be used at a
particular stage cannot exceed the amount of the same material yielded by the preceding stages.
In a meaningful production plan, these balance inequalities must be satisfied even though they
involve coefficients affected by unavoidable uncertainty of the exact contents of the raw materials,
of time-varying parameters of the technological devices, etc.
If indeed all we know about the data is that they belong to a given set U, but we still have
to satisfy the actual constraints, the only way to meet the requirements is to restrict ourselves
to robust feasible candidate solutions – those satisfying all possible realizations of the uncertain
constraints, i.e., vectors x such that
In order to choose among these robust feasible solutions the best possible, we should decide how
to “aggregate” the various realizations of the objective into a single “quality characteristic”. To
be methodologically consistent, we use the same worst-case-oriented approach and take as an
objective function f (x) the maximum, over all possible realizations of the objective cT x:
With this methodology, we can associate with our uncertain LP program (i.e., with the family
T
LP(U) = min c x|(c, A, b) ∈ U
x:Ax≥b
of all usual (“certain”) LP programs with the data belonging to U) its robust counterpart. In
the latter problem we are seeking for a robust feasible solution with the smallest possible value
of the “guaranteed objective” f (x). In other words, the robust counterpart of LP(U) is the
optimization problem
n o
min t : cT x ≤ t, Ax − b ≥ 0 ∀(c, A, b) ∈ U . (R)
t,x
Note that (R) is a usual – “certain” – optimization problem, but typically it is not an LP
program: the structure of (R) depends on the geometry of the uncertainty set U and can be
very complicated.
As we shall see in a while, in many cases it is reasonable to specify the uncertainty set
U as an ellipsoid – the image of the unit Euclidean ball under an affine mapping – or, more
generally, as a CQr set. As we shall see in a while, in this case the robust counterpart of an
uncertain LP problem is (equivalent to) an explicit conic quadratic program. Thus, Robust
2.4. MORE APPLICATIONS: ROBUST LINEAR PROGRAMMING 103
Linear Programming with CQr uncertainty sets can be viewed as a “generic source” of conic
quadratic problems.
Let us look at the robust counterpart of an uncertain LP program
n n o o
min cT x : aTi x − bi ≥ 0, i = 1, ..., m |(c, A, b) ∈ U
x
in the case of a “simple” ellipsoidal uncertainty – one where the data (ai , bi ) of i-th inequality
constraint
aTi x − bi ≥ 0,
and the objective c are allowed to run independently of each other through respective ellipsoids
Ei , E. Thus, we assume that the uncertainty set is
a∗i
ai
U= (a1 , b1 ; ...; am , bm ; c) : ∃({ui , uTi ui ≤ 1}m
i=0 ) : c = c∗ + P0 u0 , = i
+ Pi u , i = 1, ..., m ,
bi b∗i
where c∗ , a∗i , b∗i are the “nominal data” and Pi ui , i = 0, 1, ..., m, represent the data perturbations;
the restrictions uTi ui ≤ 1 enforce these perturbations to vary in ellipsoids.
In order to realize that the robust counterpart of our uncertain LP problem is a conic
quadratic program, note that x is robust feasible if and only if for every i = 1, ..., m we have
∗
ai [u] a
0 ≤ min aTi [u]x − bi [u] : = i + Pi ui
ui :uT
i ui ≤1
bi [u] b∗i
x
= (a∗i x)T x − b∗i + min uTi PiT
ui :uT
i ui ≤1
−1
x
= (a∗i )T x − b∗i − PiT
−1 2
Similarly, a pair (x, t) satisfies all realizations of the inequality cT x ≤ t “allowed” by our ellip-
soidal uncertainty set U if and only if
Thus, the robust counterpart (R) becomes the conic quadratic program
x ∗ T ∗
kP0T xk2 −cT∗ x
T
min t : ≤ + t; Pi
≤ [ai ] x − bi , i = 1, ..., m (RLP)
x,t −1 2
1. The directional distribution of energy sent by the antenna can be described in terms of
antenna’s diagram which is a complex-valued function D(δ) of a 3D direction δ. The
directional distribution of energy sent by the antenna is proportional to |D(δ)|2 .
104 LECTURE 2. CONIC QUADRATIC PROGRAMMING
2. When the antenna is comprised of several antenna elements with diagrams D1 (δ),..., Dk (δ),
the diagram of the antenna is just the sum of the diagrams of the elements.
In a typical Antenna Design problem, we are given several antenna elements with diagrams
D1 (δ),...,Dk (δ) and are allowed to multiply these diagrams by complex weights xi (which in
reality corresponds to modifying the output powers and shifting the phases of the elements). As
a result, we can obtain, as a diagram of the array, any function of the form
k
X
D(δ) = xi Di (δ),
i=1
and our goal is to find the weights xi which result in a diagram as close as possible, in a prescribed
sense, to a given “target diagram” D∗ (δ).
Consider an example of a planar antenna comprised of a central circle and 9 concentric
rings of the same area as the circle (Fig. 2.1.(a)) in the XY -plane (“Earth’s surface”). Let the
wavelength be λ = 50cm, and the outer radius of the outer ring be 1 m (twice the wavelength).
One can easily see that the diagram of a ring {a ≤ r ≤ b} in the plane XY (r is the distance
from a point to the origin) as a function of a 3-dimensional direction δ depends on the altitude
(the angle θ between the direction and the plane) only. The resulting function of θ turns out to
be real-valued, and its analytic expression is
Zb Z2π
1
r cos 2πrλ−1 cos(θ) cos(φ) dφ dr.
Da,b (θ) =
2
a 0
Our design problem is simplified considerably by the fact that the diagrams of our “building
blocks” and the target diagram are real-valued; thus, we need no complex numbers, and the
problem we should finally solve is
10
X
min τ : −τ ≤ D∗ (θ` ) − xj Dj (θ` ) ≤ τ, ` = 1, ..., 120 . (Nom)
τ ∈R,x∈R10
j=1
This is a simple LP program; its optimal solution x∗ results in the diagram depicted at Fig.
2.1.(c). The uniform distance between the actual and the target diagrams is ≈ 0.0621 (recall
that the target diagram varies from 0 to 1).
2.4. MORE APPLICATIONS: ROBUST LINEAR PROGRAMMING 105
0.25 1.2
0.2 1
0.15 0.8
0.1 0.6
0.05 0.4
0 0.2
−0.05 0
−0.1 −0.2
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
Now recall that our design variables are characteristics of certain physical devices. In reality,
of course, we cannot tune the devices to have precisely the optimal characteristics x∗j ; the best
we may hope for is that the actual characteristics xfct ∗
j will coincide with the desired values xj
within a small margin, say, 0.1% (this is a fairly high accuracy for a physical device):
∗
xfct
j = pj xj , 0.999 ≤ pj ≤ 1.001.
It is natural to assume that the factors pj are random with the mean value equal to 1; it is
perhaps not a great sin to assume that these factors are independent of each other.
Since the actual weights differ from their desired values x∗j , the actual (random) diagram of
our array of antennae will differ from the “nominal” one we see on Fig.2.1.(c). How large could
be the difference? Look at the picture:
1.2 8
6
1
4
0.8
2
0.6
0.4
−2
0.2
−4
0
−6
−0.2 −8
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
“Dream and reality”: the nominal (left, solid) and an actual (right, solid) diagrams
[dashed: the target diagram]
The diagram shown to the right is not even the worst case: we just have taken as pj a sample
of 10 independent numbers distributed uniformly in [0.999, 1.001] and have plotted the diagram
corresponding to xj = pj x∗j . Pay attention not only to the shape (completely opposite to what
we need), but also to the scale: the target diagram varies from 0 to 1, and the nominal diagram
(the one corresponding to the exact optimal xj ) differs from the target by no more than by
0.0621 (this is the optimal value in the “nominal” problem (Nom)). The actual diagram varies
from ≈ −8 to ≈ 8, and its uniform distance from the target is 7.79 (125 times the nominal
optimal value!). We see that our nominal optimal design is completely meaningless: it looks as
if we were trying to get the worse possible result, not the best possible one...
106 LECTURE 2. CONIC QUADRATIC PROGRAMMING
How could we get something better? Let us try to apply the Robust Counterpart approach.
To this end we take into account from the very beginning that if we want the amplification
coefficients to be certain xj , then the actual amplification coefficients will be xfct
j = pj xj , 0.999 ≤
pj ≤ 1.001, and the actual discrepancies will be
10
X
δ` (x) = D∗ (θ` ) − pj xj Dj (θ` ).
j=1
Thus, we in fact are solving an uncertain LP problem where the uncertainty affects the coeffi-
cients of the constraint matrix (those corresponding to the variables xj ): these coefficients may
vary within 0.1% margin of their nominal values.
In order to apply to our uncertain LP program the Robust Counterpart approach, we should
specify the uncertainty set U. The most straightforward way is to say that our uncertainty is “an
interval” one – every uncertain coefficient in a given inequality constraint may (independently
of all other coefficients) vary through its own uncertainty segment “nominal value ±0.1%”. This
approach, however, is too conservative: we have completely ignored the fact that our pj ’s are of
stochastic nature and are independent of each other, so that it is highly improbable that all of
them will simultaneously fluctuate in “dangerous” directions. In order to utilize the statistical
independence of perturbations, let us look what happens with a particular inequality
10
X
− τ ≤ δ` (x) ≡ D∗ (θ` ) − pj xj Dj (θ` ) ≤ τ (2.4.2)
j=1
when pj ’s are random. For a fixed x, the quantity δ` (x) is a random variable with the mean
10
δ`∗ (x) = D∗ (θ` ) −
X
xj Dj (θ` )
j=1
Thus, “a typical value” of δ` (x) differs from δ`∗ (x) by a quantity of order of σ` (x). Now let
us act as an engineer which believes that a random variable differs from its mean by at most
three times its standard deviation; since we are not obliged to be that concrete, let us choose
a “safety parameter” ω and ignore all events which result in |δ` (x) − δ`∗ (x)| > ων` (x) 5) . As
for the remaining events – those with |δ` (x) − δ`∗ (x)| ≤ ων` (x) – we take upon ourselves full
responsibility. With this approach, a “reliable deterministic version” of the uncertain constraint
(2.4.2) becomes the pair of inequalities
Replacing all uncertain inequalities in (Nom) with their “reliable deterministic versions” and
recalling the definition of δ`∗ (x) and ν` (x), we end up with the optimization problem
minimize τ
s.t.
10
kQ` xk2 ≤ [D∗ (θ` ) −
P
xj Dj (θ` )] + τ, ` = 1, ..., 120
j=1 (Rob)
10
kQ` xk2 ≤ −[D∗ (θ` ) −
P
xj Dj (θ` )] + τ, ` = 1, ..., 120
j=1
[Q` = ωκDiag(D1 (θ` ), D2 (θ` ), ..., D10 (θ` ))]
It is immediately seen that (Rob) is nothing but the robust counterpart of (Nom) corresponding
to a simple ellipsoidal uncertainty, namely, the one as follows:
(all constraints in (Nom) are of this form) affected by the uncertainty are the coeffi-
cients A`j of the left hand side, and the difference dA[`] between the vector of these
coefficients and the nominal value (D1 (θ` ), ..., D10 (θ` ))T of the vector of coefficients
belongs to the ellipsoid
Thus, the above “engineering reasoning” leading to (Rob) was nothing but a reasonable way to
specify the uncertainty ellipsoids!
where aj (x) are deterministic affine functions of the design vector x, and j are inde-
pendent random perturbations with zero means and such that |j | ≤ σj . Assume that
x satisfies the “reliable” version of (2.4.3), specifically, the deterministic constraint
q
a0 (x) − κ σ12 a21 (x) + ... + σn2 a2n (x) ≥ 0 (2.4.4)
Thus, ( )
γ 2σ2 κ2
p(κ) ≤ min exp{ − γκσ} = exp − .
γ>0 2 2
Now let us look what are the diagrams yielded by the Robust Counterpart approach – i.e.,
those given by the robust optimal solution. These diagrams are also random (neither the nominal
nor the robust solution cannot be implemented exactly!). However, it turns out that they are
incomparably closer to the target (and to each other) than the diagrams associated with the
optimal solution to the “nominal” problem. Look at a typical “robust” diagram:
1.2
0.8
0.6
0.4
0.2
−0.2
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
With the safety parameter ω = 1, the robust optimal value is 0.0817; although it is by 30%
larger than the nominal optimal value 0.0635, the robust optimal value has a definite advantage
that it indeed says something reliable about the quality of actual diagrams we can obtain when
implementing the robust optimal solution: in a sample of 40 realizations of the diagrams cor-
responding to the robust optimal solution, the uniform distances from the target were varying
from 0.0814 to 0.0830.
We have built the robust optimal solution under the assumption that the “implementation
errors” do not exceed 0.1%. What happens if in reality the errors are larger – say, 1%? It turns
out that nothing dramatic happens: now in a sample of 40 diagrams given by the “old” robust
optimal solution (affected by 10 times larger “implementation errors”) the uniform distances
from the target were varying from 0.0834 to 0.116. Imagine what will happen with the nominal
solution under the same circumstances...
The last issue to be addressed here is: why is the nominal solution so unstable? And why
with the robust counterpart approach we were able to get a solution which is incomparably
better, as far as “actual implementation” is concerned? The answer becomes clear when looking
at the nominal and the robust optimal weights:
j 1 2 3 4 5 6 7 8 9 10
xnom
j 1624.4 -14701 55383 -107247 95468 19221 -138622 144870 -69303 13311
xrob
j -0.3010 4.9638 -3.4252 -5.1488 6.8653 5.5140 5.3119 -7.4584 -8.9140 13.237
It turns out that the nominal problem is “ill-posed” – although its optimal solution is far away
from the origin, there is a “massive” set of “nearly optimal” solutions, and among the latter
ones we can choose solutions of quite moderate magnitude. Indeed, here are the optimal values
obtained when we add to the constraints of (Nom) the box constraints |xj | ≤ L, j = 1, ..., 10:
Since the “implementation inaccuracies” for a solution are the larger the larger the solution is,
there is no surprise that our “huge” nominal solution results in a very unstable actual design.
In contrast to this, the Robust Counterpart penalizes the (properly measured) magnitude of x
(look at the terms kQ` xk2 in the constraints of (Rob)) and therefore yields a much more stable
design. Note that this situation is typical for many applications: the nominal solution is on
the boundary of the nominal feasible domain, and there are “nearly optimal” solutions to the
nominal problem which are in the “deep interior” of this domain. When solving the nominal
problem, we do not take any care of a reasonable tradeoff between the “depth of feasibility”
and the optimality: any improvement in the objective is sufficient to make the solution just
marginally feasible for the nominal problem. And a solution which is only marginally feasible
in the nominal problem can easily become “very infeasible” when the data are perturbed. This
would not be the case for a “deeply interior” solution. With the Robust Counterpart approach,
we do use certain tradeoff between the “depth of feasibility” and the optimality – we are trying
to find something like the “deepest feasible nearly optimal solution”; as a result, we normally
gain a lot in stability; and if, as in our example, there are “deeply interior nearly optimal”
solutions, we do not loose that much in optimality.
Example 2: NETLIB Case Study. NETLIB is a collection of about 100 not very large LPs,
mostly of real-world origin, used as the standard benchmark for LP solvers. In the study to
110 LECTURE 2. CONIC QUADRATIC PROGRAMMING
be described, we used this collection in order to understand how “stable” are the feasibility
properties of the standard – “nominal” – optimal solutions with respect to small uncertainty in
the data. To motivate the methodology of this “Case Study”, here is the constraint # 372 of
the problem PILOT4 from NETLIB:
aT x ≡ −15.79081x826 − 8.598819x827 − 1.88789x828 − 1.362417x829 − 1.526049x830
−0.031883x849 − 28.725555x850 − 10.792065x851 − 0.19004x852 − 2.757176x853
−12.290832x854 + 717.562256x855 − 0.057865x856 − 3.785417x857 − 78.30661x858
−122.163055x859 − 6.46609x860 − 0.48371x861 − 0.615264x862 − 1.353783x863 (C)
−84.644257x864 − 122.459045x865 − 43.15593x866 − 1.712592x870 − 0.401597x871
+x880 − 0.946049x898 − 0.946049x916
≥ b ≡ 23.387405
The related nonzero coordinates in the optimal solution x∗ of the problem, as reported by CPLEX
(one of the best commercial LP solvers), are as follows:
x∗826 = 255.6112787181108 x∗827 = 6240.488912232100 x∗828 = 3624.613324098961
x∗829 = 18.20205065283259 x∗849 = 174397.0389573037 x∗870 = 14250.00176680900
x∗871 = 25910.00731692178 x∗880 = 104958.3199274139
The indicated optimal solution makes (C) an equality within machine precision.
Observe that most of the coefficients in (C) are “ugly reals” like -15.79081 or -84.644257.
We have all reasons to believe that coefficients of this type characterize certain technological
devices/processes, and as such they could hardly be known to high accuracy. It is quite natural
to assume that the “ugly coefficients” are in fact uncertain – they coincide with the “true” values
of the corresponding data within accuracy of 3-4 digits, not more. The only exception is the
coefficient 1 of x880 – it perhaps reflects the structure of the problem and is therefore exact –
“certain”.
Assuming that the uncertain entries of a are, say, 0.1%-accurate approximations of unknown
entries of the “true” vector of coefficients ã, we looked what would be the effect of this uncertainty
on the validity of the “true” constraint ãT x ≥ b at x∗ . Here is what we have found:
• The minimum (over all vectors of coefficients ã compatible with our “0.1%-uncertainty
hypothesis”) value of ãT x∗ − b, is < −104.9; in other words, the violation of the constraint can
be as large as 450% of the right hand side!
• Treating the above worst-case violation as “too pessimistic” (why should the true values of
all uncertain coefficients differ from the values indicated in (C) in the “most dangerous” way?),
consider a more realistic measure of violation. Specifically, assume that the true values of the
uncertain coefficients in (C) are obtained from the “nominal values” (those shown in (C)) by
random perturbations aj 7→ ãj = (1 + ξj )aj with independent and, say, uniformly distributed
on [−0.001, 0.001] “relative perturbations” ξj . What will be a “typical” relative violation
max[b − ãT x∗ , 0]
V = × 100%
b
of the “true” (now random) constraint ãT x ≥ b at x∗ ? The answer is nearly as bad as for the
worst scenario:
Prob{V > 0} Prob{V > 150%} Mean(V )
0.50 0.18 125%
Table 2.1. Relative violation of constraint # 372 in PILOT4
(1,000-element sample of 0.1% perturbations of the uncertain data)
2.4. MORE APPLICATIONS: ROBUST LINEAR PROGRAMMING 111
We see that quite small (just 0.1%) perturbations of “obviously uncertain” data coefficients can
make the “nominal” optimal solution x∗ heavily infeasible and thus – practically meaningless.
Inspired by this preliminary experiment, we have carried out the “diagnosis” and the “treat-
ment” phases as follows.
“Diagnosis”. Given a “perturbation level” (for which we have used the values 1%, 0.1%,
0.01%), for every one of the NETLIB problems, we have measured its “stability index” at this
perturbation level, specifically, as follows.
aT x ≤ b
of the program,
• We looked at the right hand side coefficients aj and split them into “certain” – those
which can be represented, within machine accuracy, as rational fractions p/q with
|q| ≤ 100, and “uncertain” – all the rest. Let J be the set of all uncertain coefficients
of the constraint under consideration.
• We defined the reliability index of the constraint as the quantity
rP
aT x∗ + a2j (x∗j )2 − b
j∈J
× 100% (I)
max[1, |b|]
rP
Note that the quantity a2j (x∗j )2 , as we remember from the Antenna story, is
j∈J
of order of typical difference between aT x∗ and ãT x∗ , where ã is obtained from a
by random perturbation aj 7→ ãj = pj aj of uncertain coefficients, with independent
random pj uniformly distributed in the segment [−, ]. In other words, the reliability
index is of order of typical violation (measured in percents of the right hand side)
of the constraint, as evaluated at x∗ , under independent random perturbations of
uncertain coefficients, being the relative magnitude of the perturbations.
3. We treat the nominal solution as unreliable, and the problem - as bad, the level of per-
turbations being , if the worst, over the inequality constraints, reliability index of the
constraint is worse than 5%.
The results of the Diagnosis phase of our Case Study were as follows. From the total of 90
NETLIB problems we have processed,
• in 27 problems the nominal solution turned out to be unreliable at the largest ( = 1%)
level of uncertainty;
• 19 of these 27 problems are already bad at the 0.01%-level of uncertainty, and in 13 of
these 19 problems, 0.01% perturbations of the uncertain data can make the nominal solution
more than 50%-infeasible for some of the constraints.
The details are given in Table 2.2.
112 LECTURE 2. CONIC QUADRATIC PROGRAMMING
“Treatment”. At the treatment phase of our Case Study, we used the Robust Counterpart
methodology, as outlined in Example 1, to pass from “unreliable” nominal solutions of bad
NETLIB problems to “uncertainty-immunized” robust solutions. The primary goals here were to
understand whether “treatment” is at all possible (the Robust Counterpart may happen to be
infeasible) and how “costly” it is – by which margin the robust solution is worse, in terms of
2.4. MORE APPLICATIONS: ROBUST LINEAR PROGRAMMING 113
the objective, than the nominal solution. The answers to both these questions turned out to be
quite encouraging:
• Reliable solutions do exist, except for the four cases corresponding to the highest ( = 1%)
uncertainty level (see the right column in Table 2.3).
• The price of immunization in terms of the objective value is surprisingly low: when ≤
0.1%, it never exceeds 1% and it is less than 0.1% in 13 of 23 cases. Thus, passing to the
robust solutions, we gain a lot in the ability of the solution to withstand data uncertainty,
while losing nearly nothing in optimality.
Table 2.3. Objective values for nominal and robust solutions to bad NETLIB problems
where A(ζ, u) is an affine mapping and K is a direct product of ice-cream cones. Assume,
further, that the above CQR of U is strictly feasible:
Then the robust counterpart of LP(U) is equivalent to an explicit conic quadratic problem.
Proof. Introducing an additional variable t and denoting by z = (t, x) the extended vector of
design variables, we can write down the instances of our uncertain LP in the form
n o
min dT z : αiT (ζ)z − βi (ζ) ≥ 0, i = 1, ..., m + 1 (LP[ζ])
z
are affine in the data vector ζ. The robust counterpart of our uncertain LP is the optimization
program n o
min dT z → min : αiT (ζ)z − βi (ζ) ≥ 0 ∀ζ ∈ U ∀i = 1, ..., m + 1 . (RCini )
z
Let us fix i and ask ourselves what does it mean that a vector z satisfies the infinite system of
linear inequalities
αiT (ζ)z − βi (ζ) ≥ 0 ∀ζ ∈ U. (Ci )
Clearly, a given vector z possesses this property if and only if the optimal value in the optimiza-
tion program n o
min τ : τ ≥ αiT (ζ)z − βi (ζ), ζ ∈ U
τ,ζ
is nonnegative. Recalling the definition of U, we see that the latter problem is equivalent to the
conic quadratic program
min τ : τ ≥ αiT (ζ)z − βi (ζ) ≡ [Ai ζ + ai ]T z − [bTi ζ + ci ], A(ζ, u) ≡ P ζ + Qu + r ≥K 0
τ,ζ
| {z } | {z }
αi (ζ)
βi (ζ)
(CQi [z])
in variables τ, ζ, u. Thus, z satisfies (Ci ) if and only if the optimal value in (CQi [z]) is nonneg-
ative.
Since by assumption the system of conic quadratic inequalities A(ζ, u) ≥K 0 is strictly
feasible, the conic quadratic program (CQi [z]) is strictly feasible. By the Conic Duality Theorem,
if (a) the optimal value in (CQi [z]) is nonnegative, then (b) the dual to (CQi [z]) problem admits
a feasible solution with a nonnegative value of the dual objective. By Weak Duality, (b) implies
2.4. MORE APPLICATIONS: ROBUST LINEAR PROGRAMMING 115
(a). Thus, the fact that the optimal value in (CQi [z]) is nonnegative is equivalent to the fact
that the dual problem admits a feasible solution with a nonnegative value of the dual objective:
z satisfies (Ci )
m
Opt(CQi [z]) ≥ 0
m
N
∃λ ∈ R, ξ ∈ R (N is the dimension of K):
λ[aTi z − ci ] − ξ T r ≥ 0,
λ = 1,
−λATi z + bi + P T ξ = 0,
QT ξ = 0,
λ ≥ 0,
ξ ≥K 0.
m
N :
∃ξ ∈ R
T T
ai z − ci − ξ r ≥ 0
−ATi z + bi + P T ξ = 0,
QT ξ = 0,
ξ ≥K 0.
z satisfies (Ci )
m
∃ξ ∈ RN :
aTi z − ci − ξ T r ≥ 0,
−ATi z + bi + P T ξ = 0,
QT ξ = 0,
ξ ≥K 0.
Consequently, the set of robust feasible z – those satisfying (Ci ) for all i = 1, ..., m + 1 – is CQr
(as the intersection of finitely many CQr sets), whence the robust counterpart of our uncertain
LP, being the problem of minimizing a linear objective over a CQr set, is equivalent to a conic
quadratic problem. Here is this problem:
minimize dT z
aTi z − ci − ξiT r ≥ 0,
−AT z + b + P T ξ = 0,
i i i
T ξ = 0, , i = 1, ..., m + 1
Q i
ξi ≥K 0
with design variables z, ξ1 , ..., ξm+1 . Here Ai , ai , ci , bi come from the affine functions αi (ζ) =
Ai ζ + ai and βi (ζ) = bTi ζ + ci , while P, Q, r come from the description of U:
U = {ζ : ∃u : P ζ + Qu + r ≥K 0}.
116 LECTURE 2. CONIC QUADRATIC PROGRAMMING
Remark 2.4.1 Looking at the proof of Proposition 2.4.2, we see that the assumption that
the uncertainty set U is CQr plays no crucial role. What indeed is important is that U is the
projection on the ζ-space of the solution set of a strictly feasible conic inequality associated with
certain cone K. Whenever this is the case, the above construction demonstrates that the robust
counterpart of LP(U) is a conic problem associated with the cone which is a direct product of
several cones dual to K. E.g., when the uncertainty set is polyhedral (i.e., it is given by finitely
many scalar linear inequalities: K = Rm + ), the robust counterpart of LP(U) is an explicit LP
program (and in this case we can eliminate the assumption that the conic inequality defining
U is strictly feasible (why?)). Consider, e.g., an uncertain LP with interval uncertainty in the
data:
n |cj − c∗j | ≤ j , j = 1, ..., n
o
T
min c x : Ax ≥ b : Aij ∈ [A∗ij − ij , A∗ij + ij ], i = 1, ..., m, j = 1, ..., n .
x
|bi − b∗i | ≤ δi , i = 1, ..., m
(why ?)
where K is a direct product of ice-cream cones and A is a matrix with trivial null space. The
optimal value of the problem clearly is a function of the data (c, A, b) of the problem. What can
be said about CQ-representability of this function? In general, not much: the function is not
even convex. There are, however, two modifications of our question which admit good answers.
Namely, under mild regularity assumptions
(a) With c, A fixed, the optimal value is a CQ-representable function of the right hand side
vector b;
(b) with A, b fixed, the minus optimal value is a CQ-representable function of c.
Here are the exact forms of our claims:
Proposition 2.4.3 Let c, A be fixed, and let B be a CQr set in Rdim b such that for every b ∈ B
problem (2.4.5) is strictly feasible. Then the optimal value of the problem is a CQr function on
B.
The statement is quite evident: if b is such that (2.4.5) is strictly feasible, then the optimal value
Opt(b) in the problem is either −∞, or is achieved (by Conic Duality Theorem). In both cases,
(
cT x ≤ t,
Opt(b) ≤ t ⇔ ∃x : ,
Ax − b ≥K 0
2.4. MORE APPLICATIONS: ROBUST LINEAR PROGRAMMING 117
which is, essentially, a CQR for certain function which coincides with Opt(b) on the set B of
values of b; in this CQR, b, t are the “variables of interest”, and x plays the role of the additional
variable. The CQR of the function Opt(b) with the domain B is readily given, via calculus of
CQR’s, by the representation
cT x ≤ t,
{b ∈ B&Opt(b) ≤ t} ⇔ ∃x : Ax − b ≥K 0,
b∈B
Proposition 2.4.4 Let A, b be such that (2.4.5) is strictly feasible. Then the minus optimal
value −Opt(c) of the problem is a CQr function of c.
Proof. For every c and t, the relation −Opt(c) ≤ t says exactly that (2.4.5) is below bounded
with the optimal value ≥ −t. By the Conic Duality Theorem (note that (2.4.5) is strictly
feasible!) this is the case if and only if the dual problem admits a feasible solution with the
value of the dual objective ≥ −t, so that
T
b y ≥ t,
Opt(c) ≤ t ⇔ ∃y : AT y = c,
y ≥ 0.
K
The resulting description of the epigraph of the function −Opt(c) is a CQR for this function,
with c, t playing the role of the “variables of interest” and y being the additional variable.
A careful reader could have realized that Proposition 2.4.2 is nothing but a straightforward
application of Proposition 2.4.4.
Remark 2.4.2 Same as in the case of Proposition 2.4.2, in Propositions 2.4.3, 2.4.4 the assump-
tion that uncertainty set U is CQr plays no crucial role. Thus, Propositions 2.4.3, 2.4.4 remain
valid for an arbitrary conic program, up to the fact that in this general case we should speak about
the representability of the epigraphs of Opt(b) and −Opt(c) via conic inequalities associated with
direct products of cones K, and their duals, rather than about CQ-representability. In particular,
Propositions 2.4.2, 2.4.3, 2.4.4 remain valid in the case of Semidefinite representability to be
discussed in Lecture 3.
• There are situations in dynamical decision-making when the decisions should be made at subsequent
time instants, and decision made at instant t in principle can depend on the part of uncertain data
which becomes known at this instant.
• There are situations in LP when some of the decision variables do not correspond to actual decisions;
they are artificial “analysis variables” added to the problem in order to convert it to a desired form,
say, a Linear Programming one. The analysis variables clearly may adjust themselves to the true
values of the data.
To give an example, consider the problem where we look for the best, in the discrete L1 -norm,
approximation of a given sequence b by a linear combination of given sequences aj , j = 1, ..., n, so
that the problem with no data uncertainty is
( )
T
P P
min t : |bt − atj xj | ≤ t (P)
x,t t=1 j
( m )
T
P P
min t : yt ≤ t, −yt ≤ bt − atj xj ≤ yt , 1 ≤ t ≤ T (LP)
t,x,y t=1 j
Note that (LP) is an equivalent LP reformulation of (P), and y are typical analysis variables;
whether x’s do or do not represent “actual decisions”, y’s definitely do not represent them. Now
assume that the data become uncertain. Perhaps we have reasons to require from (t, x)s to be
independent of actual data and to satisfy the constraint in (P) for all realizations of the data. This
requirement means that the variables t, x in (LP) must be data-independent, but we have absolutely
no reason to insist on data-independence of y’s: (t, x) is robust feasible for (P) if and only if (t, x),
for all realizations of the data from the uncertainty set, can be extended, by a properly chosen and
perhaps depending on the data vector y, to a feasible solution of (the corresponding realization of)
(LP). In other words, equivalence between (P) and (LP) is restricted to the case of certain data
only; when the data become uncertain, the robust counterpart of (LP) is more conservative than
the one of (P).
In order to take into account a possibility for (part of) the variables to adjust themselves to the true
values of (part of) the data, we could act as follows.
Now assume that decision variable xj is allowed to depend on part of the true data. Since the true data are
affine functions of ζ, this is the same as to assume that xj can depend on “a part” Pj ζ of the perturbation
vector, where Pj is a given matrix. The case of Pj = 0 correspond to “here and now” decisions xj –
those which should be done in advance; we shall call these decision variables non-adjustable. The case
of nonzero Pj (“adjustable decision variable”) corresponds to allowing certain dependence of xj on the
data, and the case when Pj has trivial kernel means that xj is allowed to depend on the entire true data.
Adjustable Robust Counterpart of LP. With our assumptions, a natural modification of the
Robust Optimization methodology results in the following adjustable Robust Counterpart of LP:
Pn
cj [ζ]φj (Pj ζ) ≤ t ∀ζ ∈ Z
j=1
min n t: Pn (ARC)
t,{φj (·)}j=1
φj (Pj ζ)Aj [ζ] − b[ζ] ≥ 0 ∀ζ ∈ Z
j=1
2.4. MORE APPLICATIONS: ROBUST LINEAR PROGRAMMING 119
Here cj [ζ] is j-th entry of the objective vector, and Aj [ζ] is j-th column of the constraint matrix.
It should be stressed that the variables in (ARC) corresponding to adjustable decision variables in
the original problem are not reals; they are “decision rules” – real-valued functions of the corresponding
portion Pj ζ of the data. This fact makes (ARC) infinite-dimensional optimization problem and thus
problem which is extremely difficult for numerical processing. Indeed, in general it is unclear how to
represent in a tractable way a general-type function of three (not speaking of three hundred) variables;
and how could we hope to find, in an efficient manner, optimal decision rules when we even do not know
how to write them down? Thus, in general (ARC) has no actual meaning – basically all we can do with
the problem is to write it down on paper and then look at it...
φj (Pj ζ) = µj + νjT Pj ζ.
With this approach, our new decision variables become reals µj and vectors νj , and (ARC) becomes the
following problem (called Affinely Adjustable Robust Counterpart of LP):
cj [z][µj + νjT Pj ζ] ≤ t ∀ζ ∈ Z
P
j
min n t: P T (AARC)
t,{µj ,νj }j=1 [µj + νj Pj ]Aj [ζ] − b[ζ] ≥ 0 ∀ζ ∈ Z
j
Note that the AARC is “in-between” the usual non-adjustable RC (no dependence of variables on the
true data at all) and the ARC (arbitrary dependencies of the decision variables on the allowed portions
of the true data). Note also that the only reason to restrict ourselves with affine decision rules is the
desire to end up with a “tractable” robust counterpart, and even this natural goal for the time being is
not achieved. Indeed, the constraints in (AARC) are affine in our new decision variables t, µj , νj , which
is a good news. At the same time, they are semi-infinite, same as in the case of the non-adjustable
Robust Counterpart, but, in contrast to this latter case, in general are quadratic in perturbations rather
than to be linear in them. This indeed makes a difference: as we know from Proposition 2.4.2, the
usual – non-adjustable – RC of an uncertain LP with CQr uncertainty set is equivalent to an explicit
Conic Quadratic problem and as such is computationally tractable (in fact, the latter remain true for the
case of non-adjustable RC of uncertain LP with arbitrary “computationally tractable” uncertainty set).
In contrast to this, AARC can become intractable for uncertainty sets as simple as boxes. There are,
however, good news on AARCs:
• First, there exist a generic “good case” where the AARC is tractable. This is the “fixed recourse”
case, where the coefficients of adjustable variables xj – those with Pj 6= 0 – are certain (not
affected by uncertainty). In this case, the left hand sides of the constraints in (AARC) are affine in
ζ, and thus AARC, same as the usual non-adjustable RC, is computationally tractable whenever
the perturbation set Z is so; in particular, Proposition 2.4.2 remains valid for both RC and AARC.
• Second, we shall see in Lecture 3 that even when AARC is intractable, it still admits tight, in
certain precise sense, tractable approximations.
• pi (t) is the i-th order of the period – the amount of the product to be produced during the period by
factory i and used to satisfy the demand of the period (and, perhaps, to replenish the warehouse);
• Qi - the maximal cumulative production capacity of i’th factory throughout the planning horizon.
The goal is to minimize the total production cost over all factories and the entire planning period. When
all the data are certain, the problem can be modelled by the following linear program:
min F
pi (t),v(t),F
T P
P I
s.t. ci (t)pi (t) ≤ F
t=1 i=1
0 ≤ pi (t) ≤ Pi (t), i = 1, . . . , I, t = 1, . . . , T
PT (2.4.6)
pi (t) ≤ Q(i), i = 1, . . . , I
t=1
I
P
v(t + 1) = v(t) + pi (t) − dt , t = 1, . . . , T
i=1
Vmin ≤ v(t) ≤ Vmax , t = 2, . . . , T + 1.
min F
pi (t),F
T P
P I
s.t. ci (t)pi (t) ≤ F
t=1 i=1
0 ≤ pi (t) ≤ Pi (t), i = 1, . . . , I, t = 1, . . . , T (2.4.7)
PT
pi (t) ≤ Q(i), i = 1, . . . , I
t=1
t P
P I t
P
Vmin ≤ v(1) + pi (s) − ds ≤ Vmax , t = 1, . . . , T.
s=1 i=1 s=1
Assume that the decision on supplies pi (t) is made at the beginning of period t, and that we are allowed
to make these decisions on the basis of demands dr observed at periods r ∈ It , where It is a given subset
of {1, ..., t}. Further, assume that we should specify our supply policies before the planning period starts
(“at period 0”), and that when specifying these policies, we do not know exactly the future demands; all
we know is that
dt ∈ [d∗t − θd∗t , d∗t + θd∗t ], t = 1, . . . , T, (2.4.8)
with given positive θ and positive nominal demand d∗t . We have now an uncertain LP, where the uncertain
data are the actual demands dt , the decision variables are the supplies pi (t), and these decision variables
are allowed to depend on the data {dτ : τ ∈ It } which become known when pi (t) should be specified.
Note that our uncertain LP is a “fixed recourse” one – the uncertainty affects solely the right hand side.
Thus, the AARC of the problem is computationally tractable, which is good. Let us build the AARC.
Restricting our decision-making policy with affine decision rules
X
0 r
pi (t) = πi,t + πi,t dr , (2.4.9)
r∈It
2.4. MORE APPLICATIONS: ROBUST LINEAR PROGRAMMING 121
r
where the coefficients πi,t are our new non-adjustable design variables, we get from (2.4.7) the following
s
uncertain Linear Programming problem in variables πi,t , F:
min F
π,F !
T P
I
0 r
P P
s.t. ci (t) πi,t
+ ≤F
πi,t dr
t=1 i=1
0
P r r∈It
0 ≤ πi,t + πi,t dr ≤ Pi (t), i = 1, . . . , I, t = 1, . . . , T
r∈It !
T (2.4.10)
0
P P r
πi,t + πi,t dr ≤ Q(i), i = 1, . . . , I
t=1 r∈It
!
t I t
0 r
P P P P
Vmin ≤ v(1) + πi,s + πi,s dr − ds ≤ Vmax ,
s=1 i=1 r∈Is s=1
t = 1, . . . , T
∀{dt ∈ [d∗t − θd∗t , d∗t + θd∗t ], t = 1, . . . , T },
min F
π,F !
T P
I T I
0 r
P P P P
s.t. ci (t)πi,t + ci (t)πi,t dr − F ≤ 0
t=1 i=1 r=1 i=1 t:r∈It
t
0 r
P
πi,t + πi,t dr ≤ Pi (t), i = 1, . . . , I, t = 1, . . . , T
r∈I
Pt
0 r
πi,t + πi,t dr ≥ 0, i = 1, . . . , I, t = 1, . . . , T
r∈It !
T T
0
P P P r
πi,t + πi,t dr ≤ Qi , i = 1, . . . , I (2.4.11)
t=1 r=1 t:r∈It
!
t P I t I
0 r
P P P P
πi,s + πi,s − 1 dr ≤ Vmax − v(1)
s=1 i=1 r=1 i=1 s≤t,r∈Is
t = 1, . . . , T !
t PI t I
0 r
P P P P
− πi,s − πi,s − 1 dr ≤ v(1) − Vmin
s=1 i=1 r=1 i=1 s≤t,r∈Is
t = 1, . . . , T
∀{dt ∈ [d∗t − θd∗t , d∗t + θd∗t ], t = 1, . . . , T }.
T
dt xt ≤ y, ∀dt ∈ [d∗t (1 − θ), d∗t (1 + θ)]
P
t=1
mP
d∗t (1 − θ)xt + d∗t (1 + θ)xt ≤ y
P
t:xt <0 t:xt >0
m
T T
d∗t xt + θ d∗t |xt | ≤ y,
P P
t=1 t=1
X X I
X X
r
αr ≡ ci (t)πi,t ; δir ≡ r
πi,t ; ξtr ≡ r
πi,s − 1,
t:r∈It t:r∈It i=1 s≤t,r∈Is
122 LECTURE 2. CONIC QUADRATIC PROGRAMMING
we can straightforwardly convert the AARC (2.4.11) into an equivalent LP (cf. Remark 2.4.1):
min F
π,F,α,β,γ,δ,ζ,ξ,η
I I T T T
r 0
αr d∗r + θ βr d∗r ≤ F ;
P P P P P P
ci (t)πi,t = αr , −βr ≤ αr ≤ βr , 1 ≤ r ≤ T, ci (t)πi,t +
i=1 t:r∈It P r ∗ t=1
P r ∗i=1 r=1 r=1
r r r 0
−γi,t ≤ πi,t ≤ γi,t , r ∈ It , πi,t + πi,t dr + θ γi,t dr ≤ Pi (t), 1 ≤ i ≤ I, 1 ≤ t ≤ T ;
r∈ItP r∈It
0 r ∗ r ∗ r
= δir , −ζir ≤ δir ≤ ζir , 1 ≤ i ≤ I, 1 ≤ r ≤ T,
P P
πi,t + πi,t dr − θ γi,t dr ≥ 0, πi,t
r∈It r∈It t:r∈It
T T T
0
δir d∗r ζir d∗r ≤ Qi , 1 ≤ i ≤ I;
P P P
πi,t + +θ
t=1 r=1 r=1
I
r
− ξtr = 1, −ηtr ≤ ξtr ≤ ηtr , 1 ≤ r ≤ t ≤ T,
P P
πi,s
i=1 s≤t,r∈Is
t P I t t
0
ξtr d∗r + θ ηtr d∗r ≤ Vmax − v(1), 1 ≤ t ≤ T,
P P P
πi,s +
s=1 i=1 r=1 r=1
t P I t t
0
ξtr d∗r − θ ηtr d∗r ≥ v(1) − Vmin , 1 ≤ t ≤ T.
P P P
πi,s +
s=1 i=1 r=1 r=1
(2.4.12)
An illustrative example. There are I = 3 factories producing a seasonal product, and one warehouse.
The decisions concerning production are made every two weeks, and we are planning production for 48 weeks,
thus the time horizon is T = 24 periods. The nominal demand d∗ is seasonal, reaching its maximum in winter,
specifically,
1 π (t − 1)
d∗t = 1000 1 + sin , t = 1, . . . , 24.
2 12
We assume that the uncertainty level θ is 20%, i.e., dt ∈ [0.8d∗t , 1.2d∗t ], as shown on the picture.
1800
1600
1400
1200
1000
800
600
400
0 5 10 15 20 25 30 35 40 45 50
The production costs per unit of the product depend on the factory and on time and follow the same seasonal
pattern as the demand, i.e., rise in winter and fall in summer. The production cost for a factory i at a period t
2.4. MORE APPLICATIONS: ROBUST LINEAR PROGRAMMING 123
is given by:
π (t−1)
1
ci (t) = αi 1 + 2
sin 12
, t = 1, . . . , 24.
α1 = 1
α2 = 1.5
α3 = 2
3.5
2.5
1.5
0.5
0
0 5 10 15 20 25
The maximal production capacity of each one of the factories at each two-weeks period is Pi (t) = 567 units,
and the integral production capacity of each one of the factories for a year is Qi = 13600. The inventory at the
warehouse should not be less then 500 units, and cannot exceed 2000 units.
With this data, the AARC (2.4.12) of the uncertain inventory problem is an LP, the dimensions of which vary,
depending on the “information basis” (see below), from 919 variables and 1413 constraints (empty information
basis) to 2719 variables and 3213 constraints (on-line information basis).
The experiments. In every one of the experiments, the corresponding management policy was tested
against a given number (100) of simulations; in every one of the simulations, the actual demand dt of period t
was drawn at random, according to the uniform distribution on the segment [(1 − θ)d∗t , (1 + θ)d∗t ] where θ was
the “uncertainty level” characteristic for the experiment. The demands of distinct periods were independent of
each other.
We have conducted two series of experiments:
1. The aim of the first series of experiments was to check the influence of the demand uncertainty θ on the
total production costs corresponding to the robustly adjustable management policy – the policy (2.4.9)
yielded by the optimal solution to the AARC (2.4.12). We compared this cost to the “ideal” one, i.e., the
cost we would have paid in the case when all the demands were known to us in advance and we were using
the corresponding optimal management policy as given by the optimal solution of (2.4.6).
2. The aim of the second series of experiments was to check the influence of the “information basis” allowed
for the management policy, on the resulting management cost. Specifically, in our model as described in
the previous Section, when making decisions pi (t) at time period t, we can make these decisions depending
on the demands of periods r ∈ It , where It is a given subset of the segment {1, 2, ..., t}. The larger are these
subsets, the more flexible can be our decisions, and hopefully the less are the corresponding management
costs. In order to quantify this phenomenon, we considered 4 “information bases” of the decisions:
(a) It = {1, ..., t} (the richest “on-line” information basis);
(b) It = {1, ..., t − 1} (this standard information basis seems to be the most natural “information basis”:
past is known, present and future are unknown);
(c) It = {1, ..., t − 4} (the information about the demand is received with a four-day delay);
(d) It = ∅ (i.e., no adjusting of future decisions to actual demands at all. This “information basis”
corresponds exactly to the management policy yielded by the usual RC of our uncertain LP.).
124 LECTURE 2. CONIC QUADRATIC PROGRAMMING
2. The influence of the information basis. The influence of the information basis on the performance of the
robustly adjustable management policy is displayed in the following table:
These experiments were carried out at the uncertainty level of 20%. We see that the poorer is the information
basis of our management policy, the worse are the results yielded by this policy. In particular, with 20% level
of uncertainty, there does not exist a robust non-adjustable management policy: the usual RC of our uncertain
LP is infeasible. In other words, in our illustrating example, passing from a priori decisions yielded by RC to
“adjustable” decisions yielded by AARC is indeed crucial.
An interesting question is what is the uncertainty level which still allows for a priori decisions. It turns out
that the RC is infeasible even at the 5% uncertainty level. Only at the uncertainty level as small as 2.5% the RC
becomes feasible and yields the following management costs:
RC Ideal cost
price of
Uncertainty Mean Std Mean Std
robustness
2.5% 35287 0 33842 172 4.3%
Note that even at this unrealistically small uncertainty level the price of robustness for the policy yielded by the
RC is by 4.3% larger than the ideal cost (while for the robustly adjustable management this difference is just
0.3%.
since it does not impose on the adjustable variables an “ad hoc” restriction (motivated solely by the
desire to end up with a tractable problem) to be affine functions of the uncertain data. At the same
time, the above “if applicable” is highly restrictive: the computational effort in Dynamical Programming
explodes exponentially with the dimension of the state space of the dynamical system in question. For
example, the simple Inventory problem we have considered has 4-dimensional state space (the current
amount of product in the warehouse plus remaining total capacities of the three factories), which is al-
ready computationally too demanding for accurate implementation of Dynamic Programming. The main
advantage of the AARC-based dynamical decision-making as compared with Dynamic Programming (as
well as with Multi-Stage Stochastic Programming) comes from the “built-in” computational tractabil-
ity of the approach, which prevents the “curse of dimensionality” and allows to process routinely fairly
complicated models with high-dimensional state spaces and many stages.
By the way, it is instructive to compare the AARC approach with Dynamic Programming when the
latter is applicable. For example, let us reduce the number of factories in our Inventory problem from 3
to 1, increasing the production capacity of this factory from the previous 567 to 1800 units per period,
and let us make the cumulative capacity of the factory equal to 24 × 1800, so that the restriction on
cumulative production becomes redundant. The resulting dynamical decision-making problem has just
one-dimensional state space (all which matters for the future is the current amount of product in the
warehouse). Therefore we can easily find by Dynamic Programming the “minimax optimal” inventory
management cost (minimum over arbitrary casual6) decision rules, maximum over the realizations of the
demands from the uncertainty set). With 20% uncertainty, this minimax optimal inventory management
cost turns out to be Opt∗ = 31269.69. The guarantees for the AARC-based inventory policy can be only
worse than for the minimax optimal one: we should pay a price for restricting the decision rules to be
affine in the demands. How large is this price? Computation shows that the optimal value in the AARC
is OptAARC = 31514.17, i.e., it is just by 0.8% larger than the minimax optimal cost Opt∗ . And all this
– at the uncertainty level as large as 20%! We conclude that the AARC is perhaps not as bad as one
could think...
To pose the question formally, let us say that a system of linear inequalities
P y + tp + Qu ≥ 0 (LP)
kyk2 ≤ t (CQI)
P i yi + ti pi + Qi ui ≥ 0
if is small enough, this program, for every practical purpose, is “the same” as (CQP) 7) .
{(y, t) : t ≥ φ(y)}
Theorem 2.5.1 Let n be the dimension of y in (CQI), and let 0 < < 1/2. There exists
(and can be explicitly written) a system of no more than O(1)n ln 1 linear inequalities of the
form (LP) with dim u ≤ O(1)n ln 1 which is an -approximation of (CQI). Here O(1)’s are
appropriate absolute constants.
To get an impression of the constant factors in the Theorem, look at the numbers I(n, ) of
linear inequalities and V (n, ) of additional variables u in an -approximation (LP) of the conic
quadratic inequality (CQI) with dim y = n:
is a representation of (CQI) in the sense that a collection (y10 ≡ y1 , ..., yn0 ≡ yn , y1κ ≡ t) can be
extended to a solution of (2.5.1) if and only if (y, t) solves (CQI). Moreover, let Π` (x1 , x2 , x3 , u` )
be polyhedral ` -approximations of the cone
q
3
L = {(x1 , x2 , x3 ) : x21 + x22 ≤ x3 },
Writing down this system of linear constraints as Π(y, t, u) ≥ 0, where Π is linear in its ar-
guments, y = (y10 , ..., yn0 ), t = y1κ , and u is the collection of all u`i , ` = 1, ..., κ and all yi` ,
` = 1, ..., κ − 1, we immediately conclude that Π is a polyhedral -approximation of Ln+1 with
κ
Y
1+= (1 + ` ). (2.5.3)
`=1
In view of this observation, we may focus on building polyhedral approximations of the Lorentz
cone L3 .
Note that (2.5.4) can be straightforwardly written down as a system of linear homogeneous
inequalities Π(ν) (x1 , x2 , x3 , u) ≥ 0, where u is the collection of 2(ν+1) variables ξ j , η i , j = 0, ..., ν.
q
Proposition 2.5.1 Π(ν) is a polyhedral δ(ν)-approximation of L3 = {(x1 , x2 , x3 ) : x21 + x22 ≤
x3 } with
1
δ(ν) = − 1. (2.5.5)
π
cos 2ν+1
thus ensuring (2.5.4.b), and let P j = (ξ j , η j ). The point P i is obtained from P j−1 by the
π
following construction: we rotate clockwise P j−1 by the angle φj = 2j+1 , thus getting a point
Q ; if this point is in the upper half-plane, we set P = Q , otherwise P j is the reflection
j−1 j j−1
of Qj−1 with respect to the x-axis. From this description it is clear that
(I) kP j k2 = kP j−1 k2 , so that all vectors P j are of the same Euclidean norm as P 0 , i.e., of
the norm k(x1 , x2 )k2 ;
2.5. DOES CONIC QUADRATIC PROGRAMMING EXIST? 129
(II) Since the point P 0 is in the first quadrant, the point Q0 is in the angle − π4 ≤ arg(P ) ≤ π4 ,
so that P 1 is in the angle 0 ≤ arg(P ) ≤ π4 . The latter relation, in turn, implies that Q1 is in
the angle − π8 ≤ arg(P ) ≤ π8 , whence P 2 is in the angle 0 ≤ arg(P ) ≤ π8 . Similarly, P 3 is in the
π π
angle 0 ≤ arg(P ) ≤ 16 , and so on: P j is in the angle 0 ≤ arg(P ) ≤ 2j+1 .
ν ν
By (I), ξ ≤ kP k2 = k(x1 , x2 )k2 ≤ x3 , so that the first inequality in (2.5.4.c) is satisfied.
π
By (II), P ν is in the angle 0 ≤ arg(P ) ≤ 2ν+1 , so that the second inequality in (2.5.4.c) also is
satisfied. We have extended a point from L3 to a solution to (2.5.4).
(ii): Let (x1 , x2 , x3 ) can be extended to a solution (x1 , x2 , x3 , {ξ j , η j }νj=0 ) to (2.5.4). Let
us set P j = (ξ j , η j ). From (2.5.4.a, b) it follows that all vectors P j are nonnegative. We have
k P 0 k2 ≥ k (x1 , x2 ) k2 by (2.5.4.a). Now, (2.5.4.b) says that the coordinates of P j are ≥ absolute
values of the coordinates of P j−1 taken in certain orthonormal system of coordinates, so that
kP j k2 ≥ kP j−1 k2 . Thus, kP ν k2 ≥ k(x1 , x2 )T k2 . On the other hand, by (2.5.4.c) one has
1 1
kP ν k2 ≤ π
ξν ≤ π
x3 , so that k(x1 , x2 )T k2 ≤ δ(ν)x3 , as claimed.
cos cos
2ν+1 2ν+1
Specifying in (2.5.2) the mappings Π` (·) as Π(ν` ) (·), we conclude that for every collection
of positive integers ν1 , ..., νκ one can point out a polyhedral β-approximation Πν1 ,...,νκ (y, t, u) of
Ln , n = 2κ :
(
0 `−1
ξ`,i ≥ |y2i−1 |
(a`,i ) 0 `−1
η`,i ≥ |y2i |
ξj = cos π
j+1 ξ j−1
+ sin π
j+1
j−1
η`,i
`,i 2 `,i 2
(b`,i ) , j = 1, ..., ν`
ηj π
≥ − sin 2j+1
j−1
ξ`,i + cos 2j+1π j−1
η`,i
(2.5.6)
( ν`,i`
ξ`,i `
≤ yi
(c`,i ) ν`
ν`
η`,i ≤ tg 2ν`π+1 ξ`,i
i = 1, ..., 2κ−` , ` = 1, ..., κ.
1. The dimension of the u-vector (comprised of all variables in (2.5.6) except yi = yi0 and
t = y1κ ) is
κ
X
p(n, ν1 , ..., νκ ) ≤ n + O(1) 2κ−` ν` ;
`=1
2. The image dimension of Πν1 ,...,νκ (·) (i.e., the # of linear inequalities plus twice the # of
linear equations in (2.5.6)) is
κ
X
q(n, ν1 , ..., νκ ) ≤ O(1) 2κ−` ν` ;
`=1
κ
Y 1
β = β(n; ν1 , ..., νκ ) = − 1.
π
`=1 cos 2ν` +1
130 LECTURE 2. CONIC QUADRATIC PROGRAMMING
β(ν1 , ..., νκ ) ≤ ,
p(n, ν1 , ..., νκ ) ≤ O(1)n ln 2 ,
q(n, ν1 , ..., νκ ) ≤ O(1)n ln 2 ,
as required.
2.6 Exercises
2.6.1 Around randomly perturbed linear constraints
Consider a linear constraint
aT x ≥ b [x ∈ Rn ]. (2.6.1)
We have seen that if the coefficients aj of the left hand side are subject to random perturbations:
aj = a∗j + j , (2.6.2)
where j are independent random variables with zero means taking values in segments [−σj , σj ],
then “a reliable version” of the constraint is
sX
a∗j xj
X
−ω σj2 x2j ≥ b, (2.6.3)
j j
| {z }
α(x)
where ω > 0 is a “safety parameter”. “Reliability” means that if certain x satisfies (2.6.3), then
x is “exp{−ω 2 /4}-reliable solution to (2.6.1)”, that is, the probability that x fails to satisfy
a realization of the randomly perturbed constraint (2.6.1) does not exceed exp{−ω 2 /4} (see
Proposition 2.4.1). Of course, there exists a possibility to build an “absolutely safe” version of
P ∗
(2.6.1) – (2.6.2) (an analogy of the Robust Counterpart), that is, to require that min (aj +
|j |≤σj j
j )xj ≥ b, which is exactly the inequality
a∗j xj −
X X
σj |xj | ≥ b. (2.6.4)
j j
| {z }
β(x)
Whenever x satisfies (2.6.4), x satisfies all realizations of (2.6.1), and not “all, up to exceptions
of small probability”. Since (2.6.4) ensures more guarantees than (2.6.3), it is natural to expect
from the latter inequality to be “less conservative” than the former one, that is, to expect
that the solution set of (2.6.3) is larger than the solution set of (2.6.4). Whether this indeed
is the case? The answer depends on the value of the safety parameter ω: when ω ≤ 1, the
“safety term” α(x) in (2.6.3) is, for every x, not greater than the safety term β(x) in (2.6.4),
√
so that every solution to (2.6.4) satisfies (2.6.3). When n > ω > 1, the “safety terms” in
our inequalities become “non-comparable”: depending on x, it may happen that α(x) ≤ β(x)
2.6. EXERCISES 131
√
(which is typical when ω << n), same as it may happen that α(x) > β(x). Thus, in the
√
range 1 < ω < n no one of inequalities (2.6.3), (2.6.4) is more conservative than the other
√
one. Finally, when ω ≥ n, we always have α(x) ≥ β(x) (why?), so that for “large” values
of ω (2.6.3) is even more conservative than (2.6.4). The bottom line is that (2.6.3) is not
completely satisfactory candidate to the role of “reliable version” of linear constraint (2.6.1)
affected by random perturbations (2.6.2): depending on the safety parameter, this candidate
not necessarily is less conservative than the “absolutely reliable” version (2.6.4).
The goal of the subsequent exercises is to build and to investigate an improved version of
(2.6.3).
(a) x = u + vr
(b)
P ∗
aj xj − σj |uj | − ω
P P 2 2
σ j vj ≥ b (2.6.5)
j j j
Prove that then the probability for x to violate a realization of (2.6.1) is ≤ exp{−ω 2 /4} (and is
≤ exp{−ω 2 /2} in the case of symmetrically distributed j ).
2) Verify that the requirement “x can be extended, by properly chosen u, v, to a solution of
(2.6.5)” is weaker than every one of the requirements
(a) x satisfies (2.6.3)
(b) x satisfies (2.6.4)
min aT x ≥ b (2.6.6)
a∈U
of (2.6.1) corresponding to certain choices of the uncertainty set U: (2.6.3) corresponds to the
ellipsoidal uncertainty set
U = {a : aj = a∗j + σj ζj ,
X
ζj2 ≤ ω 2 },
j
(!) System (2.6.5) is (equivalent to) the Robust Counterpart (2.6.6) of (2.6.1), the
uncertainty set being the intersection of the above ellipsoid and box:
U∗ = {a : aj = a∗j + σj ζj ,
X
ζj2 ≤ ω 2 , max |ζj | ≤ 1}.
j
j
Specifically, x can be extended to a feasible solution of (2.6.5) if and only if one has
min aT x ≥ b.
a∈U∗
Exercise 2.3 Extend the above constructions and results to the case of uncertain linear inequal-
ity
aT x ≥ b
with certain b and the vector of coefficients a randomly perturbed according to the scheme
a = a∗ + B,
where B is deterministic, and the entries 1 , ..., N of are independent random variables with
zero means and such that |i | ≤ σi for all i (σi are deterministic).
Background. In what follows, you can take for granted the following facts:
1. The diagram of “standardly invoked” harmonic oscillator placed at a point p ∈ R3 is the
following function of a 3D unit direction δ:
! !
2πpT δ 2πpT δ
Dp (δ) = cos + i sin [δ ∈ R3 , δ T δ = 1] (2.6.7)
λ λ
where λ is the wavelength, and i is the imaginary unit.
2. The diagram of an array of oscillators placed at points p1 , ..., pk , is the function
k
X
D(δ) = z` Dp` (δ),
`=1
where z` are the “element weights” (which form the antenna design and can be arbitrary
complex numbers).
2.6. EXERCISES 133
3. A natural for engineers way to measure the “concentration” of the energy sent by antenna
around a given direction e (which from now on is the positive direction of the x-axis) is
• to choose a θ > 0 and to define the corresponding sidelobe angle ∆θ as the set of all
unit 3D directions δ which are at the angle ≥ θ with the direction e;
|D(e)|
• to measure the “energy concentration” by the index ρ = max |D(δ)| , where D(·) is the
δ∈∆θ
diagram of the antenna.
4. To make the index easily computable, let us replace in its definition the maximum over
the entire sidelobe angle with the maximum over a given “fine finite grid” Γ ⊂ ∆θ , thus
arriving at the quantity
|D(e)|
ρ=
max |D(δ)|
δ∈Γθ
Exercise 2.4 1) Whether the objective (2.6.8) is a concave (and thus “easy to maximize”)
function?
2) Prove that (∗) is equivalent to the convex optimization program
( ! )
X X
max < (x` + iy` )D` (e) :| (x` + iy` )D` (δ)| ≤ 1, δ ∈ Γ . (2.6.9)
x` ,y` ∈R
` `
In order to carry out our remaining tasks, it makes sense to approximate (2.6.9) with a Linear
Programming problem. To this end, it suffices to approximate the modulus of a complex number
z (i.e., the Euclidean norm of a 2D vector) by the quantity
With the outlined approximation of the modulus, (2.6.9) becomes the optimization program
( ! ! )
X X
max < (x` + iy` )D` (e) : < ωj (x` + iy` )D` (δ) ≤ 1, 1 ≤ j ≤ J, δ ∈ Γ . (2.6.10)
x` ,y` ∈R
` `
Operational Exercise 2.6.1 1) Verify that (2.6.10) is a Linear Programming program and
solve it numerically for the following two setups:
Data A:
where x∗` , y`∗ are the “nominal optimal weights” obtained when solving (2.6.10), x` , y` are ac-
tual weights, σ > 0 is the “perturbation level”, and ξ` , η` are mutually independent random
perturbations uniformly distributed in [−1, 1].
2.1) Check by simulation what happens with the concentration index of the actual diagram as
a result of implementation errors. Carry out the simulations for the perturbation level σ taking
values 1.e-4, 5.e-4, 1.e-3.
2.2) If you are not satisfied with the behaviour of nominal design(s) in the presence imple-
mentation errors, use the Robust Counterpart methodology to replace the nominal designs with
the robust ones. What is the “price of robustness” in terms of the index? What do you gain in
stability of the diagram w.r.t. implementation errors?
Lecture 3
Semidefinite Programming
In this lecture we study Semidefinite Programming – a generic conic program with an extremely
wide area of applications.
and we may use in connection with these spaces all notions based upon the Euclidean structure,
e.g., the (Frobenius) norm of a matrix
v
u m
q uX q
kXk2 = hX, Xi = t Xij2 = Tr(X T X)
i,j=1
and likewise the notions of orthogonality, orthogonal complement of a linear subspace, etc. Of
course, the Frobenius inner product of symmetric matrices can be written without the transpo-
sition sign:
hX, Y i = Tr(XY ), X, Y ∈ Sm .
Let us focus on the space Sm . After it is equipped with the Frobenius inner product, we may
speak about a cone dual to a given cone K ⊂ Sm :
K∗ = {Y ∈ Sm | hY, Xi ≥ 0 ∀X ∈ K}.
Among the cones in Sm , the one of special interest is the semidefinite cone Sm
+ , the cone of
all symmetric positive semidefinite matrices1) . It is easily seen that Sm
+ indeed is a cone, and
moreover it is self-dual:
(Sm m
+ )∗ = S+ .
1)
Recall that a symmetric n × n matrix A is called positive semidefinite if xT Ax ≥ 0 for all x ∈ Rm ; an
equivalent definition is that all eigenvalues of A are nonnegative
135
136 LECTURE 3. SEMIDEFINITE PROGRAMMING
In the case of semidefinite programs, where K = Sm + , the usual notation leads to a conflict with
m
the notation related to the space where S+ lives. Look at (CP): without additional remarks it
is unclear what is A – is it a m × m matrix from the space Sm or is it a linear mapping acting
from the space of the design vectors – some Rn – to the space Sm ? When speaking about a
conic problem on the cone Sm + , we should have in mind the second interpretation of A, while the
standard notation in (CP) suggests the first (wrong!) interpretation. In other words, we meet
with the necessity to distinguish between linear mappings acting to/from Sm and elements of
Sm (which themselves are linear mappings from Rm to Rm ). In order to resolve the difficulty,
we make the following
Notational convention: To denote a linear mapping acting from a linear space to a space
of matrices (or from a space of matrices to a linear space), we use uppercase script letters like
A, B,... Elements of usual vector spaces Rn are, as always, denoted by lowercase Latin/Greek
letters a, b, ..., z, α, ..., ζ, while elements of a space of matrices usually are denoted by uppercase
Latin letters A, B, ..., Z. According to this convention, a semidefinite program of the form (CP)
should be written as n o
min cT x : Ax − B ≥Sm +
0 . (∗)
x
We also simplify the sign ≥Sm+
to and the sign >Sm
+
to (same as we write ≥ instead of ≥Rm+
and > instead of >Rm +
). Thus, A B (⇔ B A) means that A and B are symmetric matrices
of the same size and A − B is positive semidefinite, while A B (⇔ B ≺ A) means that A, B
are symmetric matrices of the same size with positive definite A − B.
Our last convention is how to write down expressions of the type AAxB (A is a linear
mapping from some Rn to Sm , x ∈ Rn , A, B ∈ Sm ); what we are trying to denote is the result
of the following operation: we first take the value Ax of the mapping A at a vector x, thus
getting an m × m matrix Ax, and then multiply this matrix from the left and from the right by
the matrices A, B. In order to avoid misunderstandings, we write expressions of this type as
A[Ax]B
or as AA(x)B, or as AA[x]B.
Linear Matrix Inequality constraints and semidefinite programs. In the case of conic
quadratic problems, we started with the simplest program of this type – the one with a single
conic quadratic constraint Ax − b ≥Lm 0 – and then defined a conic quadratic program as a
program with finitely many constraints of this type, i.e., as a conic program on a direct product
of the ice-cream cones. In contrast to this, when defining a semidefinite program, we impose on
the design vector just one Linear Matrix Inequality (LMI) Ax − B 0. Now we indeed should
not bother about more than a single LMI, due to the following simple fact:
A system of finitely many LMI’s
Ai x − Bi 0, i = 1, ..., k,
Indeed, a block-diagonal symmetric matrix is positive (semi)definite if and only if all its
diagonal blocks are so.
is necessary and sufficient for a pair of a primal feasible solution x and a dual feasible solution
Λ to be optimal for the corresponding problems.
It is easily seen that for a pair X, Y of positive semidefinite symmetric matrices one has
Tr(XY ) = 0 ⇔ XY = Y X = 0;
in particular, in the case of strictly feasible primal and dual problems, the “primal slack” S∗ =
Ax∗ − B corresponding to a primal optimal solution commutes with (any) dual optimal solution
Λ∗ , and the product of these two matrices is 0. Besides this, S∗ and Λ∗ , as a pair of commuting
symmetric matrices, share a common eigenbasis, and the fact that S∗ Λ∗ = 0 means that the
eigenvalues of the matrices in this basis are “complementary”: for every common eigenvector,
either the eigenvalue of S∗ , or the one of Λ∗ , or both, are equal to 0 (cf. with complementary
slackness in the LP case).
can be cast as a semidefinite program. Just as in the previous lecture, this question actually
asks whether a given convex set/convex function is positive semidefinite representable (in short:
SDr). The definition of the latter notion is completely similar to the one of a CQr set/function:
n
that a convex set X ⊂ R is SDr, if there exists an affine mapping (x, u) →
We say
x
A − B : Rnx × Rku → Sm such that
u
x
x ∈ X ⇔ ∃u : A − B 0;
u
in the original design vector x and a vector u of additional design variables such that
X is a projection of the solution set of the LMI onto the x-space. An LMI with this
property is called Semidefinite Representation (SDR) of the set X.
A convex function f : Rn → R ∪ {+∞} is called SDr, if its epigraph
{(x, t) | t ≥ f (x)}
By exactly the same reasons as in the case of conic quadratic problems, one has:
1. If f is a SDr function, then all its level sets {x | f (x) ≤ a} are SDr; the SDR
of the level sets are explicitly given by (any) SDR of f ;
2. If all the sets Xi in problem (P) are SDr with known SDR’s, then the problem
can explicitly be converted to a semidefinite program.
In order to understand which functions/sets are SDr, we may use the same approach as in
Lecture 2. “The calculus”, i.e., the list of basic operations preserving SD-representability, is
exactly the same as in the case of conic quadratic problems; we just may repeat word by word
the relevant reasoning from Lecture 2, each time replacing “CQr” with “SDr”. Thus, the only
issue to be addressed is the derivation of a catalogue of “simple” SDr functions/sets. Our first
observation in this direction is as follows:
1-17. 2) If a function/set is CQr, it is also SDr, and any CQR of the function/set can be
explicitly converted to its SDR.
Indeed, the notion of a CQr/SDr function is a “derivative” of the notion of a CQr/SDr set:
by definition, a function is CQr/SDr if and only if its epigraph is so. Now, CQr sets are
exactly those sets which can be obtained as projections of the solution sets of systems of
conic quadratic inequalities, i.e., as projections of inverse images, under affine mappings, of
direct products of ice-cream cones. Similarly, SDr sets are projections of the inverse images,
under affine mappings, of positive semidefinite cones. Consequently,
(i) in order to verify that a CQr set is SDr as well, it suffices to show that an inverse image,
under an affine mapping, of a direct product of ice-cream cones – a set of the form
l
Y
Z = {z | Az − b ∈ K = Lki }
i=1
is the inverse image of a semidefinite cone under an affine mapping. To this end, in turn, it
suffices to demonstrate that
l
Lki of ice-cream cones is an inverse image of a semidefinite cone
Q
(ii) a direct product K =
i=1
under an affine mapping.
Indeed, representing K as {y | Ay − b ∈ Sm
+ }, we get
Z = {z | Az − b ∈ K} = {z | Âz − B̂ ∈ Sm
+ },
2)
We refer to Examples 1-17 of CQ-representable functions/sets from Section 2.3
140 LECTURE 3. SEMIDEFINITE PROGRAMMING
In fact the implication (iii) ⇒ (ii) is given by our calculus, since a direct product of SDr
sets is again SDr3) .
We have reached the point where no more reductions are necessary, and here is the demon-
stration of (iii). To see that the Lorentz cone Lk , k > 1, is SDr, it suffices to observe
that
x tIk−1 x
∈ Lk ⇔ A(x, t) = 0 (3.2.1)
t xT t
(x is k − 1-dimensional, t is scalar, Ik−1 is the (k − 1) × (k − 1) unit matrix). (3.2.1) indeed
resolves the problem, since the matrix A(x, t) is linear in (x, t)!
verify (3.2.1), which is immediate. If (x, t) ∈ Lk , i.e., if kxk2 ≤ t, then for
It remainsto
ξ
every y = ∈ Rk (ξ is (k − 1)-dimensional, τ is scalar) we have
τ
We see that the “expressive abilities” of semidefinite programming are even richer than
those of Conic Quadratic programming. In fact the gap is quite significant. The first new
possibility is the ability to handle eigenvalues, and the importance of this possibility can hardly
be overestimated.
we can represent K as the inverse image of a semidefinite cone under an affine mapping, namely, as
18. The largest eigenvalue λmax (X) regarded as a function of m × m symmetric matrix X is
SDr. Indeed, the epigraph of this function
λmax (X : M ) ≤ t
tM − X 0.
18b. The spectral norm |X| of a symmetric m × m matrix X, i.e., the maximum of abso-
lute values of the eigenvalues of X, is SDr. Indeed, a SDR of the epigraph
tIm − X 0, tIm + X 0.
In spite of their simplicity, the indicated results are extremely useful. As a more complicated
example, let us build a SDr for the sum of the k largest eigenvalues of a symmetric matrix.
From now on, speaking about m × m symmetric matrix X, we denote by λi (X), i = 1, ..., m,
its eigenvalues counted with their multiplicities and arranged in a non-ascending order:
The vector of the eigenvalues (in the indicated order) will be denoted λ(X):
The question we are about to address is which functions of the eigenvalues are SDr. We already
know that this is the case for the largest eigenvalue λ1 (X). Other eigenvalues cannot be SDr
since they are not convex functions of X. And convexity, of course, is a necessary condition for
SD-representability (cf. Lecture 2). It turns out, however, that the m functions
k
X
Sk (X) = λi (X), k = 1, ..., m,
i=1
{(X, t) | Sk (x) ≤ t}
X X 0 ⇒ λ(X) ≥ λ(X 0 ).
1
.
λ(X) ≤ λ(Z + sIm ) = λ(Z) + s .. ,
1
whence
Sk (X) ≤ Sk (Z) + sk.
4)
which is n immediate corollary of the fundamental Variational Characterization of Eigenvalues of symmetric
matrices, see Section A.7.3: for a symmetric m × m matrix A,
Since Z 0 (see (3.2.2.b)), we have Sk (Z) ≤ Tr(Z), and combining these inequalities we get
The latter inequality, in view of (3.2.2.a)), implies Sk (X) ≤ t, and (i) is proved.
To prove (ii), assume that we are given X, t with Sk (X) ≤ t, and let us set s = λk (X).
Then the k largest eigenvalues of the matrix X − sIm are nonnegative, and the remaining are
nonpositive. Let Z be a symmetric matrix with the same eigenbasis as X and such that the
k largest eigenvalues of Z are the same as those of X − sIm , and the remaining eigenvalues
are zeros. The matrices Z and Z − X + sIm are clearly positive semidefinite (the first by
construction, and the second since in the eigenbasis of X this matrix is diagonal with the first
k diagonal entries being 0 and the remaining being the same as those of the matrix sIm − X,
i.e., nonnegative). Thus, the matrix Z and the real s we have built satisfy (3.2.2.b, c). In
order to see that (3.2.2.a) is satisfied as well, note that by construction Tr(Z) = Sk (x) − sk,
whence t − sk − Tr(Z) = t − Sk (x) ≥ 0.
CT
B
A=
C D
be a symmetric matrix with k × k block B and ` × ` block D. Assume that B is positive definite.
Then A is positive (semi)definite if and only if the matrix
D − CB −1 C T
CT
T T B x
0 ≤ (x , y ) = xT Bx + 2xT C T y + y T Dy ∀x ∈ Rk , y ∈ R` ,
C D y
Since B is positive definite by assumption, the infimum in x can be computed explicitly for every
fixed y: the optimal x is −B −1 C T y, and the optimal value is
y T Dy − y T CB −1 C T y = y T [D − CB −1 C T ]y.
The positive definiteness/semidefiniteness of A is equivalent to the fact that the latter ex-
pression is, respectively, positive/nonnegative for every y 6= 0, i.e., to the positive definite-
ness/semidefiniteness of the Schur complement of B in A.
144 LECTURE 3. SEMIDEFINITE PROGRAMMING
is neither a convex nor a concave function of X (if m ≥ 2), it turns out that the function
Detq (X) is concave in X whenever 0 ≤ q ≤ m 1
. Function of this type are important in many
volume-related problems (see below); we are about to prove that
1
if q is a rational number,, 0 ≤ q ≤ m, then the function
−Detq (X), X 0
fq (X) =
+∞, otherwise
is SDr.
Consider the following system of LMI’s:
X ∆
0, (D)
∆T D(∆)
where ∆ is m × m lower triangular matrix comprised of additional variables, and D(∆) is
the diagonal matrix with the same diagonal entries as those of ∆. Let diag(∆) denote the
vector of the diagonal entries of the square matrix ∆.
As we know from Lecture 2 (see Example 15), the set
{(δ, t) ∈ Rm q
+ × R | t ≤ (δ1 ...δm ) }
admits an explicit CQR. Consequently, this set admits an explicit SDR as well. The latter
SDR is given by certain LMI S(δ, t; u) 0, where u is the vector of additional variables of
the SDR, and S(δ, t, u) is a matrix affinely depending on the arguments. We claim that
(!) The system of LMI’s (D) & S(diag(∆), t; u) 0 is a SDR for the set
{(X, t) | X 0, t ≤ Detq (X)},
which is basically the epigraph of the function fq (the latter is obtained from our set by
reflection with respect to the plane t = 0).
To support our claim, recall that by Linear Algebra a matrix X is positive semidefinite if and
only if it can be factorized as X = ∆
b∆b T with a lower triangular ∆, b ≥ 0; the resulting
b diag(∆)
matrix ∆ is called the Choleski factor of X. No note that if X 0 and t ≤ Detq (X), then
b
(1) We can extend X by appropriately chosen lower triangular matrix ∆ to a solution of (D)
m
Q
in such a way that if δ = diag(∆), then δi = Det(X).
i=1
Indeed, let ∆
b be the Choleski factor of X. Let D b be the diagonal matrix with the same
diagonal entries as those of ∆, and let ∆ = ∆D, so that the diagonal entries δi of ∆ are
b b b
squares of the diagonal entries bδi of the matrix ∆.
b Thus, D(∆) = D b 2 . It follows that
−1 T 2 −1 b b T b T = X. We
for every > 0 one has ∆[D(∆) + I] ∆ = ∆D[D + I] D∆ ∆
b b b b∆
X ∆
see that by the Schur Complement Lemma all matrices of the form
∆T D(∆) + I
X ∆
with > 0 are positive semidefinite, whence 0. Thus, (D) is indeed
∆T D(∆)
m m
b T ⇒ Det(X) = Det2 (∆) δi2 =
Q Q
satisfied by (X, ∆). And of course X = ∆
b∆ b = b δi .
i=1 i=1
3.2. WHAT CAN BE EXPRESSED VIA LMI’S? 145
m
m q
δi = Det(X), we get t ≤ Detq (X) =
Q Q
(2) Since δ = diag(∆) ≥ 0 and δi , so that we
i=1 i=1
can extend (t, δ) by a properly chosen u to a solution of the LMI S(diag(∆), t; u) 0.
We conclude that if X 0 and t ≤ Detq (X), then one can extend the pair X, t by properly
chosen ∆ and u to a solution of the LMI (D) & S(diag(∆), t; u) 0, which is the first part
of the proof of (!).
To complete the proof of (!), it suffices to demonstrate that if for a given pair X, t there
exist ∆ and u such that (D) and the LMI S(diag(∆), t; u) 0 are satisfied, then X is
positive semidefinite and t ≤ Detq (X). This is immediate: denoting δ = diag(∆) [≥ 0]
and applying the Schur Complement Lemma, we conclude that X ∆[D(∆) + I]−1 ∆T
for every > 0. Applying (W), we get λ(X) ≥ λ(∆[D(∆) + I]−1 ∆T ), whence of course
m
Det(X) ≥ Det(∆[D(∆) + I]−1 ∆T ) = δi2 /(δi + ). Passing to limit as → 0, we get
Q
i=1
m
Q
δi ≤ Det(X). On the other hand, the LMI S(δ, t; u) 0 takes place, which means that
i=1 m q
δi . Combining the resulting inequalities, we come to t ≤ Detq (X), as required.
Q
t≤
i=1
18e. Negative powers of the determinant. Let q be a positive rational. Then the function
Det−q (X),
X0
f (X) =
+∞, otherwise
of symmetric m × m matrix X is SDr.
The construction is completely similar to the one used in Example 18d. As we remember from
Lecture 2, Example 16, the function g(δ) = (δ1 ...δm )−q of positive vector δ = (δ1 , ..., δm )T is
CQr and is therefore SDr as well. Let an SDR of the function be given by LMI R(δ, t, u)
0. The same arguments as in Example 18d demonstrate that the pair of LMI’s (D) &
R(Dg(∆), t, u) 0 is an SDR for f .
t ≥ f (x) ⇔ ∃u : S(x, t, u) 0,
f (X) = g(λ(X))
(a) t ≥ f (X)
m
∃x1 , ...,
xm , u :
S(x1 , ..., xm , t, u) 0 (3.2.3)
x1 ≥ x2 ≥ ... ≥ xm
(b)
Sj (X) ≤ x1 + ... + xj , j = 1, ..., m − 1
Tr(X) = x1 + ... + xm
146 LECTURE 3. SEMIDEFINITE PROGRAMMING
k
P
(recall that the functions Sj (X) = λi (X) are SDr, see Example 18c). Thus, the solution set
i=1
of (b) is SDr (as an intersection of SDr sets), which implies SD-representability of the projection
of this set onto the (X, t)-plane; by (3.2.3) the latter projection is exactly the epigraph of f ).
The proof of Proposition 3.2.1 is based upon an extremely useful result known as Birkhoff’s
Theorem5) .
As a corollary of Proposition 3.2.1, we see that the following functions of a symmetric m × m
matrix X are SDr:
The importance of the singular values comes from the Singular Value Decomposition Theorem
which states that a k × l matrix A (k ≤ l) can be represented as
k
X
A= σi (A)ei fiT ,
i=1
5)
The Birkhoff Theorem, which, aside of other applications, implies a number of crucial facts about eigenvalues
of symmetric matrices, by itself even does not mention the word “eigenvalue” and reads: The extreme points of
the polytope P of double stochastic m × m matrices – those with nonnegative entries and unit sums of entries in
every row and every column – are exactly the permutation matrices (those with a single nonzero entry, equal to
1, in every row and every column). For proof, see Section B.2.8.C.1.
3.2. WHAT CAN BE EXPRESSED VIA LMI’S? 147
where {ei }ki=1 and {fi }ki=1 are orthonormal sequences in Rk and Rl , respectively; this is a
surrogate of the eigenvalue decomposition of a symmetric k × k matrix
k
X
A= λi (A)ei eTi ,
i=1
For a symmetric matrix, the singular values are exactly the moduli of the eigenvalues, and our
new definition of the norm coincides with the one already given in 18b.
It turns out that the sum of a given number of the largest singular values of A
p
X
Σp (A) = σi (A)
i=1
is a convex and, moreover, a SDr function of A. In particular, the operator norm of A is SDr:
19. The sum Σp (X) of p largest singular values of a rectangular matrix X ∈ Mk,l is SDr. In
particular, the operator norm of a rectangular matrix is SDr:
−X T
tIl
|X| ≤ t ⇔ 0.
−X tIk
Indeed, the result in question follows from the fact that the sums of p largest eigenvalues of
a symmetric matrix are SDr (Example 18c) due to the following
Since X̄ linearly depends on X, SDR’s of the functions Sp (·) induce SDR’s of the functions
Σp (X) = Sp (X̄) (Rule on affine substitution, Lecture 2; recall that all “calculus rules”
established in Lecture 2 for CQR’s are valid for SDR’s as well).
k
σi (X)ei fiT be a singular value de-
P
Let us justify our observation. Let X =
i=1
fi
composition of X. We claim that the 2k (k + l)-dimensional vectors gi+ =
ei
− fi
and gi = are orthogonal to each other, and they are eigenvectors of X̄
−ei
with the eigenvalues σi (X) and −σi (X), respectively. Moreover, X̄ vanishes on
the orthogonal complement of the linear span of these vectors. In other words,
we claim that the eigenvalues of X̄, arranged in the non-ascending order, are as
follows:
σ1 (X), σ2 (X), ..., σk (X), 0, ..., 0, −σk (X), −σk−1 (X), ..., −σ1 (X);
| {z }
2(l−k)
148 LECTURE 3. SEMIDEFINITE PROGRAMMING
(we have used that both {fj } and {ej } are orthonormal systems). Thus, gi+ is an
eigenvector of X̄ with the eigenvalue σi (X). Similar computation shows that gi−
is an eigenvector of X̄ with the eigenvalue −σi (X).
f
It remains to verify that if h = is orthogonal to all gi± (f is l-dimensional,
e
e is k-dimensional), then X̄h = 0. Indeed, the orthogonality assumption means
that f T fi ±eT ei = 0 for all i, whence eT ei =0 and f T fi = 0 for all i. Consequently,
k
T
P
f (e
i=1 j j e)
0 XT f
= k
= 0.
X 0 e P
T
ej (fj f )
i=1
Looking at Proposition 3.2.1, we see that the fact that specific functions of eigenvalues of a
symmetric matrix X, namely, the sums Sk (X) of k largest eigenvalues of X, are SDr, underlies
the possibility to build SDR’s for a wide class of functions of the eigenvalues. The role of the
sums of k largest singular values of a rectangular matrix X is equally important:
Proposition 3.2.2 Let g(x1 , ..., xk ) : Rk+ → R ∪ {+∞} be a symmetric monotone function:
f (X) = g(σ(X))
(a) t ≥ f (X)
m
∃x1 , ...,
xk , u : (3.2.4)
S(x1 , ..., xk , t, u) 0
(b) x1 ≥ x2 ≥ ... ≥ xk
Σ (X) ≤ x + ... + x , j = 1, ..., m
j 1 j
3.2. WHAT CAN BE EXPRESSED VIA LMI’S? 149
Note the difference between the symmetric (Proposition 3.2.1) and the non-symmetric (Propo-
sition 3.2.2) situations: in the former the function g(x) was assumed to be SDr and symmetric
only, while in the latter the monotonicity requirement is added.
The proof of Proposition 3.2.2 is outlined in Section 3.8
“Nonlinear matrix inequalities”. There are several cases when matrix inequalities F (x)
0, where F is a nonlinear function of x taking values in the space of symmetric m × m matrices,
can be “linearized” – expressed via LMI’s.
Indeed, let us denote by X 0 the set in the right hand side of the latter relation; we should
prove that X 0 = X . By definition, X is the closure of its intersection with the domain X 0.
It is clear that X 0 also is the closure of its intersection with the domain X 0. Thus, all we
need to prove is that a pair (Y, X) with X 0 belongs to X if and only if it belongs to X 0 .
“If” part: Assume that X 0 and (Y, X) ∈ X 0 . Then there exists Z such that Z 0,
Z Y and X CZC T . Let us choose a sequence Zi Z such that Zi → Z, i → ∞.
Since CZi C T → CZC T X as i → ∞, we can find a sequence of matrices Xi such that
T
Xi → X,i → ∞, and Xi CZi C for all i. By the Schur Complement Lemma, the
Xi C
matrices are positive definite; applying this lemma again, we conclude that
C T Zi−1
−1 −1
Zi C T Xi C. Note that the left and the right hand side matrices in the latter inequality
are positive definite. Now let us use the following simple fact
Lemma 3.2.2 Let U, V be positive definite matrices of the same size. Then
U V ⇔ U −1 V −1 .
Applying Lemma 3.2.2 to the inequality Zi−1 C T Xi−1 C[ 0], we get Zi (C T Xi−1 C)−1 .
As i → ∞, the left hand side in this inequality converges to Z, and the right hand side
converges to (C T X −1 C)−1 . Hence Z (C T X −1 C)−1 , and since Y Z, we get Y
(C T X −1 C)−1 , as claimed.
“Only if” part: Let X 0 and Y (C T X −1 C)−1 ; we should prove that there exists Z 0
such that Z Y and X CZC T . We claim that the required relations are satisfied by
Z = (C T X −1 C)−1 . The only nontrivial part of the claim is that X CZC T , and here is the
required
−1 justification:
by its origin Z 0, and by the Schur Complement Lemma the matrix
Z CT
is positive semidefinite, whence, by the same Lemma, X C(Z −1 )−1 C T =
C X
CZC T .
3.2. WHAT CAN BE EXPRESSED VIA LMI’S? 151
+
21a. Polynomials nonnegative on the entire axis: The set P2k (R) is SDr – it is the image
k+1
of the semidefinite cone S+ under the affine mapping
1
t
X 7→ Coef(eT (t)Xe(t)) : Sk+1 → R2k+1 , e(t) = t2 (C)
...
tk
+
First note that the fact that P + ≡ P2k (R) is an affine image of the semidefinite cone indeed
+
implies the SD-representability of P , see the “calculus” of conic representations in Lecture
2. Thus, all we need is to show that P + is exactly the same as the image, let it be called P ,
of Sk+1
+ under the mapping (C).
(1) The fact that P is contained in P + is immediate. Indeed, let X be a (k + 1) × (k + 1)
positive semidefinite matrix. Then X is a sum of dyadic matrices:
k+1
X
X= pi (pi )T , pi = (pi0 , pi1 , ..., pik )T ∈ Rk+1
i=1
is the sum of squares of other polynomials and therefore is nonnegative on the axis. Thus,
the image of X under the mapping (C) belongs to P + .
Note that reversing our reasoning, we get the following result:
6)
It is clear why we have restricted the degree to be even: a polynomial of an odd degree cannot be nonnegative
on the entire axis!
152 LECTURE 3. SEMIDEFINITE PROGRAMMING
With (!), the remaining part of the proof – the demonstration that the image of Sk+1
+ contains
P + , is readily given by the following well-known algebraic fact:
p(t) = ω 2 [(t − λ1 )2 ]m1 ...[(t − λr )2 ]mr [(t − µ1 )(t − µ∗1 )]...[(t − µs )(t − µ∗s )]
is a product of sums of squares. But such a product is itself a sum of squares (open the
parentheses)!
In fact we can say more: a nonnegative polynomial p is a sum of just two
squares! To see this, note that, as we have seen, p is a product of sums of two
squares and take into account the following fact (Liouville):
The product of sums of two squares is again a sum of two squares:
(cf. with: “the modulus of a product of two complex numbers is the product
of their moduli”).
+
Equipped with the SDR of the set P2k (R) of polynomials nonnegative on the entire axis, we
can immediately obtain SDR’s for the polynomials nonnegative on a given ray/segment:
2) The set Pk+ ([0, 1]) of (coefficients of) polynomials of degree ≤ k which are nonnegative on
the segment [0, 1], is SDr.
Indeed, a polynomial p(t) of degree ≤ k is nonnegative on [0, 1] if and only if the rational
function 2
t
g(t) = p
1 + t2
3.3. APPLICATIONS OF SEMIDEFINITE PROGRAMMING IN ENGINEERING 153
is nonnegative on the entire axis, or, which is the same, if and only if the polynomial
p+ (t) = (1 + t2 )k g(t)
Identifying such a polynomial with its vector of coefficients Coef(p) ∈ R2k+1 , we may ask how to
express the set Sk+ (∆) of those trigonometric polynomials of degree ≤ k which are nonnegative
on a segment ∆ ⊂ [0, 2π].
21c. Trigonometric polynomials nonnegative on a segment. The set Sk+ (∆) is SDr.
Indeed, sin(`φ) and cos(`φ) are polynomials of sin(φ) and cos(φ), and the latter functions,
in turn, are rational functions of ζ = tan(φ/2):
1 − ζ2 2ζ
cos(φ) = 2
, sin(φ) = [ζ = tan(φ/2)].
1+ζ 1 + ζ2
Consequently, a trigonometric polynomial p(φ) of degree ≤ k can be represented as a rational
function of ζ = tan(φ/2):
p+ (ζ)
p(φ) = [ζ = tan(φ/2)],
(1 + ζ 2 )k
where the coefficients of the algebraic polynomial p+ of degree ≤ 2k are linear functions
of the coefficients of p. Now, the requirement for p to be nonnegative on a given segment
∆ ⊂ [0, 2π] is equivalent to the requirement for p+ to be nonnegative on a “segment” ∆+
(which, depending on ∆, may be either the usual finite segment, or a ray, or the entire axis).
We see that Sk+ (∆) is inverse image, under certain linear mapping, of the SDr set P2k
+
(∆+ ),
+
so that Sk (∆) itself is SDr.
Finally, we may ask which part of the above results can be saved when we pass from nonneg-
ative polynomials of one variable to those of two or more variables. Unfortunately, not too much.
E.g., among nonnegative polynomials of a given degree with r > 1 variables, exactly those of
them who are sums of squares can be obtained as the image of a positive semidefinite cone under
certain linear mapping similar to (D). The difficulty is that in the multi-dimensional case the
nonnegativity of a polynomial is not equivalent to its representability as a sum of squares, thus,
the positive semidefinite cone gives only part of the polynomials we are interested to describe.
d2
M x(t) = −Ax(t), (N)
dt2
where x(t) ∈ Rn is the state vector of the system at time t, M is the (generalized) “mass
matrix”, and A is the “stiffness” matrix of the system. Basically, (N) is the Newton law for a
system with the potential energy 21 xT Ax.
As a simple example, consider a system of k points of masses µ1 , ..., µk linked by springs with
given elasticity coefficients; here x is the vector of the displacements xi ∈ Rd of the points
from their equilibrium positions ei (d = 1/2/3 is the dimension of the model). The Newton
equations become
d2 X
µi x i (t) = − νij (ei − ej )(ei − ej )T (xi − xj ), i = 1, ..., k,
dt2
j6=i
l x
d2
dt2 x(t) = −νx(t), ν = κl .
Another example is given by trusses – mechanical constructions, like a railway bridge or the
Eiffel Tower, built from linked to each other thin elastic bars.
Note that in the above examples both the mass matrix M and the stiffness matrix A are
symmetric positive semidefinite; in “nondegenerate” cases they are even positive definite, and
this is what we assume from now on. Under this assumption, we can pass in (N) from the
variables x(t) to the variables y(t) = M 1/2 x(t); in these variables the system becomes
d2
y(t) = −Ây(t), Â = M −1/2 AM −1/2 . (N0 )
dt2
It is well known that the space of solutions of system (N0 ) (where  is symmetric positive
definite) is spanned by fundamental (perhaps complex-valued) solutions of the form exp{µt}f .
A nontrivial (with f 6= 0) function of this type is a solution to (N0 ) if and only if
(µ2 I + Â)f = 0,
so that the allowed values of µ2 are the minus eigenvalues of the matrix Â, and f ’s are the
corresponding eigenvectors of Â. Since the matrix  is symmetric positive definite, the only
3.3. APPLICATIONS OF SEMIDEFINITE PROGRAMMING IN ENGINEERING 155
q
allowed values of µ are purely imaginary, with the imaginary parts ± λj (Â). Recalling that
the eigenvalues/eigenvectors of  are exactly the eigenvalues/eigenvectors of the pencil [M, A],
we come to the following result:
(!) In the case of positive definite symmetric M, A, the solutions to (N) – the “free
motions” of the corresponding mechanical system S – are of the form
n
X
x(t) = [aj cos(ωj t) + bj sin(ωj t)]ej ,
j=1
where aj , bj are free real parameters, ej are the eigenvectors of the pencil [M, A]:
(λj M − A)ej = 0
p
and ωj = λj . Thus, the “free motions” of the system S are mixtures of har-
monic oscillations along the eigenvectors of the pencil [M, A], and the frequencies
of the oscillations (“the eigenfrequencies of the system”) are the square roots of the
corresponding eigenvalues of the pencil.
From the engineering viewpoint, the “dynamic behaviour” of mechanical constructions such
as buildings, electricity masts, bridges, etc., is the better the larger are the eigenfrequencies of
the system7) . This is why a typical design requirement in mechanical engineering is a lower
bound
λmin (A : M ) ≥ λ∗ [λ∗ > 0] (3.3.1)
on the smallest eigenvalue λmin (A : M ) of the pencil [M, A] comprised of the mass and the
stiffness matrices of the would-be system. In the case of positive definite symmetric mass
matrices (3.3.1) is equivalent to the matrix inequality
A − λ∗ M 0. (3.3.2)
7)
Think about a building and an earthquake or about sea waves and a light house: in this case the external
load acting at the system is time-varying and can be represented as a sum of harmonic oscillations of different
(and low) frequencies; if some of these frequencies are close to the eigenfrequencies of the system, the system can
be crushed by resonance. In order to avoid this risk, one is interested to move the eigenfrequencies of the system
away from 0 as far as possible.
156 LECTURE 3. SEMIDEFINITE PROGRAMMING
If M and A are affine functions of the design variables (as is the case in, e.g., Truss Design), the
matrix inequality (3.3.2) is a linear matrix inequality on the design variables, and therefore it
can be processed via the machinery of semidefinite programming. Moreover, in the cases when
A is affine in the design variables, and M is constant, (3.3.2) is an LMI in the design variables
and λ∗ , and we may play with λ∗ , e.g., solve a problem of the type “given the mass matrix of
the system to be designed and a number of (SDr) constraints on the design variables, build a
system with the minimum eigenfrequency as large as possible”, which is a semidefinite program,
provided that the stiffness matrix is affine in the design variables.
σOA
O O O
A simple circuit
Element OA: outer supply of voltage VOA and resistor with conductance σOA
Element AO: capacitor with capacitance CAO
Element AB: resistor with conductance σAB
Element BO: capacitor with capacitance CBO
E.g., a chip is, electrically, a complicated circuit comprised of elements of the indicated type.
When designing chips, the following characteristics are of primary importance:
• Speed. In a chip, the outer voltages are switching at certain frequency from one constant
value to another. Every switch is accompanied by a “transition period”; during this period,
the potentials/currents in the elements are moving from their previous values (correspond-
ing to the static steady state for the “old” outer voltages) to the values corresponding to
the new static steady state. Since there are elements with “inertia” – capacitors – this
transition period takes some time8 ). In order to ensure stable performance of the chip, the
transition period should be much less than the time between subsequent switches in the
outer voltages. Thus, the duration of the transition period is responsible for the speed at
which the chip can perform.
• Dissipated heat. Resistors in the chip dissipate heat which should be eliminated, otherwise
the chip will not function. This requirement is very serious for modern “high-density”
chips. Thus, a characteristic of vital importance is the dissipated heat power.
The two objectives: high speed (i.e., a small transition period) and small dissipated heat –
usually are conflicting. As a result, a chip designer faces the tradeoff problem like “how to get
a chip with a given speed and with the minimal dissipated heat”. It turns out that different
8)
From purely mathematical viewpoint, the transition period takes infinite time – the currents/voltages ap-
proach asymptotically the new steady state, but never actually reach it. From the engineering viewpoint, however,
we may think that the transition period is over when the currents/voltages become close enough to the new static
steady state.
3.3. APPLICATIONS OF SEMIDEFINITE PROGRAMMING IN ENGINEERING 157
optimization problems related to the tradeoff between the speed and the dissipated heat in an
RC circuit belong to the “semidefinite universe”. We restrict ourselves with building an SDR
for the speed.
Simple considerations, based on Kirchoff laws, demonstrate that the transition period in an
RC circuit is governed by a linear system of differential equations as follows:
d
C w(t) = −Sw(t) + Rv. (3.3.3)
dt
Here
• The state vector w(·) is comprised of the potentials at all but one nodes of the circuit (the
potential at the remaining node – “the ground” – is normalized to be identically zero);
• Matrix C 0 is readily given by the topology of the circuit and the capacitances of the
capacitors and is linear in these capacitances. Similarly, matrix S 0 is readily given
by the topology of the circuit and the conductances of the resistors and is linear in these
conductances. Matrix R is given solely by the topology of the circuit;
• v is the vector of outer voltages; recall that this vector is set to certain constant value at
the beginning of the transition period.
As we have already mentioned, the matrices C and S, due to their origin, are positive semidefi-
nite; in nondegenerate cases, they are even positive definite, which we assume from now on.
Let w b = Rv. The difference δ(t) = w(t) − w
b be the steady state of (3.3.3), so that S w b is a
solution to the homogeneous differential equation
d
C δ(t) = −Sδ(t). (3.3.4)
dt
Setting γ(t) = C 1/2 δ(t) (cf. Section 3.3.1), we get
d
γ(t) = −(C −1/2 SC −1/2 )γ(t). (3.3.5)
dt
Since C and S are positive definite, all eigenvalues λi of the symmetric matrix C −1/2 SC −1/2 are
positive. It is clear that the space of solutions to (3.3.5) is spanned by the “eigenmotions”
where {ei } is an orthonormal eigenbasis of the matrix C −1/2 SC −1/2 . We see that all solutions
to (3.3.5) (and thus - to (3.3.4) as well) are exponentially fast converging to 0, or, which is
the same, the state w(t) of the circuit exponentially fast approaches the steady state w. b The
“time scale” of this transition is, essentially, defined by the quantity λmin = min λi ; a typical
i
“decay rate” of a solution to (3.3.5) is nothing but T = λ−1
min . S. Boyd has proposed to use T to
quantify the length of the transition period, and to use the reciprocal of it – i.e., the quantity
λmin itself – as the quantitative measure of the speed. Technically, the main advantage of this
definition is that the speed turns out to be the minimum eigenvalue of the matrix C −1/2 SC −1/2 ,
i.e., the minimum eigenvalue of the matrix pencil [C : S]. Thus, the speed in Boyd’s definition
turns out to be efficiently computable (which is not the case for other, more sophisticated, “time
constants” used by engineers). Even more important, with Boyd’s approach, a typical design
158 LECTURE 3. SEMIDEFINITE PROGRAMMING
specification “the speed of a circuit should be at least such and such” is modelled by the matrix
inequality
S λ∗ C. (3.3.6)
As it was already mentioned, S and C are linear in the capacitances of the capacitors and
conductances of the resistors; in typical circuit design problems, the latter quantities are affine
functions of the design parameters, and (3.3.6) becomes an LMI in the design parameters.
d
x(t) = A(t)x(t), x(0) = x0 . (ULS)
dt
Here x(t) ∈ Rn represents the state of the system at time t, and A(t) is a time-varying n × n
matrix. We assume that the system is uncertain in the sense that we have no idea of what is x0 ,
and all we know about A(t) is that this matrix, at any time t, belongs to a given uncertainty set
U. Thus, (ULS) represents a wide family of linear dynamic systems rather than a single system;
and it makes sense to call a trajectory of the uncertain linear system (ULS) every function x(t)
which is an “actual trajectory” of a system from the family, i.e., is such that
d
x(t) = A(t)x(t)
dt
for all t ≥ 0 and certain matrix-valued function A(t) taking all its values in U.
d
x(t) = f (t, x(t)) [x ∈ Rn ] (NLS)
dt
with a given right hand side f (t, x) and a given equilibrium x(t) ≡ 0 (i.e., f (t, 0) = 0, t ≥ 0)
as an uncertain linear system.
∂ Indeed, let us define the set Uf as the closed convex hull of
the set of n × n matrices ∂x f (t, x) | t ≥ 0, x ∈ Rn . Then for every point x ∈ Rn we have
Rs ∂
f (t, x) = f (t, 0) + ∂x f (t, sx) xds = Ax (t)x,
0
R1 ∂
Ax (t) = ∂x f (t, sx)ds ∈ U.
0
We see that every trajectory of the original nonlinear system (NLS) is also a trajectory of the
uncertain linear system (ULS) associated with the uncertainty set U = Uf (this trick is called
“global lumbarization”). Of course, the set of trajectories of the resulting uncertain linear
system can be much wider than the set of trajectories of (NLS); however, all “good news”
about the uncertain system (like “all trajectories of (ULS) share such and such property”)
are automatically valid for the trajectories of the “nonlinear system of interest” (NLS), and
only “bad news” about (ULS) (“such and such property is not shared by some trajectories
of (ULS)”) may say nothing about the system of interest (NLS).
3.3. APPLICATIONS OF SEMIDEFINITE PROGRAMMING IN ENGINEERING 159
x(t) → 0 as t → ∞
If xT Xx is a Lyapunov function, then the resulting quantity must be at most −αxT (t)Xx(t),
i.e., we should have h i
xT (t) −αX − AT (t)X − XA(t) x(t) ≥ 0
for every possible value of A(t) at any time t and for every possible value x(t) of a trajectory
of the system at this time. Since possible values of x(t) fill the entire Rn and possible values of
A(t) fill the entire U, we conclude that
−αX − AT X − XA 0 ∀A ∈ U.
−ŝX̂ − AT X̂ − X̂A 0 ∀A ∈ U.
thus, (s = −ŝ, X̂) is a feasible solution to (Ly) with negative value of the objective. We have
demonstrated that if (ULS) admits a quadratic Lyapunov function, then (Ly) has a feasible
solution with negative value of the objective. Reversing the reasoning, we can verify the inverse
implication.
Polytopic uncertainty set. The first “tractable case” of (Ly) is when U is a polytope
given as a convex hull of finitely many points:
U = Conv{A1 , ..., AN }.
10)
The only case when the existence of a quadratic Lyapunov function is a criterion (i.e., a necessary and
d
sufficient condition) for stability is the simplest case of certain time-invariant linear system dt x(t) = Ax(t)
(U = {A}). This is the case which led Lyapunov to the general concept of what is now called “a Lyapunov
function” and what is the basic approach to establishing convergence of different time-dependent processes to
their equilibria. Note also that in the case of time-invariant linear system there exists a straightforward algebraic
stability criterion – all eigenvalues of A should have negative real parts. The advantage of the Lyapunov approach
is that it can be extended to more general situations, which is not the case for the eigenvalue criterion.
3.3. APPLICATIONS OF SEMIDEFINITE PROGRAMMING IN ENGINEERING 161
(why?).
The assumption that U is a polytope given as a convex hull of a finite set is crucial for a
possibility to get a “computationally tractable” equivalent reformulation of (Ly). If U is, say,
a polytope given by a list of linear inequalities (e.g., all we know about the entries of A(t) is
that they reside in certain intervals; this case is called “interval uncertainty”), (Ly) may become
as hard as a problem can be: it may happen that just to check whether a given pair (s, X) is
feasible for (Ly) is already a “computationally intractable” problem. The same difficulties may
occur when U is a general-type ellipsoid in the space of n × n matrices. There exists, however,
a specific type of “uncertainty ellipsoids” U for which (Ly) is “easy”. Let us look at this case.
(x is the state, u is the control and y is the output we can observe) “closed” by a feedback
u(t) = Ky(t).
u(t) = K y(t)
Now assume that the input perturbations ∆ are of spectral norm |∆| not exceeding a given ρ
(norm-bounded perturbations):
Proposition 3.3.2 [8] In the case of uncertainty set (3.3.9), (3.3.12) the “semi-infinite”
semidefinite program (Ly) is equivalent to the usual semidefinite program
minimize α
s.t.
αIn − AT∗ X − XA∗ − λC T C
ρXB (3.3.13)
0
ρB T X λIk
X In
in the design variables α, λ, X.
When shrinking the set of perturbations (3.3.12) to the ellipsoid
v
u
uXk X
l
E = {∆ ∈ Mk,l | k∆k2 ≡ t ∆2ij ≤ ρ}, 11 ) (3.3.14)
u
i=1 j=1
we basically do not vary (Ly): in the case of the uncertainty set (3.3.9), (Ly) is still equivalent
to (3.3.13).
Proof. It suffices to verify the following general statement:
Y − QT ∆T P T Z T R − RT ZP ∆Q 0 (3.3.15)
The statement of Proposition 3.4.14 is just a particular case of Lemma 3.3.2. For example, in
the case of uncertainty set (3.3.9), (3.3.12) a pair (α, X) is a feasible solution to (Ly) if and only
if X In and (3.3.15) is valid for Y = αX − AT∗ X − XA∗ , P = B, Q = C, Z = X, R = In ;
Lemma 3.3.2 provides us with an LMI reformulation of the latter property, and this LMI is
exactly what we see in the statement of Proposition 3.4.14.
Proof of Lemma. (3.3.15) is valid for all ∆ with |∆| ≤ ρ (let us call this property of (Y, P, Q, Z, R)
“Property 1”) if and only if
11)
This indeed is a “shrinkage”: |∆| ≤ k∆k2 for every matrix ∆ (prove it!)
3.3. APPLICATIONS OF SEMIDEFINITE PROGRAMMING IN ENGINEERING 163
The maximum over ∆, |∆| ≤ ρ, of the quantity η T ∆ζ, clearly is equal to ρ times the product of the
Euclidean norms of the vectors η and ζ (why?). Thus, Property 2 is equivalent to
ξ T QT Qξ − η T η ≥ 0 (I)
ξ T Y ξ − 2ρη T P T Z T Rξ ≥ 0. (II)
Indeed, for a fixed ξ the minimum over η satisfying (I) of the left hand side in (II) is
nothing but the left hand side in (Property 3).
xT Ax ≥ 0 (A)
is strictly feasible: there exists x̄ such that x̄T Ax̄ > 0. Then the quadratic inequality
xT Bx ≥ 0 (B)
is a consequence of (A) if and only if it is a linear consequence of (A), i.e., if and only if there
exists a nonnegative λ such that
B λA.
(for a proof, see Section 3.5). Property 4 says that the quadratic inequality (II) with variables ξ, η is a
consequence of (I); by the S-Lemma (recall that Q 6= 0, so that (I) is strictly feasible!) this is equivalent
to the existence of a nonnegative λ such that
T
Y −ρRT ZP Q Q
−λ 0,
−ρP T Z T R −Ik
which is exactly the statement of Lemma 3.3.2 for the case of |∆| ≤ ρ. The case of perturbations with
k∆k2 ≤ ρ is completely similar, since the equivalence between Properties 2 and 3 is valid independently
of which norm of ∆ – | · | or k · k2 – is used.
uncertainty set U. The question is whether we can equip our uncertain “open loop” system
(UOS) with a linear feedback
u(t) = Ky(t)
d
x(t) = [A(t) + B(t)KC(t)] x(t) (UCS)
dt
will be stable and, moreover, such that its stability can be certified by a quadratic Lyapunov
function. In other words, now we are simultaneously looking for a “stabilizing controller” and a
quadratic Lyapunov certificate of its stabilizing ability.
With the “global lumbarization” trick we may use the results on uncertain controlled linear
systems to build stabilizing linear controllers for nonlinear controlled systems
d
dt x(t) = f (t, x(t), u(t))
y(t) = g(t, x(t))
Assuming f (t, 0, 0) = 0, g(t, 0) = 0 and denoting by U the closed convex hull of the set
∂ ∂ ∂
g(t, x) t ≥ 0, x ∈ Rn , u ∈ Rk ,
f (t, x, u), f (t, x, u),
∂x ∂u ∂x
we see that every trajectory of the original nonlinear system is a trajectory of the uncertain
linear system (UOS) associated with the set U. Consequently, if we are able to find a
stabilizing controller for (UOS) and certify its stabilizing property by a quadratic Lyapunov
function, then the resulting controller/Lyapunov function will stabilize the nonlinear system
and will certify the stability of the closed loop system, respectively.
Exactly the same reasoning as in the previous Section leads us to the following
Proposition 3.3.3 Let U be the uncertainty set associated with an uncertain open loop con-
trolled system (UOS). The system admits a stabilizing controller along with a quadratic Lya-
punov stability certificate for the resulting closed loop system if and only if the optimal value in
the optimization problem
minimize s
s.t.
(LyS)
[A + BKC]T X + X[A + BKC] sIn ∀(A, B, C) ∈ U
X In ,
in design variables s, X, K, is negative. Moreover, every feasible solution to the problem with
negative value of the objective provides stabilizing controller along with quadratic Lyapunov sta-
bility certificate for the resulting closed loop system.
A bad news about (LyS) is that it is much more difficult to rewrite this problem as a semidefinite
program than in the analysis case (i.e., the case of K = 0), since (LyS) is a semi-infinite system
of nonlinear matrix inequalities. There is, however, an important particular case where this
difficulty can be eliminated. This is the case of a feedback via the full state vector – the case
when y(t) = x(t) (i.e., C(t) is the unit matrix). In this case, all we need in order to get a
3.3. APPLICATIONS OF SEMIDEFINITE PROGRAMMING IN ENGINEERING 165
stabilizing controller along with a quadratic Lyapunov certificate of its stabilizing ability, is to
solve a system of strict matrix inequalities
Indeed, given a solution (X, K, Z) to this system, we always can convert it by normalization of
X to a solution of (LyS). Now let us make the change of variables
h i
Y = X −1 , L = KX −1 , W = X −1 ZX −1 ⇔ X = Y −1 , K = LY −1 , Z = Y −1 W Y −1 .
(we have multiplied all original matrix inequalities from the left and from the right by Y ).
What we end up with, is a system of strict linear matrix inequalities with respect to our new
design variables L, Y, W ; the question of whether this system is solvable can be converted to the
question of whether the optimal value in a problem of the type (LyS) is negative, and we come
to the following
Proposition 3.3.4 Consider an uncertain controlled linear system with a full observer:
d
dt x(t)= A(t)x(t) + B(t)u(t)
y(t) = x(t)
and let U be the corresponding uncertainty set (which now is comprised of pairs (A, B) of possible
values of (A(t), B(t)), since C(t) ≡ In is certain).
The system can be stabilized by a linear controller
minimize s
s.t.
(Ly∗ )
BL + AY + LT B T + Y AT sIn ∀(A, B) ∈ U
Y I
the Quadratic Lyapunov Stability Synthesis reduces to solving the semidefinite program
n o
min s : Bi L + Ai Y + Y ATi + LT BiT sIn , i = 1, ..., N ; Y In .
s,Y,L
(Stones) Given n stones of positive integer weights (i.e., given n positive integers
a1 , ..., an ), check whether you can partition these stones into two groups of equal
weight, i.e., check whether a linear equation
n
X
ai xi = 0
i=1
f ∗ = min{f (x) : x ∈ X}
x
3.4. SEMIDEFINITE RELAXATIONS OF INTRACTABLE PROBLEMS 167
from above is to present a feasible solution x̄; then clearly f ∗ ≤ f (x̄). And a typical way to
bound the optimal value from below is to pass from the problem to its relaxation
f∗ = min{f (x) : x ∈ X 0 }
x
increasing the feasible set: X ⊂ X 0 . Clearly, f∗ ≤ f ∗ , so, whenever the relaxation is efficiently
solvable (to ensure this, we should take care of how we choose X 0 ), it provides us with a
“computable” lower bound on the actual optimal value.
When building a relaxation, one should take care of two issues: on one hand, we want the
relaxation to be “efficiently solvable”. On the other hand, we want the relaxation to be “tight”,
otherwise the lower bound we get may be by far “too optimistic” and therefore not useful. For a
long time, the only practical relaxations were the LP ones, since these were the only problems one
could solve efficiently. With recent progress in optimization techniques, nonlinear relaxations
become more and more “practical”; as a result, we are witnessing a growing theoretical and
computational activity in the area of nonlinear relaxations of combinatorial problems. These
developments mostly deal with semidefinite relaxations. Let us look how they emerge.
To see that this form is “universal”, note that it covers the classical universal combinatorial
problem – a generic LP program with Boolean (0-1) variables:
n o
min cT x : aTi x − bi ≤ 0, i = 1, ..., m; xj ∈ {0, 1}, j = 1, ..., n (B)
x
Indeed, the fact that a variable xj must be Boolean can be expressed by the quadratic equality
x2j − xj = 0,
which can be represented by a pair of opposite quadratic inequalities and a linear inequality
aTi x − bi ≤ 0 is a particular case of quadratic inequality. Thus, (B) is equivalent to the problem
n o
min cT x : aTi x − bi ≤ 0, i = 1, ..., m; x2j − xj ≤ 0, −x2j + xj ≤ 0 j = 1, ..., n ,
x,s
where m
P
A(λ) = A0 + λi Ai
i=1
m
P
b(λ) = b0 + λi bi
i=1
m
P
c(λ) = c0 + λi ci
i=1
By construction, the function fλ (x) is ≤ the actual objective f0 (x) on the feasible set of the
problem (3.4.1). Consequently, the unconstrained infimum of this function
is a lower bound for the optimal value in (3.4.1). We come to the following simple result (cf.
the Weak Duality Theorem:)
(*) Assume that λ ∈ Rm
+ and ζ ∈ R are such that
fλ (x) − ζ ≥ 0 ∀x ∈ Rn (3.4.3)
(i.e., that ζ ≤ a(λ)). Then ζ is a lower bound for the optimal value in (3.4.1).
It remains to clarify what does it mean that (3.4.3) holds. Recalling the structure of fλ , we see
that it means that the inhomogeneous quadratic form
g(x) = xT Ax + 2bT x + c
is nonnegative everywhere if and only if certain associated homogeneous quadratic form is non-
negative everywhere. Indeed, given t 6= 0 and x ∈ Rn , the fact that g(t−1 x) ≥ 0 means exactly
the nonnegativity of the homogeneous quadratic form G(x, t)
xT Ax + 2bT x + c ≥ 0
is identically true – is valid for all x ∈ Rn – if only if the matrix (3.4.4) is positive
semidefinite.
3.4. SEMIDEFINITE RELAXATIONS OF INTRACTABLE PROBLEMS 169
Applying this observation to gλ (x), we get the following equivalent reformulation of (*):
If (λ, ζ) ∈ Rm
+ × R satisfy the LMI
m
P m
λi ci − ζ bT0 + λi bTi
P
i=1 i=1
0,
m
P m
P
b0 + λi bi A0 + λi Ai
i=1 i=1
then ζ is a lower bound for the optimal value in (3.4.1).
Now, what is the best lower bound we can get with this scheme? Of course, it is the optimal
value of the semidefinite program
m m
λi ci − ζ bT0 + λi bTi
P P
c0 +
i=1 i=1
max ζ : 0, λ ≥ 0 . (3.4.5)
m m
ζ,λ P P
b0 + λi bi A0 + λi Ai
i=1 i=1
We have proved the following simple
Proposition 3.4.1 The optimal value in (3.4.5) is a lower bound for the optimal value in
(3.4.1).
The outlined scheme is extremely transparent, but it looks different from a relaxation scheme
as explained above – where is the extension of the feasible set of the original problem? In fact
the scheme is of this type. To see it, note that the value of a quadratic form at a point x ∈ Rn
can be written as the Frobenius inner product of a matrix defined by the problem data and the
T
1 1
dyadic matrix X(x) = :
x x
T
bT bT
1 c 1 c
xT Ax + 2bT x + c = = Tr X(x) .
x b A x b A
Consequently, (3.4.1) can be written as
bT0 bTi
c0 ci
min Tr X(x) : Tr X(x) ≤ 0, i = 1, ..., m . (3.4.6)
x b0 A0 bi Ai
Thus, we may think of (3.4.2) as a problem with linear objective and linear equality constraints
and with the design vector X which is a symmetric (n + 1) × (n + 1) matrix running through
the nonlinear manifold X of dyadic matrices X(x), x ∈ Rn . Clearly, all points of X are positive
semidefinite matrices with North-Western entry 1. Now let X̄ be the set of all such matrices.
Replacing X by X̄ , we get a relaxation of (3.4.6) (the latter problem is, essentially, our original
problem (3.4.1)). This relaxation is the semidefinite program
: Tr(Āi X) ≤
minX Tr(Ā0 X) 0; X11 = 1
0, i = 1, ..., m; X
ci bTi
(3.4.7)
Ai = , i = 1, ..., m .
bi Ai
We get the following
Proposition 3.4.2 The optimal value of the semidefinite program (3.4.7) is a lower bound for
the optimal value in (3.4.1).
One can easily verify that problem (3.4.5) is just the semidefinite dual of (3.4.7); thus, when
deriving (3.4.5), we were in fact implementing the idea of relaxation. This is why in the sequel
we call both (3.4.7) and (3.4.5) semidefinite relaxations of (3.4.1).
170 LECTURE 3. SEMIDEFINITE PROGRAMMING
• Convex case, where all quadratic forms in (3.4.1) are convex (i.e., Qi 0, i = 0, 1, ..., m).
Here, strict feasibility of the problem (i.e., the existence of a feasible solution x̄ with
fi (x̄) < 0, i = 1, ..., m) plus below boundedness of it imply that (3.4.7) is solvable with
the optimal value equal to the one of (3.4.1). This statement is a particular case of the
well-known Lagrange Duality Theorem in Convex Programming.
• The case of m = 1. Here the optimal value in (3.4.1) is equal to the one in (3.4.7), provided
that (3.4.1) is strictly feasible. This highly surprising fact (no convexity is assumed!) is
called Inhomogeneous S-Lemma; we shall prove it in Section .
E C
D
Graph C5
One of the fundamental characteristics of a graph Γ is its stability number α(Γ) defined as the
maximum cardinality of an independent subset of nodes – a subset such that no two nodes
from it are linked by an arc. E.g., the stability number for the graph C5 is 2, and a maximal
independent set is, e.g., {A; C}.
The problem of computing the stability number of a given graph is NP-complete, this is why
it is important to know how to bound this number.
Shannon capacity of a graph. An upper bound on the stability number of a graph which
interesting by its own right is the Shannon capacity Θ(Γ) defined as follows.
Let us treat the nodes of Γ as letters of certain alphabet, and the arcs as possible errors in
certain communication channel: you can send trough the channel one letter per unit time, and
what arrives on the other end of the channel can be either the letter you have sent, or any letter
adjacent to it. Now assume that you are planning to communicate with an addressee through the
12)
One of the formal definitions of a (non-oriented) graph is as follows: a n-node graph is just a n × n symmetric
matrix A with entries 0, 1 and zero diagonal. The rows (and the columns) of the matrix are identified with the
nodes 1, 2, ..., n of the graph, and the nodes i, j are adjacent (i.e., linked by an arc) exactly for those i, j with
Aij = 1.
3.4. SEMIDEFINITE RELAXATIONS OF INTRACTABLE PROBLEMS 171
channel by sending n-letter words (n is fixed). You fix in advance a dictionary Dn of words to be
used and make this dictionary known to the addressee. What you are interested in when building
the dictionary is to get a good one, meaning that no word from it could be transformed by the
channel into another word from the dictionary. If your dictionary satisfies this requirement, you
may be sure that the addressee will never misunderstand you: whatever word from the dictionary
you send and whatever possible transmission errors occur, the addressee is able either to get the
correct message, or to realize that the message was corrupted during transmission, but there
is no risk that your “yes” will be read as “no!”. Now, in order to utilize the channel “at full
capacity”, you are interested to get as large dictionary as possible. How many words it can
include? The answer is clear: this is precisely the stability number of the graph Γn as follows:
the nodes of Γn are ordered n-element collections of the nodes of Γ – all possible n-letter words
in your alphabet; two distinct nodes (i1 , ..., in ) (j1 , ..., jn ) are adjacent in Γn if and only if for
every l the l-th letters il and jl in the two words either coincide, or are adjacent in Γ (i.e., two
distinct n-letter words are adjacent, if the transmission can convert one of them into the other
one). Let us denote the maximum number of words in a “good” dictionary Dn (i.e., the stability
number of Γn ) by f (n), The function f (n) possesses the following nice property:
Indeed, given the best (of the cardinality f (k)) good dictionary Dk and the best good
dictionary Dl , let us build a dictionary comprised of all (k + l)-letter words as follows: the
initial k-letter fragment of a word belongs to Dk , and the remaining l-letter fragment belongs
to Dl . The resulting dictionary is clearly good and contains f (k)f (l) words, and (*) follows.
Now, this is a simple exercise in analysis to see that for a nonnegative function f with property
(*) one has
lim (f (k))1/k = sup(f (k))1/k ∈ [0, +∞].
k→∞ k≥1
In our situation sup(f (k))1/k < ∞, since clearly f (k) ≤ nk , n being the number of letters (the
k≥1
number of nodes in Γ). Consequently, the quantity
is well-defined; moreover, for every k the quantity (f (k))1/k is a lower bound for Θ(Γ). The
number Θ(Γ) is called the Shannon capacity of Γ. Our immediate observation is that
α(Γ) ≤ Θ(Γ).
Indeed, as we remember, (f (k))1/k is a lower bound for Θ(Γ) for every k = 1, 2, ...; setting
k = 1 and taking into account that f (1) = α(Γ), we get the desired result.
We see that the Shannon capacity number is an upper bound on the stability number; and this
bound has a nice interpretation in terms of the Information Theory. The bad news is that we
do not know how to compute the Shannon capacity. E.g., what is it for the toy graph C5 ?
The stability number of C5 clearly is 2, so that our first observation is that
Θ(C5 ) ≥ α(C5 ) = 2.
172 LECTURE 3. SEMIDEFINITE PROGRAMMING
To get a better estimate, let us look the graph (C5 )2 (as we remember, Θ(Γ) ≥ (f (k))1/k =
(α(Γk ))1/k for every k). The graph (C5 )2 has 25 nodes, so that we do not draw it; it, however,
is not that difficult to find its stability number, which turns out to be 5. A good 5-element
dictionary (≡ a 5-node independent set in (C5 )2 ) is, e.g.,
Thus, we get q √
Θ(C5 ) ≥ α((C5 )2 ) = 5.
Attempts to compute the subsequent lower bounds (f (k))1/k , as long as they are implementable
(think how many vertices there are in (C5 )4 !), do not
√ yield any √
improvements, and for more than
20 years it remained unknown whether Θ(C5 ) = 5 or is > 5. And this is for a toy graph!
The breakthrough in the area of upper bounds for the stability number is due to L. Lovasz who
in early 70’s found a new – computable! – bound of this type.
Lovasz capacity number. Given a n-node graph Γ, let us associate with it an affine matrix-
valued function L(x) taking values in the space of n × n symmetric matrices, namely, as follows:
• For every pair i, j of indices (1 ≤ i, j ≤ n) such that the nodes i and j are not linked by
an arc, the ij-th entry of L is equal to 1;
• For a pair i < j of indices such that the nodes i, j are linked by an arc, the ij-th and the
ji-th entries in L are equal to xij – to the variable associated with the arc (i, j).
Thus, L(x) is indeed an affine function of N design variables xij , where N is the number of
arcs in the graph. E.g., for graph C5 the function L is as follows:
1 xAB 1 1 xEA
x 1 xBC 1 1
AB
L= 1 xBC 1 xCD 1 .
1 1 xCD 1 xDE
xEA 1 1 xDE 1
Now, the Lovasz capacity number ϑ(Γ) is defined as the optimal value of the optimization
program
min {λmax (L(x))} ,
x
Proposition 3.4.3 [Lovasz] The Lovasz capacity number is an upper bound for the Shannon
capacity:
ϑ(Γ) ≥ Θ(Γ)
and, consequently, for the stability number:
Indeed, 0-1 n-dimensional vectors can be identified with sets of nodes of Γ: the coordinates
xi of the vector x representing a set A of nodes are ones for i ∈ A and zeros otherwise. The
quadratic equality constraints xi xj = 0 for such a vector express equivalently the fact that the
corresponding set of nodes is independent, and the objective eT x counts the cardinality of this
set.
As we remember, the 0-1 restrictions on the variables can be represented equivalently by
quadratic equality constraints, so that the stability number of Γ is the optimal value of the
following problem with quadratic (in fact linear) objective and quadratic equality constraints:
maximize eT x
s.t.
(3.4.8)
xi xj = 0, (i, j) is an arc
2
xi − xi = 0, i = 1, ..., n.
The latter problem is in the form of (3.4.1), with the only difference that the objective should
be maximized rather than minimized. Switching from maximization of eT x to minimization of
(−e)T x and passing to (3.4.5), we get the problem
−ζ − 21 (e + µ)T
max ζ : 0 ,
ζ,µ − 12 (e + µ) A(µ, λ)
• The off-diagonal cells ij corresponding to non-adjacent nodes i, j (“empty cells”) are zeros;
• The off-diagonal cells ij, i < j, and the symmetric cells ji corresponding to adjacent nodes
i, j (“arc cells”) are filled with free variables λij .
Note that the optimal value in the resulting problem is a lower bound for minus the optimal
value of (3.4.8), i.e., for minus the stability number of Γ.
Passing in the resulting problem from the variable ζ to a new variable ξ = −ζ and again
switching from maximization of ζ = −ξ to minimization of ξ, we end up with the semidefinite
program
− 12 (e + µ)T
ξ
min ξ : 0 . (3.4.9)
ξ,λ,µ − 12 (e + µ) A(µ, λ)
The optimal value in this problem is the minus optimal value in the previous one, which, in
turn, is a lower bound on the minus stability number of Γ; consequently, the optimal value in
(3.4.9) is an upper bound on the stability number of Γ.
174 LECTURE 3. SEMIDEFINITE PROGRAMMING
We have built a semidefinite relaxation (3.4.9) of the problem of computing the stability
number of Γ; the optimal value in the relaxation is an upper bound on the stability number.
To get the Lovasz relaxation, let us further fix the µ-variables at the level 1 (this may only
increase the optimal value in the problem, so that it still will be an upper bound for the stability
number)13) . With this modification, we come to the problem
−eT
ξ
min ξ : 0 .
ξ,λ −e A(e, λ)
In every feasible solution to the problem, ξ should be ≥ 1 (it is an upper bound for α(Γ) ≥ 1).
When ξ ≥ 1, the LMI
−eT
ξ
0
e A(e, λ)
by the Schur Complement Lemma is equivalent to the LMI
ξA(e, λ) − eeT 0.
The left hand side matrix in the latter LMI is equal to ξIn − B(ξ, λ), where the matrix B(ξ, λ)
is as follows:
• The “arc cells” from a symmetric pair off-diagonal pair ij and ji (i < j) are filled with
ξλij .
Passing from the design variables λ to the new ones xij = ξλij , we conclude that problem (3.4.9)
with µ’s set to ones is equivalent to the problem
How good is the Lovasz capacity number? The Lovasz capacity number plays a crucial
role in numerous graph-related problems; there is an important sub-family of graphs – perfect
graphs – for which this number coincides with the stability number. However, for a general-type
graph Γ, ϑ(Γ) may be a fairly poor bound for α(Γ). Lovasz has proved that for any graph Γ with
n nodes, ϑ(Γ)ϑ(Γ̂) ≥ n, where Γ̂ is the complement to Γ (i.e., two distinct nodes are adjacent in
Γ̂ if and only if they are not adjacent in Γ). It follows that for n-node graph Γ one always has
√
max[ϑ(Γ), ϑ(Γ̂)] ≥ n. On the other hand, it turns out that for a random n-node graph Γ (the
arcs are drawn at random and independently of each other, with probability 0.5 to draw an arc
13)
In fact setting µi = 1, we do not vary the optimal value at all.
3.4. SEMIDEFINITE RELAXATIONS OF INTRACTABLE PROBLEMS 175
linking two given distinct nodes) max[α(Γ), α(Γ̂)] is “typically” (with probability approaching
1 as n grows) of order of ln n. It follows that for random n-node graphs a typical value of the
ratio ϑ(Γ)/α(Γ) is at least of order of n1/2 / ln n; as n grows, this ratio blows up to ∞.
A natural question arises: are there “difficult” (NP-complete) combinatorial problems admit-
ting “good” semidefinite relaxations – those with the quality of approximation not deteriorating
as the sizes of instances grow? Let us look at two breakthrough results in this direction.
3.4.1.5 The MAXCUT problem and maximizing quadratic form over a box
The MAXCUT problem. The maximum cut problem is as follows:
Problem 3.4.1 [MAXCUT] Let Γ be an n-node graph, and let the arcs (i, j) of the graph be
associated with nonnegative “weights” aij . The problem is to find a cut of the largest possible
weight, i.e., to partition the set of nodes in two parts S, S 0 in such a way that the total weight
of all arcs “linking S and S 0 ” (i.e., with one incident node in S and the other one in S 0 ) is as
large as possible.
In the MAXCUT problem, we may assume that the weights aij = aji ≥ 0 are defined for every
pair i, j of indices; it suffices to set aij = 0 for pairs i, j of non-adjacent nodes.
In contrast to the minimum cut problem (where we should minimize the weight of a cut
instead of maximizing it), which is, basically, a nice LP program of finding the maximum flow
in a network and is therefore efficiently solvable, the MAXCUT problem is as difficult as a
combinatorial problem can be – it is NP-complete.
For this problem, the semidefinite relaxation (3.4.7) after evident simplifications becomes the
semidefinite program
n
1
aij (1 − Xij )
P
maximize 4
i,j=1
s.t. (3.4.11)
X = [Xij ]ni,j=1 = X T 0
Xii = 1, i = 1, ..., n;
176 LECTURE 3. SEMIDEFINITE PROGRAMMING
the optimal value in the latter problem is an upper bound for the optimal value of MAXCUT.
The fact that (3.4.11) is a relaxation of (3.4.10) can be established directly, independently
of any “general theory”: (3.4.10) is the problem of maximizing the objective
n n n
1 X 1 X 1 X 1
aij − aij xi xj ≡ aij − Tr(AX(x)), X(x) = xxT
4 i,j=1 2 i,j=1 4 i,j=1 4
over all rank 1 matrices X(x) = xxT given by n-dimensional vectors x with entries ±1. All
these matrices are symmetric positive semidefinite with unit entries on the diagonal, i.e.,
they belong the feasible set of (3.4.11). Thus, (3.4.11) indeed is a relaxation of (3.4.10).
The quality of the semidefinite relaxation (3.4.11) is given by the following brilliant result of
Goemans and Williamson (1995):
Theorem 3.4.1 Let OP T be the optimal value of the MAXCUT problem (3.4.10), and SDP
be the optimal value of the semidefinite relaxation (3.4.11). Then
Proof. The left inequality in (3.4.12) is what we already know – it simply says that semidef-
inite program (3.4.11) is a relaxation of MAXCUT. To get the right inequality, Goemans and
Williamson act as follows. Let X = [Xij ] be a feasible solution to the semidefinite relaxation.
Since X is positive semidefinite, it is the covariance matrix of a Gaussian random vector ξ with
zero mean, so that E {ξi ξj } = Xij . Now consider the random vector ζ = sign[ξ] comprised of
signs of the entries in ξ. A realization of ζ is almost surely a vector with coordinates ±1, i.e., it
is a cut. What is the expected weight of this cut? A straightforward computation demonstrates
that E {ζi ζj } = π2 asin(Xij ) 14) . It follows that
n
1 X n
1 X 2
E aij (1 − ζi ζi ) = aij 1 − asin(Xij ) . (3.4.13)
4 4 π
i,j=1 i,j=1
The left hand side in this inequality, by evident reasons, is ≤ OP T . We have proved that the
value of the objective in (3.4.11) at every feasible solution X to the problem is ≤ α · OP T ,
whence SDP ≤ α · OP T as well.
Note that the proof of Theorem 3.4.1 provides a randomized algorithm for building a sub-
optimal, within the factor α−1 = 0.878..., solution to MAXCUT: we find a (nearly) optimal
solution X to the semidefinite relaxation (3.4.11) of MAXCUT, generate a sample of, say, 100
realizations of the associated random cuts ζ and choose the one with the maximum weight.
14)
Recall that Xij 0 isnormalized
by the requirement Xii = 1 for all i. Omitting this normalization, we
would get E {ζi ζj } = 2
π
asin √ Xij .
Xii Xjj
3.4. SEMIDEFINITE RELAXATIONS OF INTRACTABLE PROBLEMS 177
π
3.4.1.6 Nesterov’s 2 Theorem
In the MAXCUT problem, we are in fact maximizing the homogeneous quadratic form
n
X n
X n
X
xT Ax ≡ aij x2i − aij xi xj
i=1 j=1 i,j=1
over the set Sn of n-dimensional vectors x with coordinates ±1. It is easily seen that the matrix
A of this form is positive semidefinite and possesses a specific feature that the off-diagonal
entries are nonpositive, while the sum of the entries in every row is 0. What happens when
we are maximizing over Sn a quadratic form xT Ax with a general-type (symmetric) matrix A?
An extremely nice result in this direction was obtained by Yu. Nesterov. The cornerstone of
Nesterov’s construction relates to the case when A is positive semidefinite, and this is the case
we shall focus on. Note that the problem of maximizing a quadratic form xT Ax with positive
semidefinite (and, say, integer) matrix A over Sn , same as MAXCUT, is NP-complete.
The semidefinite relaxation of the problem
n o
max xT Ax : x ∈ Sn [⇔ xi ∈ {−1, 1}, i = 1, ..., n] (3.4.14)
x
can be built exactly in the same way as (3.4.11) and turns out to be the semidefinite program
maximize Tr(AX)
s.t.
(3.4.15)
X = X T = [Xij ]ni,j=1 0
Xii = 1, i = 1, ..., n.
The optimal value in this problem, let it again be called SDP , is ≥ the optimal value OP T in
the original problem (3.4.14). The ratio OP T /SDP , however, cannot be too large:
π
Theorem 3.4.2 [Nesterov’s 2 Theorem] Let A be positive semidefinite. Then
π π
OP T ≤ SDP ≤ SDP [ = 1.570...]
2 2
The proof utilizes the central idea of Goemans and Williamson in the following brilliant reason-
ing:
The inequality SDP ≥ OP T is valid since (3.4.15) is a relaxation of (3.4.14). Let X be a feasible
solution to the relaxed problem; let, same as in the MAXCUT construction, ξ be a Gaussian random
vector with zero mean and the covariance matrix X, and let ζ = sign[ξ]. As we remember,
X 2 2
E ζ T Aζ =
Aij asin(Xij ) = Tr(A, asin[X]), (3.4.16)
i,j
π π
where for a function f on the axis and a matrix X f [X] denotes the matrix with the entries f (Xij ). Now
– the crucial (although simple) observation:
For a positive semidefinite symmetric matrix X with diagonal entries ±1 (in fact, for any
positive semidefinite X with |Xij | ≤ 1) one has
asin[X] X. (3.4.17)
178 LECTURE 3. SEMIDEFINITE PROGRAMMING
The proof is immediate: denoting by [X]k the matrix with the entries Xij k
and making
use of the Taylor series for the asin (this series converges uniformly on [−1, 1]), for a
matrix X with all entries belonging to [−1, 1] we get
∞
X 1 × 3 × 5 × ... × (2k − 1)
asin[X] − X = [X]2k+1 ,
2k k!(2k + 1)
k=1
and all we need is to note is that all matrices in the left hand side are 0 along with
X 15)
Combining (3.4.16), (3.4.17) and the fact that A is positive semidefinite, we conclude that
2 2
[OP T ≥] E ζ T Aζ = Tr(Aasin[X]) ≥ Tr(AX).
π π
π
The resulting inequality is valid for every feasible solution X of (3.4.15), whence SDP ≤ 2 OP T .
π
The Theorem has a number of far-reaching consequences (see Nesterov’s papers [27, 28]),
2
for example, the following two:
Theorem 3.4.3 Let T be an SDr compact subset in Rn+ . Consider the set
and let A be a symmetric n × n matrix. Then the quantities m∗ (A) = min xT Ax and m∗ (A) =
x∈T
max xT Ax admit efficiently computable bounds
x∈T
n o
s∗ (A) ≡ min Tr(AX) : X 0, (X11 , ..., Xnn )T ∈ T ,
X n o
s∗ (A) ≡ max Tr(AX) : X 0, (X11 , ..., Xnn )T ∈ T ,
X
such that
s∗ (A) ≤ m∗ (A) ≤ m∗ (A) ≤ s∗ (A)
and
π
m∗ (A) − m∗ (A) ≤ s∗ (A) − s∗ (A) ≤ (m∗ (A) − m∗ (A))
4−π
π
(in the case of A 0 and 0 ∈ T , the factor 4−π can be replaced with π2 ).
Thus, the “variation” max xT Ax − min xT Ax of the quadratic form xT Ax on T can be effi-
x∈T x∈T
ciently bounded from above, and the bound is tight within an absolute constant factor.
Note that if T is given by a strictly feasible SDR, then both (−s∗ (A)) and s∗ (A) are SDr
functions of A (Proposition 2.4.4).
Theorem 3.4.4 Let p ∈ [2, ∞], r ∈ [1, 2], and let A be an m × n matrix. Consider the problem
of computing the operator norm kAkp,r of the linear mapping x 7→ Ax, considered as the mapping
from the space Rn equipped with the norm k · kp to the space Rm equipped with the norm k · kr :
note that it is difficult (NP-hard) to compute this norm, except for the case of p = r = 2. The
“computationally intractable” quantity kAkp,r admits an efficiently computable upper bound
1 AT
h i Diag{µ}
ωp,r (A) = min kµk p + kλk 2−r
r : 0 ;
λ∈R ,µ∈Rn
m 2 p−2 A Diag{λ}
this bound is exact for a nonnegative matrix A, and for an arbitrary A the bound is tight within
π
the factor 2√3−2π/3 = 2.293...:
π
kAkp,r ≤ ωp,r (A) ≤ √ kAkp,r .
2 3 − 2π/3
Moreover, when p ∈ [1, ∞) and r ∈ [1, 2] are rational (or p = ∞ and r ∈ [1, 2] is rational), the
bound ωp,r (A) is an SDr function of A.
The Matrix Cube Theorem. The problem “Matrix Cube” (MC for short) is NP-hard; this
is true also for the “feasibility version” MCρ of MC, where we, given a ρ ≥ 0, are interested to
verify the inclusion Aρ ⊂ Sn+ . However, we can point out a simple sufficient condition for the
validity of the inclusion Aρ ⊂ Sn+ :
Proof. Let X 1 , ..., X m be a solution of (Sρ ). From (a) it follows that whenever kzk∞ ≤ ρ, we
have X i zi Ai for all i, whence by (b)
m
X X
A0 + zi Ai A0 − Xi 0.
i=1 i
Our main result is that the sufficient condition for the inclusion Aρ ⊂ Sn+ stated by Proposition
3.4.4 is not too conservative:
here
µ = max Rank(Ai )
1≤i≤m
Let us set
1
ϑ(k) = Z . (3.4.23)
|αi u21 αk u2k |pk (u)du α k
min + ... + ∈ R , kαk1 = 1
is positive. Since the problem is strictly feasible, its optimal value is positive if and only if the optimal
value of the dual problem
X m U i + V i = W, i = 1, ..., m,
Tr([U i − V i ]Ai ) − Tr(W A0 )
max ρ Tr(W ) = 1,
W,{U i ,V i }
U i, V i, W 0
i=1
0
2 . Now let us use simple
Lemma 3.4.1 Let W, A ∈ Sn , W 0. Then
max Tr([U − V ]A) = max Tr(XW 1/2 AW 1/2 ) = kλ(W 1/2 AW 1/2 )k1 . (3.4.26)
U,V 0,U +V =W X=X T :kλ(X)k∞ ≤1
When P, Q are linked by the relation P + Q = I and vary in {P 0, Q 0}, the matrix
X = P − Q runs through the entire “interval” {−I X I} (why?); we have proved the
first equality in (3.4.26). When proving the second equality, we may assume w.l.o.g. that
the matrix W 1/2 AW 1/2 is diagonal, so that Tr(XW 1/2 AW 1/2 ) = λT (W 1/2 AW 1/2 )Dg(X),
where Dg(X) is the diagonal of X. When X runs through the “interval” {−I X I}, the
diagonal of X runs through the entire unit cube {kxk∞ ≤ 1}, which immediately yields the
second equality in (3.4.26).
By Lemma 3.4.1, from (3.4.25) it follows that there exists W 0 such that
m
X
ρ kλ(W 1/2 Ai W 1/2 )k1 > Tr(W 1/2 A0 W 1/2 ). (3.4.27)
i=1
To prove (3.4.28.b), by homogeneity it suffices to consider the case when kλ(A)k1 = 1, and
by rotational invariance of the distribution of ξ – the case when A is diagonal, and the
first Rank(A) of diagonal entries of A are the nonzero eigenvalues of the matrix; with this
normalization, the required relation immediately follows from the definition of ϑ(·).
182 LECTURE 3. SEMIDEFINITE PROGRAMMING
m
P
We see that the matrix A0 + zi Ai is not positive semidefinite, while by construction kzk∞ ≤ ϑ(µ)ρ.
i=1
Thus, (3.4.21) holds true. (i) is proved.
To prove (ii), let α ∈ Rk be such that kαk1 = 1, and let
Z
J = |α1 u21 + ... + αk u2k |pk (u)du.
α
Let β = , and let ξ ∼ N (0, I2k ). We have
−α
( 2k ) ( k ) ( k )
X X X
2 2 2
E βi ξ i ≤ E βi ξ i + E βi+k ξi+k = 2J. (3.4.29)
i=1 i=1 i=1
α1 η1
On the other hand, let ηi = √12 (ξi − ξk+i ), ζi = √12 (ξi + ξk+i ), i = 1, ..., k, and let ω = ... ,
αk ηk
|α1 η1 | ζ1
.
e = ... , ζ = .. . Observe that ζ and ω are independent and ζ ∼ N (0, Ik ). We have
ω
|αk ηk | ζk
( 2k ) ( k )
X X
2
αi ηi ζi = 2E |ω T ζ| = E {kωk2 } E {|ζ1 |} ,
E βi ξi = 2E
i=1 i=1
where the concluding equality follows from the fact that ζ ∼ N (0, Ik ) is independent of ω. We further
have Z
2
E {|ζ1 |} = |t|p1 (t)dt = √
2π
3.4. SEMIDEFINITE RELAXATIONS OF INTRACTABLE PROBLEMS 183
and v
um
Z uX
E {kωk2 } = E {ke
ω k2 } ≥ kE {e
ω } k2 = |t|p1 (t)dt t αi2 .
i=1
2
This relation combines with (3.4.29) to yield J ≥ √
π k
. Recalling the definition of ϑ(k), we come to
√
π k
ϑ(k) ≤ 2 ,
as required in (3.4.22).
It remains to prove that ϑ(2) = π2 . From the definition of ϑ(·) it follows that
Z
ϑ−1 (2) = min |θu21 − (1 − θ)u22 |p2 (u)du ≡ min f (θ).
0≤θ≤1 0≤θ≤1
The function f (θ) is clearly convex and satisfies the identity f (θ) = f (1 − θ), 0 ≤ θ ≤ 1, so that its
minimum is attained at θ = 12 . A direct computation says that f ( 12 ) = π2 .
Corollary 3.4.1 Let the ranks of all matrices A1 , ..., Am in MC be ≤ µ. Then the optimal value
in the semidefinite problem
is a lower bound on R[A1 , ..., Am : A0 ], and the “true” quantity is at most ϑ(µ) times (see
(3.4.23), (3.4.22)) larger than the bound:
share a common quadratic Lyapunov function, i.e., the semi-infinite system of LMI’s
X I; AT X + XA −I ∀A ∈ Uρ (Lyρ )
is contained in Sn+ ; here E ij are the “basic n × n matrices” (ij-th entry of E ij is 1, all other
entries are zero). Note that the ranks of the matrices Aij [X], (i, j) ∈ D, are at most 2. Therefore
from Proposition 3.4.4 and Theorem 3.4.5 we get the following result:
h i X I, h i
X ij −ρDij [E ij ]T X + XE ij , X ij ρDij [E ij ]T X + XE ij , (i, j) ∈ D
n (Aρ )
X ij −I − AT∗ X − XA∗
P
(i,j)∈D
in matrix variables X, X ij , (i, j) ∈ D, is solvable, then so is the system (Lyρ ), and the X-
component of a solution of the former system solves the latter system.
(ii) If the system of LMI’s (Aρ ) is not solvable, then so is the system (Ly πρ ).
2
In particular, the supremum ρ[A∗ , D] of those ρ for which (Aρ ) is solvable is a lower bound
for R[A∗ , D], and the “true” quantity is at most π2 times larger than the bound:
π
ρ[A∗ , D] ≤ R[A∗ , D] ≤ ρ[A∗ , D].
2
Computing ρ[A∗ , D]. The quantity ρ[A∗ , D], in contrast to R[A∗ , D], is “efficiently computable”:
applying dichotomy in ρ, we can find a high-accuracy approximation of ρ[A∗ , D] via solving a small series
of semidefinite feasibility problems (Aρ ). Note, however, that problem (Aρ ), although “computationally
tractable”, is not that simple: in the case of “full uncertainty” (Dij > 0 for all i, j) it has n2 + n
matrix variables of the size n × n each. It turns out that applying semidefinite duality, one can reduce
dramatically the sizes of the problem specifying ρ[A∗ , D]. The resulting (equivalent!) description of the
bound is:
X I,
m
T
P
Y − η ` e j`
e j [Xe i 1
; Xe i 2
; ...; Xe im
]
1 `=1
` 0,
= inf λ
T , (3.4.32)
ρ[A∗ , D] λ,Y,X,{ηi } [Xei1 ; Xei2 ; ...; Xeim ] Diag(η1 , ..., ηm )
T
A0 [X] ≡ −I − A∗ X + XA∗ 0,
Y λA0 [X]
where (i1 , j1 ), ..., (im , jm ) are the positions of the uncertain entries in our uncertain matrix (i.e., the
pairs (i, j) such that Dij > 0) and e1 , ..., en are the standard basic orths in Rn .
Note that the optimization program in (3.4.32) has just two symmetric matrix variables X, Y , a single
scalar variable λ and m ≤ n2 scalar variables ηi , i.e., totally at most 2n2 + n + 2 scalar design variables,
which, for large m, is much less than the design dimension of (Aρ ).
3.4. SEMIDEFINITE RELAXATIONS OF INTRACTABLE PROBLEMS 185
Remark 3.4.1 Note that our results on the Matrix Cube problem can be applied to the in-
terval version of the Lyapunov Stability Synthesis problem, where we are interested to find the
supremum R of those ρ for which an uncertain controllable system
d
x(t) = A(t)x(t) + B(t)u(t)
dt
with interval uncertainty
(A(t), B(t)) ∈ Uρ = {(A, B) : |Aij − (A∗ )ij | ≤ ρDij , |Bi` − (B∗ )i` | ≤ ρCi` ∀i, j, `}
Y I, BL + AY + LT B T + Y AT −I ∀(A, B) ∈ Uρ
in variables L, Y (see Proposition 3.3.4), and them yield an efficiently computable lower bound
on R which is at most π2 times less than R.
We have seen that the Matrix Cube Theorem allows to build tight computationally tractable
approximations to semi-infinite systems of LMI’s responsible for stability of uncertain linear
dynamical systems affected by interval uncertainty. The same is true for many other semi-
infinite systems of LMI’s arising in Control in the presence of interval uncertainty, since in a
typical Control-related LMI, a perturbation of a single entry in the underlying data results in
a small-rank perturbation of the LMI – a situation well-suited for applying the Matrix Cube
Theorem.
Nesterov’s Theorem revisited. Our results on the Matrix Cube problem give an alternative
proof of Nesterov’s π2 Theorem (Theorem 3.4.2). Recall that in this theorem we are comparing
the true maximum
OP T = max{dT Ad | kdk∞ ≤ 1}
d
of a positive semidefinite (A 0) quadratic form on the unit n-dimensional cube and the
semidefinite upper bound
Then ( ! )
1 1 dT 1/2
= max ρ : 0 ∀(d : kdk∞ ≤ ρ ) (3.4.35)
OP T d A−1
and
1 n o
= max ρ : A−1 X ∀(X ∈ Sn : |Xij | ≤ ρ∀i, j) . (3.4.36)
OP T
Proof. To !get (3.4.35), note that by the Schur Complement Lemma, all matrices of the form
1 dT
with kdk∞ ≤ ρ1/2 are 0 if and only if dT (A−1 )−1 d = dT Ad ≤ 1 for all d,
d A−1
kdk∞ ≤ ρ1/2 , i.e., if and only if ρ·OP T ≤ 1; we have derived (3.4.35). We now have
1
(a) OP T ≥ρ
! m [by (3.4.35)]
1 dT
0 ∀(d : kdk∞ ≤ ρ1/2 )
d A−1
m [the Schur Complement Lemma]
A−1
ρddT
∀(d, kdk∞ ≤ 1)
m
T −1 T 2
x A x ≥ ρ(d x) ∀x ∀(d : kdk∞ ≤ 1)
m
xT A−1 x ≥ ρkxk21 ∀x
m
(b) A−1 ρY ∀(Y = Y T : |Yij | ≤ 1∀i, j)
Cρ = {A−1 +
X
zij S ij | max |zij | ≤ ρ}
i,j
1≤i≤j≤n
is contained in Sn+ ; here S ij are the “basic symmetric matrices” (S ii has a single nonzero entry,
equal to 1, in the cell ii, and S ij , i < j, has exactly two nonzero entries, equal to 1, in the cells
ij and ji). Since the ranks of the matrices S ij do not exceed 2, Proposition 3.4.4 and Theorem
3.4.5 say that the optimal value in the semidefinite program
X ij ρS ij , X ij −ρS ij , 1 ≤ i ≤ j ≤ n,
ρ(A) = max ρ P ij
X A−1 (S)
ρ,X ij
i≤j
is a lower bound for R, and this bound coincides with R up to the factor π2 ; consequently, ρ(A)
1
is an upper bound on OP T , and this bound is at most π2 times larger than OP T . It remains
1
to note that a direct computation demonstrates that ρ(A) is exactly the quantity SDP given by
(3.4.33).
3.4. SEMIDEFINITE RELAXATIONS OF INTRACTABLE PROBLEMS 187
with uncertain data (A, b, c, d) ∈ U? The question is how to describe the set of all robust feasible
solutions of this inequality, i.e., the set of x’s such that
We intend to focus on the case when the uncertainty is “side-wise” – the data (A, b) of the
left hand side and the data (c, d) of the right hand side of the inequality (3.4.37) independently of
each other run through respective uncertainty sets Uρleft , U right (ρ is the left hand side uncertainty
level). It suffices to assume the right hand side uncertainty set to be SDr with a strictly feasible
SDR:
U right = {(c, d) | ∃u : Pc + Qd + Ru S}. (3.4.39)
As about the left hand side uncertainty set, we assume that it is an intersection of concentric
ellipsoids, specifically, that
L
( )
left
X
T 2
Uρ = [A, b] = [A∗ , b∗ ] + ζ` [A` , b` ] : ζ Qj ζ ≤ ρ , j = 1, ..., J , (3.4.40)
`=1
where Q1 , ..., QJ are positive semidefinite matrices with positive definite sum.
Since the left hand side and the right hand side data independently of each other run through
respective uncertainty sets, a point x is robust feasible if and only if there exists a real τ such
that
(a) τ ≤ cT x + d ∀(c, d) ∈ U right ,
(3.4.41)
(b) kAx + bk2 ≤ τ ∀[A, b] ∈ Uρleft .
17)
Without loss of generality, we may assume that the objective is “certain” – is not affected by the data
uncertainty. Indeed, we can always ensure this situation by passing to an equivalent problem with linear (and
standard) objective:
min{f (x) : x ∈ X} 7→ min {t : f (x) − t ≤ 0, x ∈ X} .
x t,x
188 LECTURE 3. SEMIDEFINITE PROGRAMMING
We know from the previous Lecture that the set of (τ, x) satisfying (3.4.41.a) is SDr (see Propo-
sition 2.4.2 and Remark 2.4.1); it is easy to verify that the corresponding SDR is as follows:
As about building SDR of the set of pairs (τ, x) satisfying (3.4.41.b), this is much more difficult
(and in many cases even hopeless) task, since (3.4.38) in general turns out to be NP-hard and
as such cannot be posed as an explicit semidefinite program. We can, however, build a kind of
“inner approximation” of the set in question. To this end we shall use the ideas of semidefinite
relaxation. Specifically, let us set
so that
L
X L
X
(A∗ + ζ` A` )x + (b∗ + ζ` b` ) = a[x] + A[x]ζ.
`=1 `=1
In view of the latter identity, relation (3.4.41.b) reads
Predicate {·}II requires from certain quadratic form of t, ξ to be nonnegative when a number
of other quadratic forms of these variables are nonnegative. An evident sufficient condition for
this is that the former quadratic form is a linear combination, with nonnegative coefficients,
of the latter forms. When τ ≥ 0, this sufficient condition for the predicate {·}II to be valid can
be reduced to the existence of nonnegative weights λj such that the quadratic form
X
t2 (τ 2 − aT [x]a[x]) − 2tρaT [x]A[x]ξ − ρ2 ξ T AT [x]A[x]ξ − τ λj (t2 − ξ T Qj ξ)
j
in variables t, ξ is positive semidefinite. This condition is the same as the existence of nonnegative
λj such that
τ − λj
P
j − [a[x], ρA[x]]T [a[x], ρA[x]] 0.
τ P
λj Qj
j
Invoking the Schur Complement Lemma, the latter
condition, in turn, is equivalent to the
τ − λj aT [x]
P
j
existence of nonnegative λj such that the matrix ρAT [x] is positive
P
λj Qj
j
a[x] ρA[x] τI
3.4. SEMIDEFINITE RELAXATIONS OF INTRACTABLE PROBLEMS 189
aT [x]
τ−
P
λj
j
(a) {τ 0} & ∃(λj ≥ 0) : ρAT [x] 0
P
λj Qj
j
(3.4.43)
a[x] ρA[x] τI
⇓
(b) (x, τ ) satisfies (3.4.41.b)
Combining our observations, we arrive at the first – easy – part of the following statement:
Proposition 3.4.6 Let the data in the conic quadratic inequality (3.4.37) be affected by side-
wise uncertainty (3.4.39), (3.4.40). Then
(i) The system (S[ρ]) of LMIs (3.4.42.b), (3.4.43.a) in variables x, τ, Λ, {λj } is a “conserva-
tive approximation” of the Robust Counterpart of (3.4.37) in the sense that whenever x can be
extended to a feasible solution of (S[ρ]), x is robust feasible for (3.4.37), the uncertainty set being
Uρleft × U right .
(ii) The tightness of (S[ρ]) as an approximation to the robust counterpart of (3.4.37) can be
quantified as follows: if x cannot be extended to a feasible solution of (S[ρ]), then x is not robust
feasible for (3.4.37), the uncertainty set being Uϑρleft × U right . Here the “tightness factor” ϑ can
be bounded as follows:
For the proof of the “difficult part” (ii) of the Proposition, see [4].
Example: Antenna Synthesis revisited. To illustrate the potential of the Robust Optimization
methodology as applied to conic quadratic problems, consider the Circular Antenna Design problem
from Section 2.4.1. Assume that now we deal with 40 ring-type antenna elements, and that our goal is
40
P
to minimize the (discretized) L2 -distance from the synthesized diagram xj Drj−1 ,rj (·) to the “ideal”
j=1
diagram D∗ (·) which is equal to 1 in the range 77o ≤ θ ≤ 90o and is equal to 0 in the range 0o ≤ θ ≤ 70o .
The associated problem is just the Least Squares problem
v P P
u
u Dx2 (θ) + (Dx (θ) − 1)2
t θ∈Θcns θ∈Θobj
minτ,x τ : ≤τ ,
card(Θcns ∪ Θobj )
|
{z }
(3.4.44)
kD∗ −Dx k2
40
P
Dx (θ) = xj Drj−1 ,rj (θ)
j=1
190 LECTURE 3. SEMIDEFINITE PROGRAMMING
where Θcns and Θobj are the intersections of the 240-point grid on the segment 0 ≤ θ ≤ 90o with the
“angle of interest” 77o ≤ θ ≤ 90o and the “sidelobe angle” 0o ≤ θ ≤ 70o , respectively.
The Nominal Least Squares design obtained from the optimal solution to this problem is completely
unstable w.r.t. small implementation errors xj 7→ (1 + ξj )xj , |ξj | ≤ ρ:
In order to take into account implementation errors, we should treat (3.4.44) as an uncertain conic
quadratic problem
min {τ : kAx − bk2 ≤ τ } A ∈ U
τ,x
U = {A = A∗ + A∗ Diag(ξ) | kξk∞ ≤ ρ} ,
which is a particular case of the ellipsoidal uncertainty (specifically, what was called “box uncertainty”
in Proposition 3.4.6). In the experiments to be reported, we use ρ = 0.02. The approximate Robust
Counterpart (S[ρ]) of our uncertain conic quadratic problem yields the Robust design as follows:
minimize xT Bx
T (3.5.1)
s.t. x Ai x ≥ 0, i = 1, ..., m
(B, A1 , ..., Am are given symmetric m × m matrices). Assume that the problem is feasible. In
this case (3.5.1) is, at a first glance, a trivial problem: due to homogeneity, its optimal value
is either −∞ or 0, depending on whether there exists or does not exist a feasible vector x such
that xT Bx < 0. The challenge here is to detect which one of these two alternatives takes
place, i.e., to understand whether or not a homogeneous quadratic inequality xT Bx ≥ 0 is a
consequence of the system of homogeneous quadratic inequalities xT Ai x ≥ 0, or, which is the
same, to understand when the implication
(a) xT Ai x ≥ 0, i = 1, ..., m
⇓ (3.5.2)
(b) xT Bx ≥ 0
holds true.
In the case of homogeneous linear inequalities it is easy to recognize when an inequality
xT b ≥ 0 is a consequence of the system of inequalities xT ai ≥ 0, i = 1, ..., m: by Farkas
Lemma, it is the case if and only if the inequality is a linear consequence of the system, i.e.,
if b is representable as a linear combination, with nonnegative coefficients, of the vectors ai .
Now we are asking a similar question about homogeneous quadratic inequalities: when (b) is a
consequence of (a)?
In general, there is no analogy of the Farkas Lemma for homogeneous quadratic inequalities.
Note, however, that the easy “if” part of the Lemma can be extended to the quadratic case:
if the target inequality (b) can be obtained by linear aggregation of the inequalities (a) and a
trivial – identically true – inequality, then the implication in question is true. Indeed, a linear
aggregation of the inequalities (a) is an inequality of the type
m
X
xT ( λi Ai )x ≥ 0
i=1
with nonnegative weights λi , and a trivial – identically true – homogeneous quadratic inequality
is of the form
xT Qx ≥ 0
with Q 0. The fact that (b) can be obtained from (a) and a trivial inequality by linear
m
λi Ai + Q with λi ≥ 0, Q 0, or, which
P
aggregation means that B can be represented as B =
i=1
m
is the same, if B
P
λi Ai for certain nonnegative λi . If this is the case, then (3.5.2) is trivially
i=1
true. We have arrived at the following simple
192 LECTURE 3. SEMIDEFINITE PROGRAMMING
is strictly feasible and its optimal value is ≥ 0, then B λA for certain λ ≥ 0. By homogeneity
reasons, it suffices to prove exactly the same statement for the optimization problem
n o
min xT Bx : xT Ax ≥ 0, xT x = n . (P)
x
If we could show that when passing from the original problem (P) to the relaxed problem (P0 )
the optimal value (which was nonnegative for (P)) remains nonnegative, we would be done.
Indeed, observe that (P0 ) is clearly bounded below (its feasible set is compact!) and is strictly
feasible (which is an immediate consequence of the strict feasibility of (A)). Thus, by the Conic
Duality Theorem the problem dual to (P0 ) is solvable with the same optimal value (let it be
called nθ∗ ) as the one in (P0 ). The dual problem is
max {nµ : λA + µI B, λ ≥ 0} ,
µ,λ
and the fact that its optimal value is nθ∗ means that there exists a nonnegative λ such that
B λA + nθ∗ I.
If we knew that the optimal value nθ∗ in (P0 ) is nonnegative, we would conclude that B λA
for certain nonnegative λ, which is exactly what we are aiming at. Thus, all we need is to prove
that under the premise of the S-Lemma the optimal value in (P0 ) is nonnegative, and here is
the proof:
3.5. S-LEMMA AND APPROXIMATE S-LEMMA 193
Observe first that problem (P0 ) is feasible with a compact feasible set, and thus is solvable.
Let X ∗ be an optimal solution to the problem. Since X ∗ ≥ 0, there exists a matrix D such
that X ∗ = DDT . Note that we have
0 ≤ Tr(AX ∗ ) = Tr(ADDT ) = Tr(DT AD),
nθ∗ = Tr(BX ∗ ) = Tr(BDDT ) = Tr(DT BD), (*)
n = Tr(X ∗ ) = Tr(DDT ) = Tr(DT D).
(!) Let P, Q be symmetric matrices such that Tr(P ) ≥ 0 and Tr(Q) < 0. Then
there exists a vector e such that eT P e ≥ 0 and eT Qe < 0.
Indeed, let us believe that (!) is valid, and let us prove that θ∗ ≥ 0. Assume, on the contrary,
that θ∗ < 0. Setting P = DT BD and Q = DT AD and taking into account (*), we see that
the matrices P, Q satisfy the premise in (!), whence, by (!), there exists a vector e such that
0 ≤ eT P e = [De]T A[De] and 0 > eT Qe = [De]T B[De], which contradicts the premise of the
S-Lemma.
It remains to prove (!). Given P and Q as in (!), note that Q, as every symmetric matrix,
admits a representation
Q = U T ΛU
with an orthonormal U and a diagonal Λ. Note that θ ≡ Tr(Λ) = Tr(Q) < 0. Now let ξ be
a random n-dimensional vector with independent entries taking values ±1 with probabilities
1/2. We have
while
[U T ξ]T P [U T ξ] = ξ T [U P U T ]ξ,
and the expectation of the latter quantity over ξ is clearly Tr(U P U T ) = Tr(P ) ≥ 0. Since
the expectation is nonnegative, there is at least one realization ξ¯ of our random vector ξ such
that
¯ T P [U T ξ].
0 ≤ [U T ξ] ¯
Assume that the problem is strictly feasible and below bounded. Then the Semidefinite relaxation
(3.4.5) of the problem is solvable with the optimal value f∗ .
Proof. By Proposition 3.4.1, the optimal value in (3.4.5) can be only ≤ f∗ . Thus, it suffices to
verify that (3.4.5) admits a feasible solution with the value of the objective ≥ f∗ , that is, that
there exists λ∗ ≥ 0 such that
!
c0 + λ∗ c1 − f∗ bT0 + λ∗ bT1
0. (3.5.4)
b0 + λ∗ b1 A0 + λ∗ A1
194 LECTURE 3. SEMIDEFINITE PROGRAMMING
To this end, let us associate with (3.5.3) a pair of homogeneous quadratic forms of the extended
vector of variables y = (t, x), where t ∈ R, specifically, the forms
We claim that, first, there exist 0 > 0 and ȳ with ȳ T P ȳ < −0 ȳ T ȳ and, second, that for every
∈ (0, 0 ] the implication
y T P y ≤ −y T y ⇒ y T Qy ≤ 0 (3.5.5)
holds true. The first claim is evident: by assumption, there exists x̄ such that f1 (x̄) < 0; setting
ȳ = (1, x̄), we see that ȳ T P ȳ = f1 (x̄) < 0, whence ȳ T P ȳ < −0 ȳ T ȳ for appropriately chosen
0 > 0. To support the second claim, assume that y = (t, x) is such that y T P y ≤ −y T y, and
let us prove that then y T Qy ≤ 0.
Our observations combine with S-Lemma to imply that for every ∈ (0, 0 ] there exists λ =
λ ≥ 0 such that
B λ (A + I), (3.5.6)
whence, in particular,
ȳ T B ȳ ≤ λ ȳ T [A + I]ȳ.
The latter relation, due to ȳ T Aȳ < 0, implies that λ remains bounded as → +0. Thus, we
have λi → λ∗ ≥ 0 as i → ∞ for a properly chosen sequence i → +0 of values of , and (3.5.6)
implies that B λ∗ A. Recalling what are A and B, we arrive at (3.5.4).
m
where ci > 0, Ai 0, i = 1, ..., m (A can be arbitrary symmetric matrix) and Ai 0. Let
P
i=1
SDP be the optimal value in the Semidefinite relaxation of this problem:
m
ω − i=1 ci λi −bT
P
SDP = min ω : m
0, λ ≥ 0 (3.5.8)
ω,λ
−b λi Ai − A
P
i=1
(note that the problem of interest is a maximization one, whence the difference between the
relaxation and (3.4.5)). Then
Opt ≤ SDP ≤ ΘOpt, (3.5.9)
in the case of m > 1. Moreover, in the latter case there exists x∗ such that
bT x∗ ≥ 0,
xT∗ Ax∗ + 2bT x∗ ≥ SDP, (3.5.10)
xT∗ Ai x∗ ≤ Θci , i = 1, ..., m.
Proof. The case of m = 1 is given by the Inhomogeneous S-Lemma. Thus, let m > 1.
10 . We clearly have
n o
Opt = max xT Ax + 2tbT x : t2 ≤ 1, xT Ai x ≤ ci , i = 1, ..., m
t,x n o
= max z T Bz : z T Bi z ≤ ci , i = 0, 1, ..., m ,
z=(t,x)
" #
bT
B= ,
b A
" # (3.5.11)
1
B0 = ,
" #
Bi = , i = 1, ..., m,
Ai
c0 = 1.
Note that (3.5.8) is nothing but the semidefinite dual of the semidefinite program
m
Since ci > 0 for i ≥ 0, (3.5.12) is strictly feasible, and since Bi 0, the feasible set of the
P
i=0
problem is bounded (so that the problem is solvable). Since (3.5.12) is strictly feasible and
bounded, by Semidefinite Duality Theorem we have
Now let ξ be a random vector with independent coordinates taking values ±1 with probability
1/2
1/2, and let η = Z∗ U T ξ. Observe that
(a) η T Bη ≡ ξ T Dξ
n o ≡ SDP n o [by 1)]
(3.5.14)
(b) T
E η Bi η = T
E ξ Di ξ = Tr(Di )
≤ ci , i = 0, 1, ..., m [by 2)]
Indeed, η T B0 η = ξ T D0 ξ. Note that D0 = d0 dT0 for certain vector d0 (since B0 = b0 bT0 for certain
b0 ). Besides this, Tr(D0 ) ≤ c0 = 1 by (3.5.14.b). Thus,
X
η T B0 η = ( pj j )2 ,
j
P 2
where deterministic pj satisfy pj ≤ 1, and j are independent random variables taking values
j
±1 with probabilities 1/2. Now (3.5.15.a) is given by the following fact [4]:
( )
With pj and j as above, one has Prob | pj j | ≤ 1 ≥ 31 .
P
j
To verify (3.5.15.b), let us fix i ≥ 1. Since Di is positive semidefinite along with Bi , we have
k
X
Di = dj dTj [k = Rank(Di ) ≤ Rank(Bi ) = Rank(Ai )].
j=1
k
In the case of ξ T Di ξ ≥ θ kdj k22 we clearly have ξ T dj dTj ξ ≥ θkdj k22 for certain (depending on
P
j=1
ξ) value of j, so that
( )
k k n √ o
Prob ξ T Di ξ > θ kdj k22 ≤ Prob |dTj ξ| ≥ θkdj k2
P P
j=1 j=1
≤ 2k exp{−θ/2}.
The resulting inequality implies (3.5.15.b) due to the facts that η T Bi η = ξ T Di ξ and that
k
X X
kdj k22 = Tr( dj dTj ) = Tr(Di ) ≤ ci .
j=1 j
m
40 . For every θ > Θ = 2 ln(6K) we have θ ≥ 1 and
P
Let K = Rank(Ai ).
i=1
2
3 + 2K exp{−θ/2} < 1. In view of the latter fact and (3.5.15), there exists a realization
η̄ = (t̄, x̄) of η such that
(a) t̄2 ≡ η̄ T B0 η̄ ≤ 1,
(3.5.16)
(b) x̄T Ai x̄ ≡ η̄ T Bi η̄ ≤ θci , i = 1, ..., m.
while
x̄T Ax̄ + 2t̄bT x̄ = η̄ T Bbarη = SDP
by (3.5.14.a). Replacing, if necessary, t̄ with −t̄ and x̄ with −x̄, we ensure the validity of
(3.5.16.b) along with the relation
x̄T Ax̄ + 2b T
| {z x̄} ≥ SDP. (3.5.17)
≥0
Ai 0, relations (3.5.16) imply that (t̄, x̄) remain bounded as θ → +Θ, whence (3.5.16)
P
Since
i
and (3.5.17) are valid for properly chosen t̄, x̄ and θ = Θ; setting x∗ = x̄, we arrive at (3.5.10).
By (3.5.10), Θ−1/2 x∗ is a feasible solution of (3.5.7) and
affinely parameterized by perturbation vector ζ and with variables xj allowed to be affine func-
tions of Pj ζ:
xj = µj + νjT Pj ζ, (3.5.20)
198 LECTURE 3. SEMIDEFINITE PROGRAMMING
It was explained that in the case of fixed recourse (cj [ζ] and Aj [ζ] are independent of ζ for all j
for which xj is adjustable, that is, Pj 6= 0), (AARC) is equivalent to an explicit conic quadratic
program, provided that the perturbation set Z is CQr with strictly feasible CQR. In fact CQ-
representability plays no crucial role here (see Remark 2.4.1); in particular, when Z is SDr
with a strictly feasible SDR, (AARC), in the case of fixed recourse, is equivalent to an explicit
semidefinite program. What indeed plays a crucial role is the assumption of fixed recourse;
it can be shown that when this assumption does not hold, (AARC) can be computationally
intractable. Our current goal is to demonstrate that even in this difficult case (AARC) admits a
“tight” computationally tractable approximation, provided that Z is an intersection of ellipsoids
centered at the origin:
" #
n o X
T 2
Z = Zρ ≡ ζ : ζ Qi ζ ≤ ρ , i = 1, ..., m Qi 0, Qi 0 (3.5.21)
i
Indeed, since Aj [ζ] are affine in ζ, every semi-infinite constraint in (AARC) is of the form
ζ T A[z]ζ + 2bT [z]ζ ≤ c[z] ∀ζ ∈ Zρ (3.5.22)
where z = (t, {µj , νj }nj=1 ) is the vector of variables in (AARC), and A[z], b[z], c[z] are affine in z
matrix, vector and scalar. Applying Approximate S-Lemma, a sufficient condition for (3.5.22)
to be valid for a given z is the relation
m
− ρ2 λi −bT [z]
P
ω
i=1
min ω :
m
0, λ ≥ 0 ≤ c[z],
ω,λ
−b[z] λi Qi − A[z]
P
i=1
or, which is the same, the possibility to extend z, by properly chosen λ, to a solution to the
system of constraints
m
2 −bT [z]
c[z] − i=1 ρ λi
P
λ ≥ 0, m
, (3.5.23)
−b[z] λi Qi − A[z]
P
i=1
which is a system of LMIs in variables (z, λ). Replacing every one of the semi-infinite constraints
in (AARC) with corresponding system (3.5.23), we end up with an explicit semidefinite program
which is a “conservative approximation” of (AARC): both problems have the same objective,
and the z-component of a feasible solution to the approximation is feasible solution of (AARC).
At the same time, the approximation is tight up to the quite moderate factor
v
m
u !
u X
θ = t2 ln 6 Rank(Qi ) :
i=1
whenever z cannot be extended to a feasible solution of the approximation, the “moreover” part
of the Approximate S-Lemma says that z becomes infeasible for (AARC) after the original level
of perturbations is increased by the factor θ.
3.6. SEMIDEFINITE RELAXATION AND CHANCE CONSTRAINTS 199
Probζ∼P {ζ : f (x, ζ) ≤ 0} ≤ 1 − ,
where 1 is a given tolerance, and P is the distribution of ζ. This formulation tacitly assumes
that the distribution of ζ is known exactly, which typically is not the case in reality. In real life,
even if we have reasons to believe that the data perturbations are random (which by itself is a
“big if”), we usually are not informed enough to point out their distribution exactly. Instead,
we usually are able to point out a family P of probability distributions which contains the true
distribution of ζ. In this case, we usually pass to ambiguously chance constrained version of the
uncertain constraint:
sup Probζ∼P {ζ : f (x, ζ) ≤ 0} ≥ 1 − . (3.6.1)
P ∈P
Chance constraints is a quite old notion (going back to mid-1950’s); over the years, they were
subject of intensive and fruitful research of numerous excellent scholars worldwide. With all
due respect to this effort, chance constraints, even as simply-looking as scalar linear inequalities
with randomly perturbed coefficients, remain quite challenging computationally19 . The reason
is twofold:
• Checking the validity of a chance constraint at a given point x, even with exactly known
distribution of ζ, requires multi-dimensional (provided ζ is so) integration, which typi-
cally is a computationally intractable task. Of course, one can replace precise integration
18
By adding slack variable t and replacing minimization of the true objective f (x) with minimizing t under
added constraint f (x) ≤ t, we always can assume that the objective is linear and certain — not affected by
data uncertainty. By this reason, we can restrict ourselves with the case where the only uncertainty-affected
components of an optimization problem are the constraints.
19
By the way, this is perhaps the most important argument in favour of RO: as we have seen in Section 2.4, the
RO approach, at least in the case of non-adjustable uncertain Linear Programming, results in computationally
tractable (provided the uncertainty set is so) Robust Counterpart of the uncertain problem of interest.
200 LECTURE 3. SEMIDEFINITE PROGRAMMING
• Another potential difficulty with chance constraint is that its feasible set can be non-
convex already for a pretty simple – just affine in x – function f (x, ζ), which makes the
optimization of objective over such a set highly problematic.
Essentially, the only generic situation where neither one of the above two difficulties occur is the
case of scalar linear constraint
n
X
ζi x i ≤ 0
i=1
with Gaussian ζ = [ζ1 ; ...; ζN ]. Assuming that the distribution of ζ is known exactly and denoting
by µ, Σ the expectation and the covariance matrix of ζ, we have
√
µi ζi + Φ−1 () xT Σx ≤ 0,
X X
Prob{ ζi xi ≤ 0} ≥ 1 − ⇔
i i
When ≤ 1/2, the above “deterministic equivalent” of the chance constraint is an explicit Conic
Quadratic inequality.
where the data perturbations ζ ∈ Rd , and the symmetric matrix W is affine in the decision
variables; we lose nothing by assuming that W itself is the decision variable.
We start with the description of P. Specifically, we assume that our a priori information on
the distribution P of the uncertain data can be summarized as follows:
P.1 We know that the marginals Pi of P (i.e., the distributions of the entries ζi in ζ) belong
to given families Pi of probability distributions on the axis;
P.2 The matrix VP = Eζ∼P {Z[ζ]} of the first and the second moments of P is known to belong
to a given convex closed subset V of the positive semidefinite cone;
P.3 P is supported on a set S given by a finite system of quadratic constraints:
S = {ζ : Tr(A` Z[ζ]) ≤ 0, 1 ≤ ` ≤ L}
The above assumptions model well enough a priori information on uncertain data in typical
applications of decision-making origin.
Γ[γ] ≤ θ
clearly is sufficient for the validity of (3.6.2), and we can further optimize this sufficient condition
over the pairs γ, θ produced by our hypothetic mechanism.
The question is, how to build the required mechanism, and here is an answer. Let us start
with building Γ[·]. Under our assumptions on P, the most natural family of functions γ(·) for
which one can bound from above Γ∗ [γ] is comprised of functions of the form
Xd
γ(ζ) = Tr(QZ[ζ]) + γi (ζi ) (3.6.3)
i=1
Further, the simplest way to ensure (I) is to use Lagrangian relaxation, specifically, to require
from the function γ(·) given by (3.6.3) to be such that with properly chosen µ` ≥ 0 one has
Xd XL
Tr(QZ[ζ]) + γi (ζi ) + µ Tr(A` Z[ζ]) ≥ 0 ∀ζ ∈ Rd .
i=1 `=1 `
In turn, the simplest way to ensure the latter relation is to impose on γi (ζi ) the restrictions
(3.6.6)
with (b) reducing to the LMI
L
" P #
0 [q 0 ]T
i ri − θ X
Q+ + ν` A` − W 0 (3.6.7)
q0 Diag{p0 }
`=1
in variables W , p0 , q 0 , r0 ∈ Rd , Q, {νi ≥ 0}L i=1 and θ > 0. Finally, observe that under the
restrictions (3.6.4.a), (3.6.6.a), the best – resulting in the smallest possible Γ[γ] – choice of γi (·)
is
γi (ζi ) = max[pi ζi2 + 2qi ζi + ri , p0i ζi2 + 2qi0 ζi + ri0 ].
We have arrived at the following result:
Proposition 3.6.1 Let (S) be the system of constraints in variables W , p, q, r, p0 , q 0 , r0 ∈ Rd ,
Q ∈ Sd+1 , {ν` ∈ R}L L
`=1 , θ ∈ R, {µ` ∈ R}`=1 comprised of the LMIs (3.6.5), (3.6.7) augmented
by the constraints
µ` ≥ 0 ∀`, ν` ≥ 0 ∀`, θ > 0 (3.6.8)
and
Xd Z
sup Tr(QV ) + sup max[pi ζi2 + 2qi ζi + ri , p0i ζi2 + 2qi0 ζi + ri0 ]dPi ≤ θ. (3.6.9)
V ∈V i=1 P ∈P
i i
(S) is a system of convex constraints which is a safe approximation of the chance constraint
(3.6.2), meaning that whenever W can be extended to a feasible solution of (S), W is feasible
for the ambiguous chance constraint aneq500), P being given by P.1-3. This approximation is
tractable, provided that the suprema in (3.6.9) are efficiently computable.
θ 1
Note that the strict inequality θ > 0 in (3.6.8) can be expressed by the LMI 1 λ
0.
3.6.2.1 Illustration I
Consider the situation as follows: there are d = 15 assets with yearly returns ri = 1 + µi + σi ζi ,
where µi is the expected profit of i-th return, σi is the return’s variability, and ζi is random factor
with zero mean supported on [−1, 1]. The quantities µi , σi used in our illustration are shown on
the left plot on figure 3.1. The goal is to distribute $1 between the assets in order to maximize
the value-at-1% risk (the lower 1%-quantile) of the yearly profit. This is the ambiguously chance
constrained problem
( ( 15 15
) 15
)
X X X
Opt = max t : Probζ∼P µ i xi + ζi σ i x i ≥ t ≥ 0.99 ∀P ∈ P, x ≥ 0, xi = 1
t,x
i=1 i=1 i=1
(3.6.10)
3.6. SEMIDEFINITE RELAXATION AND CHANCE CONSTRAINTS 203
3 1
0.9
2.5 0.8
0.7
2 0.6
0.5
1.5 0.4
0.3
1 0.2
0.1
0.5 0
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16
Consider three hypotheses A, B, C about P. In all of them, ζi are zero mean and supported on
[−1, 1], so that the domain information is given by the quadratic inequalities ζi2 ≤ 1, 1 ≤ i ≤ 15;
this is exactly what is stated by C. In addition, A says that ζi are independent, and B says that
the covariance matrix of ζ is proportional to the unit matrix. Thus, the sets V associated with
the hypotheses are, respectively, {V ∈ Sd+1 + : Vii ≤ V00 = 1, Vij = 0, i 6= j}, {V ∈ Sd+1
+ :1=
d+1
V00 ≥ V11 = V22 = ... = Vdd , Vij = 0, i 6= j}, and {V ∈ S+ : Vii ≤ V00 = 1, V0j = 0, 1 ≤ j ≤ d},
where Sk+ is the cone of positive semidefinite symmetric k × k matrices. Solving the associated
safe tractable approximations of the problem, specifically, the Bernstein approximation in the
case of A, and the Lagrangian approximations in the cases of B, C, we arrive at the results
displayed in table 3.1 and on figure 3.1.
Note that in our illustration, the (identical to each other) single-asset portfolios yielded by
the Lagrangian approximation under hypotheses B, C are exactly optimal under circumstances.
Indeed, on a closest inspection, there exists a distribution P∗ compatible with hypothesis B (and
therefore – with C as well) such that the probability of “crisis,” where all ζi simultaneously are
equal to −1, is ≥ 0.01. It follows that under hypotheses B, C, the worst-case, over P ∈ P,
profit at 1% risk of any portfolio cannot be better than the profit of this portfolio in the case
of crisis, and the latter quantity is maximized by the single-asset portfolio depicted on figure
3.1. Note that the Lagrangian approximation turns out to be “intelligent enough” to discover
this phenomenon and to infer its consequences. A couple of other instructive observations is as
follows:
• the diversified portfolio yielded by the Bernstein approximation20 in the case of crisis
exhibits negative profit, meaning that under hypotheses B, C its worst-case profit at 1%
risk is negative;
• assume that the yearly returns are observed on a year-by-year basis, and the year-by-year
20
See, e.g., [22]. This is a specific safe tractable approximation of affinely perturbed linear scalar chance
constraint which heavily exploits the assumption that the entries ζi in ζ are mutually independent.
204 LECTURE 3. SEMIDEFINITE PROGRAMMING
realizations of ζ are independent and identically distributed. It turns out that it takes
over 100 years to distinguish, with reliability 0.99, between hypothesis A and the “bad”
distribution P∗ via the historical data.
To put these observations into proper perspective, note that it is extremely time-consuming to
identify, to reasonable accuracy and with reasonable reliability, a multi-dimensional distribution
directly from historical data, so that in applications one usually postulates certain parametric
form of the distribution with a relatively small number of parameters to be estimated from
the historical data. When dim ζ is large, the requirement on the distribution to admit a low-
dimensional parameterization usually results in postulating some kind of independence. While in
some applications (e.g., in telecommunications) this independence in many cases can be justified
via the “physics” of the uncertain data, in Finance and other decision-making applications
postulating independence typically is an “act of faith” which is difficult to justify experimentally,
and we believe a decision-maker should be well aware of the dangers related to these “acts of
faith.”
3.6.3 A Modification
When building safe tractable approximation of an affinely perturbed scalar linear chance con-
straint
d
X
∀p ∈ P : Probζ∼P {w0 + ζi wi ≤ 0} ≥ 1 − , (3.6.11)
i=1
one, essentially, is interested in efficient bounding from above the quantity
d
( )
X
p(w) = sup Eζ∼P f (w0 + ζ i wi )
P ∈P i=1
for a very specific function f (s), namely, equal to 0 when s ≤ 0 and equal to 1 when s > 0.
There are situations when we are interested in bounding similar quantity for other functions f ,
specifically, piecewise linear convex function f (s) = max [ai + bi s], see, e.g., [10]. Here again
1≤j≤J
one can use Lagrange relaxation, which in fact is able to cope with a more general problem of
bounding from above the quantity
here the matrices W j ∈ Sd+1 are affine in the decision variables and W = [W 1 , ..., W J ]. Specif-
ically, with the assumptions P.1-3 in force, observe that if, for a given W , a matrix Q ∈ Sd+1
and vectors pj , q j , rj ∈ Rd , 1 ≤ j ≤ J are such that that the relations
Xd
Tr(QZ[ζ]) + [pji ζi2 + 2qij ζi + rij ] ≥ Tr(W j Z[ζ]) ∀ζ ∈ S (Ij )
i=1
is an upper bound on Ψ[W ]. Using Lagrange relaxation, a sufficient condition for the validity
of (Ij ), 1 ≤ j ≤ J, is the existence of nonnegative µj` such that
L
" P #
j
i ri [q j ]T X
Q+ − Wj + µj` A` 0, 1 ≤ j ≤ J. (3.6.14)
qj Diag{pj } `=1
The constraints in the system are convex; they are efficiently computable, provided that the
suprema in (3.6.15.b) are efficiently computable, and whenever W, t can be extended to a feasible
solution to S, one has Ψ[W ] ≤ t. In particular, when the suprema in (3.6.15.b) are efficiently
computable, the efficiently computable quantity
3.6.3.1 Illustration II
Consider a special case of the above situation where all we know about ζ are the marginal
distributions Pi of ζi with well defined first order moments; in this case, Pi = {Pi } are singletons,
and we lose nothing when setting V = Sd+1 d
+ , S = R . Let a piecewise linear convex function on
the axis:
f (s) = max [aj + bj s]
1≤j≤J
be given, and let our goal be to bound from above the quantity
Xd
ψ(w) = sup Eζ∼P {f (ζ w )} , ζ w = w0 + ζ i wi .
P ∈P i=1
In this case, system S from Proposition 3.6.2, where we set pji = 0, Q = 0, reads
(a) 2qij = bj wi , 1 ≤ i ≤ d, q ≤ j ≤ J, j
Pd
i=1 ri ≥ aj + bj w0 ,
max [2qij ζi + rij ]dPi (ζi ) ≤ t
Pd R
(b) i=1
1≤j≤J
206 LECTURE 3. SEMIDEFINITE PROGRAMMING
is an upper bound on ψ(w). A surprising fact is that in the situation in question (i.e., when P
is comprised of all probability distributions with given marginals P1 , ..., Pd ), the upper bound
Opt[w] on ψ(w) is equal to ψ(w) [5, Poposition 4.5.4]. This result offers an alternative (and
simpler) proof for the following remarkable fact established by Dhaene et al [10]: if f is a
convex function on the axis and η1 , ..., ηd are scalar random variables with distributions P1 , ..., Pd
possessing first order moments, then supP ∈P Eη∼P {f (η1 + ... + ηd )}, P being the family of all
distributions on Rd with marginals P1 , ..., Pd , is achieved when η1 , ..., ηd are comonotone, that
is, are deterministic monotone transformation of a single random variable uniformly distributed
on [0, 1].
Exercise 3.1 Here you will learn how to verify claims like “distinguishing, with reliability 0.99,
between distributions A and B takes at least so much observations.”
1. Problem’s setting. Let P and Q be two probability distributions on the same space Ω with
densities p(·), q(·) with respect to some measure µ.
Those with limited experience in measure theory will lose nothing by assuming whenever
possible that Ω is the finite set {1, ..., N } and µ is the “counting measure” (the µ-mass
of every point from Ω is 1). In this case, the density of a probability distribution P
on Ω is just the function p(·) on the N -point set Ω (i.e., N -dimensional vector) with
p(i) = Probω∼P {ω = i} (that is, p(i) is the probability mass which is assigned by P to
a point i ∈ Ω). In this case (in the sequel, we refer to it as to the discrete one), the
density of a probability distribution on Ω is just a probabilistic vector from RN — a
nonnegative vector with entries summing up to 1.
Given an observation ω ∈ Ω drawn at random from one (we do not know in advance from
which one) of the distributions P, Q, we want to decide what is the underlying distribution;
this is called distinguishing between two simple hypotheses21 . A (deterministic) decision
rule clearly should be as follows: we specify a subset ΩP ⊂ Ω and accept the hypothesis
HP that the distribution from which ω is drawn is P if and only if ω ∈ ΩP ; otherwise we
accept the alternative hypothesis HQ saying that the “actual” distribution is Q.
A decision rule for distinguishing between the hypotheses can be characterized by two
error probabilities: P (to accept HQ when HP is true) and Q (to accept HP when HQ is
true). We clearly have
Z Z
P = p(ω)dµ(ω), Q = q(ω)dµ(ω)
ω6∈ΩP ω∈ΩP
Prove that this lower bound is achieved for the maximum likelihood test, where ΩP = {ω :
p(ω) ≥ q(ω)}.
Follow-up: Essentially, the only multidimensional case where it is easy to compute the total
error P +Q of the maximum likelihood test is when P and Q are Gaussian distributions on
Ω = Rk with common covariance matrix (which we for simplicity set to the unit matrix)
and different expectations, say, a and b, so that the densities (taken w.r.t. the usual
Lebesque measure dµ(ω1 , ..., ωk ) = dω1 ...dωk ) are p(ω) = (2π)1k/2 exp{−(ω − a)T (ω − a)/2}
and q(ω) = (2π)1k/2 exp{−(ω − b)T (ω − b)/2}. Prove that in this case the likelihood test
reduces to accepting P when kω − ak2 ≤ kω − bk2 and accepting Q otherwise.
Prove that the total error of the above test is 2Φ(ka − bk2 /2), where
Z ∞
1
Φ(s) = √ exp{−t2 /2}dt
2π s
In particular, when all ps coincide with some p (in this case, we denote p1 × ... × pk
by p⊗k ) and q 1 , ..., q k coincide with some q, then
(f) In the notation from Problem’s setting, prove that if the hypotheses HP , HQ can be
distinguished from an observation with probabilities of the errors satisfying P + Q ≤
2 < 1/2, then
1
H(p, q) ≥ (1 − 4) ln . (3.6.18)
2
Follow-up: Let p and q be Gaussian densities on Rk with unit covariance matrices
and expectations a, b. Given that kb − ak2 = 1, how large should be a sample of
independent realizations (drawn either all from p, or all from q) in order to distinguish
between P and Q with total error 2.e-6? Give a lower bound on the sample size based
on (3.6.18) and its true minimum size (to find it, use the result of the Follow-up in
item 1; note that a sample of vectors drawn independently from Gaussian distribution
is itself a large Gaussian vector).
Task 3:
(a) Prove that the Hellinger affinity is nonnegative, is concave in (p, q) ≥ 0, does not
exceed 1 when p, q are probability densities (and is equal to one if and only if p = q),
and possesses the following two properties:
s Z
Hel(p, q) ≤ 2 min[p(ω), q(ω)]dµ(ω)
and
Hel(p1 × ... × pk , q 1 × ... × q k ) = Hel(p1 , q 1 ) · ... · Hel(pk , q k ).
(b) Derive from the previous item that the total error 2 in distinguishing two hypotheses
on distribiution of ω k = (ω1 , ..., ωk ) ∈ Ω × ... × Ω, the first stating that the density of
ω k is p⊗k , and the second stating that this density is q ⊗k , admits the lower bound
4 ≥ (Hel(p, q))2k .
3.7. EXTREMAL ELLIPSOIDS 209
Follow-up: Compute the Hellinger affinity of two Gaussian densities on Rk , both with
covariance matrices Ik , the means of the densities being a and b. Use this resuilt to derive
a lower bound on the sample size considered in the previous Follow-up.
4. Experiment.
Task 4: Carry out the experiment as follows:
(a) Use the scheme represented in Proposition 3.6.1 to reproduce the results presented
in Illustration for the hypotheses B and C. Use the following data on the returns:
i−1 i−1
n = 15, ri = 1+µi +σi ζi , µi = 0.001+0.9· , σi = 0.9 + 0.2 · µi , 1 ≤ i ≤ n,
n−1 n−1
where ζi are zero mean random perturbations supported on [−1, 1].
(b) Find numerically the probability distribution p of ζ ∈ B = [−1, 1]15 for which the
probability of “crisis” ζ = −1, 1 ≤ i ≤ 15, is as large as possible under the restrictions
that
— p is supported on the set of 215 vertices of the box B (i.e., the factors ζi take values
±1 only);
— the marginal distributions of ζi induced by p are the uniform distributions on
{−1; 1} (i.e., every ζi takes values ±1 with probabilities 1/2);
— the covariance matrix of ζ is I15 .
(c) After p is found, take its convex combination with the uniform distribution on the
vertices of B to get a distribution P∗ on the vertices of B for which the probability
of the crisis P∗ ({ζ = [−1; ...; −1]}) is exactly 0.01.
(d) Use the Kullback-Leibler and the Hellinger bounds to bound from below the observa-
tion time needed to distinguish, with the total error 2 · 0.01, between two probability
distributions on the vertices of B, namely, P∗ and the uniform one (the latter corre-
sponds to the case where ζi , 1 ≤ i ≤ 15, are independent and take values ±1 with
probabilities 1/2).
E = {x = Au + c | uT u ≤ 1} [A ∈ Mn,q ] (Ell)
• it is easy to specify an ellipsoid – just to point out the corresponding matrix A and vector
c;
• the family of ellipsoids is closed with respect to affine transformations: the image of an
ellipsoid under an affine mapping again is an ellipsoid;
• there are many operations, like minimization of a linear form, computation of volume,
etc., which are easy to carry out when the set in question is an ellipsoid, and is difficult
to carry out for more general convex sets.
210 LECTURE 3. SEMIDEFINITE PROGRAMMING
By the indicated reasons, ellipsoids play important role in different areas of applied mathematics;
in particular, people use ellipsoids to approximate more complicated sets. Just as a simple
motivating example, consider a discrete-time linear time invariant controlled system:
x(t + 1) = Ax(t) + Bu(t), t = 0, 1, ...
x(0) = 0
and assume that the control is norm-bounded:
ku(t)k2 ≤ 1 ∀t.
The question is what is the set XT of all states “reachable in a given time T ”, i.e., the set of all
possible values of x(T ). We can easily write down the answer:
but this answer is not “explicit”; just to check whether a given vector x belongs to XT requires
to solve a nontrivial conic quadratic problem, the complexity of the problem being the larger the
larger is T . In fact the geometry of XT may be very complicated, so that there is no possibility
to get a “tractable” explicit description of the set. This is why in many applications it makes
sense to use “simple” – ellipsoidal - approximations of XT ; as we shall see, approximations of
this type can be computed in a recurrent and computationally efficient fashion.
It turns out that the natural framework for different problems of the “best possible” approx-
imation of convex sets by ellipsoids is given by semidefinite programming. In this Section we
intend to consider a number of basic problems of this type.
E = {x = Au + c | uT u ≤ 1} [A ∈ Sn++ ] (3.7.1)
Indeed, it is clear that if a matrix A represents, via (Ell), a given ellipsoid E, the matrix AU , U
being an orthogonal n × n matrix, represents E as well. It is known from Linear Algebra that by
multiplying a nonsingular square matrix from the right by a properly chosen orthogonal matrix,
we get a positive definite symmetric matrix, so that we always can parameterize a full-dimensional
ellipsoid by a positive definite symmetric A.
Indeed, one may take D = A−2 , where A is the matrix from the representation (3.7.1).
3.7. EXTREMAL ELLIPSOIDS 211
Note that the set (3.7.2) makes sense and is convex when the matrix D is positive semidefinite
rather than positive definite. When D 0 is not positive definite, the set (3.7.1) is, geometrically,
an “elliptic cylinder” – a shift of the direct product of a full-dimensional ellipsoid in the range
space of D and the complementary to this range linear subspace – the kernel of D.
In the sequel we deal a lot with volumes of full-dimensional ellipsoids. Since an invertible
affine transformation x 7→ Ax + b : Rn → Rn multiplies the volumes of n-dimensional domains
by |DetA|, the volume of a full-dimensional ellipsoid E given by (3.7.1) is κn DetA, where κn
is the volume of the n-dimensional unit Euclidean ball. In order to avoid meaningless constant
factors, it makes sense to pass from the usual n-dimensional volume mesn (G) of a domain G to
its normalized volume
Vol(G) = κ−1n mesn (G),
i.e., to choose, as the unit of volume, the volume of the unit ball rather than the one of the cube
with unit edges. From now on, speaking about volumes of n-dimensional domains, we always
mean their normalized volume (and omit the word “normalized”). With this convention, the
volume of a full-dimensional ellipsoid E given by (3.7.1) is just
Vol(E) = DetA,
while for an ellipsoid given by (3.7.1) the volume is
Vol(E) = [DetD]−1/2 .
Proposition 3.7.1 Assume that S is a full-dimensional polytope (i.e., is bounded and possesses
a nonempty interior). Then the largest volume ellipsoid contained in S is
E = {x = Z∗ u + z∗ | uT u ≤ 1},
maximize t
s.t.
(a) t ≤ (DetZ)1/n , (In)
(b) Z 0,
(c) kZai k2 ≤ bi − aTi z, i = 1, ..., m,
aTi (Au + c) ≤ bi ∀u : uT u ≤ 1,
Thus, (In.b − c) just express the fact that the ellipsoid {x = Zu + z | uT u ≤ 1} is contained in
S, so that (In) is nothing but the problem of maximizing (a positive power of) the volume of an
ellipsoid over ellipsoids contained in S.
We see that if S is a polytope given by a set of linear inequalities, then the problem of the
best inner ellipsoidal approximation of S is an explicit semidefinite program and as such can be
efficiently solved. In contrast to this, if S is a polytope given as a convex hull of finite set:
S = Conv{x1 , ..., xm },
then the problem of the best inner ellipsoidal approximation of S is “computationally in-
tractable” – in this case, it is difficult just to check whether a given candidate ellipsoid is
contained in S.
3.7. EXTREMAL ELLIPSOIDS 213
E = E(Z, z) ≡ {x = Zu + z | uT u ≤ 1} [Z ∈ Mn,q ]
E⊂W
m
uT u ≤ 1 ⇒ (Zu + z − y)T Y T Y (Zu + z − y) ≤ 1
m
uT u ≤ t2 ⇒ (Zu + t(z − y))T Y T Y (Zu + t(z − y)) ≤ t2
m S-Lemma
∃λ ≥ 0 : [t − (Zu + t(z − y)) Y Y (Zu + t(z − y))] − λ[t2 − uT u] ≥ 0 ∀(u, t)
2 T T
m
1 − λ − (z − y)T Y T Y (z − y) −(z − y)T Y T Y Z
∃λ ≥ 0 : 0
−Z T Y T Y (z − y) λIq − Z T Y T Y Z
m
1−λ (z − y)T Y T
∃λ ≥ 0 : − ( Y (z − y) Y Z ) 0
λIq ZT Y T
Now note that in view of Lemma on the Schur Complement the matrix
1−λ (z − y)T Y T
− ( Y (z − y) Y Z )
λIq ZT Y T
is positive semidefinite if and only if the matrix in (3.7.3) is so. Thus, E ⊂ W if and only if
there exists a nonnegative λ such that the matrix in (3.7.3), let it be called P (λ), is positive
3.7. EXTREMAL ELLIPSOIDS 215
semidefinite. Since the latter matrix can be positive semidefinite only when λ ≥ 0, we have
proved the first statement of the proposition. To prove the second statement, note that the
matrix in (3.7.4), let it be called Q(λ), is closely related to P (λ):
Y −1
maximize t
s.t.
t ≤ (DetZ)1/n ,
In Bi (z − ci ) Bi Z (InEll)
(z − ci )T Bi 1 − λi 0, i = 1, ..., m,
ZBi λi In
Z 0
E = {x = Z∗ u + z∗ | uT u ≤ 1}.
i = 1, ..., m, be given ellipsoids in Rn ; assume that the convex hull W of the union of these
ellipsoids possesses a nonempty interior. Then the problem of the best outer ellipsoidal approx-
imation of W is the explicit semidefinite program
maximize t
s.t.
t ≤ (DetY )1/n ,
In Y ci − z Y Ai (OutEll)
(Y ci − z)T 1 − λi 0, i = 1, ..., m,
ATi Y λi Iki
Y 0
ku(t)k2 ≤ 1, t = 0, ..., T − 1.
How could we build such an approximation recursively? Let Xt be the set of all states where
the system can be driven in time t ≤ T , and assume that we have already built inner and outer
t and E t
ellipsoidal approximations Ein out of the set Xt :
t t
Ein ⊂ Xt ⊂ Eout .
Let also
E = {x = Bu | uT u ≤ 1}.
3.7. EXTREMAL ELLIPSOIDS 217
Given m ellipsoids W1 , ..., Wm in Rn , find the best inner and outer ellipsoidal ap-
proximations of the arithmetic sum
W = {x = w1 + w1 + ... + wm | wi ∈ Wi , i = 1, ..., m}
In fact, we have posed two different problems: the one of inner approximation of W (let this
problem be called (I)) and the other one, let it be called (O), of outer approximation. It seems
that in general both these problems are difficult (at least when m is not once for ever fixed).
There exist, however, “computationally tractable” approximations of both (I) and (O) we are
about to consider.
In considerations to follow we assume, for the sake of simplicity, that the ellipsoids W1 , ..., Wm
are full-dimensional (which is not a severe restriction – a “flat” ellipsoid can be easily approxi-
mated by a “nearly flat” full-dimensional ellipsoid). Besides this, we may assume without loss
of generality that all our ellipsoids Wi are centered at the origin. Indeed, we have Wi = ci + Vi ,
where ci is the center of Wi and Vi = Wi − ci is centered at the origin; consequently,
so that the problems (I) and (O) for the ellipsoids W1 , ..., Wm can be straightforwardly reduced
to similar problems for the centered at the origin ellipsoids V1 , ..., Vm .
Wi = {x ∈ Rn | xT Bi x ≤ 1} [Bi 0].
Our strategy to approximate (O) is very natural: we intend to build a parametric family of
ellipsoids in such a way that, first, every ellipsoid from the family contains the arithmetic sum
W1 + ... + Wm of given ellipsoids, and, second, the problem of finding the smallest volume
ellipsoid within the family is a “computationally tractable” problem (specifically, is an explicit
218 LECTURE 3. SEMIDEFINITE PROGRAMMING
semidefinite program)23) . The seemingly simplest way to build the desired family was proposed
in [8] and is based on the idea of semidefinite relaxation. Let us start with the observation that
an ellipsoid
W [Z] = {x | xT Zx ≤ 1} [Z 0]
contains W1 + ... + Wm if and only if the following implication holds:
n o
{xi ∈ Rn }m i T i 1 m T 1 m
i=1 , [x ] Bi x ≤ 1, i = 1, ..., m ⇒ (x + ... + x ) Z(x + ... + x ) ≤ 1. (∗)
Now let B i be (nm) × (nm) block-diagonal matrix with m diagonal blocks of the size n × n
each, such that all diagonal blocks, except the i-th one, are zero, and the i-th block is the n × n
matrix Bi . Let also M [Z] denote (mn) × (mn) block matrix with m2 blocks of the size n × n
each, every of these blocks being the matrix Z. This is how B i and M [Z] look in the case of
m = 2: " # " # " #
1 B1 2 Z Z
B = , B = , M [Z] = .
B2 Z Z
Validity of implication (∗) clearly is equivalent to the following fact:
(*.1) For every (mn)-dimensional vector x such that
xT B i x ≡ Tr(B i xx T
|{z}) ≤ 1, i = 1, ..., m,
X[x]
one has
xT M [Z]x ≡ Tr(M [Z]X[x]) ≤ 1.
Now we can use the standard trick: the rank one matrix X[x] is positive semidefinite, so that
we for sure enforce the validity of the above fact when enforcing the following stronger fact:
(*.2) For every (mn) × (mn) symmetric positive semidefinite matrix X such that
Tr(B i X) ≤ 1, i = 1, ..., m,
one has
Tr(M [Z]X) ≤ 1.
W [Z] = {x | xT Zx ≤ 1}
We are basically done: the set of those symmetric matrices Z for which the optimal value
in (SDP) is ≤ 1 is SD-representable; indeed, the problem is clearly strictly feasible, and Z
affects, in a linear fashion, the objective of the problem only. On the other hand, the optimal
value in a strictly feasible semidefinite maximization program is a SDr function of the objective
(“semidefinite version” of Proposition 2.4.4). Consequently, the set of those Z for which the
optimal value in (SDP) is ≤ 1 is SDr (as the inverse image, under affine mapping, of the level set
of a SDr function). Thus, the “parameter” Z of those ellipsoids W [Z] which satisfy the premise
in (D) and thus contain W1 + ... + Wm varies in an SDr set Z. Consequently, the problem of
finding the smallest volume ellipsoid in the family {W [Z]}Z∈Z is equivalent to the problem of
maximizing a positive power of Det(Z) over the SDr set Z, i.e., is equivalent to a semidefinite
program.
It remains to build the aforementioned semidefinite program. By the Conic Duality Theorem
the optimal value in the (clearly strictly feasible) maximization program (SDP) is ≤ 1 if and
only if the dual problem
(m )
X X
min λi | λi B i M [Z], λi ≥ 0, i = 1, ..., m .
λ
i=1 i
admits a feasible solution with the value of the objective ≤ 1, or, which is clearly the same
(why?), admits a feasible solution with the value of the objective equal 1. In other words,
whenever Z 0 is such that M [Z] is a convex combination of the matrices B i , the set
W [Z] = {x | xT Zx ≤ 1}
(which is an ellipsoid when Z 0) contains the set W1 + ... + Wm . We have arrived at the
following result (see [8], Section 3.7.4):
Wi = {x ∈ Rn | xT Bi x ≤ 1} [Bi 0],
where B i is the (mn) × (mn) block-diagonal matrix with blocks of the size n × n and the only
nonzero diagonal block (the i-th one) equal to Bi , and M [Z] is the (mn)×(mn) matrix partitioned
into m2 blocks, every one of them being Z. Every feasible solution (Z, ...) to this program with
positive value of the objective produces ellipsoid
W [Z] = {x | xT Zx ≤ 1}
which contains W1 + ... + Wm , and the volume of this ellipsoid is at most t−n/2 . The smallest
volume ellipsoid which can be obtained in this way is given by (any) optimal solution of (Õ).
220 LECTURE 3. SEMIDEFINITE PROGRAMMING
How “conservative” is (Õ) ? The ellipsoid W [Z ∗ ] given by the optimal solution of (Õ)
contains the arithmetic sum W of the ellipsoids Wi , but not necessarily is the smallest volume
ellipsoid containing W ; all we know is that this ellipsoid is the smallest volume one in certain
subfamily of the family of all ellipsoids containing W . “In the nature” there exists the “true”
smallest volume ellipsoid W [Z ∗∗ ] = {x | xT Z ∗∗ x ≤ 1}, Z ∗∗ 0, containing W . It is natural to
ask how large could be the ratio
Vol(W [Z ∗ ])
ϑ= .
Vol(W [Z ∗∗ ])
The answer is as follows:
π n/2
Proposition 3.7.5 One has ϑ ≤ 2 .
Note that the bound stated by Proposition 3.7.5 is not as bad as it looks: the natural way to
compare the “sizes” of two n-dimensional bodies E 0 , E 00 is to look at the ratio of their average
1/n
Vol(E 0 )
linear sizes Vol (it is natural to assume that shrinking a body by certain factor, say, 2,
(E 00 )
we reduce the “size” of the body exactly by this factor, n
∗
p and not by 2 ). With this approach, the
“level of non-optimality” of W [Z ] is no more than π/2 = 1.253..., i.e., is within 25% margin.
Proof of Proposition 3.7.5: Since Z ∗∗ contains W , the implication (*.1) holds true, i.e., one
has
maxmn
{xT M [Z ∗∗ ]x | xT B i x ≤ 1, i = 1, ..., m} ≤ 1.
x∈R
Since the matrices Bi, i = 1, ..., m, commute and M [Z ∗∗ ] 0, we can apply Proposition 3.8.1
(see Section 3.8.5) to conclude that there exist nonnegative µi , i = 1, ..., m, such that
m
∗∗
X X π
M [Z ] µi B i , µi ≤ .
i=1 i
2
!−1 !−1
Z ∗∗ , t = Det1/n (Z), we get a feasible
P P
It follows that setting λi = µj µi , Z = µj
j j
solution of (Õ). Recalling the origin of Z ∗ , we come to
n/2
as claimed.
Problem (O), the case of “co-axial” ellipsoids. Consider the co-axial case – the one when
there exist coordinates (not necessarily orthogonal) such that all m quadratic forms defining the
ellipsoids Wi are diagonal in these coordinates, or, which is the same, there exists a nonsingular
matrix C such that all the matrices C T Bi C, i = 1, ..., m, are diagonal. Note that the case of
m = 2 always is co-axial – Linear Algebra says that every two homogeneous quadratic forms, at
least one of the forms being positive outside of the origin, become diagonal in a properly chosen
coordinates.
We are about to prove that
(E) In the “co-axial” case, (Õ) yields the smallest in volume ellipsoid containing
W1 + ... + Wm .
3.7. EXTREMAL ELLIPSOIDS 221
Consider the co-axial case. Since we are interested in volume-related issues, and the ratio
of volumes remains unchanged under affine transformations, we can assume w.l.o.g. that the
matrices Bi defining the ellipsoids Wi = {x | xT Bi x ≤ 1} are positive definite and diagonal; let
bi` be the `-th diagonal entry of Bi , ` = 1, ..., n.
By the Fritz John Theorem, “in the nature” there exists a unique smallest volume ellipsoid
W∗ which contains W1 + ... + Wm ; from uniqueness combined with the fact that the sum of our
ellipsoids is symmetric w.r.t. the origin it follows that this optimal ellipsoid W∗ is centered at
the origin:
W∗ = {x | xT Z∗ x ≤ 1}
x ∈ W1 + ... + Wm ⇔ Ex ∈ W1 + ... + Wm .
It follows that the ellipsoid E(W∗ ) = {x | xT (E T Z∗ E)x ≤ 1} covers W1 + ... + Wm along with
W∗ and of course has the same volume as W∗ ; from the uniqueness of the optimal ellipsoid it
follows that E(W∗ ) = W∗ , whence E T Z∗ E = Z∗ (why?). Since the concluding relation should
be valid for all diagonal matrices E with diagonal entries ±1, Z∗ must be diagonal.
Now assume that the set
given by a nonnegative vector z contains W1 + ... + Wm . Then the following implication holds
true:
n
X n
X
∀{xi` } i=1,...,m : bi` (xi` )2 ≤ 1, i = 1, ..., m ⇒ z` (x1` + x2` + ... + xm 2
` ) ≤ 1. (3.7.6)
`=1,...,n
`=1 `=1
Denoting y`i = (xi` )2 and taking into account that z` ≥ 0, we see that the validity of (3.7.6)
implies the validity of the implication
n n m q
y`i y`j
X X X X
∀{y`i ≥ 0} i=1,...,m : bi` y`i ≤ 1, i = 1, ..., m ⇒ z` y`i + 2 ≤ 1.
`=1,...,n
`=1 `=1 i=1 1≤i<j≤m
(3.7.7)
Now let Y be an (mn) × (mn) symmetric matrix satisfying the relations
Let us partition Y into m2 square blocks, and let Y`ij be the `-th diagonal entry of the ij-th
Y`ij
ii
Y`
block of Y . For all i, j with 1 ≤ i < j ≤ m, and all `, 1 ≤ ` ≤ n, the 2 × 2 matrix
Y`ij Y`jj
is a principal submatrix of Y and therefore is positive semidefinite along with Y , whence
q
Y`ij ≤ Y`ii Y`jj . (3.7.9)
222 LECTURE 3. SEMIDEFINITE PROGRAMMING
In view of (3.7.8), the numbers y`i ≡ Y`ii satisfy the premise in the implication (3.7.7), so that
" #
n m q
1 ≥ Y`ii Y`ii Y`jj
P P P
z` +2 [by (3.7.7)]
`=1 "i=1 1≤i<j≤m #
n m
≥ Y`ii + 2 Y`ij [since z ≥ 0 and by (3.7.9)]
P P P
z`
`=1 i=1 1≤i<j≤m
= Tr(Y M [Diag(z)]).
Thus, (3.7.8) implies the inequality Tr(Y M [Diag(z)]) ≤ 1, i.e., the implication
Y 0, Tr(Y B i ) ≤ 1, i = 1, ..., m ⇒ Tr(Y M [Diag(z)]) ≤ 1
holds true. Since the premise in this implication is strictly feasible, the validity of the implication,
by Semidefinite Duality, implies the existence of nonnegative λi , i λi ≤ 1, such that
P
X
M [Diag(z)] λi B i .
i
Wi = {x | x = Ai u | uT u ≤ 1} [Det(Ai ) 6= 0].
Indeed, assume, first, that there exists a vector x∗ such that the inequality in (3.7.10) is
violated at x = x∗ , and let us prove that in this case W [Z] is not contained in the set
W = W1 + ... + Wm . We have
and similarly
max xT∗ x = kZ T x∗ k2 ,
x∈E[Z]
whence
m
max xT∗ x xT∗ (x1 + ... + xm ) = max xT∗ xi
P
= max
i i
x∈W x ∈Wi i=1 x ∈Wi
m
kATi x∗ k2 < kZ T x∗ k2 = max xT∗ x,
P
=
i=1 x∈E[Z]
and we see that E[Z] cannot be contained in W . Vice versa, assume that E[Z] is not
contained in W , and let y ∈ E[Z]\W . Since W is a convex compact set and y 6∈ W , there
exists a vector x∗ such that xT∗ y > max xT∗ x, whence, due to the previous computation,
x∈W
m
X
kZ T x∗ k2 = max xT∗ x ≥ xT∗ y > max xT∗ x = kATi x∗ k2 ,
x∈E[Z] x∈W
i=1
and we have found a point x = x∗ at which the inequality in (3.7.10) is violated. Thus, E[Z]
is not contained in W if and only if (3.7.10) is not true, which is exactly what should be
proved.
A natural way to generate ellipsoids satisfying (3.7.10) is to note that whenever Xi are n × n
matrices of spectral norms
q q
|Xi | ≡ λmax (XiT Xi ) = λmax (Xi XiT ) = max{kXi xk2 | kxk2 ≤ 1}
x
satisfies (3.7.10):
m
X m
X m
X
T
kZ xk2 = k[X1T AT1 + ... + T T
Xm Am ]xk2 ≤ kXiT ATi xk2 ≤ |XiT |kATi xk2 ≤ kATi xk2 .
i=1 i=1 i=1
224 LECTURE 3. SEMIDEFINITE PROGRAMMING
Thus, every collection of square matrices Xi with spectral norms not exceeding 1 produces
an ellipsoid satisfying (3.7.10) and thus contained in W , and we could use the largest volume
ellipsoid of this form (i.e., the one corresponding to the largest |Det(A1 X1 + ... + Am Xm )|) as a
surrogate of the largest volume ellipsoid contained in W . Recall that we know how to express a
bound on the spectral norm of a matrix via LMI:
−X T
tIn
|X| ≤ t ⇔ 0 [X ∈ Mn,n ]
−X tIn
m
P
(item 16 of Section 3.2). The difficulty, however, is that the matrix Ai Xi specifying the
i=1
ellipsoid E(X1 , ..., Xm ), although being linear in the “design variables” Xi , is not necessarily
symmetric positive semidefinite, and we do not know how to maximize the determinant over
general-type square matrices. We may, however, use the following fact from Linear Algebra:
Lemma 3.7.1 Let Y = S + C be a square matrix represented as the sum of a symmetric matrix
S and a skew-symmetric (i.e., C T = −C) matrix C. Assume that S is positive definite. Then
|Det(Y )| ≥ Det(S).
Proof. We have Y = S + C = S 1/2 (I + Σ)S 1/2 , where Σ = S −1/2 CS −1/2 is skew-symmetric along
with C. We have |Det(Y )| = Det(S)|Det(I + Σ)|; it remains to note that all eigenvalues of the skew-
symmetric matrix Σ are purely imaginary, so that the eigenvalues of I + Σ are ≥ 1 in absolute value,
whence |Det(I + Σ)| ≥ 1.
In view of Lemma, it makes sense to impose on X1 , ..., Xm , besides the requirement that
their spectral norms are ≤ 1, also the requirement that the “symmetric part”
m m
" #
1 X X
S(X1 , ..., Xm ) = Ai Xi + XiT Ai
2 i=1 i=1
P
of the matrix Ai Xi is positive semidefinite, and to maximize under these constraints the
i
quantity Det(S(X1 , ..., Xm )) – a lower bound on the volume of the ellipsoid E[Z(X1 , ..., Xm )].
With this approach, we come to the following result:
Proposition 3.7.6 Let Wi = {x = Ai u | uT u ≤ 1}, Ai 0, i = 1, .., m. Consider the
semidefinite program
maximize t
s.t.
m
1/n
1
t ≤ [XiT Ai
P
(a) Det 2 + Ai Xi ]
m
i=1 (Ĩ)
[XiT Ai + Ai Xi ] 0
P
(b)
i=1
−XiT
In
(c) 0, i = 1, ..., m
−Xi In
with design variables X1 , ..., Xm ∈ Mn,n , t ∈ R. Every feasible solution ({Xi }, t) to this problem
produces the ellipsoid
m
X
E(X1 , ..., Xm ) = {x = ( Ai Xi )u | uT u ≤ 1}
i=1
3.7. EXTREMAL ELLIPSOIDS 225
contained in the arithmetic sum W1 + ... + Wm of the original ellipsoids, and the volume of
this ellipsoid is at least tn . The largest volume ellipsoid which can be obtained in this way is
associated with (any) optimal solution to (Ĩ).
we have started with, since the latter problem always has an optimal solution {Xi∗ } with
m
Ai Xi∗ . Indeed, let {Xi+ } be an optimal
P
positive semidefinite symmetric matrix G∗ =
i=1
m
Ai Xi+ , as every n × n square matrix, admits
P
solution of the problem. The matrix G+ =
i=1
a representation G+ = G∗ U , where G+ is a positive semidefinite symmetric, and U is an
orthogonal matrix. Setting Xi∗ = Xi U T , we convert {Xi+ } into a new feasible solution of
m
Ai Xi∗ = G∗ 0, and Det(G+ ) = Det(G∗ ), so that the new
P
(3.7.11); for this solution
i=1
solution is optimal along with {Xi+ }.
Problem (I), the co-axial case. We are about to demonstrate that in the co-axial case,
when in properly chosen coordinates in Rn the ellipsoids Wi can be represented as
Wi = {x = Ai u | uT u ≤ 1}
with positive definite diagonal matrices Ai , the above scheme yields the best (the largest volume)
ellipsoid among those contained in W = W1 + ... + Wm . Moreover, this ellipsoid can be pointed
out explicitly – it is exactly the ellipsoid E[Z] with Z = Z(In , ..., In ) = A1 + ... + Am !
The announced fact is nearly evident. Assuming that Ai are positive definite and diagonal,
consider the parallelotope
m
X
c = {x ∈ Rn | |xj | ≤ `j =
W [Ai ]jj , j = 1, ..., n}.
i=1
This parallelotope clearly contains W (why?), and the largest volume ellipsoid contained in
W
c clearly is the ellipsoid
n
X
{x | `−2 2
j xj ≤ 1},
j=1
i.e., is nothing else but the ellipsoid E[A1 + ... + Am ]. As we know from our previous
considerations, the latter ellipsoid is contained in W , and since it is the largest volume
c ⊃ W , it is the largest volume ellipsoid contained
ellipsoid among those contained in the set W
in W as well.
Example. In the example to follow we are interested to understand what is the domain DT
on the 2D plane which can be reached by a trajectory of the differential equation
d −0.8147 −0.4163
x1 (t) x1 (t) u1 (t) x1 (0) 0
= + , =
dt x2 (t) 0.8167 −0.1853 x2 (t) 0.7071u2 (t) x2 (0) 0
| {z }
A
226 LECTURE 3. SEMIDEFINITE PROGRAMMING
u1 (t)
in T sec under a piecewise-constant control u(t) = which switches from one constant
u2 (t)
value to another one every ∆t = 0.01 sec and is subject to the norm bound
ku(t)k2 ≤ 1 ∀t.
The system is stable (the eigenvalues of A are −0.5 ± 0.4909i). In order to build DT , note
that the
states ofthe system at time instants k∆t, k = 0, 1, 2, ... are the same as the states
x1 (k∆t)
x[k] = of the discrete time system
x2 (k∆t)
∆t
1 0 0
Z
x[k + 1] = exp{A∆t} x[k] + exp{As} ds u[k], x[0] =
, (3.7.12)
| {z } 0 0.7071 0
S 0
| {z }
B
where u[k] is the value of the control on the “continuous time” interval (k∆t, (k + 1)∆t).
We build the inner Ik and the outer Ok ellipsoidal approximations of the domains Dk = Dk∆t
in a recurrent manner:
• Ik+1 is the best (the largest in the area) ellipsis contained in the set
• Ok+1 is the best (the smallest in the area) ellipsis containing the set
SOk + BW,
0.6
0.4
0.2
−0.2
−0.4
−0.6
3.8 Exercises
3.8.1 Around positive semidefiniteness, eigenvalues and -ordering
3.8.1.1 Criteria for positive semidefiniteness
Recall the criterion of positive definiteness of a symmetric matrix:
are positive.
Exercise 3.2 Prove that a symmetric m × m matrix A is positive semidefinite if and only if all
its principal minors (i.e., determinants of square sub-matrices symmetric w.r.t. the diagonal)
are nonnegative.
Hint: look at the angular minors of the matrices A + In for small positive .
Exercise 3.4 Derive the Variational Characterization of Singular Values from the Variational
Characterization of Eigenvalues.
Exercise 3.5 Derive from the Variational Characterization of Eigenvalues the following facts:
(i) [Monotonicity of the vector of eigenvalues] If A B, then λ(A) ≥ λ(B);
(ii) The functions λ1 (X), λm (X) of X ∈ Sm are convex and concave, respectively.
(iii) If ∆ is a convex subset of the real axis, then the set of all matrices X ∈ Sm with spectrum
from ∆ is convex.
This definition is compatible with the arithmetic of real polynomials: when you add/multiply
polynomials, you add/multiply the “values” of these polynomials at every fixed symmetric ma-
trix:
(p + q)(A) = p(A) + q(A); (p · q)(A) = p(A)q(A).
A nice feature of this definition is that
(A) For A ∈ Sm , the matrix p(A) depends only on the restriction of p on the spectrum
(set of eigenvalues) of A: if p and q are two polynomials such that p(λi (A)) = q(λi (A))
for i = 1, ..., m, then p(A) = q(A).
Indeed, we can represent a symmetric matrix A as A = U T ΛU , where U is orthogonal and Λ
is diagonal with the eigenvalues of A on its diagonal. Since U U T = I, we have Ai = U T Λi U ;
consequently,
p(A) = U T p(Λ)U,
and since the matrix p(Λ) depends on the restriction of p on the spectrum of A only, the
result follows.
As a byproduct of our reasoning, we get an “explicit” representation of p(A) in terms
of the spectral decomposition A = U T ΛU (U is orthogonal, Λ is diagonal with the
diagonal λ(A)):
(B) The matrix p(A) is just U T Diag(p(λ1 (A)), ..., p(λn (A)))U .
(A) allows to define arbitrary functions of matrices, not necessarily polynomials:
Let A be symmetric matrix and f be a real-valued function defined at least at the
spectrum of A. By definition, the matrix f (A) is defined as p(A), where p is a
polynomial coinciding with f on the spectrum of A. (The definition makes sense,
since by (A) p(A) depends only on the restriction of p on the spectrum of A, i.e.,
every “polynomial continuation” p(·) of f from the spectrum of A to the entire axis
results in the same p(A)).
The “calculus of functions of a symmetric matrix” is fully compatible with the usual arithmetic
of functions, e.g:
(f + g)(A) = f (A) + g(A); (µf )(A) = µf (A); (f · g)(A) = f (A)g(A); (f ◦ g)(A) = f (g(A)),
provided that the functions in question are well-defined on the spectrum of the corre-
sponding matrix. And of course the spectral decomposition of f (A) is just f (A) =
U T Diag(f (λ1 (A)), ..., f (λm (A)))U , where A = U T Diag(λ1 (A), ..., λm (A))U is the spectral de-
composition of A.
Note that “Calculus of functions of symmetric matrices” becomes very unusual when we
are trying to operate with functions of several (non-commuting) matrices. E.g., it is generally
not true that exp{A + B} = exp{A} exp{B} (the right hand side matrix may be even non-
symmetric!). It is also generally not true that if f is monotone and A B, then f (A) f (B),
etc.
Exercise 3.6 Demonstrate by an example that the relation 0 A B does not necessarily
imply that A2 B 2 .
By the way, the relation 0 A B does imply that 0 A1/2 B 1/2 .
Sometimes, however, we can get “weak” matrix versions of usual arithmetic relations. E.g.,
230 LECTURE 3. SEMIDEFINITE PROGRAMMING
Exercise 3.7 Let f be a nondecreasing function on the real line, and let A B. Prove that
λ(f (A)) ≥ λ(f (B)).
The strongest (and surprising) “weak” matrix version of a usual (“scalar”) inequality is as
follows.
Let f (t) be a closed convex function on the real line; by definition, it means that f is a
function on the axis taking real values and the value +∞ such that
– the set Dom f of the values of argument where f is finite is convex and nonempty;
– if a sequence {ti ∈ Dom f } converges to a point t and the sequence f (ti ) has a limit, then
t ∈ Dom f and f (t) ≤ lim f (ti ) (this property is called “lower semicontinuity”).
i→∞
0, 0≤t≤1
E.g., the function f (x) = is closed. In contrast to this, the functions
+∞, otherwise
0, 0<t≤1
g(x) = 1, t=0
+∞, for all remaining t
and
0, 0<t<1
h(x) =
+∞, otherwise
are not closed, although they are convex: a closed function cannot “jump up” at an endpoint
of its domain, as it is the case for g, and it cannot take value +∞ at a point, if it takes values
≤ a < ∞ in a neighbourhood of the point, as it is the case for h.
For a convex function f , its Legendre transformation f∗ (also called the conjugate, or the
Fenchel dual of f ) is defined as
f∗ (s) = sup [ts − f (t)] .
t
It turns out that the Legendre transformation of a closed convex function also is closed and
convex, and that twice taken Legendre transformation of a closed convex function is this function.
The Legendre transformation (which, by the way, can be defined for convex functions on Rn
as well) underlies many standard inequalities. Indeed, by definition of f∗ we have
For specific choices of f , we can derive from the general inequality (L) many useful inequalities.
E.g.,
1 1
st ≤ t2 + s2 ∀s, t ∈ R;
2 2
tp sq
t≥0
p, q , s≥0
• If 1 < p < ∞ and f (t) = , then f∗ (s) = , with q given by
+∞, t < 0 +∞, s<0
1 1
p + q = 1, and (L) becomes the Young inequality
tp sq 1 1
∀(s, t ≥ 0) : ts ≤ + , 1 < p, q < ∞, + = 1.
p q p q
3.8. EXERCISES 231
Now, what happens with (L) if s, t are symmetric matrices? Of course, both sides of (L) still
make sense and are matrices, but we have no hope to say something reasonable about the relation
between these matrices (e.g., the right hand side in (L) is not necessarily symmetric). However,
Exercise 3.8 Let f∗ be a closed convex function with the domain Dom f∗ ⊂ R+ , and let f be
the Legendre transformation of f∗ . Then for every pair of symmetric matrices X, Y of the same
size with the spectrum of X belonging to Dom f and the spectrum of Y belonging to Dom f∗ one
has
λ(f (X)) ≥ λ Y 1/2 XY 1/2 − f∗ (Y ) 24 )
pij ≥ 0, i, j = 1, ..., m;
m
P
pij = 1, j = 1, ..., m;
i=1
Pm
pij = 1, i = 1, ..., m.
j=1
where all Πi are permutation matrices (i.e., with exactly one nonzero element, equal
to 1, in every row and every column).
For proof, see Section B.2.8.C.1.
An immediate corollary of the Birkhoff Theorem is the following fact:
(C) Let f : Rm → R ∪ {+∞} be a convex symmetric function (symmetry means
that the value of the function remains unchanged when we permute the coordinates
in an argument), let x ∈ Dom f and P ∈ Sm . Then
f (P x) ≤ f (x).
The role of (C) in numerous questions related to eigenvalues is based upon the following
simple
Combining the Observation and (C), we conclude that if f is a convex symmetric function on
Rm , then for every m × m symmetric matrix A one has
f (Dg(A)) ≤ f (λ(A)).
Moreover, let Om be the set of all orthogonal m × m matrices. For every V ∈ Om , the matrix
V T AV has the same eigenvalues as A, so that for a convex symmetric f one has
whence
f (λ(A)) ≥ max f (Dg(V T AV )).
V ∈Om
In fact the inequality here is equality, since for properly chosen V ∈ Om we have Dg(V T AV ) =
λ(A). We have arrived at the following result:
Exercise 3.9 Let g(t) : R → R ∪ {+∞} be a convex function, and let Fn be the set of all
matrices X ∈ Sn with the spectrum belonging to Dom g. Prove that the function Tr(g(X)) is
convex on Fn .
3.8. EXERCISES 233
Hint: Apply (D) to the function f (x1 , ..., xn ) = g(x1 ) + ... + g(xn ).
λT (A)λ(B) ≥ Tr(AB).
1 1
Exercise 3.11 Prove that if A ∈ Sm and p, q ∈ [1, ∞] are such that p + q = 1, then
1 1
In particular, kλ(·)kp is a norm on Sm , and the conjugate of this norm is kλ(·)kq , p + q = 1.
X11 X12 ... X1m
X12T X22 ... X2m
Exercise 3.12 Let X = .. .. be an n × n symmetric matrix which is
..
. .···
.
T
X1m T
X2m ... Xmm
2
partitioned into m blocks Xij in a symmetric, w.r.t. the diagonal, fashion (so that the blocks
Xjj are square), and let
X11
X22
X=
b
.. .
.
Xmm
1) Let F : Sn → R ∪ {+∞} be a convex “rotation-invariant” function: for all Y ∈ Sn and
all orthogonal matrices U one has F (U T Y U ) = F (Y ). Prove that
b ≤ F (X).
F (X)
3) Let g : R → R ∪ {+∞} be convex function on the real line which is finite on the set of
eigenvalues of X, and let Fn ⊂ Sn be the set of all n × n symmetric matrices with all eigenvalues
belonging to the domain of g. Assume that the mapping
Y 7→ g(Y ) : Fn → Sn
is -convex:
Prove that
(g(X))ii g(Xii ), i = 1, ..., m,
where the partition of g(X) into the blocks (g(X))ij is identical to the partition of X into the
blocks Xij .
3. [X 2 ] X11
2
[This inequality is nearly evident; it follows also from Exercise 3.12.3) with g(t) = t2 (the
-convexity of g(Y ) is stated in Exercise 3.22.1))];
−1
4. If X 0, then X11 [X −1 ]
[Exercise 3.12.3) with g(t) = t−1 for t > 0; the -convexity of g(Y ) on Sn++ is stated by
Exercise 3.22.2)];
1/2
5. For every X 0, [X 1/2 ] X11
√
[Exercise 3.12.3) with g(t) = − t; the -convexity of g(Y ) is stated by Exercise 3.22.4)].
Extension: If X 0, then for every α ∈ (0, 1) one has [X α ] X11
α
Exercise 3.13 1) Let A = [aij ]i,j 0, let α ≥ 0, and let B ≡ [bij ]i,j = Aα . Prove that
≤ aαii , α≤1
bii
≥ aαii , α≥1
3.8. EXERCISES 235
2) Let A = [aij ]i,j 0, and let B ≡ [bij ]i,j = A−1 . Prove that bii ≥ a−1
ii .
3) Let [A] denote the northwest 2 × 2 block of a square matrix. Which of the implications
are true?
Claim: [“Majorization principle”] X[x] is exactly the solution set of the following
system of inequalities in variables y ∈ Rm :
(recall that Sj (y) is the sum of the largest j entries of a vector y).
Exercise 3.14 [Easy part of the claim] Let Y be the solution set of (+). Prove that Y ⊃ X[x].
Exercise 3.15 [Difficult part of the claim] Let Y be the solution set of (+). Prove that Y ⊂
X[x].
Sketch of the proof: Let y ∈ Y . We should prove that y ∈ X[x]. By symmetry, we may
assume that the vectors x and y are ordered: x1 ≥ x2 ≥ ... ≥ xm , y1 ≥ y2 ≥ ... ≥ ym .
Assume that y 6∈ X[x], and let us lead this assumption to a contradiction.
1) Since X[x] clearly is a convex compact set and y 6∈ X[x], there exists a linear functional
Pm
c(z) = ci zi which separates y and X[x]:
i=1
(Abel’s formula – a discrete version of integration by parts). Use this observation along with
“orderedness” of c(·) and the inclusion y ∈ Y to conclude that c(y) ≤ c(x), thus coming to
the desired contradiction.
Exercise 3.17 Let x ∈ Rm , and let X+ [x] be the set of all vectors x0 dominated by a vector
form X[x]:
X+ [x] = {y | ∃z ∈ X[x] : y ≤ z}.
1) Prove that X+ [x] is a closed convex set.
2) Prove the following characterization of X+ [x]:
X+ [x] is exactly the set of solutions of the system of inequalities Sj (y) ≤ Sj (x),
j = 1, ..., m, in variables y.
Exercise 3.18 Derive Proposition 3.2.2 from the result of Exercise 3.17.2).
for reals xi , yi , i = 1, ..., n; this inequality is exact in the sense that for every collection x1 , ..., xn
there exists a collection y1 , ..., yn with yi2 = 1 which makes (3.8.1) an equality.
P
i
where σ(A) = λ([AAT ]1/2 ) is the vector of singular values of a matrix A arranged in the non-
ascending order.
Prove that for every collection X1 , ..., Xn ∈ Mp,q there exists a collection Y1 , ..., Yn ∈ Mp,q
with YiT Yi = Iq which makes (∗) an equality.
P
i
(ii) Prove the following “matrix version” of the Cauchy inequality: whenever Xi , Yi ∈ Mp,q ,
one has " #1/2
X X X
| Tr(XiT Yi )| ≤ Tr XiT Xi kλ( YiT Yi )k∞
1/2
, (∗∗)
i i i
and for every collection X1 , ..., Xn ∈ Mp,q there exists a collection Y1 , ..., Yn ∈ Mp,q with
P T
Yi Yi = Iq which makes (∗∗) an equality.
i
Both sides of this inequality make sense when the nonnegative reals ai are replaced with positive
semidefinite n × n matrices Ai . What happens with the inequality in this case?
3.8. EXERCISES 237
Consider the following four statements (where α > 1 is a real and m, n > 1):
1)
m
!1/α m
X X
∀(Ai ∈ Sn+ ) : Aαi Ai .
i=1 i=1
2)
m
!1/α m
!
X X
∀(Ai ∈ Sn+ ) : λmax Aαi ≤ λmax Ai .
i=1 i=1
3)
m
!1/α m
!
X X
∀(Ai ∈ Sn+ ) : Tr Aαi ≤ Tr Ai .
i=1 i=1
4)
m
!1/α m
!
X X
∀(Ai ∈ Sn+ ) : Det Aαi ≤ Det Ai .
i=1 i=1
Among these 4 statements, exactly 2 are true. Identify and prove the true statements.
d2
D2 F (x)[h, h] ≡ F (x + th)
dt2 t=0
convex.
6) Let G : Dom G → Sk and F : Dom F → Sm , let G(Dom G) ⊂ Dom F , and let H(x) =
F (G(x)) : Dom G → Sm .
a) Prove that if G and F are -convex and F is -monotone, then H is -convex.
b) Prove that if G and F are -concave and F is -monotone, then H is -concave.
7) Let Fi : G → Sm , and assume that for every x ∈ G exists
Prove that if all functions from the sequence {Fi } are (a) -convex, or (b) -concave, or (c)
-monotone, or (d) -antimonotone, then so is F .
The goal of the next exercise is to establish the -convexity of several matrix-valued functions.
Exercise 3.22 Prove that the following functions are -convex:
1) F (x) = xxT : Mp,q → Sp ;
2) F (x) = x−1 : int Sm m
+ → int S+ ;
3) F (u, v) = uT v −1 u : Mp,q × int Sp+ → Sq ;
Prove that the following functions are -concave and monotone:
4) F (x) = x1/2 : Sm+ →S ;
m
5) F (x) = ln x : int S+ → Sm ;
m
−1
6) F (x) = Ax−1 AT : int Sn+ → Sm , provided that A is an m × n matrix of rank m.
The goal of the subsequent exercises is to answer affirmatively the simplest question of the
above series:
Let π(x) be a convex polynomial of one variable. Then its epigraph
{(t, x) ∈ R2 | t ≥ π(x)}
is SDr.
Let us fix a nonnegative integer k and consider the curve
p(x) = (1, x, x2 , ..., x2k )T ∈ R2k+1 .
Let Πk be the closure of the convex hull of values of the curve. How can one describe Πk ?
A convenient way to answer this question is to pass to a matrix representation of all objects
involved. Namely, let us associate with a vector ξ = (ξ0 , ξ1 , ..., ξ2k ) ∈ R2k+1 the (k + 1) × (k + 1)
symmetric matrix
· · · ξk
ξ0 ξ1 ξ2 ξ3
ξ
1 ξ2 ξ3 ξ4 · · · ξk+1
ξ2
ξ3 ξ4 ξ5 · · · ξk+2
M(ξ) = ξ3 ξ4 ξ5 ξ6 · · · ξk+3
,
.. .. .. .. .. ..
. . . . . .
ξk ξk+1 ξk+2 ξk+3 ... ξ2k
so that
[M(ξ)]ij = ξi+j , i, j = 0, ..., k.
The transformation ξ 7→ M(ξ) : R2k+1 → Sk+1 is a linear embedding; the image of Πk under
this embedding is the closure of the convex hull of values of the curve
P (x) = M(p(x)).
b k of Πk under the mapping M possesses the following properties:
It follows that the image Π
(i) Πk belongs to the image of M, i.e., to the subspace Hk of S2k+1 comprised of Hankel
b
matrices – matrices with entries depending on the sum of indices only:
n o
Hk = X ∈ S2k+1 |i + j = i0 + j 0 ⇒ Xij = Xi0 j 0 ;
and we know something about this mapping: Example 21a (Lecture 3) says that
240 LECTURE 3. SEMIDEFINITE PROGRAMMING
(H) The image of the cone Sk+1 + under the mapping M∗ is exactly the cone of
coefficients of polynomials of degree ≤ 2k which are nonnegative on the entire real
line.
(I) Let π(x) = π0 + π1 x + π2 x2 + ... + π2k x2k be a convex polynomial of degree 2k.
Then the epigraph of π is SDr:
where
1 x x2 x3 ... xk
x x2 x3 x4 ... xk+1
Note that the set X [π] makes sense for an arbitrary polynomial π, not necessary for a convex
one. What is the projection of this set onto the (t, x)-plane? The answer is surprisingly nice:
this is the convex hull of the epigraph of the polynomial π!
Exercise 3.25 Let π(x) = π0 + π1 x + ... + π2k x2k with π2k > 0, and let
be the convex hull of the epigraph of π (the set of all convex combinations of points from the
epigraph of π).
1) Prove that G[π] is a closed convex set.
2) Prove that
G[π] = X [π].
• the dimension of x is equal to the number of arcs in Γ, and the coordinates of x are indexed
by these arcs;
3.8. EXERCISES 241
• the element of L(x) in an “empty” cell ij (one for which the nodes i and j are not linked
by an arc in Γ) is 1;
• the elements of L(x) in a pair of symmetric “non-empty” cells ij, ji (those for which
the nodes i and j are linked by an arc) are equal to the coordinate of x indexed by the
corresponding arc.
As we remember, the importance of Θ(Γ) comes from the fact that Θ(Γ) is a computable upper
bound on the stability number α(Γ) of the graph. We have seen also that the Shore semidefinite
relaxation of the problem of finding the stability number of Γ leads to a “seemingly stronger”
upper bound on α(Γ), namely, the optimal value σ(Γ) in the semidefinite program
− 21 (e + µ)T
λ
min λ : 0 (Sh)
λ,µ,ν − 12 (e + µ) A(µ, ν)
• the off-diagonal entries of A(µ, ν) in a pair of symmetric “non-empty” cells ij, ji are equal
to the coordinate of ν indexed by the corresponding arc.
We have seen that (L) can be obtained from (Sh) when the variables µi are set to 1, so that
σ(Γ) ≤ Θ(Γ). Thus,
α(Γ) ≤ σ(Γ) ≤ Θ(Γ). (3.8.2)
Exercise 3.26 1) Prove that if (λ, µ, ν) is a feasible solution to (Sh), then there exists a sym-
metric n × n matrix A such that λIn − A 0 and at the same time the diagonal entries of
A and the off-diagonal entries in the “empty cells” are ≥ 1. Derive from this observation that
the optimal value in (Sh) is not less than the optimal value Θ0 (Γ) in the following semidefinite
program:
min {λ : λIn − X 0, Xij ≥ 1 whenever i, j are not adjacent in Γ} (Sc).
λ,x
The upper bound Θ0 (Γ) on the stability number of Γ is called the Schrijver capacity of graph Γ.
Note that we have
α(Γ) ≤ Θ0 (Γ) ≤ σ(Γ) ≤ Θ(Γ).
A natural question is which inequalities in this chain may happen to be strict. In order to answer
it, we have computed the quantities in question for about 2,000 random graphs with number of
242 LECTURE 3. SEMIDEFINITE PROGRAMMING
nodes varying 8 to 20. In our experiments, the stability number was computed – by brute force
– for graphs with ≤ 12 nodes; for all these graphs, the integral part of Θ(Γ) was equal to α(Γ).
Furthermore, Θ(Γ) was non-integer in 156 of our 2,000 experiments, and in 27 of these 156 cases
the Schrijver capacity number Θ0 (Γ) was strictly less than Θ(Γ). The quantities Θ0 (·), σ(·), Θ(·)
for 13 of these 27 cases are listed in the table below:
Graph # # of nodes α Θ0 σ Θ
1 20 ? 4.373 4.378 4.378
2 20 ? 5.062 5.068 5.068
3 20 ? 4.383 4.389 4.389
4 20 ? 4.216 4.224 4.224
5 13 ? 4.105 4.114 4.114
6 20 ? 5.302 5.312 5.312
7 20 ? 6.105 6.115 6.115
8 20 ? 5.265 5.280 5.280
9 9 3 3.064 3.094 3.094
10 12 4 4.197 4.236 4.236
11 8 3 3.236 3.302 3.302
12 12 4 4.236 4.338 4.338
13 10 3 3.236 3.338 3.338
Exercise 3.27 Compute the stability numbers of the graphs # 8 and # 13.
The chromatic number ξ(Γ) of a graph Γ is the minimal number of colours such that one can
colour the nodes of the graph in such a way that no two adjacent (i.e., linked by an arc) nodes
get the same colour25) . The complement Γ̄ of a graph Γ is the graph with the same set of nodes,
and two distinct nodes in Γ̄ are linked by an arc if and only if they are not linked by an arc in
Γ.
25)
E.g., when colouring a geographic map, it is convenient not to use the same colour for a pair of countries
with a common border. It was observed that to meet this requirement for actual maps, 4 colours are sufficient.
The famous “4-colour” Conjecture claims that this is so for every geographic map. Mathematically, you can
represent a map by a graph, where the nodes represent the countries, and two nodes are linked by an arc if and
only if the corresponding countries have common border. A characteristic feature of such a graph is that it is
planar – you may draw it on 2D plane in such a way that the arcs will not cross each other, meeting only at the
nodes. Thus, mathematical form of the 4-colour Conjecture is that the chromatic number of any planar graph is
at most 4. This is indeed true, but it took about 100 years to prove the conjecture!
3.8. EXERCISES 243
The standard semidefinite relaxations of the problems are, respectively, the problems
(a) max {Tr(AX) : X 0, Xii = 1, i = 1, ..., n} ,
X (3.8.4)
(b) max {Tr(AX) : X 0, Xii ≤ 1, i = 1, ..., n} ;
X
the optimal value of a relaxation is an upper bound for the optimal value of the respective
original problem.
Exercise 3.30 Let A ∈ Sn . Prove that
max xT Ax ≥ Tr(A).
x:xi =±1, i=1,...,n
Develop an efficient algorithm which, given A, generates a point x with coordinates ±1 such that
xT Ax ≥ Tr(A).
Exercise 3.31 Prove that if the diagonal entries of A are nonnegative, then the optimal values
in (3.8.4.a) and (3.8.4.b) are equal to each other. Thus, in the case in question, the relaxations
“do not understand” whether we are maximizing over the vertices of the cube or over the entire
cube.
Exercise 3.32 Prove that the problems dual to (3.8.4.a, b) are, respectively,
(a) min {Tr(Λ) : Λ A, Λ is diagonal} ,
Λ (3.8.5)
(b) min {Tr(Λ) : Λ A, Λ 0, Λ is diagonal} ;
Λ
the optimal values in these problems are equal to those of the respective problems in (3.8.4) and
are therefore upper bounds on the optimal values of the respective combinatorial problems from
(3.8.3).
244 LECTURE 3. SEMIDEFINITE PROGRAMMING
The latter claim is quite transparent, since the problems (3.8.5) can be obtained as follows:
• In order to bound from above the optimal value of a quadratic form xT Ax on a given set
S, we look at those quadratic forms xT Λx which can be easily maximized over S. For the case
of S = Vrt(Cn ) these are quadratic forms with diagonal matrices Λ, and for the case of S = Cn
these are quadratic forms with diagonal and positive semidefinite matrices Λ; in both cases, the
respective maxima are merely Tr(Λ).
• Having specified a family F of quadratic forms xT Λx “easily optimizable over S”, we then
look at those forms from F which majorate everywhere the original quadratic form xT Ax, and
take among these forms the one with the minimal max xT Λx, thus coming to the problem
x∈S
T
min max x Λx : Λ A, Λ ∈ F . (!)
Λ x∈S
It is evident that the optimal value in this problem is an upper bound on max xT Ax. It is also
x∈S
immediately seen that in the case of S = Vrt(Cn ) the problem (!), with F specified as the set
D of all diagonal matrices, is equivalent to (3.8.5.a), while in the case of S = Cn (!), with F
specified as the set D+ of positive semidefinite diagonal matrices, is nothing but (3.8.5.b).
Given the direct and quite transparent road leading to (3.8.5.a, b), we can try to move a
little bit further along this road. To this end observe that there are trivial upper bounds on the
maximum of an arbitrary quadratic form xT Λx over Vrt(Cn ) and Cn , specifically:
X X
max xT Λx ≤ Tr(Λ) + |Λij |, max xT Λx ≤ |Λij |.
x∈Vrt(Cn ) i6=j
x∈Cn
i,j
For the above families D, D+ of matrices Λ for which xT Λx is “easily optimizable” over Vrt(Cn ),
respectively, Cn , the above bounds are equal to the precise values of the respective maxima. Now
let us update (!) as follows: we eliminate the restriction Λ ∈ F, replacing simultaneously the
objective max xT Λx with its upper bound, thus coming to the pair of problems
x∈S
( )
|Λij | : Λ A
P
(a) min Tr(Λ) + [S = Vrt(Cn )]
Λ i6=j
( ) (3.8.6)
|Λij | : Λ A
P
(b) min [S = Cn ]
Λ i,j
From the origin of the problems it is clear that they still yield upper bounds on the optimal
values of the respective problems (3.8.3.a, b), and that these bounds are at least as good as the
bounds yielded by the standard relaxations (3.8.4.a, b):
Indeed, consider the problem (3.8.6.a). Whenever Λ is a feasible solution of this problem,
the quadratic form xT Λx majorates everywhere the form xT Ax, so that max xT Ax ≤
x∈Vrt(Cn )
3.8. EXERCISES 245
Note that problems (3.8.6) are equivalent to semidefinite programs and thus are of the same sta-
tus of “computational tractability” as the standard SDP relaxations (3.8.5) of the combinatorial
problems in question. At the same time, our new bounding problems are more difficult than the
standard SDP relaxations. Can we justify this by getting an improvement in the quality of the
bounds?
Exercise 3.33 Find out whether the problems (3.8.6.a, b) yield better bounds than the respective
problems (3.8.5.a, b), i.e., whether the inequalities (*), (**) in (3.8.7) can be strict.
Exercise 3.34 Let D be a given subset of Rn+ . Consider the following pair of optimization
problems: n o
max xT Ax : (x21 , x22 , ..., x2n )T ∈ D (P )
x
max {Tr(AX) : X 0, Dg(X) ∈ D} (R)
X
(Dg(X) is the diagonal of a square matrix X). Note that when D = {(1, ..., 1)T }, (P ) is the
problem of maximizing a quadratic form over the vertices of Cn , while (R) is the standard
semidefinite relaxation of (P ); when D = {x ∈ Rn | 0 ≤ xi ≤ 1 ∀i}, (P ) is the problem of
maximizing a quadratic form over the cube Cn , and (R) is the standard semidefinite relaxation
of the latter problem.
1) Prove that if D is semidefinite-representable, then (R) can be reformulated as a semidefi-
nite program.
2) Prove that (R) is a relaxation of (P ), i.e., that
Opt(P ) ≤ Opt(R).
m
2 X
max{xT Ax | xi = ±1, i = 1, ..., m} = max{ aij asin(Xij ) | X 0, Xii = 1, i = 1, ..., m}.
π i,j=1
246 LECTURE 3. SEMIDEFINITE PROGRAMMING
A natural mathematical model of a swing is the linear time invariant dynamic system
with positive ω 2 and 0 ≤ µ < ω (the term 2µy 0 (t) represents friction). A general solution to this
equation is
q
y(t) = a cos(ω 0 t + φ0 ) exp{−µt}, ω 0 = ω 2 − µ2 ,
with free parameters a and φ0 , i.e., this is a decaying oscillation. Note that the equilibrium
y(t) ≡ 0
is stable – every solution to (S) converges to 0, along with its derivative, exponentially fast.
After stability is observed, an immediate question arises: how is it possible to swing on
a swing? Everybody knows from practice that it is possible. On the other hand, since the
equilibrium is stable, it looks as if it was impossible to swing, without somebody’s assistance,
for a long time. The reason which makes swinging possible is highly nontrivial – parametric
resonance. A swinging child does not sit on the swing in a once forever fixed position; what he
does is shown below:
A swinging child
As a result, the “effective length” of the swing l – the distance from the point where the rope
is fixed to the center of gravity of the system – is varying with time: l = l(t). Basic mechanics
says that ω 2 = g/l, g being the gravity acceleration. Thus, the actual swing is a time-varying
linear dynamic system:
and it turns out that for properly varied ω(t) the equilibrium y(t) ≡ 0 is not stable. A swinging
child is just varying l(t) in a way which results in an unstable dynamic system (S0 ), and this
3.8. EXERCISES 247
0.08
0.03
0.06
0.02
0.04
0.02
0.01
0
−0.02
−0.04
−0.01
−0.06
−0.02
−0.08
−0.03 −0.1
0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18
g
y 00 (t) = − l+h sin(2ωt) y(t) − 2µy 0 (t), y(0) = 0, y 0 (0) = 0.1
h i
m 1
p
l = 1 [m], g = 10 [ sec 2 ], µ = 0.15[ sec ], ω = g/l
Graph of y(t)
left: h = 0.125: this child is too small; he should grow up...
right: h = 0.25: this child can already swing...
Exercise 3.36 Assume that you are given parameters l (“nominal length of the swing rope”),
h > 0 and µ > 0, and it is known that a swinging child can vary the “effective length” of the rope
within the bounds l ± h, i.e., his/her movement is governed by the uncertain linear time-varying
system
g g
y 00 (t) = −a(t)y(t) − 2µy 0 (t), a(t) ∈ , .
l+h l−h
Try to identify the domain in the 3D-space of parameters l, µ, h where the system is stable, as
well as the domain where its stability can be certified by a quadratic Lyapunov function. What
is “the difference” between these two domains?
π
3.8.5 Around Nesterov’s 2
Theorem
Exercise 3.37 Prove the following statement:
Assume that
2. There exists a combination of the matrices Bi with nonnegative coefficients which is positive
definite;
3. A 0.
Then Opt ≥ 0, (SDP ) and (SDD) are solvable with equal optimal values, and
π
Opt ≤ SDP ≤ Opt. (3.8.8)
2
Hint: Observe that since Bi are commuting symmetric matrices, they share a common or-
thogonal eigenbasis, so that w.l.o.g. we can assume that all Bi ’s are diagonal. In this latter
case, use Theorem 3.4.3.
P1 Λ1 P1T
In
. .. ..
.
A=
T
P m Λ m P m I n
m
T −1 −1 −1
P
In ··· In [Pi ] Λi Pi
i=1
and apply twice the Schur Complement Lemma: first time – to prove that the matrix is
positive semidefinite, and the second time – to get from the latter fact the desired inequality.
Exercise 3.40 Assume you are given m full-dimensional ellipsoids centered at the origin
in Rn .
1) Prove that for every collection Λ of positive definite n × n matrices Λi such that
X
λmax (Λi ) ≤ 1
i
the ellipsoid
"m #−1
T
X −1/2 −1 −1/2
EΛ = {x | x Bi Λi Bi x ≤ 1}
i=1
3.8. EXERCISES 249
In this form, the approximation scheme in question was proposed by Schweppe (1975).
Exercise 3.41 Let Ai be nonsingular n × n matrices, i = 1, ..., m, and let Wi = {x = Ai u |
uT u ≤ 1} be the associated ellipsoids in Rn . Let ∆m = {λ ∈ Rm
+ |
P
λi = 1}. Prove that
i
1) Whenever λ ∈ ∆m and A ∈ Mn,n is such that
m
λ−1
X
AAT F (λ) ≡ T
i Ai Ai ,
i=1
It follows that
In view of this observation, we can try to approximate W from inside and from outside by the
ellipsoids W− ≡ A(W− ) and W + = A(W + ), where W− and W + are, respectively, the largest
and the smallest volume nm-dimensional ellipsoids contained in/containing B.
so that
m m
W ⊃ W− ≡ {x = Ai u[i] | uT [i]u[i] ≤ 1},
P P
i=1 i=1
m m √
W ⊂ W+ ≡ {x = Ai u[i] | uT [i]u[i] ≤ m} =
P P
mW− .
i=1 i=1
W− = {x = Bu | u ∈ Rn , uT u ≤ 1}
Derive from this observation that the “level of conservativeness” of the inner ellipsoidal
√
approximation of W given by Proposition 3.7.6 is at most m: if W∗ is this inner ellipsoidal
approximation and W∗∗ is the largest volume ellipsoid contained in W , then
1/n 1/n
Vol(W∗∗ ) Vol(W ) √
≤ ≤ m.
Vol(W∗ ) Vol(W∗ )
Invariant ellipsoids
Exercise 3.43 Consider a discrete time controlled dynamic system
x(t + 1) = Ax(t) + bu(t), t = 0, 1, 2, ...
x(0) = 0,
where x(t) ∈ Rn is the state vector and u(t) ∈ [−1, 1] is the control at time t. An ellipsoid
centered at the origin
W = {x | xT Zx ≤ 1} [Z 0]
is called invariant, if
x ∈ W ⇒ Ax ± b ∈ W.
Prove that
1) If W is an invariant ellipsoid and x(t) ∈ W for some t, then x(t0 ) ∈ W for all t0 ≥ t.
2) Assume that the vectors b, Ab, A2 b, ..., An−1 b are linearly independent. Prove that an
invariant ellipsoid exists if and only if A is stable (the absolute values of all eigenvalues of A
are < 1).
3) Assuming that A is stable, prove that an ellipsoid {x | xT Zx ≤ 1} [Z 0] is invariant if
and only if there exists λ ≥ 0 such that
1 − bT Zb − λ −bT ZA
0.
−AT Zb λZ − AT ZA
How could one use this fact to approximate numerically the smallest volume invariant ellipsoid?
in such a way that if u(·) is a control satisfying (3.8.12) and x(0) is an initial state satisfying
(3.8.13), then for every t ≥ 0 it holds
x(t) ∈ E(t).
252 LECTURE 3. SEMIDEFINITE PROGRAMMING
(L) For every t ≥ 0 and every point xt ∈ E(t) (see (3.8.14)), every trajectory x(τ ),
τ ≥ t, of the system
d
x(τ ) = A(τ )x(τ ) + B(τ )u(τ ) + v(τ ), x(t) = xt
dτ
with ku(·)k2 ≤ 1 satisfies x(τ ) ∈ E(τ ) for all τ ≥ t.
Note that (L) is a sufficient (but in general not necessary) condition for the system of ellipsoids
E(t), t ≥ 0, “to cover” all trajectories of (3.8.11) – (3.8.12). Indeed, when formulating (L), we
act as if we were sure that the states x(t) of our system run through the entire ellipsoid E(t),
which is not necessarily the case. The advantage of (L) is that this condition can be converted
into an “infinitesimal” form:
Exercise 3.44 Prove that if Gt 0 is continuously differentiable and satisfies (L), then
d
∀ t ≥ 0, x, u : xT Gt x = 1, uT u ≤ 1 : 2uT B T (t)Gt x + xT [
Gt + AT (t)Gt + Gt A(t)]x ≤ 0.
dt
(3.8.16)
Vice versa, if Gt is a continuously differentiable function taking values in the set of positive
definite symmetric matrices and satisfying (3.8.16) and the initial condition G0 = G0 , then the
associated system of ellipsoids {E(t)} satisfies (L).
The result of Exercise 3.44 provides us with a kind of description of the families of ellipsoids
{E(t)} we are interested in. Now let us take care of the volumes of these ellipsoids. The latter
can be done via a “greedy” (locally optimal) policy: given E(t), let us try to minimize, under
restriction (3.8.16), the derivative of the volume of the ellipsoid at time t. Note that this locally
optimal policy does not necessary yield the smallest volume ellipsoids satisfying (L) (achieving
“instant reward” is not always the best way to happiness); nevertheless, this policy makes sense.
We have 2 ln vol(Et ) = − ln Det(Gt ), whence
d d
2 ln vol(E(t)) = −Tr(G−1
t Gt );
dt dt
d
thus, our greedy policy requires to choose Ht ≡ dt Gt as a solution to the optimization problem
max Tr(G−1 T T T d T
t H) : 2u B (t)Gt x + x [ dt Gt − A (t)Gt − Gt A(t)]x ≤ 0
H=H T
∀ x, u : xT Gt x = 1, uT u ≤ 1 .
3.8. EXERCISES 253
Exercise 3.45 Prove that the outlined greedy policy results in the solution Gt to the differential
equation
q
d
= −AT (t)Gt − Gt A(t) −
q n T − Tr(Gt B(t)B T (t)) G , t ≥ 0;
dt Gt Tr(Gt B(t)B T (t)) Gt B(t)B (t)Gt n t
G0 = G0 .
Prove that the solution to this equation is symmetric and positive definite for all t > 0, provided
that G0 = [G0 ]T 0.
Exercise 3.46 Modify the previous reasoning to demonstrate that the “locally optimal” policy
for building inner ellipsoidal approximation of the set
E(t) = {x | (x − xt )T Wt (x − xt ) ≤ 1},
The model of computations in CT: an idealized computer capable to store only integers
(i.e., finite binary words), and its operations are bitwise: we are allowed to multiply, add and
compare integers. To add and to compare two `-bit integers, it takes O(`) “bitwise” elementary
operations, and to multiply a pair of `-bit integers it costs O(`2 ) elementary operations (the cost
of multiplication can be reduced to O(` ln(`)), but it does not matter) .
In CT, a solution to an instance (p) of a generic problem P is a finite binary word y such
that the pair (Data(p),y) satisfies certain “verifiable condition” A(·, ·). Namely, it is assumed
that there exists a code M for the above “Integer Arithmetic computer” such that executing the
code on every input pair x, y of finite binary words, the computer after finitely many elementary
255
256 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS
operations terminates and outputs either “yes”, if A(x, y) is satisfied, or “no”, if A(x, y) is not
satisfied. Thus, P is the problem
Given n positive integers a1 , ..., an , find a vector x = (x1 , ..., xn )T with coordinates
±1 such that xi ai = 0, or detect that no such vector exists
P
i
is a generic combinatorial problem. Indeed, the data of the instance of the problem, same as
candidate solutions to the instance, can be naturally encoded by finite sequences of integers. In
turn, finite sequences of integers can be easily encoded by finite binary words. And, of course,
for this problem you can easily point out a code for the “Integer Arithmetic computer” which,
given on input two binary words x = Data(p), y encoding the data vector of an instance (p)
of the problem and a candidate solution, respectively, verifies in finitely many “bit” operations
whether y represents or does not represent a solution to (p).
A solution algorithm for a generic problem P is a code S for the Integer Arithmetic computer
which, given on input the data vector Data(p) of an instance (p) ∈ P, after finitely many
operations either returns a solution to the instance, or a (correct!) claim that no solution exists.
The running time TS (p) of the solution algorithm on instance (p) is exactly the number of
elementary (i.e., bit) operations performed in course of executing S on Data(p).
A solvability test for a generic problem P is defined similarly to a solution algorithm, but
now all we want of the code is to say (correctly!) whether the input instance is or is not solvable,
just “yes” or “no”, without constructing a solution in the case of the “yes” answer.
where length(x) is the bit length (i.e., number of bits) of a finite binary word x. The algo-
rithm/test is called polynomial time, if its complexity is bounded from above by a polynomial
of `.
Classes P and NP. A generic problem P is said to belong to the class NP, if the corresponding
condition A, see (4.1.1), possesses the following two properties:
4.1. COMPLEXITY OF CONVEX PROGRAMMING 257
Thus, the first property of an NP problem states that given the data Data(p) of a
problem instance p and a candidate solution y, it is easy to check whether y is an
actual solution of (p) – to verify this fact, it suffices to compute A(Data(p), y), and
this computation requires polynomial in length(Data(p)) + length(y) time.
The second property of an NP problem makes its instances even more easier:
A generic problem P is said to belong to the class P, if it belongs to the class NP and is
polynomially solvable.
The question whether P=NP – whether NP-complete problems are or are not polynomially
solvable, is qualified as “the most important open problem in Theoretical Computer Science”
and remains open for about 30 years. One of the most basic results of Theoretical Computer
Science is that NP-complete problems do exist (the Stones problem is an example). Many of
these problems are of huge practical importance, and are therefore subject, over decades, of
intensive studies of thousands excellent researchers. However, no polynomial time algorithm for
any of these problems was found. Given the huge total effort invested in this research, we should
conclude that it is “highly improbable” that NP-complete problems are polynomially solvable.
Thus, at the “practical level” the fact that certain problem is NP-complete is sufficient to qualify
the problem as “computationally intractable”, at least at our present level of knowledge.
1)
Here and in what follows, we denote by χ positive “characteristic constants” associated with the predi-
cates/problems in question. The particular values of these constants are of no importance, the only thing that
matters is their existence. Note that in different places of even the same equation χ may have different values.
258 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS
where
• n(p) is the design dimension of program (p);
Families of optimization programs. We want to speak about methods for solving opti-
mization programs (p) “of a given structure” (for example, Linear Programming ones). All
programs (p) “of a given structure”, like in the combinatorial case, form certain family P, and
we assume that every particular program in this family – every instance (p) of P – is specified by
its particular data Data(p). However, now the data is a finite-dimensional real vector; one may
think about the entries of this data vector as about particular values of coefficients of “generic”
(specific for P) analytic expressions for p0 (x) and X(p). The dimension of the vector Data(p)
will be called the size of the instance:
The model of computations. This is what is known as “Real Arithmetic Model of Computa-
tions”, as opposed to ”Integer Arithmetic Model” in the CT. We assume that the computations
are carried out by an idealized version of the usual computer which is capable to store count-
ably many reals and can perform with them the standard exact real arithmetic operations – the
four basic arithmetic operations, evaluating elementary functions, like cos and exp, and making
comparisons.
be the optimal value of the instance (i.e., the infimum of the values of the objective on the
feasible set, if the instance is feasible, and +∞ otherwise). A point x ∈ Rn(p) is called an
-solution to (p), if
InfeasP (x, p) ≤ and p0 (x) − Opt(p) ≤ ,
i.e., if x is both “-feasible” and “-optimal” for the instance.
4.1. COMPLEXITY OF CONVEX PROGRAMMING 259
It is convenient to define the number of accuracy digits in an -solution to (p) as the quantity
!
Size(p) + kData(p)k1 + 2
Digits(p, ) = ln .
Polynomial computability. Let P be a generic convex program, and let InfeasP (x, p) be
the corresponding measure of infeasibility of candidate solutions. We say that our family is
polynomially computable, if there exist two codes Cobj , Ccons for the Real Arithmetic computer
such that
1. For every instance (p) ∈ P, the computer, when given on input the data vector of the
instance (p) and a point x ∈ Rn(p) and executing the code Cobj , outputs the value p0 (x) and a
subgradient e(x) ∈ ∂p0 (x) of the objective p0 of the instance at the point x, and the running
time (i.e., total number of operations) of this computation Tobj (x, p) is bounded from above by
a polynomial of the size of the instance:
∀ (p) ∈ P, x ∈ Rn(p) : Tobj (x, p) ≤ χSizeχ (p) [Size(p) = dim Data(p)]. (4.1.3)
260 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS
(recall that in our notation, χ is a common name of characteristic constants associated with P).
2. For every instance (p) ∈ P, the computer, when given on input the data vector of the
instance (p), a point x ∈ Rn(p) and an > 0 and executing the code Ccons , reports on output
whether InfeasP (x, p) ≤ , and if it is not the case, outputs a linear form a which separates the
point x from all those points y where InfeasP (y, p) ≤ :
the running time Tcons (x, , p) of the computation being bounded by a polynomial of the size of
the instance and of the “number of accuracy digits”:
∀ (p) ∈ P, x ∈ Rn(p) , > 0 : Tcons (x, , p) ≤ χ (Size(p) + Digits(p, ))χ . (4.1.5)
Note that the vector a in (4.1.4) is not supposed to be nonzero; when it is 0, (4.1.4) simply says
that there are no points y with InfeasP (y, p) ≤ .
Polynomial growth. We say that a generic convex program P is with polynomial growth, if
the objectives and the infeasibility measures, as functions of x, grow polynomially with kxk1 ,
the degree of the polynomial being a power of Size(p):
∀ (p) ∈ P, x ∈ Rn(p) :
χ (4.1.6)
|p0 (x)| + InfeasP (x, p) ≤ (χ [Size(p) + kxk1 + kData(p)k1 ])(χSize (p))
.
Polynomial boundedness of feasible sets. We say that a generic convex program P has
polynomially bounded feasible sets, if the feasible set X(p) of every instance (p) ∈ P is bounded,
and is contained in the Euclidean ball, centered at the origin, of “not too large” radius:
∀(p) ∈ Pn:
χ (4.1.7)
X(p) ⊂ x ∈ Rn(p) : kxk2 ≤ (χ [Size(p) + kData(p)k1 ])χSize
o
(p)
.
Example. Consider generic optimization problems LP b , CQP b , SDP b with instances in the
conic form
n o
min p0 (x) ≡ cT(p) x : x ∈ X(p) ≡ {x : A(p) x − b(p) ∈ K(p), kxk2 ≤ R} ; (4.1.8)
x∈Rn(p)
where K is a cone belonging to a characteristic for the generic program family K of cones,
specifically,
• the family of nonnegative orthants for LP b ,
• the family of direct products of Lorentz cones for CQP b ,
• the family of semidefinite cones for SDP b .
The data of and instance (p) of the type (4.1.8) is the collection
with naturally defined size(s) of a cone K from the family K associated with the generic program
under consideration: the sizes of Rn+ and of Sn+ equal n, and the size of a direct product of Lorentz
cones is the sequence of the dimensions of the factors.
4.1. COMPLEXITY OF CONVEX PROGRAMMING 261
The generic conic programs in question are equipped with the infeasibility measure
n o
Infeas(x, p) = min t ≥ 0 : te[K(p) ] + A(p) x − b(p) ∈ K(p) , (4.1.9)
where
• X is a convex compact set in Rn with a nonempty interior
• f is a continuous convex function on X.
Assume that our “environment” when solving (4.1.10) is as follows:
1. We have access to a Separation Oracle Sep(X) for X – a routine which, given on input a
point x ∈ Rn , reports on output whether or not x ∈ int X, and in the case of x 6∈ int X,
returns a separator – a nonzero vector e such that
eT x ≥ max eT y (4.1.11)
y∈X
(the existence of such a separator is guaranteed by the Separation Theorem for convex
sets);
262 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS
2. We have access to a First Order oracle which, given on input a point x ∈ int X, returns
the value f (x) and a subgradient f 0 (x) of f at x (Recall that a subgradient f 0 (x) of f at
x is a vector such that
f (y) ≥ f (x) + (y − x)T f 0 (x) (4.1.12)
for all y; convex function possesses subgradients at every relative interior point of its
domain, see Section C.6.2.);
3. We are given two positive reals R ≥ r such that X is contained in the Euclidean ball,
centered at the origin, of the radius R and contains a Euclidean ball of the radius r (not
necessarily centered at the origin).
Theorem 4.1.2 In the outlined “working environment”, for every given > 0 it is possible to
find an -solution to (4.1.10), i.e., a point x ∈ X with
in no more than N () subsequent calls to the Separation and the First Order oracles plus no
more than O(1)n2 N () arithmetic operations to process the answers of the oracles, with
VarX (f )R
2
N () = O(1)n ln 2 + . (4.1.13)
·r
Here
VarX (f ) = max f − min f.
X X
Ellipsoid method: the idea. Assume that we have already found an n-dimensional ellipsoid
E = {x = c + Bu | uT u ≤ 1} [B ∈ Mn,n , DetB 6= 0]
which contains the optimal set X∗ of (4.1.10) (note that X∗ 6= ∅, since the feasible set X of
(4.1.10) is assumed to be compact, and the objective f – to be convex on the entire Rn and
therefore continuous, see Theorem C.4.1). How could we construct a smaller ellipsoid containing
X∗ ?
The answer is immediate.
1) Let us call the separation oracle Sep(X), the center c of the current ellipsoid being the
input. There are two possible cases:
1.a) Sep(X) reports that c 6∈ X and returns a separator a:
a 6= 0, aT c ≥ sup aT y. (4.1.14)
y∈X
4.1. COMPLEXITY OF CONVEX PROGRAMMING 263
In this case we can replace our current “localizer” E of the optimal set X∗ by a smaller one –
namely, by the “half-ellipsoid”
b = {x ∈ E | aT x ≤ aT c}.
E
Indeed, by assumption X∗ ⊂ E; when passing from E to E, b we cut off all points x of E where
aT x > aT c, and by (4.1.14) all these points are outside of X and therefore outside of X∗ ⊂ X.
Thus, X∗ ⊂ E. b
1.b) Sep(X) reports that c ∈ X. In this case we call the first order oracle O(f ), c being the
input; the oracle returns the value f (c) and a subgradient a = f 0 (c) of f at c. Again, two cases
are possible:
1.b.1) a = 0. In this case we are done – c is a minimizer of f on X. Indeed, c ∈ X, and
(4.1.12) reads
f (y) ≥ f (c) + 0T (y − c) = f (c) ∀y ∈ Rn .
Thus, c is a minimizer of f on Rn , and since c ∈ X, c minimizes f on X as well.
1.b.2) a 6= 0. In this case (4.1.12) reads
b = {x ∈ E | aT x ≤ aT c}
E
b = {x ∈ E | aT x ≤ aT c}
E [a 6= 0]
E + = {x = c+ + B + u | uT u ≤ 1},
1
c+ = c −
n+1
Bp,
B + = B √n2 −1 (In − ppT ) + n+1 ppT = √nn2 −1 B + n+1
n n n
− √ n
n2 −1
(Bp)pT ,
BT a
p= √
aT BB T a
(4.1.15)
and if n = 1, then the set E
b is contained in the ellipsoid (which now is just a segment)
E + = {x | c+ B + u | |u| ≤ 1},
c+ = c − 12 |Ba|
Ba
,
1
B+ = 2 B.
264 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS
In all cases, the n-dimensional volume Vol(E + ) of the ellipsoid E + is less than the
one of E:
n−1
n n
Vol(E + ) = √ Vol(E) ≤ exp{−1/(2n)}Vol(E) (4.1.16)
n2 − 1 n+1
n−1
(in the case of n = 1, √ n = 1).
n2 −1
(*) says that there exists (and can be explicitly specified) an ellipsoid E + ⊃ E b with the volume
+
constant times less than the one of E. Since E covers E, and the latter set, as we have
b
seen, covers X∗ , E + covers X∗ . Now we can iterate the above construction, thus obtaining a
sequence of ellipsoids E, E + , (E + )+ , ... with volumes going to 0 at a linear rate (depending on
the dimension n only) which “collapses” to the set X∗ of optimal solutions of our problem –
exactly as in the usual bisection!
Note that (*) is just an exercise in elementary calculus. Indeed, the ellipsoid E is given
as an image of the unit Euclidean ball W = {u | uT u ≤ 1} under the one-to-one affine
mapping u 7→ c + Bu; the half-ellipsoid E b is then the image, under the same mapping, of
the half-ball
Wc = {u ∈ W | pT u ≤ 0}
p being the unit vector from (4.1.15); indeed, if x = c + Bu, then aT x ≤ aT c if and only if
aT Bu ≤ 0, or, which is the same, if and only if pT u ≤ 0. Now, instead of covering E b by
+ +
a small in volume ellipsoid E , we may cover by a small ellipsoid W the half-ball W and
c
then take E + to be the image of W + under our affine mapping:
E + = {x = c + Bu | u ∈ W + }. (4.1.17)
Indeed, if W + contains W
c , then the image of W + under our affine mapping u 7→ c + Bu
contains the image of W
c , i.e., contains E.
b And since the ratio of volumes of two bodies
remain invariant under affine mapping (passing from a body to its image under an affine
mapping u 7→ c + Bu, we just multiply the volume by |DetB|), we have
Vol(E + ) Vol(W + )
= .
Vol(E) Vol(W )
Thus, the problem of finding a “small” ellipsoid E + containing the half-ellipsoid E b can be
+
reduced to the one of finding a “small” ellipsoid W containing the half-ball W , as shown
c
on Fig. 5.1. Now, the problem of finding a small ellipsoid containing W c is very simple: our
“geometric data” are invariant with respect to rotations around the p-axis, so that we may
look for W + possessing the same rotational symmetry. Such an ellipsoid W + is given by just
3 parameters: its center should belong to our symmetry axis, i.e., should be −hp for certain
h, one of the half-axes of the ellipsoid (let its length be µ) should be directed along p, and the
remaining n − 1 half-axes should be of the same length λ and be orthogonal to p. For our 3
parameters h, µ, λ we have 2 equations expressing the fact that the boundary of W + should
pass through the “South pole” −p of W and trough the “equator” {u | uT u = 1, pT u = 0};
indeed, W + should contain W c and thus – both the pole and the equator, and since we are
+
looking for W with the smallest possible volume, both the pole and the equator should
be on the boundary of W + . Using our 2 equations to express µ and λ via h, we end up
with a single “free” parameter h, and the volume of W + (i.e., const(n)µλn−1 ) becomes an
explicit function of h; minimizing this function in h, we find the “optimal” ellipsoid W + ,
check that it indeed contains W c (i.e., that our geometric intuition was correct) and then
+ +
convert W into E according to (4.1.17), thus coming to the explicit formulas (4.1.15) –
4.1. COMPLEXITY OF CONVEX PROGRAMMING 265
⇔
x=c+Bu
b and E +
E, E c and W +
W, W
Figure 5.1.
Ellipsoid method: the construction. There is a small problem with implementing our idea
of “trapping” the optimal set X∗ of (4.1.10) by a “collapsing” sequence of ellipsoids. The only
thing we can ensure is that all our ellipsoids contain X∗ and that their volumes rapidly (at a
linear rate) converge to 0. However, the linear sizes of the ellipsoids should not necessarily go
to 0 – it may happen that the ellipsoids are shrinking in some directions and are not shrinking
(or even become larger) in other directions (look what happens if we apply the construction to
minimizing a function of 2 variables which in fact depends only on the first coordinate). Thus,
to the moment it is unclear how to build a sequence of points converging to X∗ . This difficulty,
however, can be easily resolved: as we shall see, we can form this sequence from the best
feasible solutions generated so far. Another issue which remains open to the moment is when
to terminate the method; as we shall see in a while, this issue also can be settled satisfactory.
The precise description of the Ellipsoid method as applied to (4.1.10) is as follows (in this
description, we assume that n ≥ 2, which of course does not restrict generality):
c0 = 0, B0 = RI, E0 = {x = c0 + B0 u | uT u ≤ 1};
note that E0 ⊃ X.
We also set
ρ0 = R, L0 = 0.
266 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS
The quantities ρt will be the “radii” of the ellipsoids Et to be built, i.e., the radii
of the Euclidean balls of the same volumes as Et ’s. The quantities Lt will be our
guesses for the variation
of the objective on the initial ellipsoid E0 . We shall use these guesses in the termi-
nation test.
Finally, we input the accuracy > 0 to which we want to solve the problem.
Step t, t = 1, 2, .... At the beginning of step t, we have the previous ellipsoid
(i.e., have ct−1 , Bt−1 ) along with the quantities Lt−1 ≥ 0 and
a 6= 0 : aT ct−1 ≥ sup aT y.
y∈X
at = a, Lt = Lt−1
and go to rule 3) below. Otherwise – i.e., when ct−1 ∈ X – we call step t productive
and go to rule 2).
2) We call the first order oracle O(f ), ct−1 being the input, and get the value
f (ct−1 ) and a subgradient a ≡ f 0 (ct−1 ) of f at the point ct−1 . It is possible that
a = 0; in this case we terminate and claim that ct−1 is an optimal solution to (4.1.10).
In the case of a 6= 0 we set
at = a,
compute the quantity
update L by setting
Lt = max{Lt−1 , `t }
and go to rule 3).
3) We set
bt = {x ∈ Et−1 | aT x ≤ aT ct−1 }
E t t
We also set
(n−1)/n 1/n
n n
1/n
ρt = |DetBt | = √ ρt−1
2
n −1 n+1
is satisfied. If it is the case, we terminate and output, as the result of the solution
process, the best (i.e., with the smallest value of f ) of the “search points” cτ −1
associated with productive steps τ ≤ t (we shall see that these productive steps
indeed exist, so that the result of the solution process is well-defined). If (4.1.19) is
not satisfied, we go to step t + 1.
Just to get some feeling how the method works, here is a 2D illustration. The problem is
min f (x),
−1≤x1 ,x2 ≤1
1
f (x) = + 0.623233851x2 − 7.957418455)2
2 (1.443508244x1
+5(−0.350974738x1 + 0.799048618x2 + 2.877831823)4 ,
the optimal solution is x∗1 = 1, x∗2 = −1, and the exact optimal value is 70.030152768...
The values of f at the best (i.e., with the smallest value of the objective) feasible solutions
found in course of first t steps of the method, t = 1, 2, ..., 256, are shown in the following table:
Ellipsoid method: complexity analysis. We are about to establish our key result (which,
in particular, immediately implies Theorem 4.1.2):
268 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS
1
2 3
0
2
3
15
15
Theorem 4.1.3 Let the Ellipsoid method be applied to convex program (4.1.10) of dimension
n ≥ 2 such that the feasible set X of the problem contains a Euclidean ball of a given radius
r > 0 and is contained in the ball E0 = {kxk2 ≤ R} of a given radius R. For every input
accuracy > 0, the Ellipsoid method terminates after no more than
R + VarR (f )
2
N () = Ceil 2n ln + ln +1 (4.1.20)
r
steps, where
VarR (f ) = max f − min f,
E0 E0
(a) E0 ⊃ X; n o
(b) bτ = x ∈ Eτ −1 | aT x ≤ aT cτ −1 , τ = 1, ..., t;
Eτ ⊃ E τ τ
n−1 (4.1.22)
(c) Vol(Eτ ) = ρnτ Vol(E0 ) = √nn2 −1 n
n+1 Vol(Eτ −1 )
≤ exp{−1/(2n)}vol(Eτ −1 ), τ = 1, ..., t.
4.1. COMPLEXITY OF CONVEX PROGRAMMING 269
20 . We claim that
Indeed, there are only two possible reasons for termination. First, it may happen that
ct−1 ∈ X and f 0 (ct−1 ) = 0 (see rule 2)). From our preliminary considerations we know that in
this case ct−1 is an optimal solution to (4.1.10), which is even more than what we have claimed.
Second, it may happen that at step t relation (4.1.19) is satisfied. Let us prove that the claim
of 20 takes place in this case as well.
20 .a) Let us set
ν= ∈ (0, 1].
+ Lt
By (4.1.19), we have ρt /r < ν, so that there exists ν 0 such that
ρt
< ν0 < ν [≤ 1]. (4.1.24)
r
Let x∗ be an optimal solution to (4.1.10), and X + be the “ν 0 -shrinkage” of X to x∗ :
We have n
rν 0
Vol(X + ) = (ν 0 )n Vol(X) ≥ Vol(E0 ) (4.1.26)
R
(the last inequality is given by the fact that X contains a Euclidean ball of the radius r), while
n
ρt
Vol(Et ) = Vol(E0 ) (4.1.27)
R
by definition of ρt . Comparing (4.1.26), (4.1.27) and taking into account that ρt < rν 0 by
(4.1.24), we conclude that Vol(Et ) < Vol(X + ) and, consequently, X + cannot be contained in
Et . Thus, there exists a point y which belongs to X + :
aTτ (z − cτ −1 ) ≤ `τ ≤ Lτ .
Thus,
f (x∗ ) ≥ f (cτ −1 ) + aTτ (x∗ − cτ −1 )
Lτ ≥ aTτ (z − cτ −1 )
Multiplying the first inequality by (1 − ν 0 ), the second – by ν 0 and adding the results, we get
and we come to 0
ν Lτ
f (cτ −1 ) ≤ f (x∗ ) + 1−ν 0
ν 0 Lt
≤ f (x∗ ) + 1−ν 0
[since Lτ ≤ Lt in view of τ ≤ t]
≤ f (x∗ ) +
[by definition of ν and since ν 0 < ν]
= Opt(C) + .
We see that there exists a productive (i.e., with feasible cτ −1 ) step τ ≤ t such that the corre-
sponding search point cτ −1 is -optimal. Since we are in the situation where the result x b is the
best of the feasible search points generated in course of the first t steps, x
b is also feasible and
-optimal, as claimed in 20 .
30 It remains to verify that the method does terminate in course of the first N = N ()
steps. Assume, on the contrary, that it is not the case, and let us lead this assumption to a
contradiction.
First, observe that for every productive step t we have
whence
`t ≡ max aTt (u − ct−1 ) ≤ VarR (f ).
u∈E0
Since we have assumed that the method does not terminate in course of the first N steps, we
have
ρN
≥ . (4.1.31)
r + LN
The right hand side in this inequality is ≥ /( + VarR (f )) by (4.1.30), while the left hand side
is ≤ exp{−N/(2n2 )}R by (4.1.23). We get
R + VarR (f )
2
exp{−N/(2n )}R/r ≥ ⇒ N ≤ 2n2 ln + ln ,
+ VarR (f ) r
which is the desired contradiction (see the definition of N = N () in (4.1.20)).
Now note that if ` = length(Data(q)), then the total number of bits in Data(p[q]) and in (q) is
bounded by a polynomial of ` (since the transformation Data(q) 7→ (Data(p[q]), (q), µ(q)) takes
272 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS
CT-polynomial time). It follows that both s(q) and d(q) are bounded by polynomials in `, so
that our “Real Arithmetic” solvability test for Q takes polynomial in length(Data(q)) number
of arithmetic operations.
Recall that Q was assumed to be an NP-complete generic problem, so that it would be “highly
improbable” to find a polynomial time solvability test for this problem, while we have managed
to build such a test. We conclude that the polynomial solvability of P is highly improbable as
well.
4.2 Interior Point Polynomial Time Methods for LP, CQP and
SDP
4.2.1 Motivation
Theorem 4.1.1 states that generic convex programs, under mild computability and bounded-
ness assumptions, are polynomially solvable. This result is extremely important theoretically;
however, from the practical viewpoint it is, essentially, no more than “an existence theorem”.
Indeed, the “universal” complexity bounds coming from Theorem 4.1.2, although polynomial,
are not that attractive: by Theorem 4.1.1, when solving problem (4.1.10) with n design vari-
ables, the “price” of an accuracy digit (what it costs to reduce current inaccuracy by factor
2) is O(n2 ) calls to the first order and the separation oracles plus O(n4 ) arithmetic operations
to process the answers of the oracles. Thus, even for simplest objectives to be minimized over
simplest feasible sets, the arithmetic price of an accuracy digit is O(n4 ); think how long will
it take to solve a problem with, say, 1,000 variables (which is still a “small” size for many
applications). The good news about the methods underlying Theorem 4.1.2 is their universal-
ity: all they need is a Separation oracle for the feasible set and the possibility to compute the
objective and its subgradient at a given point, which is not that much. The bad news about
these methods has the same source as the good news: the methods are “oracle-oriented” and
capable to use only local information on the program they are solving, in contrast to the fact
that when solving instances of well-structured programs, like LP, we from the very beginning
have in our disposal complete global description of the instance. And of course it is ridiculous
to use a complete global knowledge of the instance just to mimic the local in their nature first
order and separation oracles. What we would like to have is an optimization technique capable
to “utilize efficiently” our global knowledge of the instance and thus allowing to get a solution
much faster than it is possible for “nearly blind” oracle-oriented algorithms. The major event
in the “recent history” of Convex Optimization, called sometimes “Interior Point revolution”,
was the invention of these “smart” techniques.
The simplest way to get a proper impression of the (most of) IP methods is to start with a
quite traditional interior penalty scheme for solving optimization problems.
Practitioners thought the (properly modified) Newton method to be the fastest, in terms
of the iteration count, routine for smooth (not necessarily convex) unconstrained minimization,
although sometimes “too heavy” for practical use: the practical drawbacks of the method are
both the necessity to invert the Hessian matrix at each step, which is computationally costly in
the large-scale case, and especially the necessity to compute this matrix (think how difficult it is
to write a code computing 5,050 second order derivatives of a messy function of 100 variables).
Classical interior penalty scheme: the construction. Now consider a constrained convex
optimization program. As we remember, one can w.l.o.g. make its objective linear, moving, if
necessary, the actual objective to the list of constraints. Thus, let the problem be
n o
min cT x : x ∈ X ⊂ Rn , (C)
x
where X is a closed convex set, which we assume to possess a nonempty interior. How could we
solve the problem?
Traditionally it was thought that the problems of smooth convex unconstrained minimization
are “easy”; thus, a quite natural desire was to reduce the constrained problem (C) to a series
of smooth unconstrained optimization programs. To this end, let us choose somehow a barrier
(another name – “an interior penalty function”) F (x) for the feasible set X – a function which
is well-defined (and is smooth and strongly convex) on the interior of X and “blows up” as a
point from int X approaches a boundary point of X :
and let us look at the one-parametric family of functions generated by our objective and the
barrier:
Ft (x) = tcT x + F (x) : int X → R.
Here the penalty parameter t is assumed to be nonnegative.
It is easily seen that under mild regularity assumptions (e.g., in the case of bounded X ,
which we assume from now on)
• Every function Ft (·) attains its minimum over the interior of X , the minimizer x∗ (t) being
unique;
• The central path x∗ (t) is a smooth curve, and all its limiting, t → ∞, points belong to the
set of optimal solutions of (C).
This fact is quite clear intuitively. To minimize Ft (·) for large t is the same as to minimize the
function fρ (x) = cT x + ρF (x) for small ρ = 1t . When ρ is small, the function fρ is very close to
cT x everywhere in X , except a narrow stripe along the boundary of X , the stripe becoming thinner
and thinner as ρ → 0. Therefore we have all reasons to believe that the minimizer of Ft for large t
(i.e., the minimizer of fρ for small ρ) must be close to the set of minimizers of cT x on X .
We see that the central path x∗ (t) is a kind of Ariadne’s thread which leads to the solution set of
(C). On the other hand, to reach, given a value t ≥ 0 of the penalty parameter, the point x∗ (t)
on this path is the same as to minimize a smooth strongly convex function Ft (·) which attains
its minimum at an interior point of X . The latter problem is “nearly unconstrained one”, up to
the fact that its objective is not everywhere defined. However, we can easily adapt the methods
of unconstrained minimization, including the Newton one, to handle “nearly unconstrained”
4.2. INTERIOR POINT POLYNOMIAL TIME METHODS FOR LP, CQP AND SDP 275
problems. We see that constrained convex optimization in a sense can be reduced to the “easy”
unconstrained one. The conceptually simplest way to make use of this observation would be to
choose a “very large” value t̄ of the penalty parameter, like t̄ = 106 or t̄ = 1010 , and to run an
unconstrained minimization routine, say, the Newton method, on the function Ft̄ , thus getting
a good approximate solution to (C) “in one shot”. This policy, however, is impractical: since
we have no idea where x∗ (t̄) is, we normally will start our process of minimizing Ft̄ very far
from the minimizer of this function, and thus for a long time will be unable to exploit fast local
convergence of the method for unconstrained minimization we have chosen. A smarter way to
use our Ariadne’s thread is exactly the one used by Theseus: to follow the thread. Assume, e.g.,
that we know in advance the minimizer of F0 ≡ F , i.e., the point x∗ (0)3) . Thus, we know where
the central path starts. Now let us follow this path: at i-th step, standing at a point xi “close
enough” to some point x∗ (ti ) of the path, we
• first, increase a bit the current value ti of the penalty parameter, thus getting a new “target
point” x∗ (ti+1 ) on the path,
and
• second, approach our new target point x∗ (ti+1 ) by running, say, the Newton method,
started at our current iterate xi , on the function Fti+1 , until a new iterate xi+1 “close enough”
to x∗ (ti+1 ) is generated.
As a result of such a step, we restore the initial situation – we again stand at a point which
is close to a point on the central path, but this latter point has been moved along the central
path towards the optimal set of (C). Iterating this updating and strengthening appropriately
our “close enough” requirements as the process goes on, we, same as the central path, approach
the optimal set. A conceptual advantage of this “path-following” policy as compared to the
“brute force” attempt to reach a target point x∗ (t̄) with large t̄ is that now we have a hope to
exploit all the time the strongest feature of our “working horse” (the Newton method) – its fast
local convergence. Indeed, assuming that xi is close to x∗ (ti ) and that we do not increase the
penalty parameter too rapidly, so that x∗ (ti+1 ) is close to x∗ (ti ) (recall that the central path is
smooth!), we conclude that xi is close to our new target point x∗ (ti ). If all our “close enough”
and “not too rapidly” are properly controlled, we may ensure xi to be in the domain of the
quadratic convergence of the Newton method as applied to Fti+1 , and then it will take a quite
small number of steps of the method to recover closeness to our new target point.
4.2.3 But...
Luckily, the pessimistic analysis of the classical interior penalty scheme is not the “final truth”.
It turned out that what prevents this scheme to yield a polynomial time method is not the
structure of the scheme, but the huge amount of freedom it allows for its elements (too much
freedom is another word for anarchy...). After some order is added, the scheme becomes a
polynomial time one! Specifically, it was understood that
1. There is a (completely non-traditional) class of “good” (self-concordant4) ) barriers. Every
barrier F of this type is associated with a “self-concordance parameter” θ(F ), which is a
real ≥ 1;
2. Whenever a barrier F underlying the interior penalty scheme is self-concordant, one can
specify the notion of “closeness to the central path” and the policy for updating the penalty
parameter in such a way that a single Newton step
xi 7→ xi+1 = xi − [∇2 Fti+1 (xi )]−1 ∇Fti+1 (xi ) (4.2.1)
suffices to update a “close to x∗ (ti )” iterate xi into a new iterate xi+1 which is close, in
the same sense, to x∗ (ti+1 ). All “close to the central path” points belong to int X , so that
the scheme keeps all the iterates strictly feasible.
4)
We do not intend to explain here what is a “self-concordant barrier”; for our purposes it suffices to say that
this is a three times continuously differentiable convex barrier F satisfying a pair of specific differential inequalities
linking the first, the second and the third directional derivatives of F .
4.3. INTERIOR POINT METHODS FOR LP, CQP, AND SDP: BUILDING BLOCKS 277
3. The penalty updating policy mentioned in the previous item is quite simple:
!
0.1
ti 7→ ti+1 = 1+ p ti ;
θ(F )
in particular,
it does not “slow down” as ti grows and ensures linear, with the ratio
1 + √0.1 , growth of the penalty. This is vitally important due to the following fact:
θ(F )
4. The inaccuracy of a point x, which is close to some point x∗ (t) of the central path, as an
approximate solution to (C) is inverse proportional to t:
2θ(F )
cT x − min cT y ≤ .
y∈X t
It follows that
(!) After we have managed once to get close to the central path – havepbuilt a
point x0 which is close to a point x(t0 ), t0 > 0, on the path, every O( θ(F ))
steps of the scheme improve the quality of approximate solutions generated by
the scheme by an absolute constant factor. In particular, it takes no more than
θ(F )
q
O(1) θ(F ) ln 2 +
t0
Note that with our simple penalty updating policy all needed to perform a step of the
interior penalty scheme is to compute the gradient and the Hessian of the underlying
barrier at a single point and to invert the resulting Hessian.
Items 3, 4 say that essentially all we need to derive from the just listed general results a polyno-
mial time method for a generic convex optimization problem is to be able to equip every instance
of the problem with a “good” barrier in such a way that both the parameter of self-concordance
of the barrier θ(F ) and the arithmetic cost at which we can compute the gradient and the Hes-
sian of this barrier at a given point are polynomial in the size of the instance5) . And it turns
out that we can meet the latter requirement for all interesting “well-structured” generic convex
programs, in particular, for Linear, Conic Quadratic, and Semidefinite Programming. Moreover,
“the heroes” of our course – LP, CQP and SDP – are especially nice application fields of the
general theory of interior point polynomial time methods; in these particular applications, the
theory can be simplified, on one hand, and strengthened, on another.
4.3 Interior point methods for LP, CQP, and SDP: building
blocks
We are about to explain what the interior point methods for LP, CQP, SDP look like.
5)
Another requirement is to be able once get close to a point x∗ (t0 ) on the central path with a not “disastrously
small” value of t0 – we should initialize somehow our path-following method! It turns out that such an initialization
is a minor problem – it can be carried out via the same path-following technique, provided we are given in advance
a strictly feasible solution to our problem.
278 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS
associated with a cone K given as a direct product of m “basic” cones, each of them being either
a second-order, or a semidefinite cone:
k
K = Sk+1 × ... × S+p × Lkp+1 × ... × Lkm ⊂ E = Sk1 × ... × Skp × Rkp+1 × ... × Rkm . (Cone)
Of course, the generic problem in question covers LP (no Lorentz factors, all semidefinite factors
are of dimension 1), CQP (no semidefinite factors) and SDP (no Lorentz factors).
Now, we shall equip the semidefinite and the Lorentz cones with “canonical barriers”:
• The canonical barrier for a semidefinite cone Sn+ is
−Ik−1
Lk (x) = − ln(x2k − x21 − ... − x2k−1 ) T
= − ln(x Jk x), Jk = ;
1
Recall that all direct factors in the direct product representation (Cone) of our “universe” E
are Euclidean spaces; the matrix factors Ski are endowed with the Frobenius inner product
while the “arithmetic factors” Rki are endowed with the usual inner product
E itself will be regarded as a Euclidean space endowed with the direct sum of inner products
on the factors: p X m
X
hX, Y iE = Tr(Xi Yi ) + XiT Yi .
i=1 i=p+1
It is clearly seen that our basic barriers, same as their direct sum K, indeed are barriers for
the corresponding cones: they are C∞ -smooth on the interiors of their domains, blow up to
∞ along every sequence of points from these interiors converging to a boundary point of the
corresponding domain and are strongly convex. To verify the latter property, it makes sense to
compute explicitly the first and the second directional derivatives of these barriers (we need the
corresponding formulae in any case); to simplify notation, we write down the derivatives of the
basic functions Sk , Lk at a point x from their domain along a direction h (you should remember
that in the case of Sk both the point and the direction, in spite of their lower-case denotation,
are k × k symmetric matrices):
d
Sk (x + th) = −Tr(x−1 h) = −hx−1 , hiSk ,
DSk (x)[h] ≡ dt
t=0
i.e.
∇Sk (x) = −x−1 ;
d2
2 Sk (x + th) = Tr(x−1 hx−1 h) = hx−1 hx−1 , hiSk ,
D Sk (x)[h, h] ≡ dt2
t=0
i.e.
[∇2 Sk (x)]h = x−1 hx−1 ;
T (4.3.1)
d
Lk (x + th) = −2 hxT JJk xx ,
DLk (x)[h] ≡ dt
k
t=0
i.e.
∇Lk (x) = − xT 2J x Jk x;
k
T J x]2 T
d2
2 4 [h − 2 hxTJJx
kh
D Lk (x)[h, h] ≡ dt2 Lk (x + th) = k
[xT Jk x]2
,
t=0
i.e.
4 2
∇2 Lk (x) = J xxT Jk
[xT Jk x]2 k
− J .
xT J k x k
− ln Det(tx) = − ln Det(x) − k ln t,
• In the SDP case, ∇F (x) = ∇Sk (x) = −x−1 and [∇2 F (x)]h = ∇2 Sk (x)h =
x−1 hx−1 (see (4.3.1)). Here (a) becomes the identity hx−1 , xiF ≡ Tr(x−1 x) = k,
and (b) kindly informs us that x−1 xx−1 = x−1 .
(iii) Consequently, k-th differential Dk F (x) of F , k ≥ 1, is homogeneous, of degree −k, in
x ∈ Dom F :
∀(x ∈ Dom F, t > 0, h1 , ..., hk ) :
∂ k F (tx+s1 h1 +...+sk hk ) (4.3.2)
Dk F (tx)[h1 , ..., hk ] ≡ ∂s1 ∂s2 ...∂sk = t−k Dk F (x)[h1 , ..., hk ].
s1 =...=sk =0
Proof. (i): it is immediately seen that Sk and Lk are logarithmically homogeneous with parameters of
logarithmic homogeneity −θ(Sk ), −θ(Lk ), respectively; and of course the property of logarithmic homo-
geneity is stable with respect to taking direct sums of functions: if Dom Φ(u) and Dom Ψ(v) are closed
w.r.t. the operation of multiplying a vector by a positive scalar, and both Φ and Ψ are logarithmi-
cally homogeneous with parameters α, β, respectively, then the function Φ(u) + Ψ(v) is logarithmically
homogeneous with the parameter α + β.
(ii): To get (ii.a), it suffices to differentiate the identity
in t at t = 1:
d
F (tx) = F (x) − θ(F ) ln t ⇒ h∇F (tx), xi = F (tx) = −θ(F )t−2 ,
dt
and it remains to set t = 1 in the concluding identity.
Similarly, to get (ii.b), it suffices to differentiate the identity
since h[∇2 F (x)]h, xi = h[∇2 F (x)]x, hi (symmetry of partial derivatives!) and since the resulting equality
F (tx) = F (x) − θ ln t
in x, we get
tk Dk F (tx)[h1 , ..., hk ] = Dk F (x)[h1 , ..., hk ].
Assuming that the linear mapping x 7→ Ax is an embedding (i.e., that KerA = {0} – this
is Assumption A from Lecture 1), we can write down our primal-dual pair in a symmetric
geometric form (Lecture 1, Section 1.4.4):
where L is a linear subspace in E (the image space of the linear mapping x 7→ Ax), L⊥ is the
orthogonal complement to L in E, and C ∈ E satisfies A∗ C = c, i.e., hC, AxiE ≡ cT x.
To simplify things, from now on we assume that both problems (CP) and (CD) are strictly
feasible. In terms of (P) and (D) this assumption means that both the primal feasible plane
L − B and the dual feasible plane L⊥ + C intersect the interior of the cone K.
Remark 4.4.1 By the Conic Duality Theorem (Lecture 1), both (CP) and (D) are solvable with
equal optimal values:
Opt(CP) = Opt(D)
(recall that we have assumed strict primal-dual feasibility). Since (P) is equivalent to (CP), (P)
is solvable as well, and the optimal value of (P) differs from the one of (P) by hC, BiE 7) . It
follows that the optimal values of (P) and (D) are linked by the relation
Opt(P) − Opt(D) + hC, BiE = 0. (4.4.1)
this barrier is
K(x)
c = K(Ax − B) : int X → R (4.4.2)
and is indeed a barrier. Now we can apply the interior penalty scheme to trace the central
path x∗ (t) associated with the resulting barrier; with some effort it can be derived from the
primal-dual strict feasibility that this central path is well-defined (i.e., that the minimizer of
ct (x) = tcT x + K(x)
K c
on int X exists for every t ≥ 0 and is unique)8) . What is important for us for the moment, is
the central path itself, not how to trace it. Moreover, it is highly instructive to pass from the
central path x∗ (t) in the space of design variables to its image
X∗ (t) = Ax∗ (t) − B
in E. The resulting curve has a name – it is called the primal central path of the primal-dual
pair (P), (D); by its origin, it is a curve comprised of strictly feasible solutions of (P) (since it
is the same – to say that x belongs to the (interior of) the set X and to say that X = Ax − B
is a (strictly) feasible solution of (P)). A simple and very useful observation is that the primal
central path can be defined solely in terms of (P), (D) and thus is a “geometric entity” – it is
independent of a particular parameterization of the primal feasible plane L − B by the design
vector x:
7)
Indeed, the values of the respective objectives cT x and hC, Ax − BiE at the corresponding to each other
feasible solutions x of (CP) and X = Ax − B of (P) differ from each other by exactly hC, BiE :
cT x − hC, XiE = cT x − hC, Ax − BiE = cT x − hA∗ C, xiE +hC, BiE .
| {z }
=0 due to A∗ C=c
8)
In Section 4.2.1, there was no problem with the existence of the central path, since there X was assumed to
be bounded; in our now context, X not necessarily is bounded.
4.4. PRIMAL-DUAL PAIR OF PROBLEMS AND PRIMAL-DUAL CENTRAL PATH 283
(*) A point X∗ (t) of the primal central path is the minimizer of the aggregate
we see that the function Pbt (x) = Pt (Ax − B) of x ∈ int X differs from the function K b t (x)
by a constant (depending on t) and has therefore the same minimizer x∗ (t) as the function
Kb t (x). Now, when x runs through int X , the point X = Ax − B runs exactly through the
set of strictly feasible solutions of (P), so that the minimizer X∗ of Pt on the latter set and
the minimizer x∗ (t) of the function Pbt (x) = Pt (Ax − B) on int X are linked by the relation
X∗ = Ax∗ (t) − B.
Indeed, we know that X∗ (t) is the unique minimizer of the smooth convex function Pt (X) =
thC, XiE +K(X) on the intersection of the primal feasible plane L−B and the interior of the
cone K; a necessary and sufficient condition for a point X of this intersection to minimize
Pt over the intersection is that ∇Pt must be orthogonal to L.
• In the SDP case, a point X∗ (t), t > 0, of the primal central path is uniquely defined
by the following two requirements: (1) X∗ (t) 0 should be feasible for (P), and (2)
the k × k matrix
tC − X∗−1 (t) = tC + ∇Sk (X∗ (t))
(see (4.3.1)) should belong to L⊥ , i.e., should be orthogonal, w.r.t. the Frobenius
inner product, to every matrix of the form Ax.
The dual problem (D) is in no sense “worse” than the primal problem (P) and thus also
possesses the central path, now called the dual central path S∗ (t), t ≥ 0, of the primal-dual pair
(P), (D). Similarly to (*), (*0 ), the dual central path can be characterized as follows:
(**0 ) A point S∗ (t), t ≥ 0, of the dual central path is the unique minimizer of the
aggregate
Dt (S) = −thB, SiE + K(S)
on the set of strictly feasible solutions of (D) 9) . S∗ (t) is exactly the strictly feasible
solution S to (D) such that the vector −tB +∇F (S) is orthogonal to L⊥ (i.e., belongs
to L).
9)
Note the slight asymmetry between the definitions of the primal aggregate Pt and the dual aggregate Dt :
in the former, the linear term is thC, XiE , while in the latter it is −thB, SiE . This asymmetry is in complete
accordance with the fact that we write (P) as a minimization, and (D) – as a maximization problem; to write
(D) in exactly the same form as (P), we were supposed to replace B with −B, thus getting the formula for Dt
completely similar to the one for Pt .
284 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS
• In the SDP case, a point S∗ (t), t > 0, of the dual central path is uniquely defined
by the following two requirements: (1) S∗ (t) 0 should be feasible for (D), and (2)
the k × k matrix
−tB − S∗−1 (t) = −tB + ∇Sk (S∗ (t))
(see (4.3.1)) should belong to L, i.e., should be representable in the form Ax for
some x.
From Proposition 4.3.2 we can derive a wonderful connection between the primal and the dual
central paths:
Theorem 4.4.1 For t > 0, the primal and the dual central paths X∗ (t), S∗ (t) of a (strictly
feasible) primal-dual pair (P), (D) are linked by the relations
Proof. By (*0 ), the vector tC+∇K(X∗ (t)) belongs to L⊥ , so that the vector S = −t−1 ∇K(X∗ (t)) belongs
to the dual feasible plane L⊥ +C. On the other hand, by Proposition 4.4.3 the vector −∇K(X∗ (t)) belongs
to Dom K, i.e., to the interior of K; since K is a cone and t > 0, the vector S = −t−1 ∇F (X∗ (t)) belongs
to the interior of K as well. Thus, S is a strictly feasible solution of (D). Now let us compute the gradient
of the aggregate Dt at the point S:
Thus, S is strictly feasible for (D) and ∇Dt (S) ∈ L. But by (**0 ) these properties characterize S∗ (t);
thus, S∗ (t) = S ≡ −t−1 ∇K(X∗ (t)). This relation, in view of Proposition 4.3.2, implies that X∗ (t) =
−t−1 ∇K(S∗ (t)). Another way to get the latter relation from the one S∗ (t) = −t−1 ∇K(X∗ (t)) is just to
refer to the primal-dual symmetry.
In fact, the connection between the primal and the dual central paths stated by Theorem 4.4.1
can be used to characterize both the paths:
Theorem 4.4.2 Let (P), (D) be a strictly feasible primal-dual pair.
For every t > 0, there exists a unique strictly feasible solution X of (P) such that −t−1 ∇K(X)
is a feasible solution to (D), and this solution X is exactly X∗ (t).
Similarly, for every t > 0, there exists a unique strictly feasible solution S of (D) such that
−t−1 ∇K(S) is a feasible solution of (P), and this solution S is exactly S∗ (t).
Proof. By primal-dual symmetry, it suffices to prove the first claim. We already know (Theorem
4.4.1) that X = X∗ (t) is a strictly feasible solution of (P) such that −t−1 ∇K(X) is feasible
for (D); all we need to prove is that X∗ (t) is the only point with these properties, which is
immediate: if X is a strictly feasible solution of (P) such that −t−1 ∇K(X) is dual feasible, then
−t−1 ∇K(X) ∈ L⊥ + C, or, which is the same, ∇K(X) ∈ L⊥ − tC, or, which again is the same,
∇Pt (X) = tC + ∇K(X) ∈ L⊥ . And we already know from (*0 ) that the latter property, taken
together with the strict primal feasibility, is characteristic for X∗ (t).
4.4. PRIMAL-DUAL PAIR OF PROBLEMS AND PRIMAL-DUAL CENTRAL PATH 285
Characterization of the central path. By Theorem 4.4.2, the points (X∗ (t), S∗ (t)) of the
central path possess the following properties:
(CentralPath):
In fact, the indicated properties fully characterize the central path: whenever two points X, S
possess the properties 1) - 3) with respect to some t > 0, X is nothing but X∗ (t), and S is
nothing but S∗ (t) (this again is said by Theorem 4.4.2).
Duality gap along the central path. Recall that for an arbitrary primal-dual feasible pair
(X, S) of the (strictly feasible!) primal-dual pair of problems (P), (D), the duality gap
DualityGap(X, S) ≡ [hC, XiE − Opt(P)] + [Opt(D) − hB, SiE ] = hC, XiE − hB, SiE + hC, BiE
(see (4.4.1)) which measures the “total inaccuracy” of X, S as approximate solutions of the
respective problems, can be written down equivalently as hS, XiE (see statement (!) in Section
1.4.5). Now, what is the duality gap along the central path? The answer is immediate:
A distance to the central path. Our canonical barrier K(·) is a strongly convex smooth
function on int K; in particular, its Hessian matrix ∇2 K(Y ), taken at a point Y ∈ int K, is
positive definite. We can use the inverse of this matrix to measure the distances between points
of E, thus arriving at the norm
q
kHkY = h[∇2 K(Y )]−1 H, HiE .
It turns out that
10)
Which, among other, much more important consequences, explains the name “augmented complementary
slackness” of the property 10 .3): at the primal-dual pair of optimal solutions X ∗ , S ∗ the duality gap should be
zero: hS ∗ , X ∗ iE = 0. Property 10 .3, as we just have seen, implies that the duality gap at a primal-dual pair
(X∗ (t), S∗ (t)) from the central path, although nonzero, is “controllable” – θ(K)
t
– and becomes small as t grows.
4.4. PRIMAL-DUAL PAIR OF PROBLEMS AND PRIMAL-DUAL CENTRAL PATH 287
Observe that dist(Z, Z∗ (t)) ≥ 0, and dist(Z, Z∗ (t)) = 0 if and only if S = −t−1 ∇K(X), which,
for a strictly primal-dual feasible pair Z = (X, S), means that Z = Z∗ (t) (see the characterization
of the primal-dual central path); thus, dist(Z, Z∗ (t)) indeed can be viewed as a kind of distance
from Z to Z∗ (t).
In the SDP case X, S are k × k symmetric matrices, and
dist2 (Z, Z∗ (t)) = ktS + ∇Sk (X)k2X = h[∇2 Sk (X)] −1 (tS + ∇Sk (X)), tS + ∇Sk (X)iF
= Tr X(tS − X −1 )X(tS − X −1 )
[see (4.3.1)]
= Tr([tX 1/2 SX 1/2 − I]2 ),
so that
Besides this,
ktX 1/2 SX 1/2 − Ik22 = Tr [tX 1/2 SX 1/2 − I]2
= Tr t2 X 1/2 SX 1/2 X 1/2 SX 1/2 − 2tX 1/2 SX 1/2 + I
= Tr(t2 X 1/2 SXSX 1/2 ) − 2tTr(X 1/2 SX 1/2 ) + Tr(I)
= Tr(t2 XSXS − 2tXS + I)
= Tr(t2 SXSX − 2tSX + I)
= Tr(t2 S 1/2 XS 1/2 S 1/2 XS 1/2 − 2tS 1/2 XS 1/2 + I)
= Tr([tS 1/2 XS 1/2 − I]2 ),
2θ(K)
DualityGap(X, S) = hS, XiE ≤ 2DualityGap(Z∗ (t)) = . (4.4.7)
t
288 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS
Let us check A in the SDP case. Let (t, X, S) satisfy the premise of A. The duality gap
at the pair (X, S) of strictly primal-dual feasible solutions is
DualityGap(X, S) = hX, SiF = Tr(XS),
while by (4.4.6) the relation dist((S, X), Z∗ (t)) ≤ 1 means that
k
Tr(XS) = Tr(X 1/2 SX 1/2 ) =
P
DualityGap(X, S) = δi
i=1 s
−1
k
P −1 −1
√ k
P
≤ kt + |δi − t | ≤ kt + k (δi − t−1 )2
i=1 i=1
√
≤ kt−1 + kt−1 ,
and (4.4.7) follows.
iterate (t̄, X̄, S̄) such that t̄ > 0, X̄ is strictly primal feasible, S̄ is strictly dual feasible, and
(X̄, S̄) is “good”, in certain precise sense, approximation of the point Z∗ (t̄) = (X∗ (t̄), S∗ (t̄))
on the central path, into a new iterate (t+ , X+ , S+ ) with similar properties and a larger value
t+ > t̄ of the penalty parameter. Given such an updating and iterating it, we indeed shall trace
the central path, with all the benefits (see above) coming from the latter fact12) How could we
construct the required updating? Recalling the description of the central path, we see that our
question is:
X ∈ L − B,
(4.5.1)
S ∈ L⊥ + C
(which is in fact a system of linear equations) and approximately satisfies the system
of nonlinear equations
update it into a new triple (t+ , X+ , S+ ) with the same properties and t+ > t̄.
Since the left hand side G(·) in our system of nonlinear equations is smooth around (t̄, X̄, S̄)
(recall that X̄ was assumed to be strictly primal feasible), the most natural, from the viewpoint
of Computational Mathematics, way to achieve our target is as follows:
2. We linearize the left hand side Gt+ (X, S) of the system of nonlinear equations (4.5.2) at
the point (X̄, S̄), and replace (4.5.2) with the linearized system of equations
3. We define the corrections ∆X, ∆S from the requirement that the updated pair X+ =
X̄ + ∆X, S+ = S̄ + ∆S must satisfy (4.5.1) and the linearized version (4.5.3) of (4.5.2).
In other words, the corrections should solve the system
∆X ∈ L,
∆S ∈ L⊥ , (4.5.4)
∂Gt+ (X̄,S̄) ∂Gt+ (X̄,S̄)
Gt+ (X̄, S̄) + ∂X ∆X + ∂S ∆S =0
X+ = X̄ + ∆X,
(4.5.5)
S+ = S̄ + ∆S.
12)
Of course, besides knowing how to trace the central path, we should also know how to initialize this process
– how to come close to the path to be able to start its tracing. There are different techniques to resolve this
“initialization difficulty”, and basically all of them achieve the goal by using the same path-tracing technique,
now applied to an appropriate auxiliary problem where the “initialization difficulty” does not arise at all. Thus,
at the level of ideas the initialization techniques do not add something essentially new, which allows us to skip in
our presentation all initialization-related issues.
290 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS
The primal-dual IP methods we are describing basically fit the outlined scheme, up to the
following two important points:
• If the current iterate (X̄, S̄) is not enough close to Z∗ (t̄), and/or if the desired improvement
t+ − t̄ is too large, the corrections given by the outlined scheme may be too large; as a
result, the updating (4.5.5) as it is may be inappropriate, e.g., X+ , or S+ , or both, may
be kicked out of the cone K. (Why not: linearized system (4.5.3) approximates well the
“true” system (4.5.2) only locally, and we have no reasons to trust in corrections coming
from the linearized system, when these corrections are large.)
There is a standard way to overcome the outlined difficulty – to use the corrections in a
damped fashion, namely, to replace the updating (4.5.5) with
X+ = X̄ + α∆X,
(4.5.6)
S+ = S̄ + β∆S,
and to choose the stepsizes α > 0, β > 0 from additional “safety” considerations, like
ensuring the updated pair (X+ , S+ ) to reside in the interior of K, or enforcing it to stay in
a desired neighbourhood of the central path, or whatever else. In IP methods, the solution
(∆X, ∆S) of (4.5.4) plays the role of search direction (and this is how it is called), and the
actual corrections are proportional to the search ones rather than to be exactly the same.
In this sense the situation is completely similar to the one with the Newton method from
Section 4.2.2 (which is natural: the latter method is exactly the lumbarization method for
solving the Fermat equation ∇f (x) = 0).
• The “augmented complementary slackness” system (4.5.2) can be written down in many
different forms which are equivalent to each other in the sense that they share a common
solution set. E.g., we have the same reasons to express the augmented complementary
slackness requirement by the nonlinear system (4.5.2) as to express it by the system
b t (X, S) ≡ X + t−1 ∇K(S) = 0,
G
not speaking about other possibilities. And although all systems of nonlinear equations
Ht (X, S) = 0
expressing the augmented complementary slackness are “equivalent” in the sense that they
share a common solution set, their linearizations are different and thus – lead to different
search directions and finally to different path-following methods. Choosing appropriate (in
general even varying from iteration to iteration) analytic representation of the augmented
complementary slackness requirement, one can gain a lot in the performance of the result-
ing path-following method, and the IP machinery facilitates this flexibility (see “SDP case
examples” below).
reaches the point (t1 = 2t0 , X 1 , S 1 ) from the same neighbourhood, after the same O(1) θ(K)
p
H = A∗ [∇2 K(X)]A;
2. Subsequent Choleski factorization of the matrix H (which, due to its origin, is symmetric
positive definite and thus admits Choleski decomposition H = DDT with lower triangular
D).
Looking at (Cone), (CP) and (4.3.1), we immediately conclude that the arithmetic cost of
assembling and factorizing H is polynomial in the size dim Data(·) of the data defining (CP),
and that the parameter θ(K) also is polynomial in this size. Thus, the cost of an accuracy digit
for the methods in question is polynomial in the size of the data, as is required from polynomial
time methods13) . Explicit complexity bounds for LP b , CQP b , SDP b are given in Sections 4.6.1,
4.6.2, 4.6.3, respectively.
where t+ is the target value of the penalty parameter. The system (4.5.4) now becomes
(a) ∆X ∈ L
m
0
(a ) ∆X = A∆x [∆x ∈ Rn ]
(b) ∆S ∈ L⊥ (4.5.7)
m
0
(b ) ∗
A ∆S = 0
(c) t+ [S̄ + ∆S] + ∇K(X̄) + [∇2 K(X̄)]∆X = 0;
13)
Strictly speaking, the outlined complexity considerations are applicable to the “highway” phase of the
solution process, after we once have reached the neighbourhood N0.1 of the central path. However, the results of
our considerations remain unchanged after the initialization expenses are taken into account, see Section 4.6.
292 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS
the unknowns here are ∆X, ∆S and ∆x. To process the system, we eliminate ∆X via (a0 ) and
multiply both sides of (c) by A∗ , thus getting the equation
A∗ [∇2 K(X̄)]A ∆x + [t+ A∗ [S̄ + ∆S] + A∗ ∇K(X̄)] = 0. (4.5.8)
| {z }
H
thus getting a solution to (4.5.7). Restricting ourselves with the stepsizes α = β = 1 (see
(4.5.6)), we come to the “closed form” description of the method:
(a) t 7→ t+ > t
(b) x 7→ x+ = x + −[A (∇2 K(X))A]−1 [t+ c + A∗ ∇K(X)] ,
∗
| {z } (4.5.11)
∆x
(c) S 7→ S+ = −t−1 2
+ [∇K(X) + [∇ K(X)]A∆x],
where x is the current iterate in the space Rn of design variables and X = Ax − B is its image
in the space E.
The resulting scheme admits a quite natural explanation. Consider the function
F (x) = K(Ax − B);
you can immediately verify that this function is a barrier for the feasible set of (CP). Let also
Ft (x) = tcT x + F (x)
be the associated barrier-generated family of penalized objectives. Relation (4.5.11.b) says that
the iterates in the space of design variables are updated according to
x 7→ x+ = x − [∇2 Ft+ (x)]−1 ∇Ft+ (x),
i.e., the process in the space of design variables is exactly the process (4.2.1) from Section 4.2.3.
Note that (4.5.11) is, essentially, a purely primal process (this is where the name of the
method comes from). Indeed, the dual iterates S, S+ just do not appear in formulas for x+ , X+ ,
and in fact the dual solutions are no more than “shadows” of the primal ones.
Remark 4.5.1 When constructing the primal path-following method, we have started with the
augmented slackness equations in form (4.5.2). Needless to say, we could start our developments
with the same conditions written down in the “swapped” form
X + t−1 ∇K(S) = 0
as well, thus coming to what is called “dual path-following method”. Of course, as applied to a
given pair (P), (D), the dual path-following method differs from the primal one. However, the
constructions and results related to the dual path-following method require no special care – they
can be obtained from their “primal counterparts” just by swapping “primal” and “dual” entities.
4.5. TRACING THE CENTRAL PATH 293
The complexity analysis of the primal path-following method can be summarized in the following
Theorem 4.5.1 Let 0 < χ ≤ κ ≤ 0.1. Assume that we are given a starting point (t0 , x0 , S0 )
such that t0 > 0 and the point
(X0 = Ax0 − B, S0 )
is κ-close to Z∗ (t0 ):
dist((X0 , S0 ), Z∗ (t0 )) ≤ κ.
Starting with (t0 , x0 , X0 , S0 ), let us iterate process (4.5.11) equipped with the penalty updating
policy
!
χ
t+ = 1+ p t (4.5.12)
θ(K)
ti = 1 + √ χ ti−1 ,
θ(K)
xi = xi−1 − [A∗ (∇2 K(Xi−1 ))A] −1
[ti c + A∗ ∇K(Xi−1 )],
| {z }
∆xi
Xi = Axi − B,
−1
Si = −ti [∇K(Xi−1 ) + [∇2 K(Xi−1 )]A∆xi ]
The resulting process is well-defined and generates strictly primal-dual feasible pairs (Xi , Si ) such
that (ti , Xi , Si ) stay in the neighbourhood Nκ of the primal-dual central path.
The theorem says that with properly chosen κ, χ (e.g., κ = χ = 0.1) we can, getting once close to
the primal-dual central path, trace it by the primal path-following method, keeping the iterates
in Nκ -neighbourhood
p of the path and increasing the penalty parameter by an absolute constant
factor every O( θ(K)) steps – exactly as it was claimed in Sections 4.2.3, 4.5.2. This fact is
extremely important theoretically; in particular, it underlies the polynomial time complexity
bounds for LP, CQP and SDP from Section 4.6 below. As a practical tool, the primal and the
dual path-following methods, at least in their short-step form presented above, are not that
attractive. The computational power of the methods can be improved by passing to appropriate
large-step versions of the algorithms, but even these versions are thought of to be inferior as
compared to “true” primal-dual path-following methods (those which “indeed work with both
(P) and (D)”, see below). There are, however, cases when the primal or the dual path-following
scheme seems to be unavoidable; these are, essentially, the situations where the pair (P), (D) is
“highly asymmetric”, e.g., (P) and (D) have different by order of magnitudes design dimensions
dim L, dim L⊥ . Here it becomes too expensive computationally to treat (P), (D) in a “nearly
symmetric way”, and it is better to focus solely on the problem with smaller design dimension.
294 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS
To get an impression of how the primal path-following method works, here is a picture:
What you see is the 2D feasible set of a toy SDP (K = S3+ ). “Continuous curve” is the primal central
path; dots are iterates xi of the algorithm. We cannot draw the dual solutions, since they “live” in 4-
dimensional space (dim L⊥ = dim S3 − dim L = 6 − 2 = 4)
Recall that our generic scheme of a path-following IP method suggests, given a current triple
(t̄, X̄, S̄) with positive t̄ and strictly primal, respectively, dual feasible X̄ and S̄, to update the
this triple into a new triple (t+ , X+ , S+ ) of the same type as follows:
(i) First, we somehow rewrite the system (4.5.13) as an equivalent system
(ii) Second, we choose somehow a new value t+ > t̄ of the penalty parameter and linearize
system (4.5.14) (with t set to t+ ) at the point (X̄, S̄), thus coming to the system of linear
equations
∂ Ḡt+ (X̄, S̄) ∂ Ḡt+ (X̄, S̄)
∆X + ∆S = −Ḡt+ (X̄, S̄), (4.5.15)
∂X ∂S
for the “corrections” (∆X, ∆S);
We add to (4.5.15) the system of linear equations on ∆X, ∆S expressing the requirement that
a shift of (X̄, S̄) in the direction (∆X, ∆S) should preserve the validity of the linear constraints
in (P), (D), i.e., the equations saying that ∆X ∈ L, ∆S ∈ L⊥ . These linear equations can be
written down as
∆X = A∆x [⇔ ∆X ∈ L]
(4.5.16)
A∗ ∆S = 0 [⇔ ∆S ∈ L⊥ ]
(iii) We solve the system of linear equations (4.5.15), (4.5.16), thus obtaining a primal-dual
search direction (∆X, ∆S), and update current iterates according to
X+ = X̄ + α∆x, S+ = S̄ + β∆S
where the primal and the dual stepsizes α, β are given by certain “side requirements”.
The major “degree of freedom” of the construction comes from (i) – from how we construct
the system (4.5.14). A very popular way to handle (i), the way which indeed leads to primal-dual
methods, starts from rewriting (4.5.13) in a form symmetric w.r.t. X and S. To this end we
first observe that (4.5.13) is equivalent to every one of the following two matrix equations:
XS = t−1 I; SX = t−1 I.
XS + SX = 2t−1 I, (4.5.17)
which, by its origin, is a consequence of (4.5.13). On a closest inspection, it turns out that
(4.5.17), regarded as a matrix equation with positive definite symmetric matrices, is equivalent
to (4.5.13). It is possible to use in the role of (4.5.14) the matrix equation (4.5.17) as it is;
this policy leads to the so called AHO (Alizadeh-Overton-Haeberly) search direction and the
“XS + SX” primal-dual path-following method.
It is also possible to use a “scaled” version of (4.5.17). Namely, let us choose somehow a
positive definite scaling matrix Q and observe that our original matrix equation (4.5.13) says
that S = t−1 X −1 , which is exactly the same as to say that Q−1 SQ−1 = t−1 (QXQ)−1 ; the latter,
in turn, is equivalent to every one of the matrix equations
Se = Q−1 S̄Q−1 , X
b = QX̄Q
commute (X̄, S̄ are the iterates to be updated); we call such a policy a “commutative scaling”.
Popular commutative scalings are:
1. Q = S̄ 1/2 (Se = I, X
b = S̄ 1/2 X̄ S̄ 1/2 ) (the “XS” method);
3. Q is such that Se = X
b (the NT (Nesterov-Todd) method, extremely attractive and deep)
1/4
S̄
If X̄ and S̄ were just positive reals, the formula for Q would be simple: Q = X̄
.
In the matrix case this simple formula becomes a bit more complicated (to make our
life easier, below we write X instead of X̄ and S instead of S̄):
P ? =? P T
m
X −1/2 (X 1/2 SX 1/2 )−1/2 X 1/2 S ? =? SX 1/2 (X 1/2 SX 1/2 )−1/2 X −1/2
m
X −1/2 (X 1/2 SX 1/2 )−1/2 X 1/2 S X 1/2 (X 1/2 SX 1/2 )1/2 X −1/2 S −1 ? =? I
m
X −1/2 (X 1/2 SX 1/2 )−1/2 (X 1/2 SX 1/2 )(X 1/2 SX 1/2 )1/2 X −1/2 S −1 ? =? I
m
X −1/2 (X 1/2 SX 1/2 )X −1/2 S −1 ? =? I
and the concluding ? =? indeed is =.
Now let us verify that P is positive definite. Recall that the spectrum of the product of
two square matrices, symmetric or not, remains unchanged when swapping the factors.
Therefore, denoting σ(A) the spectrum of A, we have
and the argument of the concluding σ(·) clearly is a positive definite symmetric matrix.
Thus, the spectrum of symmetric matrix P is positive, i.e., P is positive definite.
(b): To verify that QXQ = Q−1 SQ−1 , i.e., that P 1/2 XP 1/2 = P −1/2 SP −1/2 , is the
same as to verify that P XP = S. The latter equality is given by the following compu-
tation:
P XP = X −1/2 (X 1/2 SX 1/2 )−1/2 X 1/2 S X X −1/2 (X 1/2 SX 1/2 )−1/2 X 1/2 S
= X −1/2 (X 1/2 SX 1/2 )−1/2 (X 1/2 SX 1/2 )(X 1/2 SX 1/2 )−1/2 X 1/2 S
= X −1/2 X 1/2 S
= S.
You should not think that Nesterov and Todd guessed the formula for this scaling ma-
trix. They did much more: they have developed an extremely deep theory (covering the
general LP-CQP-SDP case, not just the SDP one!) which, among other things, guar-
antees that the desired scaling matrix exists (and even is unique). After the existence
is established, it becomes much easier (although still not that easy) to find an explicit
formula for Q.
Scalings. We already have mentioned what a scaling of Sk+ is: this is the linear one-to-one transfor-
mation of Sk given by the formula
H 7→ QHQT , (Scl)
where Q is a nonsingular scaling matrix. It is immediately seen that (Scl) is a symmetry of the semidefinite
cone Sk+ – it maps the cone onto itself. This family of symmetries is quite rich: for every pair of points
A, B from the interior of the cone, there exists a scaling which maps A onto B, e.g., the scaling
1/2 −1/2 −1/2 1/2
H 7→ (B
| {z A })H(A
| {zB }).
Q QT
Essentially, this is exactly the existence of that rich family of symmetries of the underlying cones which
makes SDP (same as LP and CQP, where the cones also are “perfectly symmetric”) especially well suited
for IP methods.
In what follows we will be interested in scalings associated with positive definite scaling matrices.
The scaling given by such a matrix Q (X,S,...) will be denoted by Q (resp., X ,S,...):
Q[H] = QHQ.
Given a problem of interest (CP) (where K = Sk+ ) and a scaling matrix Q 0, we can scale the problem,
i.e., pass from it to the problem
min cT x : Q [Ax − B] 0
Q(CP)
x
which, of course, is equivalent to (CP) (since Q[H] is positive semidefinite iff H is so). In terms of
“geometric reformulation” (P) of (CP), this transformation is nothing but the substitution of variables
• a primal-dual feasible pair (X, S) of solutions to (P), (D) – it is converted to the pair (X
b =
−1 −1
QXQ, S = Q SQ ), which, as it is immediately seen, is a pair of feasible solutions to (P), (D).
e b e
Note that the primal-dual scaling preserves strict feasibility and the duality gap:
DualityGapP,D (X, S) = Tr(XS) = Tr(QXSQ−1 ) = Tr(X
b S)
e = DualityGap (X,
P,D
b e
b S);
e
• the primal-dual central path (X∗ (·), S∗ (·)) of (P), (D); it is converted into the curve (X
b∗ (t) =
−1 −1
QX∗ (t)Q, S∗ (t) = Q S∗ (t)Q ), which is nothing but the primal-dual central path Z(t) of the
e
primal-dual pair (P),
b (D).
e
The latter fact can be easily derived from the characterization of the primal-dual central path; a
more instructive derivation is based on the fact that our “hero” – the barrier Sk (·) – is “semi-
invariant” w.r.t. scaling:
Sk (Q(X)) = − ln Det(QXQ) = − ln Det(X) − 2 ln Det(Q) = Sk (X) + const(Q).
Now, a point on the primal central path of the problem (P) b associated with penalty parameter t,
let this point be temporarily denoted by Y (t), is the unique minimizer of the aggregate
Skt (Y ) = thQ−1 CQ−1 , Y iF + Sk (Y ) ≡ tTr(Q−1 CQ−1 Y ) + Sk (Y )
over the set of strictly feasible solutions to (P). We see that X(t) is exactly the point X∗ (t) on
the primal central path associated with problem (P). Thus, the point Y (t) of the primal central
path associated with (P)b is nothing but X b∗ (t) = QX∗ (t)Q. Similarly, the point of the central path
associated with the problem (D) e is exactly Se∗ (t) = Q−1 S∗ (t)Q−1 .
• the neighbourhood Nκ of the primal-dual central path Z(·) associated with the pair of problems
(P), (D) (see (4.4.8)). As you can guess, the image of Nκ is exactly the neighbourhood N κ , given
by (4.4.8), of the primal-dual central path Z(·) of (P),
b (D).
e
The latter fact is immediate: for a pair (X, S) of strictly feasible primal and dual solutions to (P),
(D) and a t > 0 we have (see (4.4.6)):
= Tr X(tS − X −1 )X(tS − X −1 )
= dist2 ((X, S), Z∗ (t)).
we
1. Choose the new value t+ of the penalty parameter according to
−1
χ
t+ = 1− √ t̄, (4.5.21)
k
where χ ∈ (0, 1) is a parameter of the method;
2. Choose somehow the scaling matrix Q 0 such that the matrices X
b = QX̄Q and
−1 −1
S = Q S̄Q commute with each other;
e
3. Linearize the equation
2
QXSQ−1 + Q−1 SXQ = I
t+
at the point (X̄, S̄), thus coming to the equation
2
Q[∆XS +X∆S]Q−1 +Q−1 [∆SX +S∆X]Q = I −[QX̄ S̄Q−1 +Q−1 S̄ X̄Q]; (4.5.22)
t+
∆X ∈ L,
(4.5.23)
∆S ∈ L⊥ ;
5. Solve system (4.5.22), (4.5.23), thus getting “primal-dual search direction” (∆X, ∆S);
300 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS
6. Update current primal-dual solutions (X̄, S̄) into a new pair (X+ , S+ ) according to
X+ = X̄ + ∆X, S+ = S̄ + ∆S.
We already have explained the ideas underlying (U), up to the fact that in our previous explanations we
dealt with three “independent” entities t̄ (current value of the penalty parameter), X̄, S̄ (current primal
and dual solutions), while in (U) t̄ is a function of X̄, S̄:
k
t̄ = . (4.5.24)
Tr(X̄ S̄)
The reason for establishing this dependence is very simple: if (t, X, S) were on the primal-dual central
path: XS = t−1 I, then, taking traces, we indeed would get t = Tr(XS)k
. Thus, (4.5.24) is a reasonable
way to reduce the number of “independent entities” we deal with.
Note also that (U) is a “pure Newton scheme” – here the primal and the dual stepsizes are equal to
1 (cf. (4.5.6)).
The major element of the complexity analysis of path-following polynomial time methods for SDP is
as follows:
Let, further, (X̄, S̄) be a pair of strictly feasible primal and dual solutions to (P), (D) such that the triple
(4.5.19) satisfies (4.5.20). Then the updated pair (X+ , S+ ) is well-defined (i.e., system (4.5.22), (4.5.23)
is solvable with a unique solution), X+ , S+ are strictly feasible solutions to (P), (D), respectively,
k
t+ =
Tr(X+ S+ )
The theorem says that with properly chosen κ, χ (say, κ = χ = 0.1), updating (U) converts a close to
the primal-dual central path, in the sense of (4.5.20), strictly primal-dual feasible iterate (X̄, S̄) into
a new strictly primal-dual feasible iterate with the same closeness-to-the-path property and larger, by
factor (1 + O(1)k −1/2 ), value of the penalty parameter. Thus, after we get close to the path – reach
its 0.1-neighbourhood N0.1 – we are able to trace
√ this path,
p staying in N0.1 and increasing the penalty
parameter by absolute constant factor in O( k) = O( θ(K)) steps, exactly as announced in Section
4.5.2.
Proof of Theorem 4.5.2. 10 . Observe, first (this observation is crucial!) that it suffices to prove our
Theorem in the particular case when X̄, S̄ commute with each other and Q = I. Indeed, it is immediately
seen that the updating (U) can be represented as follows:
1. We first scale by Q the “input data” of (U) – the primal-dual pair of problems (P), (D) and the
strictly feasible pair X̄, S̄ of primal and dual solutions to these problems, as explained in sect.
“Scaling”. Note that the resulting entities – a pair of primal-dual problems and a strictly feasible
pair of primal-dual solutions to these problems – are linked with each other exactly in the same
fashion as the original entities, due to scaling invariance of the duality gap and the neighbourhood
Nκ . In addition, the scaled primal and dual solutions commute;
2. We apply to the “scaled input data” yielded by the previous step the updating (U)
b completely
similar to (U), but using the unit matrix in the role of Q;
3. We “scale back” the result of the previous step, i.e., subject this result to the scaling associated
with Q−1 , thus obtaining the updated iterate (X + , S + ).
4.5. TRACING THE CENTRAL PATH 301
Given that the second step of this procedure preserves primal-dual strict feasibility, w.r.t. the scaled
primal-dual pair of problems, of the iterate and keeps the iterate in the κ-neighbourhood Nκ of the
corresponding central path, we could use once again the “scaling invariance” reasoning to assert that the
result (X + , S + ) of (U) is well-defined, is strictly feasible for (P), (D) and is close to the original central
path, as claimed in the Theorem. Thus, all we need is to justify the above “Given”, and this is exactly
the same as to prove the theorem in the particular case of Q = I and commuting X̄, S̄. In the rest of
the proof we assume that Q = I and that the matrices X̄, S̄ commute with each other. Due to the latter
property, X̄, S̄ are diagonal in a properly chosen orthonormal basis; representing all matrices from Sk in
this basis, we can reduce the situation to the case when X̄ and S̄ are diagonal. Thus, we may (and do)
assume in the sequel that X̄ and S̄ are diagonal, with diagonal entries xi ,si , i = 1, ..., k, respectively, and
that Q = I. Finally, to simplify notation, we write t, X, S instead of t̄, X̄, S̄, respectively.
20 . Our situation and goals now are as follows. We are given orthogonal to each other affine planes
L − B, L⊥ + C in Sk and two positive definite diagonal matrices X = Diag({xi }) ∈ L − B, S =
Diag({si }) ∈ L⊥ + C. We set
1 Tr(XS)
µ= =
t k
and know that
ktX 1/2 SX 1/2 − Ik2 ≤ κ.
We further set
1
µ+ = = (1 − χk −1/2 )µ (4.5.26)
t+
and consider the system of equations w.r.t. unknown symmetric matrices ∆X, ∆S:
(a) ∆X ∈ L
(b) ∆S ∈ L⊥ (4.5.27)
(c) ∆XS + X∆S + ∆SX + S∆X = 2µ+ I − 2XS
We should prove that the system has a unique solution such that the matrices
X+ = X + ∆X, S+ = S + ∆S
are
(i) positive definite,
(ii) belong, respectively, to L − B, L⊥ + C and satisfy the relation
Tr(X+ S+ ) = µ+ k; (4.5.28)
Tr(XS 0 ) = k × 1
and
kX 1/2 S 0 X 1/2 − Ik2 = kµ−1 X 1/2 SX 1/2 − Ik2 ≤ κ.
The “we should prove” part becomes: to verify that the system of equations
(a) ∆X ∈ L
(b) ∆S 0 ∈ L⊥
(c) ∆XS 0 + X∆S 0 + ∆S 0 X + S 0 ∆X = 2(1 − χk −1/2 )I − 2XS 0
302 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS
0
has a unique solution and that the matrices X+ = X + ∆X, S+ = S 0 + ∆S+
0
are positive definite, are
⊥ 0
contained in L − B, respectively, L + C and satisfy the relations
0 µ+
Tr(X+ S+ )= = 1 − χk −1/2
µ
and
1/2 1/2
k(1 − χk −1/2 )−1 X+ S+
0
X+ − Ik2 ≤ κ.
Thus, the general situation indeed can be reduced to the one with µ = 1, µ+ = 1 − χk −1/2 , and we loose
nothing assuming, in addition to what was already postulated, that
Tr(XS)
µ ≡ t−1 ≡ = 1, µ+ = 1 − χk −1/2 ,
k
whence
k
X
[Tr(XS) =] xi s i = k (4.5.30)
i=1
and
n
X
[ktX 1/2 SX 1/2 − Ik22 ≡] (xi si − 1)2 ≤ κ2 . (4.5.31)
i=1
30 . We start with proving that (4.5.27) indeed has a unique solution. It is convenient to pass in
(4.5.27) from the unknowns ∆X, ∆S to the unknowns
has only trivial solution. Let (δX, δS) be a solution to the homogeneous system. Relation L(δX, ∆S) = 0
means that
ψij
(δX)ij = − (δS)ij , (4.5.34)
φij
4.5. TRACING THE CENTRAL PATH 303
whence
X ψij
Tr(δXδS) = − (∆S)2ij . (4.5.35)
i,j
φij
Tr(δXδS) = Tr(X −1/2 ∆XX −1/2 X 1/2 ∆SX 1/2 ) = Tr(X −1/2 ∆X∆SX 1/2 ) = Tr(∆X∆S),
and the latter quantity is 0 due to ∆X = X 1/2 δXX 1/2 ∈ L and ∆S = X −1/2 δSX −1/2 ∈ L⊥ . Thus, the
left hand side in (4.5.35) is 0; since φij > 0, ψij > 0, (4.5.35) implies that δS = 0. But then δX = 0 in
view of (4.5.34). Thus, the homogeneous version of (4.5.33) has the trivial solution only, so that (4.5.33)
is solvable with a unique solution.
40 . Let δX, δS be the unique solution to (4.5.33), and let ∆X, ∆S be linked to δX, δS according to
(4.5.32). Our local goal is to bound from above the Frobenius norms of δX and δS.
From (4.5.33.c) it follows (cf. derivation of (4.5.35)) that
(δS)ij + 2 µ+ φ−x
ψ
ij i si
(a) (δX)ij = − φij ii
δij , i, j = 1, ..., k;
(4.5.36)
2 µ+ψ−x
φ i si
(b) (δS)ij = − ψij
ij
(δX)ij + ii
δij , i, j = 1, ..., k.
Multiplying (4.5.36.a) by (δS)ij and taking sum over i, j, we get, in view of (4.5.37), the relation
X ψij X µ + − xi s i
(δS)2ij = 2 (δS)ii ; (4.5.38)
i,j
φij i
φii
Now let
θ i = xi s i , (4.5.40)
so that in view of (4.5.30) and (4.5.31) one has
P
(a) θi = k,
P i (4.5.41)
(b) (θi − 1)2 ≤ κ2 .
i
Observe that r
√ √ θi θj xi xj
r
φij = xi xj (si + sj ) = xi xj + = θj + θi .
xi xj xj xi
Thus, q
q x
φij = θj xxji + θi xji ,
q q
xj
(4.5.42)
xi
ψij = xj + xi ;
φij
1−κ≤ ≤ 1 + κ. (4.5.43)
ψij
304 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS
r rP [see (4.5.44)]
(δX)2ij
P
≤ χ2 +(1 − θi )2
i i,j
P
[since (1 − θi ) = 0 by (4.5.41.a)]
p rP i
≤ χ2 + κ2 (δX)2ij
i,j
[see (4.5.41.b)]
and from the resulting inequality it follows that
p
χ2 + κ2
kδXk2 ≤ ρ ≡ . (4.5.45)
1−κ
Similarly,
P ψij
(1 + κ)−1 (δS)2ij (δS)2ij
P
≤ φij
i,j i,j
[see (4.5.43)]
P µ+ −xi si
≤ 2 φii (δS)ii
i
rP rP[see (4.5.38)]
≤ 2 (µ+ − xi si φ−2 2)2
ij (δS)ii
i i
rP rP
≤ (1 − κ)−1 (µ+ − θi )2 (δS)2ij
i i,j
[see (4.5.44)]
rP
p
−1
≤ (1 − κ) 2
χ +κ2 (δS)2ij
i,j
[same as above]
and from the resulting inequality it follows that
p
(1 + κ) χ2 + κ2
kδSk2 ≤ = (1 + κ)ρ. (4.5.46)
1−κ
50 . We are ready to prove 20 .(i-ii). We have
X+ = X + ∆X = X 1/2 (I + δX)X 1/2 ,
and the matrix I + δX is positive definite due to (4.5.45) (indeed, the right hand side in (4.5.45) is ρ ≤ 1,
whence the Frobenius norm (and therefore - the maximum of moduli of eigenvalues) of δX is less than
1). Note that by the just indicated reasons I + δX (1 + ρ)I, whence
X+ (1 + ρ)X. (4.5.47)
4.5. TRACING THE CENTRAL PATH 305
is positive definite. Indeed, the eigenvalues of the matrix X 1/2 SX 1/2 are ≥ min θi ≥ 1 − κ, while
√ i
(1+κ) χ2 +κ2
the moduli of eigenvalues of δS, by (4.5.46), do not exceed 1−κ < 1 − κ. Thus, the matrix
1/2 1/2
X SX + δS is positive definite, whence S+ also is so. We have proved 20 .(i).
20 .(ii) is easy to verify. First, by (4.5.33), we have ∆X ∈ L, ∆S ∈ L⊥ , and since X ∈ L − B,
S ∈ L⊥ + C, we have X+ ∈ L − B, S+ ∈ L⊥ + C. Second, we have
20 .(ii) is proved.
60 . It remains to verify 20 .(iii). We should bound from above the quantity
and our plan is first to bound from above the “close” quantity
−1
Z = X 1/2 (S+ − µ+ X+ )X 1/2
= 1/2
X (S + ∆S)X 1/2
− µ+ X 1/2 [X + ∆X]−1 X 1/2
= XS + δS − µ+ X [X 1/2 (I + δX)X 1/2 ]−1 X 1/2
1/2
[see (4.5.32)]
= XS + δS − µ+ (I + δX)−1
= XS + δS − µ+ (I − δX) − µ+ [(I + δX)−1 − I + δX]
= XS + δS + δX − µ+ I + (µ+ − 1)δX + µ+ [I − δX − (I + δX)−1 ],
| {z } | {z } | {z }
Z1 Z2 Z3
so that
kZk2 ≤ kZ 1 k2 + kZ 2 k2 + kZ 3 k2 . (4.5.49)
We are about to bound separately all 3 terms in the right hand side of the latter inequality.
Bounding kZ 2 k2 : We have
1
Zij = (XS)ij + (δS)ij + (δX)ij − µ+ δij
= (δX)ij h+ (δS)iji+ (x h i si − µ+ )δij i
2 µ+ψ−x
φ i si
= (δX)ij 1 − ψij ij
+ ii
+ xi si − µ+ δij
h i [we have used (4.5.36.b)]
φij
= (δX)ij 1 − ψij
[since ψii = 2, see (4.5.42)]
1
1 κ
|Zij | ≤ 1 − |(δX)ij | = |(δX)ij |,
1−κ 1−κ
so that
κ κ
kZ 1 k2 ≤ kδXk2 ≤ ρ (4.5.52)
1−κ 1−κ
χ ρ κ
kZk2 ≤ ρ √ + + ,
k 1−ρ 1−κ
whence, by (4.5.48),
ρ χ ρ κ
b≤
Ω √ + + . (4.5.53)
1 − χk −1/2 k 1−ρ 1−κ
4.5. TRACING THE CENTRAL PATH 307
−1 1/2 1/2
Ω2 = kµ+ X+ S+ X+ − Ik22
1/2 1/2 2
= kX+ [µ−1 −1
+ S+ − X+ ] X+ k2
| {z }
Θ=ΘT
1/2 1/2
= Tr X+ ΘX+ ΘX+
1/2 1/2
≤ (1 + ρ)Tr X+ ΘXΘX+
[see (4.5.47)]
1/2 1/2
= (1 + ρ)Tr X+ ΘX 1/2 X 1/2 ΘX+
1/2 1/2
= (1 + ρ)Tr X 1/2 ΘX+ X+ ΘX 1/2
= (1 + ρ)Tr X 1/2 ΘX+ ΘX 1/2
≤ (1 + ρ)2 Tr X 1/2 ΘXΘX 1/2
[the same (4.5.47)]
= (1 + ρ)2 kX 1/2 ΘX 1/2 k22
= (1 + ρ)2 kX 1/2 [µ−1
+ S+ − X+ ]X
−1 1/2 2
k2
2 b2
= (1 + ρ) Ω
[see (4.5.48)]
so that
h i
ρ(1+ρ) χ ρ κ
Ω ≤ (1 + ρ)Ω
b=
1−χk√−1/2
√
k
+ 1−ρ + 1−κ ,
(4.5.54)
χ2 +κ2
ρ = 1−κ .
Remark 4.5.2 We have carried out the complexity analysis for a large group of primal-dual
path-following methods for SDP (i.e., for the case of K = Sk+ ). In fact, the constructions and
the analysis we have presented can be word by word extended to the case when K is a direct
product of semidefinite cones – you just should bear in mind that all symmetric matrices we
deal with, like the primal and the dual solutions X, S, the scaling matrices Q, the primal-dual
search directions ∆X, ∆S, etc., are block-diagonal with common block-diagonal structure. In
particular, our constructions and analysis work for the case of LP – this is the case when K
is a direct product of one-dimensional semidefinite cones. Note that in the case of LP Zhang’s
family of primal-dual search directions reduces to a single direction: since now X, S, Q are
diagonal matrices, the scaling (4.5.17) 7→ (4.5.18) does not vary the equations of augmented
complementary slackness.
The recipe to translate all we have presented for the case of SDP to the case of LP is very
simple: in the above text, you should assume all matrices like X, S,... to be diagonal and look
what the operations with these matrices required by the description of the method do with their
diagonals. By the way, one of the very first approaches to the design and the analysis of IP
methods for SDP was exactly opposite: you take an IP scheme for LP, replace in its description
the words “nonnegative vectors” with “positive semidefinite diagonal matrices” and then erase
the adjective “diagonal”.
308 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS
4.5.2; note, however, that the complexity bounds to follow take into account the necessity to
“reach the highway” – to come close to the central path before tracing it, while in Section
4.5.2 we were focusing on how fast could we reduce the duality gap after the central path (“the
highway”) is reached.
Along with complexity bounds expressed in terms of the Newton complexity, we present the
bounds on the number of operations of Real Arithmetic required to build an -solution. Note
that these latter bounds typically are conservative – when deriving them, we assume the data
of an instance “completely unstructured”, which is usually not the case (cf. Warning in Section
4.5.2); exploiting structure of the data, one usually can reduce significantly computational effort
per step of an IP method and consequently – the arithmetic cost of -solution.
4.6.1 Complexity of LP b
Family of problems:
Data:
Data(p) = [m; n; c; a1 , b1 ; ...; am , bm ; R],
Size(p) ≡ dim Data(p) = (m + 1)(n + 1) + 2.
Data:
Data(P ) = [m; n; k1 , ..., km ; c; A1 , b1 , c1 , d1 ; ...; Am , bm , cm , dm ; R],
m
Size(p) ≡ dim Data(p) = (m +
P
ki )(n + 1) + m + n + 3.
i=1
(i)
where Aj , j = 0, 1, ..., n, are symmetric block-diagonal matrices with m diagonal blocks Aj of
sizes ki × ki , i = 1, ..., m.
Data:
(1) (m) (1) (m)
Data(p) = [m; n; k1 , ...km ;
c; A0 , ..., A0 ; ...; An , ..., An ; R],
m
ki (ki +1)
Size(p) ≡ dim Data(P ) =
P
2 (n + 1) + m + n + 3.
i=1
310 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS
kxk2 ≤ R,
n
−I,
P
A0 + x j Aj
j=1
cT x ≤ Opt(p) + .
the influence of the size of an SDP/CQP program on the complexity of its solving by an IP
method is twofold:
– first, the size affects the Newton complexity of the process. Theoretically, the number of
steps
p required to reduce the duality gap by a constant factor, say, factor 2, is proportional to
θ(K) (θ(K) is twice the total # of conic quadratic inequalities for CQP and the total row
size of LMI’s for SDP). Thus, we could expect an unpleasant growth of the iteration count with
θ(K). Fortunately, the iteration count for good IP methods usually is much less than the one
given by the worst-case complexity analysis and is typically about few tens, independently of
θ(K).
– second, the larger is the instance, the larger is the system of linear equations one should
solve to generate new primal (or primal-dual) search direction, and, consequently, the larger is
the computational effort per step (this effort is dominated by the necessity to assemble and to
solve the linear system). Now, the system to be solved depends, of course, on what is the IP
method we are speaking about, but it newer is simpler (and for most of the methods, is not
more complicated as well) than the system (4.5.8) arising in the primal path-following method:
The size n of this system is exactly the design dimension of problem (CP).
In order to process (Nwt), one should assemble the system (compute H and h) and then solve
it. Whatever is the cost of assembling (Nwt), you should be able to store the resulting matrix H
in memory and to factorize the matrix in order to get the solution. Both these problems – storing
and factorizing H – become prohibitively expensive when H is a large dense15) matrix. (Think
how happy you will be with the necessity to store 5000×50012 = 12, 502, 500 reals representing a
3
dense 5000 × 5000 symmetric matrix H and with the necessity to perform ≈ 5000 6 ≈ 2.08 × 1010
arithmetic operations to find its Choleski factor).
The necessity to assemble and to solve large-scale systems of linear equations is intrinsic
for IP methods as applied to large-scale optimization programs, and in this respect there is no
difference between LP and CQP, on one hand, and SDP, on the other hand. The difference is
in how difficult is to handle these large-scale linear systems. In real life LP’s-CQP’s-SDP’s, the
structure of the data allows to assemble (Nwt) at a cost negligibly small as compared to the
cost of factorizing H, which is a good news. Another good news is that in typical real world
LP’s, and to some extent for real-world CQP’s, H turns out to be “very well-structured”, which
reduces dramatically the expenses required by factorizing the matrix and storing the Choleski
factor. All practical IP solvers for LP and CQP utilize these favourable properties of real life
problems, and this is where their ability to solve problems with tens/hundreds thousands of
variables and constraints comes from. Spoil the structure of the problem – and an IP method
will be unable to solve an LP with just few thousands of variables. Now, in contrast to real life
LP’s and CQP’s, real life SDP’s typically result in dense matrices H, and this is where severe
limitations on the sizes of “tractable in practice” SDP’s come from. In this respect, real life
CQP’s are somewhere in-between LP’s and SDP’s, so that the sizes of “tractable in practice”
CQP’s could be significantly larger than in the case of SDP’s.
It should be mentioned that assembling matrices of the linear systems we are interested in
and solving these systems by the standard Linear Algebra techniques is not the only possible
way to implement an IP method. Another option is to solve these linear systems by iterative
15)
I.e., with O(n2 ) nonzero entries.
312 LECTURE 4. POLYNOMIAL TIME INTERIOR POINT METHODS
methods. With this approach, all we need to solve a system like (Nwt) is a possibility to multiply
a given vector by the matrix of the system, and this does not require assembling and storing
in memory the matrix itself. E.g., to multiply a vector ∆x by H, we can use the multiplicative
representation of H as presented in (Nwt). Theoretically, the outlined iterative schemes, as
applied to real life SDP’s, allow to reduce by orders of magnitudes the arithmetic cost of building
search directions and to avoid the necessity to assemble and store huge dense matrices, which is
an extremely attractive opportunity. The difficulty, however, is that the iterative schemes are
much more affected by rounding errors that the usual Linear Algebra techniques; as a result,
for the time being “iterative-Linear-Algebra-based” implementation of IP methods is no more
than a challenging goal.
Although the sizes of SDP’s which can be solved with the existing codes are not that im-
pressive as those of LP’s, the possibilities offered to a practitioner by SDP IP methods could
hardly be overestimated. Just ten years ago we could not even dream of solving an SDP with
more than few tens of variables, while today we can solve routinely 20-25 times larger SDP’s,
and we have all reasons to believe in further significant progress in this direction.
Lecture 5
313
314 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
γ-quants penetrate the surrounding tissue and are registered outside the patient by a PET scanner
consisting of circular arrays (rings) of gamma radiation detectors. Since the two gamma rays are emitted
simultaneously and travel in almost exactly opposite directions, we can say a lot on the location of their
source: when a pair of detectors register high-energy γ-quants within a short (∼ 10−8 sec) time window
(“a coincidence event”), we know that the photons came from a disintegration act, and that the act took
place on the line (“line of response” (LOR)) linking the detectors. The measured data set is the collection
of numbers of coincidences counted by different pairs of detectors (“bins”), and the problem is to recover
from these measurements the 3D density of the tracer.
The mathematical model of the process, after appropriate discretization, is
y = P λ + ξ,
where
• λ ≥ 0 is the vector representing the (discretized) density of the tracer; the entries of λ are indexed
by voxels – small cubes into which we partition the field of view, and λj is the mean density of
the tracer in voxel j. Typically, the number n of voxels is in the range from 3 × 105 to 3 × 106 ,
depending on the resolution of the discretization grid;
• y are the measurements; the entries in y are indexed by bins – pairs of detectors, and yi is the
number of coincidences counted by i-th pair of detectors. Typically, the dimension m of y – the
total number of bins – is millions (at least 3 × 106 );
• P is the projection matrix; its entries pij are the probabilities for a LOR originating in voxel j to
be registered by bin i. These probabilities are readily given by the geometry of the scanner;
• ξ is the measurement noise coming mainly from the fact that all physical processes underlying PET
are random. The standard statistical model for PET implies that yi , i = 1, ..., m, are independent
Poisson random variables with the expectations (P λ)i .
The problem we are interested in is to recover tracer’s density λ given measurements y. As far as
the quality of the result is concerned, the most attractive reconstruction scheme is given by the standard
in Statistics Likelihood Ratio maximization: denoting p(·|λ) the density, taken w.r.t. an appropriate
dominating measure, of the probability distribution of the measurements coming from λ, the estimate of
the unknown true value λ∗ of λ is
λ
b = argmin p(y|λ),
λ≥0
This is a nicely structured convex program (by the way, polynomially reducible to CQP and even LP).
The only difficulty – and a severe one – is in huge sizes of the problem: as it was already explained, the
number n of decision variables is at least 300, 000, while the number m of log-terms in the objective is in
the range from 3 × 106 to 25 × 106 .
At the present level of our knowledge, the design dimension n of order of tens and hundreds
of thousands rules out the possibility to solve a nonlinear convex program, even a well-structured
one, by polynomial time methods because of at least quadratic in n “blowing up” the arithmetic
cost of an iteration. When n is really large, all we can use are simple methods with linear in n
cost of an iteration. As a byproduct of this restriction, we cannot utilize anymore our knowledge
5.1. MOTIVATION: WHY SIMPLE METHODS? 315
of the analytic structure of the problem, since all known for the time being ways of doing so are
too expensive, provided that n is large. As a result, we are enforced to restrict ourselves with
black-box-oriented optimization techniques – those which use solely the possibility to compute
the values and the (sub)gradients of the objective and the constraints at a point. In Convex
Optimization, two types of “cheap” black-box-oriented optimization techniques are known:
• subgradient-type techniques, called also First Order algorithms, for nonsmooth convex
programs, including constrained ones.
Since the majority of applications are constrained, we restrict our exposition to the techniques
of the second type. We start with investigating of what, in principle, can be expected of black-
box-oriented optimization techniques.
where X is a convex compact set in a Euclidean space (which we w.l.o.g. identify with Rn )
and the objective f is a continuous convex function on Rn . Let us fix a family P(X) of convex
programs (CP) with X common for all programs from the family, so that such a program can be
identified with the corresponding objective, and the family itself is nothing but certain family
of convex functions on Rn . We intend to explain what is the Information-based complexity of
P(X) – informally, complexity of the family w.r.t. “black-box-oriented” methods. We start with
defining such a method as a routine B as follows:
1. When starting to solve (CP), B is given an accuracy > 0 to which the problem should
be solved and knows that the problem belongs to a given family P(X). However, B does
not know what is the particular problem it deals with.
2. In course of solving the problem, B has an access to the First Order oracle for f . This
oracle is capable, given
on input a point x ∈ Rn , to report on output what is the value f (x) and a subgradient
f 0 (x) of f at x.
B generates somehow a sequence of search points x1 , x2 , ... and calls the First Order oracle
to get the values and the subgradients of f at these points. The rules for building xt can be
arbitrary, except for the fact that they should be non-anticipating (causal): xt can depend
only on the information f (x1 ), f 0 (x1 ), ..., f (xt−1 ), f 0 (xt−1 ) on f accumulated by B at the
first t − 1 steps.
3. After certain number T = TB (f, ) of calls to the oracle, B terminates and outputs the
result zB (f, ). This result again should depend solely on the information on f accumulated
by B at the T search steps, and must be an -solution to (CP), i.e.,
– by the minimal number of steps in which B is capable to solve within accuracy every instance
of P(X). Finally, the Information-based complexity of the family P(X) of problems is defined
as
Compl() = min ComplB (),
B
the minimum being taken over all solution methods. Thus, the relation Compl() = N means,
first, that there exists a solution method B capable to solve within accuracy every instance of
P(X) in no more than N calls to the First Order oracle, and, second, that for every solution
method B there exists an instance of P(X) such that B solves the instance within the accuracy
in at least N steps.
Note that as far as black-box-oriented optimization methods are concerned, the information-
based complexity Compl() of a family P(X) is a lower bound on “actual” computational effort,
whatever it means, sufficient to find -solution to every instance of the family.
where
• O(1)’s are appropriately chosen positive absolute constants,
1
• (X) depends on the geometry of X, but never is less than n2
, where n is the dimen-
sion of X.
II. Complexity of finding solutions of fixed accuracy in high dimensions does depend on the
geometry of X. Here are 3 typical results:
(a) Let X be an n-dimensional box: X = {x ∈ Rn : kxk∞ ≤ 1}. Then
1 1 1
≤ ⇒ O(1)n ln( ) ≤ Compl() ≤ O(1)n ln( ). (5.1.3)
2
(b) Let X be an n-dimensional ball: X = {x ∈ Rn : kxk2 ≤ 1}. Then
1 O(1) O(1)
n≥ 2
⇒ 2 ≤ Compl() ≤ 2 . (5.1.4)
5.1. MOTIVATION: WHY SIMPLE METHODS? 317
1 O(1) O(ln n)
n≥ 2
⇒ 2 ≤ Compl() ≤ (5.1.5)
2
(in fact, O(1) in the lower bound can be replaced with O(ln n), provided that n >>
1
2
).
Since we are interested in extremely large-scale problems, the moral which we can extract from
the outlined results is as follows:
• I is discouraging: it says that we have no hope to guarantee high accuracy, like = 10−6 ,
when solving large-scale problems with black-box-oriented methods; indeed, with O(n) steps per
accuracy digit and at least O(n) operations per step (this many operations are required already
to input a search point to the oracle), the arithmetic cost per accuracy digit is at least O(n2 ),
which is prohibitively large for really large n.
• II is partly discouraging, partly encouraging. A bad news reported by II is that when X is
a box, which is the most typical situation in applications, we have no hope to solve extremely
large-scale problems, in a reasonable time, to guaranteed, even low, accuracy, since the required
number of steps should be at least of order of n. A good news reported by II is that there
exist situations where the complexity of minimizing a convex function within a fixed accuracy is
independent, or nearly independent, of the design dimension. Of course, the dependence of the
complexity bounds in (5.1.4) and (5.1.5) on is very bad and has nothing in common with being
polynomial in ln(1/); however, this drawback is tolerable when we do not intend to get high
accuracy. Another drawback is that there are not that many applications where the feasible set
is a ball or a hyperoctahedron. Note, however, that in fact we can save the most important
for us upper complexity bounds in (5.1.4) and (5.1.5) when requiring from X to be a subset of
a ball, respectively, of a hyperoctahedron, rather than to be the entire ball/hyperoctahedron.
This extension is not costless: we should simultaneously strengthen the normalization condition
(5.1.1). Specifically, we shall see that
B. The upper complexity bound in (5.1.4) remains valid when X ⊂ {x : kxk2 ≤ 1} and
S. The upper complexity bound in (5.1.5) remains valid when X ⊂ {x : kxk1 ≤ 1} and
Note that the “ball-like” case mentioned in B seems to be rather artificial: the Euclidean norm
associated with this case is a very natural mathematical entity, but this is all we can say in its
favour. For example, the normalization of the objective in B is that the Lipschitz constant of f
w.r.t. k · k2 is ≤ 1, or, which is the same, that the vector of the first order partial derivatives of f
should, at every point, be of k·k2 -norm not exceeding 1. In order words, “typical” magnitudes of
the partial derivatives of f should become smaller and smaller as the number of variables grows;
what could be the reasons for such a strange behaviour? In contrast to this, the normalization
condition imposed on f in S is that the Lipschitz constant of f w.r.t. k · k1 is ≤ 1, or, which is
the same, that the k · k∞ -norm of the vector of partial derivatives of f is ≤ 1. In other words,
the normalization is that the magnitudes of the first order partial derivatives of f should be ≤ 1,
and this normalization is “dimension-independent”. Of course, in B we deal with minimization
318 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
over subsets of the unit ball, while in S we deal with minimization over the subsets of the unit
hyperoctahedron, which is much smaller than the unit ball. However, there do exist problems
in reality where we should minimize over the standard simplex
X
∆n = {x ∈ Rn : x ≥ 0, xi = 1},
x
which indeed is a subset of the unit hyperoctahedron. For example, it turns out that the PET
Image Reconstruction problem (PET) is in fact the problem of minimization over the standard
simplex. Indeed, the optimality condition for (PET) reads
X pij
λj pj − yi P = 0, j = 1, ..., n;
i
pi` λ`
`
It follows that the optimal solution to (PET) remains unchanged when we add to the non-
negativity constraints λj ≥ 0 also the constraint
P
pj λj = B. Passing to the new variables
j
xj = B −1 pj λj , we further convert (PET) to the equivalent form
( )
min f (x) ≡ − : x ∈ ∆n
P P
yi ln( qij xj )
x
hi j i , (PET0 )
Bpij
qij = pj
Intermediate conclusion. The discussion above says that this perhaps is a good idea to
look for simple convex minimization techniques which, as applied to convex programs (CP)
with feasible sets of appropriate geometry, exhibit dimension-independent (or nearly dimension-
independent) and nearly optimal information-based complexity. We are about to present a
family of techniques of this type.
5.2. THE SIMPLEST: SUBGRADIENT DESCENT AND EUCLIDEAN BUNDLE LEVEL319
where
SD is the recurrence
xt+1 = ΠX (xt − γt f 0 (xt )) [x1 ∈ X] (SD)
where
• f 0 (x) is a subgradient of f at x:
Note: We always assume that int X 6= ∅ and that the subgradients f 0 (x) reported by the
First Order oracle at points x ∈ X satisfy the requirement
With this assumption, for every norm k · k on Rn and for every x ∈ X one has
|f (x) − f (y)|
kf 0 (x)k∗ ≡ max ξ T f 0 (x) ≤ Lk·k (f ) ≡ sup . (5.2.2)
ξ:kξk≤1 x6=y, kx − yk
x,y∈X
When, why and how SD converges? To analyse convergence of SD, we start with a simple
geometric fact:
(!) Let X ⊂ Rn be a closed convex set, and let x ∈ Rn . Then the vector e =
x − ΠX (x) forms an acute angle with every vector of the form y − ΠX (x), y ∈ X:
In particular,
whence
0 ≤ φ0 (0) = 2(ΠX (x) − x)T (y − ΠX (x)).
Consequently,
ky − xk22 = ky − ΠX (x)k22 + kΠX (x) − xk22 + 2(y − ΠX (x))T (ΠX (x) − x)
≥ ky − ΠX (x)k22 + kΠX (x) − xk22 .
T
Θ+ 12
P
γt2 kf 0 (xt )k22
t=T0
∀(T, T0 , T ≥ T0 ≥ 1) : T ≡ min f (xt ) − f∗ ≤ T (5.2.6)
t≤T P
γt
t=T0
Note that T is the non-optimality, in terms of f , of the best (with the smallest value of f )
solution xTbst found in course of the first T steps of SD.
Relation (5.2.6) allows to arrive at various convergence results.
lim T = 0. (5.2.7)
T →∞
one has √
Lk·k2 (f ) Θ
T ≡ min f (xt ) − f∗ ≤ O(1) √
T
, T ≥1 (5.2.9)
t≤T
is fixed. Moreover, when X is a Euclidean ball in Rn , this efficiency estimate “is as good as
an efficiency estimate of a black-box-oriented method can be”, provided that the dimension is
large:
!2
Vark·k2 ,X (f )
n≥
Bad news: Our “dimension-independent” efficiency estimate
• is pretty slow
1
10
0
10
−1
10
−2
10
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
We see that after rapid progress at the initial iterates, the method “gets stuck,” reflecting the
slow convergence rate.
322 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
• the first-order information {f (xτ ), f 0 (xτ )}1≤τ <t on f along the previous search points
xτ ∈ X, τ < t;
• current iterate xt ∈ X.
• At step t we
1. compute f (xt ), f 0 (xt ); this information, along with the past first-order information on f ,
provides is with the current model of the objective
This model underestimates the objective and is exact at the points x1 , ..., xt ;
2. define the best found so far value of the objective f t = min f (xτ )
τ ≤t
Note that the current gap ∆t = f t − ft is an upper bound on the inaccuracy of the best
found so far approximate solution to the problem;
f1 ≤ f2 ≤ f3 ≤ ... ≤ f∗
f 1 ≥ f 2 ≥ f 3 ≤ ... ≥ f∗
∆1 ≥ ∆2 ≥ ... ≥ 0
5.2. THE SIMPLEST: SUBGRADIENT DESCENT AND EUCLIDEAN BUNDLE LEVEL323
• Let us say that a group of subsequent iterations J = {s, s + 1, ..., r} form a segment, if
∆r ≥ (1 − λ)∆s . We claim that
If J = {s, s + 1, ..., r} is a segment, then
(i) All the sets Lt = {x ∈ X : ft (x) ≤ `t }, t ∈ J, have a point in common, specifically, (any)
minimizer u of fr (·) over X;
(ii) For t ∈ J, one has kxt − xt+1 k2 ≥ (1−λ)∆
L
r
.
k·k2 (f )
Indeed,
(i): for t ∈ J we have
• Main observation:
Indeed, when t ∈ J, the sets Lt = {x ∈ X : ft (x) ≤ `t } have a point u in common, and xt+1 is
the projection of xt onto Lt . It follows that
X
⇒ kxt − xt+1 k22 ≤ kxs − uk22 ≤ max kx − yk22
x,y∈X
t∈J
max kx − yk22
x,y∈X
⇒ Card(J) ≤
min kxt − xt+1 k22
t∈J
Corollary 5.2.2 For every , 0 < < ∆1 , the number N of steps before a gap ≤ is obtained
(i.e., before an -solution is found) does not exceed the bound
Var2k·k2 ,X (f )
N () = . (5.2.10)
λ(1 − λ)2 (2 − λ)2
Proof. Assume that N is such that ∆N > , and let us bound N from above. Let us split the
set of iterations I = {1, ..., N } into segments J1 , ..., Jm as follows:
324 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
J1 = {t : t ≤ N, (1 − λ)∆t ≤ ∆N }
and so on.
As a result, I will be partitioned “from the end to the beginning” into segments of iterations
J1 , J2 ,...,Jm . Let d` be the gap corresponding to the last iteration from J` . By maximality of
segments J` , we have
d1 ≥ ∆N >
d`+1 > (1 − λ)−1 d` , ` = 1, 2, ..., m − 1
whence
d` > (1 − λ)−(`−1) .
We now have
m m Var2 Var2k·k2 ,X (f ) P
m
k·k2 ,X (f )
Card(J` ) ≤ ≤ (1 − λ)2(`−1) −2
P P
N = (1−λ)2 d2` (1−λ)2
`=1 `=1 `=1 2
Var2k·k2 ,X (f ) P
∞ Var2k·k2 ,X (f )
≤ (1−λ)2 2
(1 − λ)2(`−1) = (1−λ)2 [1−(1−λ)2 ]2
= N ().
`=1
Discussion. 1. We have seen that the Bundle-Level method shares the dimension-
independent (and optimal in the “favourable” large-scale case) theoretical complexity bound
For every > 0, the number of steps before an -solution to convex program min f (x)
x∈X
2
Vark·k2 ,X (f )
is found, does not exceed O(1) .
There exists quite convincing experimental evidence that the Bundle-Level method obeys the
optimal in fixed dimension “polynomial time” complexity bound
For every ∈ (0, VarX (f ) ≡ max f − min f ), the number of steps before an -solution
X X
to convex program min n f (x) is found does not exceed n ln VarX (f ) + 1.
x∈X⊂R
Experimental rule: When solving convex program with n variables by BL, every n steps add
new accuracy digit.
5.2. THE SIMPLEST: SUBGRADIENT DESCENT AND EUCLIDEAN BUNDLE LEVEL325
with somehow generated 50 × 50 matrix A; in the problem, f (0) = 2.61, f∗ = 0. This is how SD
and BL work on this problem
2
10
1
10
0
10
−1
10
−2
10
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
0
10
−1
10
−2
10
−3
10
−4
10
−5
10
0 50 100 150 200 250
is equal to the size t of the current bundle – the collection of affine forms gτ (x) = f (xτ ) +
(x − xτ )T f 0 (xτ ) participating in model ft (·). Thus, the complexity of an iteration in BL grows
with the iteration number. In order to suppress this phenomenon, one needs a mechanism for
shrinking the bundle (and thus – simplifying the models of f ).
• The simplest way of shrinking the bundle is to initialize d as ∆1 and to run plain BL until an
iteration t with ∆t ≤ d/2 is met. At such an iteration, we
• shrink the current bundle, keeping in it the minimum number of the forms gτ sufficient to
ensure that
ft ≡ min max gτ (x) = min max gτ (x)
x∈X 1≤τ ≤t x∈X selected τ
• reset d as ∆t ,
and proceed with plain BL until the gap is again reduced by factor 2, etc.
Computational experience demonstrates that the outlined approach does not slow BL down,
while keeping the size of the bundle below the level of about 2n.
What is ahead. Subgradient Descent looks very geometric, which simplifies “digesting” of
the construction. However, on a close inspection this “geometricity” is completely misleading;
the “actual” reasons for convergence have nothing to do with projections, distances, etc. What
in fact is going on, is essentially less transparent and will be explained in the next section.
where X is a closed convex set in a Euclidean space E with inner product h·, ·i (unless otherwise
is explicitly stated, we assume w.l.o.g. E to be just Rn ) and assume that
To quantify this assumption, we fix once for ever a norm k · k on E and associate with f the
Lipschitz constant of f |X w.r.t. the norm k · k:
Note that from Convex Analysis it follows that f at every point x ∈ X admits a subgradient
f 0 (x) such that
kf 0 (x)k∗ ≤ Lk·k (f ),
this is not a severe restriction, since at least in the interior of X all subgradients of f are “small”
in the outlined sense.
Note that by the definition of the conjugate norm, we have
3. A distance generating function (DGF for short) ω(x) : X → R – a convex and continuous
function on X with the following properties:
or, equivalently,
1
ω(y) ≥ ω(x) + hω 0 (x), y − xi + kx − yk2 ∀(x ∈ X o , y ∈ X). (5.3.3)
2
The outlined setup (which we refer to as Proximal setup) specifies several other important
entities, specifically
• The ω-center of X — the point
xω = argmin ω(x).
x∈X
This point does exist (since X is closed and convex, and ω is continuous and strongly
convex on X) and is unique due to the strong convexity of ω. Besides this, xω ∈ X o and
hω 0 (xω ), x − xω )i ≥ 0 ∀x ∈ X (5.3.4)
Fact 5.3.1 With our ω, X o is exactly the set of minimizers, over X, of functions of the
form hp, xi + ω(x). Besides this, the latter function attains its minimum at a point x̄ ∈ X
if and only if
hω 0 (x̄) + p, x − x̄i ≥ 0 ∀x ∈ X. (5.3.5)
Fact 5.3.2 Let X, X o , k · k, ω(·) be as above and let U be a closed convex subset of
X which intersects the relative interior of X. Then for every p, the minimizer x∗ of
hp (x) = hp, xi + ω(x) over U exists and is unique, belongs to X o and is fully characterized
by the relations x∗ ∈ U ∩ X o and
hp + ω 0 (x∗ ), u − x∗ i ≥ 0 ∀u ∈ U. (5.3.6)
1
From now on, “subgradient” means “subgradient on X”
328 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
• The prox-term (or local distance, or Bregman distance from x to y), defined as
• The quantity
e = sup Vx (y) = sup ω(y) − hω 0 (xω ), y − xω i − ω(xω ) ≤ Θ := sup ω(·) − min ω(·),
Θ ω
y∈X y∈X X X
where the concluding inequality is due to (5.3.4). Every one of the quantities Θ,
e Θ is finite
if and only if X is bounded. For a bounded X, the quantity
q
Ω= 2Θ,
e
will be called the ω-radius of X. Observe that from Vx (y) ≥ 12 ky − xk2 it follows that
ky − xk2 ≤ 2Vx (y) whenever x ∈ X o , y ∈ X, which combines with the definitions of Θe and
Ω to imply that
e < ∞ ⇒ ky − xω k ≤ Ω ∀y ∈ X;
Θ (5.3.7)
this is where the name of Ω comes from.
Remark 5.3.1 The importance of ω-radius of a domain X stems from the fact that, as we
shall see in the man time, the efficiency estimate (the dependence of accuracy on iteration
count) of Mirror Descent algorithm as applied to problem (CP) on a bounded domain X
depends on the underlying proximal setup solely via the value Ω of the ω-radius of X (the
less is this radius, the better).
• Prox-mapping
Proxx (ξ) = argmin {Vx (y) + hξ, yi} = argmin ω(y) + hξ − ω 0 (x), yi : Rn → X.
y∈X y∈X
Here the parameter of the mapping — prox-center x – is a point from X o . By Fact 5.3.1,
the prox-mapping takes its values in X o .
Sometimes we shall use a modification of the prox-mapping as follows:
0
ProxU n
x (ξ) = argmin {Vx (y) + hξ, yi} = argmin ω(y) + hξ − ω (x), yi : R → U
y∈U y∈U
where U is a closed convex subset of X intersecting the relative interior of X. Here the
prox-center x is again a point from X o . By Fact 5.3.2, the latter mapping takes its values
in U ∩ X o .
Note that the methods we are about to present require at every step computing one (or two)
values of the prox-mapping, and in order for the iterations to be simple, this task should be
relatively easy. In other words, from the viewpoint of implementation, we need X and ω to be
“simple” and to “match each other.”
5.3. MIRROR DESCENT ALGORITHM 329
p Ball (called 1also the Euclidean) setup is as follows: E is a Euclidean space, kxk = kxk2 :=
The
hx, xi, ω(x) = 2 hx, xi.
Here X o = X, Vx (y) = 12 kx − yk22 , xω is the k · k2 -closest to the origin point of X. For a bounded
X, we have Ω = max ky − xω k2 . The prox-mapping is
y∈X
where
ΠX (h) = argmin kh − yk2
y∈X
is what is called the metric projector onto X. This projector is easy to compute when X is
k · kp -ball, or the “nonnegative part” of k · kp -ball – the set {x ≥ a : kx − akp ≤ r}, or the
standard simplex X
∆n = {x ∈ Rn : x ≥ 0, xi = 1},
i
We shall use this setup only when X is a convex and closed subset of the full-dimensional
simplex ∆+ n with n ≥ 3 and, in addition, X contains a positive vector, in which case ω(·) indeed
is strongly convex, modulus 1 w.r.t. k · k1 , on X, X o = {x ∈ X : x > 0}, and
q
Ω≤ 2 ln(n). (5.3.9)
The ω-center can be easily pointed out when X is either the standard simplex ∆n , or the
full-dimensional standard simplex ∆+
n ; in both cases
In these two cases it is also easy to compute the prox-mapping; here is an explicit formula for
the case of X = ∆n (work out the case of X = ∆+ n by yourself):
" #−1
h i
xi e−ξi x1 e−ξ1 ; ...; xn e−ξn .
X
X = ∆n ⇒ Proxx (ξ) =
i
Bounded case. When X is a closed convex subset of the unit ball Z = {x ∈ E : kxk ≤ 1},
which we refer to as bounded case2 , a good choice of ω is
p 1, n=1
(
1 2, n≤2
kxj kp2 , p = 1
X
ω(x) =
1+ 1
, n ≥3
,γ= 2, n=2 (5.3.10)
pγ j=1 ln n
1
, n >2
e ln(n)
which results in q
Ω ≤ O(1) ln(n + 1). (5.3.11)
General case. It turns out that the function
2
n(p−1)(2−p)/p Xn
p
b (x = [x1 ; ...; xn ]) =
ω kxj kp2 : E → R (5.3.12)
2γ j=1
with p, γ given by (5.3.10) is a compatible with k · k-norm DGF for the entire E, so that its
restriction on an arbitrary closed convex nonempty subset X of E is a DGF for X compatible
with k · k. When, in addition, X is bounded, the ω
b |X -radius of X can be bounded as
q
Ω ≤ O(1) ln(n + 1)R, (5.3.13)
Indeed, in the case of A, computing prox-mapping reduces to solving convex program of the
form
n
X
[hy j , bj i + ky j kp2 :
X
min ky j k2 ≤ 1 ; (P )
y 1 ,...,y n
j=1 j
2
note that whenever X is bounded, we can make X a subset of B`1 /`2 by shift and scaling.
5.3. MIRROR DESCENT ALGORITHM 331
clearly, at the optimum y j are nonpositive multiples of bj , which immediately reduces the prob-
lem to
X
[spj − βj sj ] : s ≥ 0,
X
min sj ≤ 1 , (P 0 )
s1 ,...,sn
j j
where βj ≥ 0 are given and p > 1. The latter problem is just a problem of minimizing a
single separable convex function under a single separable constraint. The Lagrange dual of this
problem is just a univariate convex program with easy to compute (just O(n) a.o.) value and
derivative of the objective. As a result, (P 0 ) can be solved within machine accuracy in O(n)
a.o., and therefore (P ) can be solved within machine accuracy in O(dim E) a.o.
In the case of B, similar argument reduces computing prox-mapping to solving the problem
X
min k[s1 ; ...; sn ]k2p − βj sj : s ≥ 0 ,
s1 ,...,sn
j
where β ≥ 0. This problem admits a closed form solution (find it!), and here again computing
prox-mapping takes just O(dim E) a.o.
Comments. When n = 1, the `1 /`2 norm becomes nothing but the Euclidean norm k · k2
on E = Rm1 , and the proximal setups we have just described become nothing but the Ball
setup. As another extreme, consider the case when m1 = ... = mn = 1, that is, the `1 /`2 norm
becomes the plain k · k1 -norm on E = Rn . Now let X be a closed convex subset of the unit
k · k1 -ball of Rn ; restricting the DGF (5.3.11) for the latter set onto X, we get what we shall call
Simplex setup. Note that when X is a part of the full dimensional simplex ∆+ n , Simplex setup
can be viewed as alternative to Entropy setup. Note that in the case in question both Entropy
and Simplex setups yield basically identical bounds on the ω-radius of X (and thus result in
algorithms with basically identical efficiency estimates, see Remark 5.3.1).
Note that in the case of X ⊂ ∆+ n Simplex setup, in contrast to Entropy one, yields a DGF
which is continuously differentiable on the entire X, moreover, such that
(in fact, this bound works whenever X is contained in the unit k · k1 -ball), which is stronger
than the entropy analogy of this bound, that is,
where u is the barycenter of ∆n . Note that in the entropy case, Vu (v) can be arbitrarily large,
provided that some entries in u are close to zero. Property (5.3.14) is quite valuable in some
situations (see, e.g., the issues related to updating prox-centers on page 376). The only – and
completely unimportant for all practical purposes – shortcoming of Simplex setup as compared
to Entropy one is that when X = ∆n or X = ∆+ n , prox-mapping for Entropy setup admits a
closed form representation and can be computed in O(dim X) a.o., while with Simplex setup,
computing prox-mapping requires solving a univariate convex program, and the same O(dim X)
a.o. allow to solve the problem “only” within machine accuracy.
332 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
• Given a collection µ, ν of n pairs of positive integers µi , νi , let Mµν be the space of all real
matrices x = Diag{x1 , ..., xn } with n diagonal blocks xi of sizes µi × νi ,l i = 1, ..., n. We
equip this space with natural linear operations and Frobenius inner product
hx, yi = Tr(xy T ).
non-ascending order.
Nuclear norm setup. Here E = Mµν , µi ≤ νi , 1 ≤ i ≤ n, and k · k is the nuclear norm k · knuc
on E. We set
n
X
m= µi ,
i=1
which results in q √ q
Ω ≤ 2 2 e ln(2m) ≤ 4 ln(2m). (5.3.16)
General case. It turns out that the function
2
1
X
m 1+q
ω
b (x) = 2e ln(2m) σ 1+q (x) : E → R, q = (5.3.17)
j=1 j 2 ln(2m)
3
When this is not the case, we can replace “badly sized” diagonal blocks with their transposes, with no effect
on the results of linear operations, inner products, and nuclear norm - the only entities we are interested in.
5.3. MIRROR DESCENT ALGORITHM 333
This (clearly solvable) problem has a rich group of symmetries; specifically, due to the diagonal
nature of hj , replacing a block z j in a feasible solution with Diag{}z j Diag{; }, where is a
vector of ±1’s of dimension µj , and 0 is obtained from by adding νj − µj of ±1 entries to the
right of . Since the problem is convex, it has an optimal solution which is preserved by the
above symmetries, that is, such that all z j are diagonal matrices. Denoting by ζ j the diagonals
of these diagonal matrices and setting g = [g 1 ; ...; g n ] ∈ Rm , the problem becomes
m
( )
X X
T p
min ζ g+ |ζk | : |ζk | ≤ 1 ,
ζ=[ζ 1 ;...;ζ n ]∈Rn
k=1 k
which, basically, is the problem we have met when computing prox–mapping in the case A of
`1 /`2 setup; as we remember, this problem can be solved within machine precision in O(m) a.o.
Completely similar reasoning shows that in the case of B, computing prox-mapping reduces,
at the price of a single singular value decomposition of a matrix from Mµν , to solving the same
problem as in the case B of `1 /`2 setup. The bottom line is that in the cases of A, B, the
computational effort of computing prox-mapping is dominated by the necessity to carry out
singular value decomposition of a matrix from Mµν .
Remark 5.3.2 A careful reader should have recognize at this point the deep similarity between
the `1 /`2 and the nuclear norm setups. This similarity has a very simple explanation: the `1 /`2
situation is a vary special case of the nuclear norm one. Indeed, a block vector y = [y 1 ; ...; y n ],
y j ∈ Rkj , can be thought of as the block-diagonal matrix y + with n 1 × kj diagonal blocks [y j ]T ,
j = 1, ..., n. with this identification of block-vectors y and block-diagonal matrices, the singular
values of y + are exactly the norms ky j k2 , j = 1, ..., m ≡ n, so that the nuclear norm of y + is
exactly the block-`1 norm of y. Up to minor differences in the coefficients in the formulas for
DGFs, our proximal setups respect this reduction of the `1 /`2 situation to the nuclear norm one.
334 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
∆ν = {x ∈ Sν : x 0, Tr(x) = 1}, ∆+ ν
ν = {x ∈ S : x 0, Tr(x) ≤ 1}.
Let X be a nonempty closed convex subset of ∆+ ν . Restricting the DGF (5.3.15) associated
with Mνν onto Sν (the latter space is a linear subspace of Mνν ) and further onto p X, we get a
continuously differentiable DGF for X, and the ω-radius of X does not exceed O(1) ln(m + 1),
where, m = ν1 + ... + νn . When X = ∆ν or X = ∆+ ν , the computational effort for computing
prox-mapping is dominated by the necessity to carry out eigenvalue decomposition of a matrix
from Sν .
Fact 5.3.3 Given proximal setup k · k, X, ω(·) along with a closed convex subset U of X inter-
secting the relative interior of X, let x ∈ X o , ξ ∈ Rn and x+ = ProxU
x (ξ). Then
where γt > 0 are positive stepsizes. Note that this procedure is well defined since, as we have
seen, xω ∈ X o and the values of the prox-mapping also belong to X o , so that xt ∈ X o for all t
and thus Proxxt (·) makes sense.
The points xt in (5.3.21) are what is called search points – the points where we collect the
first order information on f . There is no reason to insist that these search points should be also
the approximate solutions xt generated in course of t = 1, 2, ... steps. In MD as applied to (CP),
there are two good ways to define the approximate solutions xt , specifically:
— the best found so far approximate solution xtbst – the point of the collection x1 , ..., xt with
the smallest value of the objective f , and
— the aggregated solution
" t #−1 t
X X
xt = γτ γτ xτ . (5.3.22)
τ =1 τ =1
Note that xt ∈ X (as a convex combination of the points xτ which by construction belong to
X).
Remark 5.3.3 Note that MD with Ball setup is nothing but SD (check it!).
1
γτ hf 0 (xτ ), xτ − ui ≤ Vxτ (u) − Vxτ +1 (u) + kγτ f 0 (xτ )k2∗ . (5.3.23)
2
As a result, for every t ≥ 1 one has
Ω2 1 Pt 2 0 2
t 2 + 2 τ =1 γτ kf (xτ )k∗
f (x ) − min f ≤ Pt , (5.3.24)
X τ =1 γτ
Ω
γt = √ , 1 ≤ t ≤ T, (5.3.25)
kf 0 (x t )k∗ T
we ensure that
√
T T Ω maxt≤T kf 0 (xt )k∗ ΩLk·k (f )
f (x ) − min f ≤ Ω PT 1
≤ √ ≤ √ , (5.3.26)
X
t=1 kf 0 (xt ))k∗ T T
(Lk·k (f ) is the Lipschitz constant of f w.r.t. k · k), and similarly for xTbst .
We have
hξτ − ω 0 (xτ ) + ω 0 (xτ +1 ), u − xτ +1 i ≥ 0 [by Fact 5.3.1]
⇒ hξτ , xτ − ui ≤ hξτ , xτ − xτ +1 i + hω 0 (xτ +1 ) − ω 0 (xτ ), u − xτ +1 i
= hξτ , xτ − xτ +1 i + Vxτ (u) − Vxτ +1 (u) − Vxτ (xτ +1 ) [Magic Identity (5.3.20)]
= hξτ , xτ − xτ +1 i + Vxτ (u) − Vxτ +1 (u) − [ ω(xτ +1 ) − ω(xτ ) − hω 0 (xτ ), xτ +1 − xτ i ]
| {z }
≥ 12 kxτ +1 −xτ k2
h i
= Vxτ (u) − Vxτ +1 (u) + hξτ , xτ − xτ +1 i − 12 kxτ +1 − xτ k2
h i
≤ Vxτ (u) − Vxτ +1 (u) + kξτ k∗ kxτ − xτ +1 k − 12 kxτ +1 − xτ k2 [see (5.3.2)]
≤ Vxτ (u) − Vxτ +1 (u) + 12 kξτ k2∗ [since ab − 21 b2 ≤ 12 a2 ]
Recalling that ξτ = γτ f 0 (xτ ), we arrive at (5.3.23).
20 . Denoting by x∗ a minimizer of f on X, substituting u = x∗ in (5.3.23) and taking into
account that f is convex, we get
1
γτ [f (xτ ) − f (x∗ )] ≤ γτ hf 0 (xτ ), xτ − x∗ i ≤ Vxτ (x∗ ) − Vxτ +1 (x∗ ) + γτ2 kf 0 (xτ )k2∗ .
2
Pt
Summing up these inequalities over τ = 1, ..., t, dividing both sides by Γt ≡ τ =1 γτ and setting
ντ = γτ /Γt , we get
" t Pt
Vx (x∗ ) − Vxt+1 (x∗ ) + 21 0
#
2 2
τ =1 γτ kf (xτ )k∗
X
ντ f (xτ ) − f (x∗ ) ≤ 1 . (∗)
τ =1
Γt
Since ντ are positive and sum up to 1, and f is convex, the left hand side in (∗) is ≥ f (xt )−f (x∗ )
2
(recall that xt = tτ =1 ντ xτ ); it clearly is also ≥ f (xtbst ) − f (x∗ ). Since Vx1 (x) ≤ Ω2 due to the
P
definition of Ω and to x1 = xω , while Vxt+1 (·) ≥ 0, the right hand side in (∗) is ≤ the right hand
side in (5.3.24), and we arrive at (5.3.24).
30 . Relation (5.3.26) is readily given by (5.3.24), (5.3.25) and the fact that kf 0 (xt )k∗ ≤
Lk·k (f ), see (5.3.1). 2
5.3.4.5 Refinement
Under favorable circumstances we can slightly modify the MD algorithm and refine its complexity
analysis. Specifically, let us introduce the following assumption on the MD setup:
(!) In the situation and notation of sections 5.3.1, 5.3.2, for properly selected Ω∗ ,
0 < Ω∗ < ∞, one has
Ω2
sup Vu (v) ≤ ∗ . (5.3.27)
u∈X o ,v∈X 2
Note that when X is bounded, the Ball setup, and the Bounded and General cases of `1 /`2 and
the Nuclear norm setups meet the above assumption. Moreover, it is easily seen that in these
cases the quantities Ω∗ satisfy the same, within absolute constant factor, upper bounds as the
respective quantities Ω, that is, bounds (5.3.11) (`1 /`2 setup, Bounded case), (5.3.13) (`1 /`2
setup, General case), (5.3.16) (Nuclear norm setup, Bounded case), and (5.3.18) (Nuclear norm
setup, General case). In the case of Ball setup, we clearly have Ω∗ ≤ 2Ω.
The outlined refinement of the MD complexity analysis is given by the following version of
Theorem 5.3.1:
5.3. MIRROR DESCENT ALGORITHM 337
Theorem 5.3.2 Let (5.3.27) takes place, and let λ1 , λ2 , ... be positive reals such that
and let Pt
τ =1 λτ xτ
xt = P t .
τ =1 λτ
Then xt ∈ X and
λt 2 Pt 0 (x 2
τ =1 λτ γτ kf
Pt
t τ =1 λτ [f (xτ ) − minX f ] γt Ω∗ + τ )k∗
f (x ) − min f ≤ Pt ≤ , (5.3.29)
2 tτ =1 λτ
P
X τ =1 λτ
Note that (5.3.24) is nothing but (5.3.29) as applied with λτ = γτ , 1 ≤ τ ≤ t, and with Ω in the
role of Ω∗ .
Proof of Theorem 5.3.2. We still have at our disposal (5.3.23); multiplying both sides of this
relation by λτ /γτ and summing up the resulting inequalities over τ = 1, ..., t, we get
∀(u ∈ X) :
Pt Pt
0
τ =1 λτ hf (xτ), xτ − ui≤
λτ
τ =1 γτ[Vxτ (u) − (u)] + 12 tτ =1 λτ γτ kf 0 (xτ )k2∗
P
V
xτ +1
λ2 λ1 λ3 λ2 λt λt−1
λ1
= γ1 Vx1 (u) + − Vx2 (u) + − Vx3 (u) + ... + − Vxt (u)
γ2 γ1 γ3 γ2 γt γt−1
| {z } | {z } | {z }
≥0 ≥0 ≥0
2
1 Pt
≤ λγtt Ω2∗
0 Pt 0 (x
− λγtt Vxt+1 (u) + 2
2
τ =1 λτ γτ kf (xτ )k∗ + 1
2 τ =1 λτ γτ kf
2
τ )k∗ ,
Ω2∗
where the concluding inequality is due to 0 ≤ Vxτ (u) ≤ 2 for all τ . As a result,
λt 2 Pt
Ω + λτ γτ
maxu∈X Λ−1 γt ∗
Pt
t τ =1 λ
hτ
hf 0 (xτ ), xτ −iui ≤ τ =1
2Λt (5.3.30)
Λt = tτ =1 λτ .
P
t t
f (xt ) − f (x∗ ) ≤ Λ−1 λτ [f (xτ ) − f (x∗ )] ≤ Λ−1 λτ hf 0 (xτ ), xτ − x∗ i,
X X
t t (5.3.31)
τ =1 τ =1
t
f (xtbst ) ≤ Λ−1
X
t λτ f (xτ ),
τ =1
the first inequality in (5.3.31) is preserved when replacing xt with xtbst , and the resulting in-
equality combines with (5.3.30) to imply the validity of (5.3.29) for xtbst in the role of xt . 2
338 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
Discussion. As we have already mentioned, in the case of (!) Theorem 5.3.1 is “essentially
covered” by Theorem 5.3.2. The latter, however, sometimes yields more information. For
example, the “In particular” part of Theorem 5.3.1 states that for every given “time horizon” T ,
specifying γt , t ≤ T , according to (5.3.25), we can upper-bound the non-optimalities f (xTbst ) −
minX f and f (xT ) − minX f of the approximate solutions generated in course of T steps by
ΩLk·k (f )
√
T
, (see (5.3.26)), which is the best result we can extract from Theorem; note, however,
that to get this “best result,” we should tune the stepsizes to the total number of steps T
we intend to perform. What to do if we do not want to fix the total number T of steps in
advance? A natural way is to pass from the stepsize policy (5.3.25) to its “rolling horizon”
analogy, specifically, to use
Ω∗
γt = 0 √ , t = 1, 2, ... (5.3.32)
kf (xt )k∗ t
With these stepsizes, the efficiency estimate (5.3.26) yields
Ω∗ Lk·k (f ) ln(T +1)
max[f (xT ), f (xTbst )] − minX f ≤ O(1) √
T
, T = 1, 2, ...
T
PT
γt xt
(5.3.33)
x = PT
t=1
γ
t=1 t
(recall that O(1)’s are absolute constants); this estimate is worse that (5.3.26), although just by
logarithmic in T factor4 . On the other hand, utilizing the stepsize policy (5.3.32) and setting
tp
λt = , t = 1, 2, ... (5.3.34)
kf 0 (xt )k∗
where p > −1/2, we ensure (5.3.28) and thus can apply (5.3.29), arriving at
Ω∗ Lk·k (f )
max[f (xt ), f (xtbst )] − minX f ≤ Cp √
t
, t = 1, 2, ...
Pt
λ x
(5.3.35)
t τ =1 τ τ
x = P t
τ =1
λτ
where Cp depends solely on p > −1/2 and is continuous in this range of values of p. Note that the
efficiency estimate (5.3.35) is, essentially, the same as (5.3.26) whatever be time horizon t. Note
also that the weights with which we aggregate search points xτ to get approximate solutions
in (5.3.33) and in (5.3.35) are different: in (5.3.33), other things (specifically, kf 0 (xτ )k∗ ) being
equal, the “newer” is a search point, the less is the weight with which the point participates
in the “aggregated” approximate solution. In contrast, when p = 0, the aggregation (5.3.35)
is “time invariant,”, and when p > 0, it pays the more attention to a search point the newer
is the point, this feature being amplified as p > 0 grows. Intuitively, “more attention to the
new search points” seems to be desirable, provided the search trajectory xτ itself, not just the
sequences of averaged solutions xτ or the best found so far solutions xτbst , converges, as τ → ∞,
to the optimal set of the problem minx∈X f (x); on a closest inspection, this convergence indeed
takes place, provided (!) takes place and the stepsizes γt > 0 satisfy γt → 0,, t → ∞, and
P∞
t=1 γt = ∞.
Ball setup and optimization over the ball. As we remember, in the case of the ball setup
one has
Ω = max ky − xω k2 ≤ Dk·k2 (X),
y∈X
where Dk·k2 (X) = max kx − yk2 if the k · k2 -diameter of X. Consequently, (5.3.26) becomes
x,y∈X
Dk·k2 (X)Lk·k2 (f )
f (xt ) − min f ≤ √ , (5.3.36)
X T
meaning that the number N () of MD steps needed to solve (CP) within accuracy can be
bounded as
Dk·k2 (X)L2 (f )
2 k·k2
N () ≤ 2
+ 1. (5.3.37)
On the other hand, let L > 0, and let Pk·k2 ,L (X) be the family of all convex problems (CP) with
convex Lipschitz continuous, with constant L w.r.t. k · k2 , objectives. It is known that if X is an
2
Dk·k (X)L2
n-dimensional Euclidean ball and n ≥ 2
2
, then the information-based complexity of the
2
Dk·k (X)L2
family Pk·k2 ,L (X) is at least O(1) 2
2
(cf. (5.1.4)). Comparing this result with (5.3.37),
we conclude that
If X is an n-dimensional Euclidean ball, then the complexity of the family Pk·k2 ,L (X)
D2 (X)L2
w.r.t. the MD algorithm with the ball setup in the “large-scale case” n ≥ k·k22
coincides (within an absolute constant factor) with the information-based complexity
of the family.
Why the standard setups? “The contribution” of ω(·) to the performance estimate (5.3.26)
is in the factor Ω; the less it is, the better. In principle, given X and k · k, we could play with
ω(·) to minimize Ω. The standard setups are given by a kind of such optimization for the cases
when X is the ball and k · k = k · k2 (“the ball case”), when X is the simplex and k · k = k · k1
(“the entropy case”), and when X is the `1 /`2 ball, or the nuclear norm ball, and the norm
k · k is the block `1 norm, respectively, the nuclear norm. We did not try to solve the arising
variational problems exactly; however, it can be p proved in all three cases that the value of Ω
we havepreached (i.e., O(1) in the ball case, O( ln(n)) in the simplex case and the `1 /`2 case,
and O( ln(m) in the nuclear norm case) cannot be reduced by more than an absolute constant
factor.
Adjusting to problem’s geometry. A natural question is: When solving a particular prob-
lem (CP), which version of the Mirror Descent method to use? A “theory-based” answer could
be “look at the complexity estimates of various versions of MD and run the one with the
best complexity.” This “recommendation” usually is of no actual value, since the complexity
bounds involve Lipschitz constants of the objective w.r.t. appropriate norms, and these con-
stants usually are not known in advance. We are about to demonstrate that there are cases
where reasonable recommendations can be made based solely on the geometry of the domain
X of the problem. As an instructive example, consider the case where X is the unit k · kp -ball:
X = X p = {x ∈ Rn : kxkp ≤ 1}, where p ∈ {1, 2, ∞}.
Note that the cases of p = 1 and especially p = ∞ are “quite practical.” Indeed, the first of
them (p = 1) is, essentially, what goes on in problems of `1 minimization (section 1.3.1); the
second (p = ∞) is perhaps the most typical case (minimization under box constraints). The
case of p = 2 seems to be more of academic nature.
The question we address is: which setup — Ball or Entropy one — to choose5 .
First of all, we should explain how to process the domains in question via Entropy setup –
the latter requires X to be a part of the full-dimensional standard simplex ∆+
n , which is not the
case for the domains X we are interested in now. This issue can be easily resolved, specifically,
as follows: setting
xρ [u, v] = ρ(u − v), [u; v] ∈ Rn × Rn ,
we get a mapping from R2n onto Rn such that the image of ∆+ 2n under this mapping covers X ,
p
γ
provided that ρ = ρp = n p , where γ1 = 0, γ2 = 1/2, γ∞ = 1. It follows that in order to apply
5
of course, we could look for more options as well, but as a matter of fact, in the case in question no more
attractive options are known.
5.3. MIRROR DESCENT ALGORITHM 341
Setup X = X1 X = X2 X = X∞
Ball L22 (f ) L22 (f ) nL22 (f )
2 2 2
Entropy L1 (f ) ln(n) L1 (f )n ln(n) L1 (f )n2 ln(n)
kxk1 √
Now note that for 0 6= x ∈ Rn we have 1 ≤ kxk2 ≤ n, whence
L21 (f ) 1
1≥ 2 ≥ .
L2 (f ) n
With this in mind, the data from the above table suggest that in the case of p = 1 (X is the `1
ball X 1 ), the Entropy setup is more preferable; in all remaining cases under consideration, the
preferable setup if the Ball one.
Indeed, when X = X 1 , the Entropy setup can lose to the Ball one at most by the “small”
factor ln(n) and can gain by factor as large as n/ ln(n); these two options, roughly speaking,
correspond to the cases when subgradients of f have just O(1) significant entries (loss), or, on
the contrary, have all entries of the same order of magnitude (gain). Since the potential loss
is small, and potential gain can be quite significant, the Entropy setup in the case in question
seems to be more preferable than the Ball one. In contrast to this, in the cases of X = X 2
and X = X ∞ even the maximal potential gain associated with the Entropy setup (the ratio of
L21 (f )/L22 (f ) as small as 1/n) does not lead to an overall gain in the complexity, this is why in
these cases we would advice to use the Ball setup.
is to find an approximate solution of a desired accuracy via calls to an oracle (say, First
Order one) representing f . What we are interested in, is the number N of steps (calls to the
oracle) needed to build an -solution, and the overall computational effort in executing these N
steps. What we ignore completely, is how “good,” in terms of f , are the search points xt ∈ X,
1 ≤ t ≤ N – all we want is “to learn” f to the extent allowing to point out an -minimizer,
and the faster is our learning process, the better, no matter how “expensive,” in terms of their
non-optimality, are the search points. This traditional approach corresponds to the situation
where our optimization is off line: when processing the problem numerically, calling the oracle
at a search point does not incur any actual losses. What indeed is used “in reality,” is the
342 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
The “horizon” N here can be fixed, in which case we want to make LN as small as possible, or
can vary, in which case we want to enforce LN to approach Opt as fast as possible as N grows.
In fact, in online regret minimization it is not even assumed that the objective f remains
constant as the time evolves; instead, it is assumed that we are dealing with a sequence of
objectives ft , t = 1, 2, ..., selected “by nature” from a known in advance family F of functions
on a known in advance set X ⊂ Rn . At a step t = 1, 2, ..., we select a point xt ∈ X, and “an
oracle” provides us with certain information on ft at xt and charges us with “toll” ft (xt ). It
should be stressed that in this setting, no assumptions on how the nature selects the sequence
{ft (·) ∈ F : t ≥ 1} are made. Now our “ideal goal” would be to find a non-anticipative policy
(rules for generating xt based solely on the information acquired from the oracle at the first t − 1
steps of the process) which enforces our average toll
N
1 X
LN = ft (xt ) (5.3.40)
N t=1
whatever be the sequence {ft (·) ∈ F : t ≥ 1} selected by the nature. It is absolutely clear that
aside of trivial cases (like all functions from F share common minimizer), this ideal goal cannot
be achieved. Indeed, with LN − L∗N → 0, n → ∞, the average, over time horizon N , of (always
nonnegative) non-optimalities ft (xt )−minxt ∈X ft (x) should go to 0 as N → ∞, implying that for
large N , “near all” of these non-optimalities should be “near-zero.” The latter, with the nature
allowed to select ft ∈ F in completely unpredictable manner, clearly cannot be guaranteed,
whatever be non-anticipative policy used to generate x1 , x2 ,...
To make the online setting with “abruptly varying” ft ’s meaningful, people compare LN not
with its “utopian” lower bound L∗N (which can be achieved only by a clairvoyant knowing in
advance the trajectory f1 , f2 , ...), but with a less utopian bound, most notably, with the quantity
N
1 X
L
b N = min ft (x).
x∈X N t=1
Thus, we compare the average, on horizon N , toll of a “capable to move” (capable to select
xt ∈ X in a whatever non-anticipative fashion) “mortal” which cannot predict the future with
5.3. MIRROR DESCENT ALGORITHM 343
the best average toll of a clairvoyant who knows in advance the trajectory f1 , f2 , ... selected by
the nature, but is unable to move (must use the same action x during all “rounds” t = 1, ..., N ).
We arrive at the online regret minimization problem where we are interested in a non-anticipative
policy of selecting xt ∈ X which enforces the associated regret
N
" #
1 X 1 X
∆N = sup ft (xt ) − min ft (x) (5.3.42)
ft ∈F ,1≤t≤N N t=1 x∈X N
t=1
Note that the problem of online regret minimization admits various settings, depending on what
are our assumptions on X and F, and what is the information on ft revealed by the oracle at
the search point xt . In this section, we focus on the simple case where
1. X is a convex compact set in Euclidean space E, and (X, E) is equipped with proximal
setup – a norm k · k on E and a compatible with this norm DGF ω(·), which we assume
to be continuously differentiable on X;
2. F is the family of all convex and Lipschitz continuous, with a given constant L w.r.t. k · k,
functions on X;
3. the information we get in round t, our search point being xt , and “nature’s selection” being
ft (·) ∈ F, is the first order information ft (xt ), ft0 (xt ) on ft at xt , with the k · k∗ -norm of
the taken at xt subgradient ft0 (xt ) of ft (·) not exceeding L.
Our goal is to demonstrate that Mirror Descent algorithm, in the form presented in section
5.3.4.5, is well-suited for online regret minimization. Specifically, consider algorithm as follows:
2. Step t = 1, 2, .... Given xt and a reported by the oracle subgradient ft0 (xt ), of k · k∗ -norm
not exceeding L, of ft at xt , set
xt+1 = Proxxt (γt ft0 (xt )) := argmin hγt ft0 (xt ) − ω 0 (xt ), xi + ω(x)
x∈X
Theorem 5.3.3 In the notation and under assumptions of this section, whatever be a sequence
{ft ∈ F : t ≥ 1} selected by the nature, for the trajectory {xt : t ≥ 1} of search points generated
by the above algorithm and for a whatever sequence of weights {λt > 0 : t ≥ 1} such that λt /γt
is nondecreasing in t, it holds for all N = 1, 2, ...:
N N λN 2 PN 2
1 X 1 X γN Ω∗ + t=1 λt γt L
λt ft (xt ) − min λt ft (u) ≤ (5.3.43)
ΛN t=1 u∈X ΛN
t=1
2ΛN
where
N
X q q
ΛN = λt , Ω∗ = sup 2Vx (y) ≡ sup 2[ω(y) − ω(x) − hω 0 (x), y − xi].
t=1 x,y∈X x,y∈X
Proof of Theorem 5.3.3. With our current algorithm, relation (5.3.23) becomes
1
∀(t ≥ 1, u ∈ X) : γt hft0 (xτ ), xt − ui ≤ Vxt (u) − Vxt+1 (u) + kγτ ft0 (xt )k2∗ (5.3.45)
2
(replace f with ft in the derivation in item 10 of the proof of Theorem 5.3.1). Now let us proceed
exactly as in the proof of Theorem 5.3.2: multiply both sides in (5.3.45) by λt /γt and sum up
the resulting inequalities over t = 1, ..., N , arriving at
∀(u ∈ X) :
PN 0
t=1 λt hft (xt ), x − ui ≤ N λt
[V (u) − Vxt+1 (u)] + 12 N λ γ ¨kf 0 (x )k2
P P
t t=1 γt xt t t t t∗
t=1
λ2 λ1 λ3 λ2 λN λN −1
= λγ11 Vx1 (u) + − Vx2 (u) + − Vx3 (u) + ... + − VxN (u)
γ2 γ1 γ3 γ2 γN γN −1
| {z } | {z } | {z }
≥0 ≥0 ≥0
2
− λγN 1 PT
kf 0 (x )k2 ≤ λN Ω∗ 1 PN 0 2
VxN +1 (u) + λ γ + t=1 λt γt kft (xt )k∗ ,
N
2 t=1 t t t t ∗ γN 2 2
5.3. MIRROR DESCENT ALGORITHM 345
Ω2∗
where the concluding inequality is due to 0 ≤ Vxt (u) ≤ 2 for all t. As a result, taking into
account that kft0 (xt )k∗ ≤ L, we get
λN PN
Ω2∗ + λt γt L2
maxu∈X Λ−1
PN 0 γN t=1
N t=1 λht hft (xt ), xt − ui
i
≤ 2ΛN (5.3.46)
PN
ΛN = t=1 λt .
where
• X ⊂ Rnx , Y ⊂ Rny are nonempty convex compact subsets in the respective Euclidean
spaces (which at this point we w.l.o.g. identified with a pair of Rn ’s),
Basic facts on saddle points. (SP) gives rise to a pair of convex optimization problems
(the primal problem (P) requires to minimize a convex function over a convex compact set, the
dual problem (D) requires to maximize a concave function over another convex compact set).
The problems are dual to each other, meaning that Opt(P ) = Opt(D); this is called strong
duality6 . Note that strong duality is based upon convexity-concavity of φ and convexity of X, Y
(and to a slightly lesser extent – on the continuity of φ on Z = X × Y and compactness of the
latter domain), while the weak duality Opt(P ) ≥ Opt(D) is a completely general (and simple
to verify) phenomenon.
In our case of convex-concave Lipschitz continuous φ the induced objectives in the primal
and the dual problem are Lipschitz continuous as well. Since the domains of these problems are
compact, the problems are solvable. Denoting by X∗ , Y∗ the corresponding optimal sets, the
6
For missing proofs of all claims in this section, see section D.4.
346 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
(in words: with y set to y∗ , φ attains its minimum over x ∈ X at x∗ , and with x set to x∗ , φ
attains its maximum over y ∈ Y at y = y∗ ). These points are considered as exact solutions to
the saddle point problem in question. The common optimal value of (P) and (D) is nothing but
the value of φ at a saddle point:
The standard interpretation of a saddle point problem is a game, where the first player chooses
his action x from the set X, his adversary – the second player – chooses his action y in Y , and
with a choice (x, y) of the players, the first pays the second the sum φ(x, y). Such a game is
called antagonistic (since the interests of the players are completely opposite to each other), or
a zero sum game (since the sum φ(x, y) + [−φ(x, y)] of losses of the players is identically zero).
In terms of the game, saddle points (x∗ , y∗ ) are exactly the equilibria: if the first player sticks
to the choice x∗ , the adversary has no incentive to move away from the choice y∗ , since such a
move cannot increase his profit, and similarly for the first player, provided that the second one
sticks to the choice y∗ .
It is convenient to characterize the accuracy of a candidate (x, y) ∈ X × Y to the role of an
approximate saddle point by the “saddle point residual”
where the concluding equality is due to strong duality. We see that the saddle point inaccuracy
sad (x, y) is just the sum of non-optimalities, in terms of the respective objectives, of x as a
solution to the primal, and y as a solution to the dual problem; in particular, the inaccuracy is
nonnegative and is zero if and only if (x, y) is a saddle point of φ.
Vector field associated with (SP). From now on, we focus on problem (SP) with Lipschitz
continuous convex-concave cost function φ : X × Y → R and convex compact X, Y and assume
that φ is represented by a First Order oracle as follows: given on input a point z = (x, y) ∈
Z := X × Y , the oracle returns a subgradient Fx (z) of the convex function φ(·, y) : X → R at
the point x, and a subgradient Fy (x, y) of the convex function −φ(x, ·) : Y → R at the point Y .
It is convenient to arrange Fx , Fy in a single vector
In addition, we assume that the vector field {F (z) : z ∈ Z} is bounded (cf. (5.3.1)).
where γt > 0 are stepsizes, and z t are approximate solutions built in course of t steps.
The saddle point analogy of Theorem 5.3.1 is as follows:
1
γτ hF (zτ ), zτ − wi ≤ Vzτ (w) − Vzτ +1 (w) + kγτ F (zτ )k2∗ . (5.3.49)
2
As a result, for every t ≥ 1 one has
Ω2 1 Pt 2 2
t 2 + 2 τ =1 γτ kF (zτ )k∗
sad (z ) ≤ Pt . (5.3.50)
τ =1 γτ
Ω
γt = √ , 1 ≤ t ≤ T, (5.3.51)
kF (zt )k∗ T
we ensure that
√
T T Ω maxt≤T kF (xt )k∗ ΩLk·k (F )
sad (z ) ≤ Ω PT 1
≤ √ ≤ √ , Lk·k (F ) = sup kF (z)k∗ .
t=1 kF (xt ))k∗ T T z∈Z
(5.3.52)
Proof. 10 . (5.3.49) is proved exactly as its “minimization counterpart” (5.3.23), see item 10 of
the proof of Theorem 5.3.1; on a closest inspection, the reasoning there is absolutely independent
of where the vectors ξt in the recurrence zt+1 = Proxzt (ξt ) come from.
20 . Summing up relations (5.3.49) over τ = 1, ..., t, plugging in F = [Fx ; Fy ], dividing both
sides of the resulting inequality by Γt = tτ =1 γτ and setting ντ = γτ /Γt , we get
P
T t
" #
Ω2 1 X
Γ−1
X
ντ [hFx (zt ), xt − ui + hFy (zt ), yt − vi] ≤ Q := + γ 2 kF (zτ )k2∗ ; (!)
τ =1
t
2 2 τ =1 τ
this inequality holds true for all (u, v) ∈ X × Y . Now, recalling the origin of Fx and Fy , we have
Now, the right hand side of the latter inequality is independent of (u, v) ∈ Z; taking maximum
over (u, v) ∈ Z in the left hand side of the inequality, we get (5.3.50). This inequality clearly
implies (5.3.52). 2
In our course, the role of Theorem 5.3.4 (which is quite useful by its own right) is to be a
“draft version” of a much more powerful Theorem 5.5.4, and we postpone illustrations till the
latter Theorem is in place.
5.3.5.3 Refinement
Similarly to the minimization case, the complexity analysis of saddle point MD can be refined.
Specifically, assume that for properly selected Ω∗ , 0 < Ω∗ < ∞, we have
Ω2∗
sup Vz (u) ≤ . (5.3.53)
z∈Z o ,u∈Z 2
Same as in the proof of Theorem 5.3.4, the left hand side in this inequality upper-bounds sad (z t ),
and we arrive at (5.3.54). The “In particular” part of Theorem is straightforwardly implied by
(5.3.54). 2
under the same assumptions as above (i.e., X, Y are convex compact sets in Euclidean spaces
Rnx , Rny φ : Z = X × Y → R is convex-concave and Lipschitz continuous), but now, instead
of deterministic First Order oracle which reports the values of the associated with φ vector field
F , see (5.3.47), we have at our disposal a stochastic First Order oracle – a routine which reports
unbiased random estimates of F rather than the exact values of F . A convenient model of the
oracle is as follows: at t-th call to the oracle, the query point being z ∈ Z = X × Y , the oracle
returns a vector G = G(z, ξt ) which is a deterministic function of z and of t-th realization of
oracle’s noise ξt . We assume the noises ξ1 , ξ2 , ... to be independent of each other and identically
distributed: ξt ∼ P , t = 1, 2, .... We always assume the oracle is unbiased:
Eξ∼P {G(z, ξ)} =: F (z) = [Fx (z); Fy (z)] ∈ ∂x φ(z) × ∂y [−φ(z)] (5.3.58)
(thus, Fx (x, y) is a subgradient of φ(·, y) taken at the point x, while Fy (x, y) is a subgradient of
−φ(x, ·) taken at the point y). Besides this, we assume that the second moment of the stochastic
subgradient is bounded uniformly in z ∈ Z:
Note that when Y is a singleton, our setting recovers the problem of minimizing a Lipschitz con-
tinuous convex function over a compact convex set in the case when instead of the subgradients
of the objective, unbiased random estimates of these subgradients are available.
where γt > 0 are deterministic stepsizes. This is nothing but the recurrence (5.3.48), with the
only difference that now this recurrence “is fed” by the estimates G(zt , ξt ) of the vectors F (zt )
participating in (5.3.48). Surprisingly, the performance of SSPMD is, essentially, the same as
the performance of its deterministic counterpart:
350 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
Setting
∆τ = G(zτ , ξτ ) − F (zτ ),
we can rewrite the above relation as
t t t
X Ω2 1 X X
γτ hF (zτ ), zτ − ui ≤ + γτ2 kG(zτ , ξτ )k2∗ + γτ h∆τ , u − zτ i, (5.3.64)
τ =1
2 2 τ =1 τ =1
| {z }
D(ξ t ,u)
hAs i seen when proving Theorem 5.3.4, the left hand side in the latter relation is ≥
we have
t
sad (z t ), so that
P
τ =1 γτ
" t # t
X Ω2 1 X
γτ sad (z t ) ≤ + γ 2 kG(zτ , ξτ )k2∗ + max D(ξ t , u). (5.3.65)
τ =1
2 2 τ =1 τ u∈Z
n o n o n o
E kG(zτ , ξτ )k2∗ = Eξt kG(Zτ (ξ τ −1 ), ξτ )k2∗ = Eξτ −1 Eξτ kG(Zτ (ξ τ −1 ), ξτ )k2∗ ≤ σ2.
| {z }
≤σ 2
20 . It remains to bound E maxu∈Z D(ξ t , u) , which is done in the following two lemmas:
5.3. MIRROR DESCENT ALGORITHM 351
Pτ
Proof. Setting S0 = 0, Sτ = s=1 γs h∆s , vs − zω i, we have
n o n o n o
E Sτ2+1 = E Sτ2−1 + 2Sτ −1 ψτ + ψτ2 ≤ E Sτ2−1 + 4σ 2 Ω2 γτ2 , (5.3.68)
v1 = zω ; vτ +1 = Proxvτ (−ργτ ∆τ ).
or, equivalently,
t t
X Ω2 ρ X
γτ h−∆τ , vτ − ui ≤ + γτ2 k∆τ k2∗ ,
τ =1
2ρ 2 τ =1
352 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
whence
Pt Pt Pt
D(ξ t , u) = τ =1 γτ h−∆τ , zτ − ui ≤ τ =1 γτ h−∆τ , vτ − ui + γτ h−∆τ , zτ − vτ i
Ω2 ρ Pt Pt Pτt=1
≤ Qρ (ξ t ) := 2ρ + 2
2 2
τ =1 γτ k∆τ k∗ + τ =1 γτ h∆τ , vτ − zω i + τ =1 γτ h∆τ , zω − zτ i
Note that Qρ (ξ t ) indeed depends solely on ξ t , so that the resulting inequality says that
max D(ξ , u) ≤ Qρ (ξ ) ⇒ E max D(ξ , u) ≤ E{Qρ (ξ t )}.
t t t
u∈Z uiZ
whence v
t t
Ω2
u
X uX
E max D(ξ t , u) ≤ + 2ρσ 2 γτ2 + 4σΩt γτ2 .
uiZ 2ρ τ =1 τ =1
The resulting inequality holds true for every ρ > 0; minimizing its the right hand side in ρ > 0,
we arrive at (5.3.69). 2
which combines with (5.3.66) to yield (5.3.61). The latter relation, in turn, implies immediately
the “in particular” part of Theorem. 2
5.3.6.3 Refinement
Same as its deterministic counterparts, Theorem 5.3.6 admits a refinement as follows:
Theorem 5.3.7 Assume that (5.3.53) takes place, let λ1 , λ2 , ... be deterministic reals satisfying
(5.3.28), and let
Pt
λτ zτ
z = Pτ =1
t
t .
τ =1 λτ
Then λ Pt
t Ω2 +σ 2 λ γ
∗ τ =1 τ τ
≤ 72 γt
(z t )
E[ξ1 ;...;ξt ]∼P ×...×P sad Λτ , (5.3.70)
σ 2 = supz∈Z Eξ∼P {kG(z, ξ)k2∗ }, Λt = Tτ=1 λτ .
P
In particular, setting
Ω∗
γt = √ , t = 1, 2, ... (5.3.71)
σ t
and
λt = tp , t = 1, 2, ... (5.3.72)
5.3. MIRROR DESCENT ALGORITHM 353
Ω∗ σ
E{sad (z t )} ≤ Cp √ , (5.3.73)
t
t λt 2 Pt 2
X γt Ω∗ + τ =1 λτ γτ kG(zτ , ξτ )k∗
∀(u ∈ Z) : λτ hG(zτ , ξτ ), zτ − ui ≤ .
τ =1
2
Setting
∆τ = G(zτ , ξτ ) − F (zτ ),
we can rewrite the above relation as
t λt 2 Pt 2 t
X γt Ω∗ + τ =1 λτ γτ kG(zτ , ξτ )k∗ X
λτ hF (zτ ), zτ − ui ≤ + λτ h∆τ , u − zτ i, (5.3.74)
τ =1
2 τ =1
| {z }
D(ξ t ,u)
t λt 2 Pt 2
X γt Ω∗ + τ =1 λτ γτ kG(zτ , ξτ )k∗
max λτ hF (zτ ), zτ − ui ≤ + max D(ξ t , u).
u∈Z
τ =1
2 u∈Z
As we have seen when proving Theorem 5.3.4, the left hand side in the latter relation is ≥
Λt sad (z t ), so that
λt 2 Pt 2
t γt Ω∗ + τ =1 λτ γτ kG(zτ , ξτ )k∗
Λt sad (z ) ≤ + max D(ξ t , u). (5.3.75)
2 u∈Z
Same as in the proof of Theorem 5.3.6, the latter relation implies that
λt 2 Pt
t γt Ω∗ + σ2 τ =1 λτ γτ
t
Λt E{sad (z )} ≤ + E max D(ξ , u) . (5.3.76)
2 u∈Z
20 . Applying Lemma 5.3.2 with λτ in the role of γτ , we get the first inequality in the
following chain:
qP qP h i
t t λt 2 γt 2 Pt
E{maxu∈Z D(ξ t , u)} ≤ 6σΩ 2
τ =1 λτ ≤ 6σΩ∗
2
τ =1 λτ ≤3 γt Ω∗ + λt σ
2
τ =1 λτ
h Pt i
λt 2
≤3 γt Ω∗ + σ2 τ =1 λτ γτ ,
(5.3.77)
γt 2
where the concluding inequality is due to ≤ λτ γτ for τ ≤ t (recall that (5.3.28) holds true).
λt λ τ
(5.3.77) combines with (5.3.76) to imply (5.3.70). The latter relation straightforwardly implies
the “In particular” part of Theorem. 2
354 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
Illustration: Matrix Game and `1 minimization with uniform fit. “In the nature” there
exist “genuine” stochastic minimization/saddle point problems, those where the cost function is
given as expectation Z
φ(x, y) = Φ(x, y, ξ)dP (ξ)
with efficiently computable integrant Φ and a distribution P of ξ we can sample from. This
is what happens when the actual cost is affected by random factors (prices, demands, etc.).
In such a situation, assuming Φ convex-concave and, say, continuously differentiable in x, y
for every ξ, the vector G(x, y, ξ) = [ ∂Φ(x,y,ξ)
∂x ; − ∂Φ(x,y,ξ)
∂y ] under minimal regularity assumptions
satisfies (5.3.58) and (5.3.59). Thus, we can mimic a stochastic first order oracle for the cost
function by generating a realization ξt ∼ P and returning the vector G(x, y, ξt ) (t is the serial
number of the call to the oracle, (x, y) is the corresponding query point). Consequently, we can
solve the saddle point problem by the SSPMD algorithm. Note that typically in a situation in
question it is extremely difficult, if at all possible, to compute φ in a closed analytic form and thus
to mimic a deterministic first order oracle for φ. We may say that in the “genuine” stochastic
saddle point problems, the necessity in stochastic oracle-oriented algorithms, like SSPMD, is
genuine as well. This being said, we intend to illustrate the potential of stochastic oracles and
associated algorithms by an example of a completely different flavour, where the saddle point
problem to be solved is a simply-looking deterministic one, and the rationale for randomization
is the desire to reduce its computational complexity.
The problem we are interested in is a bilinear saddle point problem on the direct product of
two simplexes:
min max φ(x, y) := aT x + y T Ax + bT y.
x∈∆n y∈∆m
Note that due to the structure of ∆n and ∆m , we can reduce the situation to the one where
a = 0, b = 0; indeed, it suffices to replace the original A with the matrix
By this reason, and in order to simplify notation, we pose the problem of interest as
This problem arises in several applications, e.g., in matrix games and `1 minimization under
`∞ -loss.
Matrix game interpretation of (5.3.78) (this problem is usually even called “matrix
game”) is as follows:
There are two players, every one selecting his action from a given finite set; w.l.o.g.,
we can assume that the actions of the first player form the set J = {1, ..., n}, and the
actions of the second player form the set I = {1, ..., .m}. If the first player chooses
an action j ∈ J, and the second – an action i ∈ I, the first pays to the second aij ,
where the matrix A = [aij ] 1≤i≤m, is known in advance to both the players. What
1≤j≤n
should rational players (interested to reduce their losses and to increase their gains)
do?
5.3. MIRROR DESCENT ALGORITHM 355
The answer is easy when the matrix has a saddle point – there exists (i∗ , j∗ ) such
that the entry ai∗ ,j∗ is the minimal one in its row and the maximal one it its column.
You can immediately verify that in this case the maximal entry in every column is at
least ai∗ ,j∗ , and the minimal entry in every row is at most ai∗ ,j∗ , meaning that when
the first player chooses some action j, the second one can take from him at least
ai∗ ,j∗ , and if the second player chooses an action i, the first can pay to the second at
most ai∗ ,j∗ . The bottom line is that rational players should stick to choices j∗ and
i∗ ; in this case, no one of them has an incentive to change his choice, provided that
the adversary sticks to his choice. We see that if A has a saddle point, the rational
players should use its components as their choices. Now, what to do if, as it typically
is the case, A has no saddle point? In this case there is no well-defined solution to
the game as it was posed – there is no pair (i∗ , j∗ ) such that the rational players
will use, as their respective choices, i∗ and j∗ . However, there still is a meaningful
definition of a solution to a slightly modified game – one in mixed strategies, due to
von Niemann and Morgenstern. Specifically, assume that the players play the game
not just once, but many times, and are interested in their average, over the time,
losses. In this case, they can stick to randomized choices, where at t-th tour, the
first player draws his choice from J according to some probability distribution on
J (this distribution is just a vector x ∈ ∆n , called mixed strategy of the first player),
and the second player draws his choice according to some probability distribution
on I (represented by a vector y ∈ ∆m , the mixed strategy of the second player). In
this case, assuming the draws independent of each other, the expected per-tour loss
of the first player (which is also the expected per-tour gain of the second player) is
y T Ax. Rational players would now look for a saddle point of φ(x, y) = y T Ax on
the domain ∆n × ∆m comprised of pairs of mixed strategies, and in this domain the
saddle point does exist.
`1 minimization with `∞ fit is the problem min kBξ − bk∞ , or, in the saddle point
ξ:kξk1 ≤1
form, minkξk1 ≤1 maxkηk1 ≤1 η T [Bξ − b]. Representing ξ = p − q, η = r − s with nonnegative
p, q, r, s, the problem can be rewritten in terms of x = [p; q], y = [r; s], specifically, as
" #
h
T T
i B −B
min max y Ax − y a [A = , a = [b; −b]]
x∈∆n y∈∆m −B B
Problem (5.3.78) is a nice bilinear saddle point problem. The best known so far algorithms
for solving large-scale instances of this problem within a medium accuracy find an -solution
(x , y ),
sad (x , y ) ≤
in O(1) ln(m) ln(n) kAk∞ + steps, where kAk∞ is the maximal magnitude of entries in A. A
p
single step requires O(1) matrix-vector multiplications involving the matrices A and AT , plus
O(1)(m + n) operations of computational overhead. Thus, the overall arithmetic cost of an
-solution is
A+
Idet = O(1) M
356 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
arithmetic operations (a.o.), where M is the maximum of m+n and the arithmetic cost of a pair
of matrix-vector multiplications, one involving A and the other one involving AT . The outlined
bound is achieved, e.g., by the Mirror Prox algorithm with Simplex setup, see section 5.5.4.3.
For dense m × n matrices A with no specific structure, the quantity M is O(mn), so that the
above complexity bound is quadratic in m, n and can became prohibitively large when m, n are
really large. Could we do better? The answer is “yes” — we can replace time-consuming in the
large scale case precise matrix-vector multiplication by its computationally cheap randomized
version.
E{G} = Bx
If you think of how could you implement this procedure, you immediately realize that all you
need is access to a generator producing a uniformly distributed on [0, 1] random number ξ; the
output of the outlined stochastic oracle will be a deterministic function of x and ξ, as required in
our model of stochastic oracle. Moreover, the number of operations to build G modulo generating
ξ is dominated by dim x operations to generate and the effort to extract from B a column
given its number, which we from now on will assume to be O(1)m, m being the dimension of
the column. Thus, the arithmetic cost of calling the oracle in the case of m × n matrix B is
O(1)(m + n) a.o. (note that for all practical purposes, generating ξ takes just O(1) a.o.). Note
also that the output G of our stochastic oracle satisfies
Now, the vector field F (x, y) associated with the bilinear saddle point problem (5.3.78) is
" #
T AT
F (x, y) = [Fx (y) = A y; Fy (x) = −Ax] = A[x; z], A = .
−A
We see that computing F (x, y) at a point (x, y) ∈ ∆n × ∆n can be randomized — we can get an
unbiased estimate G(x, y, ξ) of F (x, y) at the cost of O(1)(m + n) a.o., and the estimate satisfies
the condition
kG(x, y, ξ)k∞ ≤ kAk∞ . (5.3.79)
where α, β are positive weight coefficients. With this choice of the norm and the DGF, we
clearly ensure that ω is strongly convex modulus 1 with respect to k · k, while the ω-radius of
Z = ∆n × ∆m is given by
Ω2 = 2[α ln(n) + β ln(m)],
provided that m, n ≥ 3, which we assume from now on. Finally, the norm conjugate to k · k is
q
k[ξ; η]k∗ = α−1 kξk2∞ + β −1 kηk2∞
(why?), so that the quantity σ defined in (5.3.61) is, in view of (5.3.79), bounded from above by
s
1 1
b = kAk∞
σ + .
α β
It makes sense to choose α, β in order to minimize the error bound (5.3.63), or, which is the
same, to minimize Ωσ. One of the optimizers (they clearly are unique up to proportionality) is
1 1
α= p p p ,β = p p p ,
ln(n)[ ln(n) + ln(m)] ln(m)[ ln(n) + ln(m)]
which results in
√ √ q q
Ω= 2, Ωσ = 2 ln(n) + ln(m) kAk∞ .
T T 6Ωσ √ q q
kAk∞
E{sad (x , y )} ≤ √ =6 2 ln(n) + ln(m) √ . (5.3.80)
T T
Discussion. The number of steps T needed to make the right hand side in the bound (5.3.80)
at most a desired ≤ kAk∞ is
2
kAk∞
N () = O(1) ln(mn) ,
Recall that for a dense general-type matrix A, to solve (5.3.78) within accuracy deterministi-
cally costs
kAk∞
q
Cdet () = O(1) ln(m) ln(n)mn a.o.
We see that as far as the dependence of the arithmetic cost of an -solution on the desired relative
accuracy δ := /kAk∞ is concerned, the deterministic algorithm significantly outperforms the
randomized one. At the same time, the dependence of the cost on m, n is much better for the
randomized algorithm. As a result, when a moderate and fixed relative accuracy is sought and
m, n are large, the randomized algorithm by far outperforms the deterministic one.
Assessing the quality. Strictly speaking, the outlined comparison of the deterministic and
the randomized algorithms for (5.3.78) is not completely fair: with the deterministic algorithm,
we know for sure that the solution we end up with indeed is an -solution to (5.3.78), while
with the randomized algorithm we know only that the expected inaccuracy of our (random!)
approximate solution is ≤ ; nobody told us that a realization of this random solution zb we
actually get solves the problem within accuracy about . If we were able to quantify the actual
accuracy of zb by computing sad (zb), we could circumvent this difficulty as follows: let us choose
the number T of steps in our randomized algorithm in such a way that the resulting expected
inaccuracy is ≤ /2. After zb is built, we compute sad (zb); if this quantity is ≤ , we terminate
with -solution to (5.3.78) at hand. Otherwise (this “otherwise” may happen with probability
≤ 1/2 only, since the inaccuracy is nonnegative, and the expected inaccuracy is ≤ /2) we rerun
the algorithm, check the quality of the resulting solution, rerun the algorithm again when this
quality still is worse then , and so on, until either an -solution is found, or a chosen in advance
maximum number N of reruns is reached. With this approach, in order to ensure generating
an -solution to the problem with probability at least 1 − β we need N = O(ln(1/β)); thus,
even for pretty small β, N is just a moderate constant8 , and the randomized algorithm still can
outperform by far its 100%-reliable deterministic counterpart. The difficulty, however, is that
in order to compute sad (zb), we need to know F (z) exactly, since in our case
7
The first sublinear time algorithm for matrix games was proposed by M. Grigoriadis and L. Khachiyan in
1995 [13]. In retrospect, their algorithm is pretty close, although not identical, to SSPMD; both algorithms share
the same complexity bound. In [13] it is also proved that in the situation in question randomization is essential:
no deterministic algorithm for solving matrix games can exhibit sublinear time behaviour.
8
e.g., N = 20 for β = 1.e-6 and N = 40 for β = 1.e-12; note that β = 1.e-12 is, for all practical purposes, the
same as β = 0.
5.3. MIRROR DESCENT ALGORITHM 359
and even a single computation of F could be too expensive (and will definitely destroy the
sublinear time behaviour). Can we circumvent somehow the difficulty? The answer is positive.
Indeed, note that
A. For a general bilinear saddle point problem
h i
min max aT x + bT y + y T Ax
x∈X y∈Y
B. The randomized computation of F we have used in the case of (5.3.78) also is of a very
specific structure, specifically, as follows: we associate with z ∈ Z = X × Y a probability
distribution Pz which is supported on Z and is such that if ζ ∼ Pz , then E{ζ} = z, and
our stochastic oracle works as follows: in order to produce an unbiased estimate of F (z),
it draws a realization ζ from Pz and returns G = F (ζ).
The point is that when A and B are met, we can modify the SSPMD algorithm, preserving its
efficiency estimate and making F (z t ) readily available for every t. Specifically, at a step τ of
the algorithm, we have at our disposal a realization ζτ of the distribution Pzτ which we used to
compute G(zτ , ξτ ) = F (ζτ ). Now let us replace the rule for generating approximate solutions,
which originally was
" t #−1 t
X X
t t
z = z̄ := γτ γ τ zτ ,
τ =1 τ =1
with the rule " t #−1 t
X X
t t
z = zb := γτ γ τ ζτ .
τ =1 τ =1
since ζt ∼ Pzt and the latter distribution is supported on Z, we still have z t ∈ Z. Further, we
have computed F (ζτ ) exactly, and since F is affine, we just have
" t #−1 t
X X
t
F (z ) = γτ γτ F (ζτ ),
τ =1 τ =1
meaning that sequential computing F (z t ) costs us just additional O(m + n) a.o. per step and
thus increases the overall complexity by just a close to 1 absolute constant factor. What is
360 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
not immediately clear, is why the new rule for generating approximate solutions preserves the
efficiency estimate. The explanation is as follows (work it out in full details by yourself): The
proof of Theorem 5.3.6 started from the observation that for every u ∈ Z we have
t
X
γτ hG(zτ , ξτ ), zτ − ui ≤ Bt (ξ t ) (!)
τ =1
which, upon taking the expectations, led us to the desired efficiency estimate. Now we should
act as follows: it is immediately seen that (!) remains valid and reads
t
X
γτ hF (ζt ), zt − ui ≤ Bt (ξ t ),
τ =1
whence
t
X
γτ hF (ζt ), ζt − ui ≤ Bt (ξ T ) + Dt (ξ t ),
τ =1
where
t
X
Dt (ξ t ) = γτ hF (ζt ), ζt − zt i,
τ =1
which, exactly as in the original proof, leads us to a slightly modified version of (!!), specifically,
to " t #
X
γτ sad (zbt ) ≤ Bt (ξ t ) + Ct (ξ t ) + Dt (ξ t )
τ =1
with the same as in the original proof upper bound on E{Bt (ξ t ) + Ct (ξ t )}. When taking
expectations of the both sides of this inequality, we get exactly the same upper bound on
E{sad (...)} as in the original proof due to the following immediate observation:
E{Dt (ξ t )} = 0.
where we used the fact that A is skew symmetric. Now, the quantity
−hAζτ , zτ i + hc, ζτ − zτ i
(we again used the fact that A is skew symmetric). Thus, E{hF (ζτ ), ζτ − zτ i} = 0,
whence E{Dt (ξ t } = 0 as well.
Operational Exercise 5.3.1 The “covering story” for your tasks is as follows:
There are n houses in a city; i-th house contains wealth wi . A burglar chooses
a house, secretly arrives there and starts his business. A policeman also chooses
his location alongside one of the houses; the probability for the policeman to catch
the burglar is a known function of the distance d between their locations, say, it is
exp{−cd} with some coefficient c > 0 (you may think, e.g., that when the burglary
starts, an alarm sounds or 911 is called, and the chances of catching the burglar
depend on how long it takes from the policeman to arrive at the scene). This cat
and mouse game takes place every night. The goal is to find numerically near-optimal
mixed strategies of the burglar, who is interested to maximize his expected “profit”
[1−exp{−cd(i, j)}]wi (i is the house chosen by the burglar, j is the house at which the
policeman is located), and of the policeman who has a completely opposite interest.
Remark: Note that the sizes of problems of the outlined type can be really huge. For
example, let there be 1000 locations, and 3 burglars and 3 policemen instead of one
burglar and one policeman. Now there are about 1003 /3! “locations” of the burglar
side, and about 10003 /3! “locations” of the police side, which makes the sizes of A
well above 108 .
The model of the situation is a matrix game where the n × n matrix A has the entries
[1 − exp{−cd(i, j)}]wi ,
d(i, j) being the distance between the locations i and j, 1 ≤ i, j ≤ n. In the matrix game
associated with A, the policeman chooses mixed strategy x ∈ ∆n , and the burglar chooses a
mixed strategy y ∈ ∆n . Your task is to get an -approximate saddle point. We assume for
the sake of definiteness that the houses are located at the nodes of a square k × k grid, and
c = 2/D, where D is the maximal distance between a pair of houses. The distance d(i, j) is
either the Euclidean distance kPi − Pj k2 between the locations Pi , Pj of i-th and j-th houses on
the 2D plane, or d(i, j) = kPi − Pj k1 (“Manhattan distance” corresponding to rectangular road
network).
Task: Solve the problem within accuracy around 1.e-3 for k = 100 (i.e., n = k 2 = 10000). You
may play with
• “profile of wealth” w. For example, in my experiment reported on fig. 5.1, I used wi =
max[pi /(k − 1), qi /(k − 1)], where pi and qi are the coordinates of i-th house (in my
experiment the houses were located at the 2D points with integer coordinates p, q varying
from 0 to k − 1). You are welcome to use other profiles as well.
• distance between houses – Euclidean or Manhattan one (I used the Manhattan distance).
362 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
0.25 0.35
0.3
0.2
0.25
0.15
0.2
0.15
0.1
0.1
0.05
0.05
0 0
100 100
80 100 80 100
60 80 60 80
40 60 40 60
40 40
20 20
20 20
0 0 0 0
0.25 0.35
0.3
0.2
0.25
0.15
0.2
0.15
0.1
0.1
0.05
0.05
0 0
100 100
80 100 80 100
60 80 60 80
40 60 40 60
40 40
20 20
20 20
0 0 0 0
Figure 5.1: My experiment: 100 × 100 square grid of houses, T = 50, 000 steps of SSPMD, the stepsize
policy (5.3.81) with α = 20. At the resulting solution (x̄, ȳ), max [y T Ax̄] = 0.5016, min [ȳ T Ax] = 0.4985,
y∈∆n x∈∆n
meaning that the saddle point value is in-between 0.4985 and 0.5016, while the sum of inaccuracies,
in terms of the respective objectives, of the primal solution x̄ and the dual solution ȳ is at most
sad (x̄, ȳ) = 0.5016 − 0.4985 = 0.0031. It took 1857 sec to compute this solution on my laptop.
It took just 3.7 sec more to find a refined solution (x̂, ŷ) such that sad (x̂, ŷ) ≤ 9.e-9, and within 6 decimals
after the dot one has max [y T Ax̂] = 0.499998, min [ŷ T Ax] = 0.499998. It follows that up to 6 accuracy
y∈∆n x∈∆n
digits, the saddle point value is 0.499998 (not exactly 0.5!).
• The number of steps T and the stepsize policy. For example, along with the constant
stepsize policy (5.3.62), you can try other policies, e.g.,
Ω
γt = α √ , 1 ≤ t ≤ T. (5.3.81)
σ t
• ∗ Looking at your approximate solution, how could you try to improve its quality at a low
computational cost? Get an idea, implement it and look how it works.
You could also try to guess in advance what are good mixed strategies of the players and then
compare your guesses with the results of your computations.
5.3. MIRROR DESCENT ALGORITHM 363
1. a convex compact set X in Euclidean space E, with (X, E) equipped with proximal setup
– a norm k · k on E and a compatible with this norm DGF ω(·), which we assume to be
continuously differentiable on X;
3. access to stochastic oracle as follows: at t-th call to the oracle, the input to the oracle
being a query point xt ∈ X and a function ft (·) ∈ F, the oracle returns vector Gt =
Gt [ft (·), xt , ξt ] and a real gt = gt [ft (·), xt , ξt ], where for every t Gt [f (·), x, ξ] and gt [f (·), x, ξ]
are deterministic functions of f ∈ F, x ∈ X and a realization ξ of “oracle’s noise.” It is
further assumed that
Our crucial assumption (which is in force everywhere in this section) is that whatever be t, x
and f (·) ∈ F, we have
(the equality here stems from (5.3.82.a) combined with the facts that xt is a deterministic
function of ξ t−1 and that ξ0 , ξ1 , ξ2 , ... are i.i.d.). Note that we want to achieve the outlined goal
whatever be nature’s policy, that is, whatever be a sequence {Ft (ξ t−1 , x) : t ≥ 1} of legitimate
deterministic functions.
2. Step t = 1, 2, .... Given xt , call the oracle, xt being the query point, pay the corresponding
toll gt , observe Gt , set
Proof. Repeating word by word the reasoning in the proof of Theorem 5.3.3, we arrive at the
relation
N N
" #
X
t−1 1 λN 2 X
λt hGt [Ft (ξ , ·), xt , ξt ], xt − xi ≤ Ω∗ + λt γt kGt [Ft (ξ t−1 , ·), xt , ξt ]k2∗ .
t=1
2 γN t=1
which holds true for every x ∈ X. Taking expectations of both sides and invoking the fact that
{ξt } are i.i.d. in combination with (5.3.82.b) and the fact that xt is a deterministic function of
ξ t−1 , we arrive at
(N ) " N
#
1 λN 2
λt hFt0 (ξ t−1 , xt ), xt
X X
Eξ N − xi ≤ Ω∗ + σ 2 λ t γt , (5.3.85)
t=1
2 γN t=1
where Ft0 (ξ t−1 , xt ) is a subgradient, taken w.r.t. x, of the function Ft (ξ t−1 , x) at x = xt . Since
Ft (ξ t−1 , x) is convex in x, we have Ft (ξ t−1 , xt )−Ft (ξ t−1 , x) ≤ hFt0 (ξ t−1 , xt ), xt −xi, and therefore
(5.3.85) implies that for all x ∈ X it holds
(N ) " N
#
X 1 λN 2 X
Eξ N λt [Ft (ξ t−1 , xt ) − Ft (ξ t−1 , x)] ≤ Ω +σ 2
λt γt ,
t=1
2 γN ∗ t=1
whence also
λN 2 PN
N N + σ2
( ) ( )
1 X 1 X γN Ω∗ t=1 λt γt
Eξ N λt Ft (ξ t−1 , xt ) − min EξN λt Ft (ξ t−1 , x) ≤ .
ΛN t=1 x∈X ΛN t=1 2ΛN
Invoking the second representation of the expected regret in (5.3.83), the resulting relation is
exactly (5.3.84). 2
with α ≥ 0. This choice meets the requirements from the description of our algorithm and from
the premise of Theorem 5.3.8, and, similarly to what happened in section ??, results in
∀(N ≥ α) : √
n
1 PN t−1 , x )
o n
1 PN 1 + αΩ∗ σ o
Eξ N ΛN t=1 λt Ft (ξ t − min EξN ΛN
√
t=1 λt Ft ,(ξ t−1 , x) ≤ O(1)
x∈X N
(5.3.86)
with interpretation completely similar to the interpretation of the bound (5.3.44), see section
??.
“The nature” selects a sequence {zt : t ≥ 1} with terms zt from the d-element set
Zd = {1, ..., d}. We observe these terms one by one, and at every time t want to
predict zt via the past observations z1 , ..., zt−1 . Our loss at step t, our prediction
being zbt , is φ(zbt , zt ), where φ is a given loss function such that φ(z, z) = 0, z = 1, ..., d,
and φ(z 0 , z) > 0 when z 0 6= z. What we are interested in is to make the average, over
time horizon t = 1, ..., N , loss small.
What was just said is a “high level” imprecise setting. Missing details are as follows. We assume
that
1. the nature generates sequences z1 , z2 , ... in a randomized fashion. Specifically, there exists
an i.i.d. sequence of “primitive” random variables ζt , t = 0, 1, 2, ... observed one by one by
the nature, and zt is a deterministic function Zt (ζ t−1 ) of ζ t−1 9 . We refer to the collection
Z ∞ = {Zt (·) : t ≥ 1} as to the generation policy.
2. we are allowed for randomized prediction, where the predicted value zbt of zt is generated at
random according to probability distribution xt – a vector from the probabilistic simplex
X = {x ∈ Rd : x ≥ 0, dz=1 xz = 1}; an entry xzt in xt is the probability for our randomized
P
Introducing auxiliary sequence η0 , η1 , ... of independent of each other and of ζt ’s random variables
uniformly distributed on [0, 1), we can make our randomized predictions zbt ∼ xt deterministic
functions of xt and ηt :
zbt = Z̆(xt , ηt ),
where, for a probabilistic vector x and a real η ∈ [0, 1), Z̆(x, η) is the integer z ∈ Zd uniquely
defined by the relation z−1 ` Pz `
`=1 x ≤ η <
P
`=1 x (think of how would you sample from a proba-
bility distribution x on {1, ..., d}). As a result, setting ξt = [ζt ; ηt ] and thinking about the i.i.d.
sequence ξ0 , ξ1 , ... as of the input “driving” generation and prediction, the respective policies
being Z ∞ and X ∞ , we see that
• xt ’s are deterministic functions of z t−1 , specifically, Xt (z t−1 ), and thus are deterministic
functions of ξ t−1 : xt = X
e t (ξ t−1 );
• predictions zbt ’s are deterministic functions of (ζ t−1 , ηt ) and thus – of (ξ t−1 , ηt ): zbt =
Zbt (ξ t−1 , ηt ).
Note that functions Zt are fully determined by generation policy, while functions X
e t and Z
bt stem
from interaction between the policies for generation and for prediction and are fully determined
by these two policies.
Given a generation policy Z ∞ = {Zt (·) : t ≥ 1}, a prediction policy X ∞ = {Xt (·) : t ≥ 1}
and a time horizon N , we
9
as always, given a sequence u0 , u1 , u2 , ..., we denote by ut = (u0 , u1 , ..., ut ) the initial fragment, of length t + 1,
of the sequence. Similarly, for a sequence u1 , u2 , ..., ut stands for the initial fragment (u1 , ..., ut ) of length t of the
sequence.
5.3. MIRROR DESCENT ALGORITHM 367
that is, by the averaged over the time horizon expected prediction error as quantified by
our loss function φ, and
where the only yet undefined quantity Zbt,x (ζ t−1 , ηt ) is the “stationary counterpart” of
Zbt (ζ t−1 , ηt ), that is, Zbt,x (ζ t−1 , ηt ) is t-th prediction yielded by interaction between the
generation policy Z ∞ and the stationary prediction policy Xx∞ = {xt ≡ x : t ≥ 1}.
Thus, the regret is the excess of the average (over time and “driving input” ξ0 , ξ1 , ...)
prediction error yielded by prediction policy X ∞ , generation policy being Z ∞ , over the
similar average error of the best possible under the circumstances (under generation policy
Z ∞ ) stationary (xt = x, t ≥ 1) prediction policy.
Put broadly and vaguely, our goal is to build a prediction policy which, whatever be a generation
policy, makes the regret small provided N is large.
To achieve our goal, we are about to “translate” our prediction problem to the language of
stochastic regret minimization as presented in section 5.3.7.1. A sketch of the translation is as
follows. Let F be the set of d linear functions
d
X
`
f (x) = φ(z, `)xz : X → R, ` = 1, ..., d;
z=1
f ` (x) is just the expected error of the randomized prediction, the distribution of the prediction
being x ∈ X, in the case when the true value to be predicted is `. Note that f ` (·) is fully
determined by ` and “remembers” ` (since by our assumption on the loss function we have
f ` (`) = 0 and f ` (z) > 0 when z 6= `). Consequently, we lose nothing when assuming that the
nature, instead of generating a sequence {zt : t ≥ 1} of points from Zd , generates a sequence
of functions ft (·) = f zt (·) from F, t = 1, 2, ...; with this interpretation, a generation policy Z ∞
becomes what was in section 5.3.7.1 called nature’s policy, a policy which specifies t-th selection
ft (·) of nature as deterministically depending on ξ t−1 as on a parameter convex (in fact, linear)
function Ft (ξ t−1 , ·) : X → R.
Now, at a step t, after the nature shows us zt , we get full information on the function
ft (·) selected by the nature at this step, and thus know the gradient of this affine function;
this gradient can be taken as what in the decision making framework of section 5.3.7.1 was
called Gt . Besides this, after the nature reveals zt , we know which prediction error we have
made, and this error can be taken as what in section 5.3.7.1 was called the toll gt . On a
straightforward inspection which we leave to the reader, the outlined translation converts the
regret minimization form of our prediction problem into the problem of stochastic online regret
minimization considered in section 5.3.7.1 and ensures validity of all assumptions of Theorem
5.3.8. As a result, we conclude that
368 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
(!) The MD algorithm from section 5.3.7.2 can be straightforwardly applied to the
Sequence Prediction problem; with proper proximal setup (namely, the Simplex one)
for the probabilistic simplex X and properly selected stepsize policy, this algorithm
∞ which guarantees the regret bound
yields prediction policy XMD
p
ln(d + 1) max |φ(z, `)|
∞ z,`
RN [XMD |Z ∞ ] ≤ O(1) √ . (5.3.88)
N
whatever be N = 1, 2, ... and a sequence generation policy Z ∞ .
√
We see that MD allows to push regret to 0 as N → ∞ at an O(1/ N ) rate and uniformly
in sequence generation policies of the nature. As a result, when sequence generating policy used
by the nature is “well suited” for stationary prediction, i.e., nature’s policy makes small the
comparator
N
( )
1 X t−1 t−1
inf Eζ N ,ηN φ(Zt,x (ζ , ηt ), Zt (ζ ))
b (5.3.89)
x∈X N t=1
∞ is good provided N is large. In
in (5.3.87), (!) says that the MD-based prediction policy XMD
the examples to follow, we use the simplest possible loss function: φ(z, `) = 0 when z = ` and
φ(z, `) = 1 otherwise.
Example 1: i.i.d. generation policy. Assume that the nature generates zt by drawing it,
independently across t, from some distribution p (let us call this generation policy an i.i.d. one).
It is immediately seen that due to the i.i.d. nature of z1 , z2 , ..., there is no way to predict zt via
z1 , ..., zt−1 too well, even when we know p in advance; specifically, the smallest possible expected
prediction error is
(p) := 1 − max pz ,
1≤z≤d
and the corresponding optimal prediction policy is to say all the time that the predicted value
of zt is the most probable value z∗ (the index of the largest entry in p). Were p known to us
in advance, we could use this trivial optimal predictor from the very beginning. On the other
hand, the optimal predictor is stationary, meaning that with the generating policy in question,
the comparator is equal to (p). As a result, (!) implies that our MD-based predictor policy is
“asymptotically optimal” in the i.i.d. case: whatever be an i.i.d. generation policy used by the
nature, the average, over time horizon N , expected prediction error of the MD-based
p prediction
√
can be worse than the best one can get under the circumstances by at most O(1) ln(d + 1)/ N .
A good news is that when designing the MD-based policy, we did not assume that the generation
policy is an i.i.d. one; in fact, we made no assumptions on this policy at all!
that is, it is [pN ], where pN is the empirical distribution of z1 , ..., zN . We see that out MD-
based predictor behaves reasonably well on individual sequences which are “nearly stationary,”
5.4. BUNDLE MIRROR ALGORITHM 369
yielding results like: “if the fraction of entries different from 1 in any initial fragment, of length
at least N0 , of z ∞ is at most 0.05, then the percent of errors in the MD-based prediction of
∞
p
the first N ≥ N0 entries of z will be at most 0.05 + O(1) ln(d + 1)/N .” Not that much of a
claim, but better than nothing!
Of course, there exist “easy to predict” individual sequences (like the alternating sequence
1, 2, 1, 2, 1, 2, ...) where the performance of any stationary predictor is poor; in these cases, we
cannot expect much from our MD-based predictor.
Note that BM is in the same relation to Bundle Level method from section 5.2.2 as MD is to
SD.
The assumptions on the problem and the setups for these algorithms are exactly the same as
for MD; in particular, X is assumed to be bounded (see Standing Assumption, section 5.3.4.2).
A. The algorithm generates a sequence of search points, all belonging to X, where the
First Order oracle is called, and at every step builds the following entities:
1. the best found so far value of f along with the best found so far search point, which is
treated as the current approximate solution built by the method;
• `s = fs + λ(f s − fs ), where
– f s is the best value of the objective known at the time when the phase starts;
– fs is the lower bound on f∗ we have at our disposal when the phase starts;
– λ ∈ (0, 1) is a parameter of the method.
In principle, the prox-center c1 corresponding to the very first phase can be chosen in X o
in an arbitrary fashion; we shall discuss details in the sequel. We start the entire process with
computing f , f 0 at this prox-center, which results in
f 1 = f (c1 )
and set
f1 = min[f (c1 ) + (x − c1 )T f 0 (x1 )],
x∈X
1. When generating xt , we already have at our disposal xt−1 and a convex compact set
Xt−1 ⊂ X such that
and
(at−1 ) x ∈ X\Xt−1 ⇒ f (x) > `s ;
(bt−1 ) xt−1 ∈ argmin ωs . (5.4.2)
Xt−1
(a) Let fe ≤ +∞ be the optimal value in (Lt−1 ) (as always, fe = ∞ means that the
problem is infeasible). Observe that the quantity
fb = min[fe, `s ]
is a lower bound on f∗ : in X\Xt−1 we have f (x) > `s by (5.4.2.at−1 ), while on Xt−1
we have f (x) ≥ fe due to the inequality f (x) ≥ gt−1 (x) given by the convexity of f .
Thus, f (x) ≥ min[`s , fe] everywhere on X.
In the case of
fb ≥ `s − θ(`s − fs ) (5.4.3)
(“significant” progress in the lower bound), we terminate the phase and update the
lower bound on f∗ as
fs 7→ fs+1 = fb.
The prox-center cs+1 for the new phase can, in principle, be chosen in X o in an
arbitrary fashion (see below).
(b) When no significant progress in the lower bound is observed, we solve the optimization
problem
min {ωs (x) : x ∈ Xt−1 , gt−1 (x) ≤ `s } . (Pt−1 ).
x
This problem is feasible, and moreover, its feasible set is of the same dimension as X.
Indeed, since Xt−1 is of the same dimension as X, the only case when the set
{x ∈ Xt−1 : gt−1 (x) ≤ `s } would lack the same property could be the case of
fe = min gt−1 (x) ≥ `s , whence fb = `s , and therefore (5.4.3) takes place, which is
x∈Xt−1
not the case.
When solving (Pt−1 ), we get the optimal solution xt of this problem and compute
f (xt ), f 0 (xt ). It can be proved (see below) that xt ∈ X o , as required from all our
search points, and that with ωs0 (x) := ω 0 (x) − ω 0 (cs ) one has
hωs0 (xt ), x − xt i ≥ 0, ∀(x ∈ Xt−1 : gt (xt ) ≤ `s ). (5.4.4)
It is possible that we get a “significant” progress in the objective, specifically, that
f (xt ) − `s ≤ θ(f s − `s ), (5.4.5)
where θ ∈ (0, 1) is a parameter of the method. In this case, we again terminate the
phase and set
f s+1 = f (xt );
the lower bound fs+1 on the optimal value is chosen as the maximum of fs and of all
lower bounds fb generated during the phase. The prox-center cs+1 for the new phase,
same as above, can in principle be chosen in X o in an arbitrary fashion.
(c) When (Pt−1 ) is feasible and (5.4.5) does not take place, we continue the phase s,
choosing as Xt an arbitrary convex compact set such that
X t ≡ {x ∈ Xt−1 : gt−1 (x) ≤ `s } ⊂ Xt ⊂ X t ≡ {x ∈ X : (x−xt )T ωs0 (xt ) ≥ 0}. (5.4.6)
Note that we are in the case when (Pt−1 ) is feasible and xt is the optimal solution to
the problem; as we have claimed, this implies (5.4.4), so that
∅=
6 X t ⊂ X t,
372 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
and thus (5.4.6) indeed allows to choose Xt . Since, as we have seen, X t is of the same
dimension as X, this property will be automatically inherited by Xt , as required by
(ot ). Besides this, every choice of Xt compatible with (5.4.6) ensures (5.4.2.at ) and
(5.4.2.bt ); the first relation is clearly ensured by the left inclusion in (5.4.6) combined
with (5.4.2.at−1 ) and the fact that f (x) ≥ gt−1 (x), while the second relation (5.4.2.bt )
follows from the right inclusion in (5.4.6) due to the convexity of ωs (·).
Clearing debts. We have claimed that when xt is well defined, ωs0 (xt ) exists and (5.4.4)
is satisfied. This is evident when ω(·) is continuously differentiable on the entire X, and
needs certain justification otherwise. Here is the justification. W.l.o.g. we can assume that
X is full-dimensional, whence, as we have already explained, the set Y = {x ∈ Xt−1 :
gt−1 (x) ≤ `s } also is full-dimensional (recall that we are in the case when the phase s does
not terminate at step t due to significant progress in the lower bound on f∗ ). Let x̄ ∈ int Y ,
let V ⊂ Y be a ball of positive radius ρ centered at x̄, and let x(r) = xt + r(x̄ − xt ), so that
the function ψ(r) = ωs (x(r)) (which clearly is convex) attains its minimum on [0, 1] at r = 0.
For 0 < r ≤ 1 we have x(r) ∈ int X, so that g(r) := ωs0 (x(r)) exists. Since ψ(r) ≥ ψ(0)
and ψ is convex, we have 0 ≤ ψ 0 (r) = hg(r), x̄ − xt i and besides this, for khk2 ≤ ρ we have
hg(r), x̄ + h − x(r)i ≤ ωs (x̄ + h) − ωs (x(r)) ≤ C, where C is independent of r and h. In other
words,
meaning that the vectors g(r) = ωs0 (x(r)) remain bounded as r → +0. It follows that ωs
(and then ω) does have a subgradient at xt = x(0) (you can take as such a subgradient every
limiting, as r → +0, point of the curve g(r)). Recalling the smoothness properties of ω, we
conclude that ωs0 (xt ) is well defined. With this in mind, (5.4.4) is proved by exactly the same
argument as we have used to establish Fact 5.3.1.
s = f s − fs
By its origin, the gap is nonnegative and nonincreasing in s; besides this, it clearly is an upper
bound on the inaccuracy, in terms of the objective, of the approximate solution z s we have at
our disposal at the beginning of phase s.
The convergence and the complexity properties of the BM algorithm are given by the fol-
lowing statement.
Theorem 5.4.1 (i) The number Ns of oracle calls at a phase s can be bounded from above as
follows:
5Ω2s L2k·k (f ) r
Ns ≤ 2 , Ω s = 2 max[ω(y) − ω(cs ) − hω 0 (cs ), y − cs i]. (5.4.7)
θ (1 − λ)2 2s y∈X
Ωs ≤ Ω
b < ∞, s = 1, 2, ... (5.4.8)
5.4. BUNDLE MIRROR ALGORITHM 373
Then for every > 0 the total number of oracle calls before the phase s with s ≤ is started
(i.e., before -solution to the problem is built) does not exceed
b 2 L2 (f )
Ω k·k
N () = c(θ, λ) (5.4.9)
2
with an appropriate c(θ, λ) depending solely and continuously on θ, λ ∈ (0, 1).
Proof. (i): Assume that phase s did not terminate in course of N steps. Observe that then
θ(1 − λ)s
kxt − xt−1 k ≥ , 1 ≤ t ≤ N. (5.4.10)
Lk·k (f )
Indeed, we have gt−1 (xt ) ≤ `s by construction of xt and gt−1 (xt−1 ) = f (xt−1 ) > `s + θ(f s − `s ), since
otherwise the phase would be terminated at the step t − 1. It follows that gt−1 (xt−1 ) − gt−1 (xt ) >
θ(f s − `s ) = θ(1 − λ)s . Taking into account that gt−1 (·), due to its origin, is Lipschitz continuous w.r.t.
k · k with constant Lk·k (f ), (5.4.10) follows.
Now observe that xt−1 is the minimizer of ωs on Xt−1 by (5.4.2.at−1 ), and the latter set, by con-
struction, contains xt , whence hxt − xt−1 , ωs0 (xt−1 )i ≥ 0 (see (5.4.4)). Applying (5.4.1), we get
2
1 θ(1 − λ)s
ωs (xt ) ≥ ωs (xt−1 ) + , t = 1, ..., N,
2 Lk·k (f )
whence 2
N θ(1 − λ)s
ωs (xN ) − ωs (x0 ) ≥ .
2 Lk·k (f )
Ω2s
The latter relation, due to the evident inequality max ωs (x)−min ωs ≤ 2 (readily given by the definition
X X
of Ωs and ωs ) implies that
Ω2s L2k·k (f )
N≤ .
θ2 (1 − λ)2 2s
Recalling the origin of N , we conclude that
Ω2s L2k·k (f )
Ns ≤ + 1.
θ2 (1 − λ)2 2s
All we need in order to get from this inequality the required relation (5.4.7) is to demonstrate that
Ω2s L2k·k (f ) 1
≥ , (5.4.11)
θ2 (1 − λ)2 2s 4
as required in (5.4.12). The only other possibility is that the phase s was terminated when the relation
(5.4.5) took place. In this case,
s+1 = f s+1 − fs+1 ≤ f s+1 − fs ≤ `s + θ(f s − `s ) − fs = λs + θ(1 − λ)s = (1 − (1 − θ)(1 − λ))s ,
A. Setting all the time Xt = X t , we ensure that Xt is cut off X by a single linear inequality;
B. Setting all the time Xt = X t , we ensure that Xt is cut off X by t linear inequalities (so
that the larger is t, the “more complicated” is the description of Xt );
C. We can choose something in-between the above extremes. For example, assume that we
have chosen certain m and are ready to work with Xt ’s cut off X by at most m linear
inequalities. In this case, we could use the policy B. at the initial steps of a phase, until
the number of linear inequalities in the description of Xt−1 reaches the maximal allowed
value m. At the step t, we are supposed to choose Xt in-between the two sets
where
implying that the linear constraints hj (x) ≤ 0 (which we w.l.o.g. assume to be nonconstant)
satisfy the Slater condition on int X: there exists x̄ ∈ int X such that hj (x̄) ≤ 0, j ≤ m.
Applying Convex Programming Duality Theorem (Theorem D.2.2), we get
m
X
Opt = max G(λ), G(λ) = min[gt−1 (x) + λj hj (x)]. (D)
λ≥0 x∈X
j=1
Now, assuming that prox-mapping associated with ω(·) is easy to compute, it is equally easy to
minimize an affine function over X (why?), so that G(λ) is easy to compute:
m
X m
X
G(λ) = gt−1 (xλ ) + λj hj (xλ ), xλ ∈ Argmin[gt−1 (x) + λj hj (x)].
j=1 x∈X j=1
and thus is as readily available as the value G(λ) of G. It follows that we can compute Opt (which
is all we need as far as the auxiliary problem (Lt−1 ) is concerned) by solving the m-dimensional
convex program (D) by a whatever rapidly converging first order method11 .
The situation with the second auxiliary problem
is similar. Here again we are interested to solve the problem only in the case when its feasible
domain is full-dimensional (recall that we w.l.o.g. have assumed X to be full-dimensional), and
the constraints cutting the feasible domain off X are linear, so that we can write (Pt−1 ) down
as the problem
min {ωs (x) : hj (x) ≤ 0, 1 ≤ j ≤ m + 1}
x
11
e.g., by the Ellipsoid algorithm, provided m is in range of tens.
376 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
Here again, the concave and well defined dual objective H(λ) can be equipped with a cheap
First Order oracle (provided, as above, that the prox-mapping associated with ω(·) is easy to
compute), so that we can reduce the auxiliary problem (Pt−1 ) to solving a low-dimensional
(provided m is not large) black-box-represented convex program by a first order method. Note
that now we want to build a high-accuracy approximation to the optimal solution of the problem
of actual interest (Pt−1 ) rather than to approximate well its optimal value, and a high-accuracy
solution to the Lagrange dual of a convex program not always allows to recover a good solution
to the program itself. Fortunately, in our case the objective of the problem of actual interest
(Pt−1 ) is strongly convex, and in this case a high accuracy solution λ to the dual problem (Dt−1 ),
the accuracy being measured in terms of the dual objective H, does produce a high-accuracy
solution
m+1
X
xλ = argmin ωs (x) + λj hj (x)
X j=1
to (Pt−1 ), the accuracy being measured in terms of the distance to the precise solution xt to the
latter problem.
The bottom line is that when the “memory depth” m (which is fully controlled by us!) is
not too big and the prox-mapping associated with ω is easy to compute, the auxiliary problems
arising in the BM algorithm can be solved, at a relatively low computational cost, via Lagrange
duality combined with good high-accuracy first order algorithm for solving black-box-represented
low dimensional convex programs.
Thus, we can choose cs ∈ X as we want, at the price of setting Ω b = Ω.e Note that when X
is either a Euclidean ball/box and Ball setup is used, or is a `1 `2 ball, or nuclear norm ball,
or spectahedron, and the respective of `1 /`2 /Nuclear norm/Spectahedron setups is used, the
outlined price is not too high (since on a closest inspection, the quantities Ω e in these cases
are, up to moderate dimension independent (or nearly so) factors, the best possible). Practical
experience says that in the situations in question, when there is a full freedom in choosing
prox-centers, a reasonable policy is to use as cs the best, in terms of the values of f , search point
generated so far.
5.4. BUNDLE MIRROR ALGORITHM 377
Figure 5.1. Ring with 360 detectors, field of view and a line of response
The set Xt summarizes, in a sense, all the information on f we have accumulated so far and
intend to use in the sequel. Relation (5.4.6) allows for a tradeoff between the quality (and the
volume) of this information and the computational effort required to solve the auxiliary problems
(Pt−1 ). With no restrictions on this effort, the most promising policy for updating Xt ’s would be
to set Xt = X t−1 (“collecting information with no compression of it”). With this policy the BM
algorithm with the ball setup is basically identical to the Prox-Level Algorithm of Lemarechal,
Nemirovski and Nesterov [18]; the “truncated memory” version of the latter method (that is,
the generic BM algorithm with ball setup) was proposed by Kiwiel [17]. Aside of theoretical
complexity bounds similar to (5.3.37), most of bundle methods (in particular, the Prox-Level
one) share the following experimentally observed property: the practical performance of the
algorithm is in full accordance with the complexity bound (5.1.2): every n steps reduce the
inaccuracy by at least an absolute constant factor (something like 3). This property is very
attractive in moderate dimensions, where we indeed are capable to carry out several times the
dimension number of steps.
The model. We process simulated measurements as if they were registered by a ring of 360
detectors, the inner radius of the ring being 1 (Fig. 5.1). The field of view is a concentric circle
of the radius 0.9, and it is covered by the 129×129 rectangular grid. The grid partitions the field
of view into 10, 471 pixels, and we act as if tracer’s density was constant in every pixel. Thus,
the design dimension of the problem (PET0 ) we are interested to solve is “just” n = 10471.
The number of bins (i.e., number m of log-terms in the objective of (PET0 )) is 39784, while
the number of nonzeros among qij is 3,746,832.
378 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
The true image is “black and white” (the density in every pixel is either 1 or 0). The
measurement time (which is responsible for the level of noise in the measurements) is mimicked
as follows: we model the measurements according to the Poisson model as if during the period
of measurements the expected number of positrons emitted by a single pixel with unit density
was a given number M .
The algorithm we are using to solve (PET0 ) is the plain BM method with the simplex setup
and the sets Xt cut off X = ∆n by just one linear inequality:
λ = 0.95, θ = 0.5.
The approximate solution reported by the algorithm at a step is the best found so far search
point (the one with the best value of the objective we have seen to the moment).
The results of two sample runs we are about to present are not that bad.
Experiment 2: Noisy measurements (40 LOR’s per pixel with unit density, to-
tally 63,092 LOR’s registered). The pictures are presented at Fig. 5.4. Here are the
numbers. With noisy measurements, we have no a priori knowledge of the true optimal value
in (PET0 ); in simulated experiments, a kind of orientation is given by the value of the objective
at the true image (which is hopefully a close to f∗ upper bound on f∗ ). In our experiment, this
bound equals to -0.8827. The best value of the objective found in 115 oracle calls is -0.8976
(which is less that the objective at the true image in fact, the algorithm went below the value
of f at the true image already after 35 oracle calls). The upper bound on the optimality gap
at termination is 9.7e-4. The progress in accuracy is plotted on Fig. 5.5. We have built totally
115 search points; the entire computation took 200 4100 .
3D PET. As it happened, the BT algorithm was not tested on actual clinical 3D PET data.
However, in late 1990’s we were participating in a EU project on 3D PET Image Reconstruction;
in this project, among other things, we did apply to the actual 3D clinical data an algorithm
which, up to minor details, is pretty close to the MD with Entropy setup as applied to convex
minimization over the standard simplex (details can be found in [2]). In the sequel, with slight
12
Do not be surprised by the obsolete hardware – the computations were carried out in 2002
5.4. BUNDLE MIRROR ALGORITHM 379
True image: 10 “hot spots” x1 = n−1 (1, ..., 1)T x2 – some traces of 8 spots
f = 2.817 f = 3.247 f = 3.185
x3 – traces of 8 spots x5 – some trace of 9-th spot x8 – 10-th spot still missing...
f = 3.126 f = 3.016 f = 2.869
x24 – trace of 10-th spot x27 – all 10 spots in place x31 – that is it...
f = 2.828 f = 2.823 f = 2.818
Figure 5.2. Reconstruction from noiseless measurements
380 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
0
10
−1
10
−2
10
−3
10
−4
10
0 20 40 60 80 100 120
abuse of terminology, we refer to this algorithm as to MD, and present some experimental data
on its the practical performance in order to demonstrate that simple optimization techniques
indeed have chances when applied to really huge convex programs. We restrict ourselves with a
single numerical example – real clinical brain study carried out on a very powerful PET scanner.
In the corresponding problem (PET0 ), there are n = 2, 763, 635 design variables (this is the
number of voxels in the field of view) and about 25,000,000 log-terms in the objective. The
reconstruction was carried out on the INTEL Marlinspike Windows NT Workstation (500 MHz
1Mb Cache INTEL Pentium III Xeon processor, 2GB RAM; this hardware, completely obsolete
now, was state-of-the-art one in late 1990s). A single call to the First Order oracle (a single
computation of the value and the subgradient of f ) took about 90 min. Pictures of clinically
acceptable quality were obtained after just four calls to the oracle (as it was the case with other
sets of PET data); for research purposes, we carried out 6 additional steps of the algorithm.
The pictures presented on Fig. 5.6 are slices – 2D cross-sections – of the reconstructed 3D
image. Two series of pictures shown on Fig. 5.6 correspond to two different versions, plain
and “ordered subsets” one (MD and OSMD, respectively), of the method, see [2] for details.
Relevant numbers are presented in Table 5.1. The best known lower bound on the optimal value
in the problem is -2.050; MD and OSMD decrease in 10 oracle calls the objective from its initial
value -1.436 to the values -2.009 and -2.016, respectively (optimality gaps 4.e-2 and 3.5e-2) and
reduce the initial inaccuracy in terms of the objective by factors 15.3 (MD) and 17.5 (OSMD).
5.4. BUNDLE MIRROR ALGORITHM 381
True image: 10 “hot spots” x1 = n−1 (1, ..., 1)T x2 – light traces of 5 spots
f = −0.883 f = −0.452 f = −0.520
x12 – all 10 spots in place x35 – all 10 spots in place x43 – ...
f = −0.872 f = −0.886 f = −0.896
Figure 5.4. Reconstruction from noisy measurements
0
10
−1
10
−2
10
−3
10
−4
10
0 20 40 60 80 100 120
where X is a closed convex set in a Euclidean space E equipped with norm k · k (not necessarily
the Euclidean one), Ψ(x) : X → R ∪ {+∞} is a lower semicontinuous convex function which
is finite on the relative interior of X, and f : X → R is a convex function. In this section, we
FAST FIRST ORDER MINIMIZATION AND SADDLE POINTS 383
focus on the smooth composite case, where f has Lipschitz continuous gradient:
is solvable with the unique solution z = argminx∈X [h(x) + ω(x)] belonging to X o and fully
characterized by the property
Note that for h as above, the set argmin x∈X {h(x) + ω(x)} is nonempty, since it clearly contains
the point argminx∈X {h(x) + ω(x)}.
The algorithm we are about to develop requires the ability to find, given ξ ∈ E and α ≥ 0,
a point from the set
argmin {hξ, xi + αΨ(x) + ω(x)}
x∈X
where α > 0 and 1 < p ≤ 2. This problem has a closed form solution (find it!) which can be
computed in O(dim E) a.o.
B. Low rank recovery: E = Mµν , k · k = k · knuc is the nuclear norm. The rationale behind
this setting is similar to the one for sparse recovery, but now the trade off is between the model
fitting and the rank of the optimal solution (which is a block-diagonal matrix) to (5.5.4).
In the case in question, for both the Euclidean and the Nuclear norm proximal setups (in
the latter case, one should take X = E and use the DGF (5.3.17)), so that computing composite
prox-mapping reduces to solving the optimization problem
!2/p
Xn m
p
X X
min Tr(y j [bj ]T ) + αkσ(y)k1 + σi (y) , m= µj , (5.5.6)
y=Diag{y 1 ,...,y n }∈Mµν
j=1 i=1 j
where b = Diag{b1 , ..., bn } ∈ Mµν , α > 0 and p ∈ (1, 2]. Same as in the case of the usual prox-
mapping associated with the Nuclear norm setup (see section 5.3.3), after computing the svd of
b the problem reduces to the similar one with diagonal bj ’s: bj = [Diag{β1j , ..., βµj j }, 0µj ×(νj −µj ) ],
where βkj ≥ 0. By the same symmetry argument as in section 5.3.3, in the case of diagonal bj ,
(5.5.6) has an optimal solution with diagonal y j ’s. Denoting sjk , 1 ≤ k ≤ j, the diagonal entries
386 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
which is of exactly the same structure as (5.5.5) and thus admits a closed form solution which
can be computed at the cost of O(m) a.o. We see that in the case in question computing com-
posite prox-mapping is no more involved that computing the usual prox-mapping and reduces,
essentially, to finding the svd of a matrix from Mµν .
In order to ensure O(1/t2 ) convergence, the sequence should satisfy certain conditions which
can be verified online; to meet these conditions, the sequence can be adjusted online. We shall
discuss this issue in-depth later.
Let us fix two tolerances ≥ 0, ¯ ≥ 0. The algorithm is as follows:
• Initialization: Set y0 = xω := argminx∈X ω(x), ψ0 (w) = Vxω (w), and select y0+ ∈ X such
that φ(y0+ ) ≤ φ(y0 ). Set A0 = 0.
1. Compute
zt ∈ argmin ψt (w). (5.5.8)
w∈X
and set
at+1
At+1 = At + at+1 , τt = . (5.5.10)
At+1
Note that τt ∈ (0, 1].
3. Set
xt+1 = τt zt + (1 − τt )yt+ . (5.5.11)
and compute f (xt+1 ), ∇f (xt+1 ).
4. Compute
5. Set
1
δt = Vzt (x̂t+1 ) + h∇f (xt+1 ), yt+1 − xt+1 i + f (xt+1 ) − f (yt+1 ) (5.5.14)
At+1
+ +
and select somehow yt+1 ∈ X in such a way that φ(yt+1 ) ≤ φ(yt+1 ).
Step t is completed, and we pass to step t + 1.
+
The approximate solution generated by the method in course of t steps is yt+1 .
In the sequel, we refer to the just described algorithm as to Fast Composite Gradient Method,
nicknamed FGM.
Theorem 5.5.1 FGM is well defined and ensures that zt ∈ X o , xt+1 , x̂t+1 , yt , yt+ ∈ X for all
t ≥ 0. Assuming that (5.5.7) holds true and that δt ≥ 0 for all t (the latter definitely is the case
when Lt = Lf for all t), one has
4Lf
yt , yt+ ∈ X, φ(yt+ ) − φ∗ ≤ φ(yt ) − φ∗ ≤ A−1
t [Vxω (x∗ ) + 2t] ≤ [Vxω (x∗ ) + 2t] (5.5.15)
t2
for t = 1, 2, ...
A step of the algorithm, modulo updates yt 7→ yt+ , reduces to computing the value and the
gradient of f at a point, computing the value of f at another point and to computing within
accuracy two prox mappings.
Comments. The O(1/t2 ) efficiency estimate (5.5.15) is conditional, the condition being that
the quantities δt defined by (5.5.14) are nonnegative for all t and that 0 < Lt ≤ Lf for all t.
Invoking (5.5.3), we see that at every step t, setting Lt = Lf ensures that δt ≥ 0. In other
words, we can safely use Lt ≡ Lf for all t. However, from the description of At ’s it follows that
the less are Lt ’s, the larger are At ’s and thus the better is the efficiency estimate (5.5.15). This
observations suggests more “aggressive” policies aimed at operating with Lt ’s much smaller than
Lf . A typical policy of this type is as follows: at the beginning of step t, we have at our disposal
a trial value L0t ≤ Lf of Lt . We start with carrying out step t with L0t in the role of Lt ; if we are
lucky and this choice of Lt results in δt ≥ 0, we pass to step t + 1 and reduce the trial value of L,
say, set L0t+1 = L0t /2. If with Lt = L0t we get δt < 0, we rerun step t with increased value of Lt ,
say, with Lt = min[2L0t , Lf ] or even with Lt = Lf . If this new value of Lt still does not result in
δt ≥ 0, we rerun step t with larger value of Lt , say, with Lt = min[4L0t , Lf ], and proceed in this
fashion until arriving at Lt ≤ Lf resulting in δt ≥ 0. We can maintain a whatever upper bound
k ≥ 1 on the number of trials at a step, just by setting Lt = Lf at k-th trial. As far as the
evolution of initial trial values L0t of Lt in time is concerned, a reasonable “aggressive” policy
could be “set L0t+1 to a fixed fraction (say, 1/2) of the last (the one resulting in δt ≥ 0) value of
Lt tested at step t.”
In spite of the fact that every trial requires some computational effort, the outlined on-line
adjustment of Lt ’s usually significantly outperforms the “safe” policy Lt ≡ Lf .
388 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
with nonnegative αt and affine `t (·), whence, taking into account that τt ∈ (0, 1], the algorithm
is well defined and maintains the inclusions zt ∈ X o , xt , x̂t+1 , yt , yt+ ∈ X.
Besides this, (5.5.16), (5.5.12) show that computing zt , same as computing x̂t+1 , reduces to
computing within accuracy composite prox-mappings, and the remaining computational effort
at step t is dominated by the necessity to compute f (xt+1 ), ∇f (xt+1 ), and f (yt+1 ), as stated in
Theorem.
20 . Observe that by construction
t
X
At = ai (5.5.17)
i=1
1 At+1 At + at+1 Lt
= 2 = = , t≥0 (5.5.18)
2τt2 At+1 2at+1 2a2t+1 2
(see (5.5.9) and (5.5.10)). Observe also that the initialization rules, (5.5.17.a) and (5.5.13.b)
imply that whenever t ≥ 1, setting λti = ai /At we have
t
a [h∇f (xi ), w − xi i + f (xi ) + Ψ(w)] + Vxω (w)
P
ψt (w) =
P i
i=1
= At ti=1 λti [h∇f (xi ), w − xi i + f (xi ) + Ψ(w)] + Vxω (w), (5.5.19)
Pt t Pt ai
i=1 λi = i=1 At = 1.
Proof. Invoking the definition of argmin , we get w̄ ∈ X o , and for some h0 (w̄) ∈ ∂X h(w̄) we
have
hh0 (w̄) + ω 0 (w̄), w − w̄i ≥ − ∀w ∈ X,
implying by convexity that
∀w ∈ X :
h(w) + ω(w) ≥ [h(w̄) + hh0 (w̄), w − w̄i] + ω(w)
= [h(w̄) + ω(w̄) + hh0 (w̄) + ω 0 (w̄), w − w̄i] + Vw̄ (w),
| {z }
≥−
as required. 2
FAST FIRST ORDER MINIMIZATION AND SADDLE POINTS 389
Invoking (5.5.13.b), the resulting inequality implies that for all w ∈ X it holds
h i
ψt+1 (w) ≥ Vzt (w) + At f (xt+1 ) + h∇f (xt+1 ), yt+ − xt+1 i + Ψ(yt+ )
(5.5.21)
+at+1 [f (xt+1 ) + h∇f (xt+1 ), w − xt+1 i + Ψ(w)] − − Bt .
By (5.5.11) and (5.5.10) we have At (yt+ − xt+1 ) − at+1 xt+1 = −at+1 zt , whence (5.5.21) implies
that
ψt+1 (w) ≥ Vzt (w) + At+1 f (xt+1 ) + At Ψ(yt+ ) + at+1 [h∇f (xt+1 ), w − zt i + Ψ(w)] − − Bt
(5.5.22)
for all w ∈ X. Thus,
∗
ψt+1 = minw∈X ψt+1 (w)
=h(w)+ω(w) with convex lover semicondinuous h
z }| {
≥ min { Vzt (w) + At+1 f (xt+1 ) + At Ψ(yt+ ) + at+1 [h∇f (xt+1 ), w − zt i + Ψ(w)] } − − Bt
w∈X
[by (5.5.22)]
≥ Vzt (x̂t+1 ) + At+1 f (xt+1 ) + At Ψ(yt+ ) + at+1 [h∇f (xt+1 ), x̂t+1 − zt i + Ψ(x̂t+1 )]
− − ¯ − Bt
| {z }
−Bt+1
[by (5.5.12) and Lemma 5.5.1]
At at+1
= Vzt (x̂t+1 ) + [At + at+1 ] [ Ψ(yt+ ) + Ψ(x̂t+1 ) ]
| {z } At + at+1 At + at+1
=At+1 by (5.5.10) | {z }
≥Ψ(yt+1 ) by (5.5.13.a) and (5.5.10); recall that Ψ is convex
+At+1 f (xt+1 ) + at+1 h∇f (xt+1 ), x̂t+1 − zt i − Bt+1
≥ Vzt (x̂t+1 ) + At+1 f (xt+1 ) + at+1 h∇f (xt+1 ), x̂t+1 − zt i + At+1 Ψ(yt+1 ) − Bt+1
that is,
∗
ψt+1 ≥ Vzt (x̂t+1 ) + At+1 f (xt+1 ) + at+1 h∇f (xt+1 ), x̂t+1 − zt i + At+1 Ψ(yt+1 ) − Bt+1 . (5.5.23)
Now note that x̂t+1 − zt = (yt+1 − xt+1 )/τt by (5.5.11) and (5.5.13.a), whence (5.5.23) implies
390 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
where the concluding inequality is due to (5.5.19). The resulting inequality taken together with
(∗t ) yields
φ(yt+ ) ≤ φ∗ + A−1
t [Vxω (x∗ ) + Bt ], t = 1, 2, ... (5.5.24)
Now, from A0 = 0, (5.5.9) and (5.5.10) it immediately follows that if Lt ≤ L for some L and all
t and if Āt are given by the recurrence
s
1 1 Āt
Ā0 = 0; Āt+1 = Āt + + 2
+ , (5.5.25)
2L 4L L
t 2
then At ≥ Āt for all t. It is immediately seen that (5.5.25) implies that Āt ≥ 4L for all t
14 . Taking into account that L ≤ L for all t by Theorem’s premise, the bottom line is that
t f
t2
At ≥ 4L f
for all t, which combines with (5.5.24) to imply (5.5.15). 2
gradient of a convex function f and Lipschitz continuity of f itself are “two extremes” in char-
acterizing function’s smoothness; they can be linked together by “in-between” requirements for
the gradient f 0 of f to be Hölder continuous with a given Hölder exponent κ ∈ [1, 2]:
kf 0 (x) − f 0 (y)k∗ ≤ Lκ kx − ykκ−1 , ∀x, y ∈ X. (5.5.26)
Inspired by Yu. Nesterov’s recent paper [32], we are about to modify the algorithm from section
5.5.1 to become applicable to problems of composite minimization (5.5.1) with functions f of
smoothness varying between the outlined two extremes. It should be stressed that the resulting
algorithm does not require a priori knowledge of smoothness parameters of f and adjusts itself
to the actual level of smoothness as characterized by Hölder scale.
While inspired by [32], the algorithm to be developed is not completely identical to the one
proposed in this reference; in particular, we still are able to work with inexact prox mappings.
where
1. X is a closed convex set in a Euclidean space E equipped with norm k · k (not necessarily
the Euclidean one). Same as in section 5.5.1, we assume that X is equipped with distance-
generating function ω(x) compatible with k · k, and ,
2. Ψ(x) : X → R ∪ {+∞} is a lower semicontinuous convex function which is finite on the
relative interior of X,
3. f : X → R is a convex Lipschitz function satisfying, for some κ ∈ [1, 2], L ∈ (0, ∞), and a
selection f 0 (·) : X → E of subgradients, the relation
L
∀(x, y ∈ X) : f (y) ≤ f (x) + hf 0 (x), y − xi +
ky − xkκ . (5.5.27)
κ
Observe that in section 5.5.1 we dealt with the case κ = 2 of the outlined situation. Note
also that a sufficient (necessary and sufficient when κ = 2) condition for (5.5.27) is Hölder
continuity of the gradient of f , i.e., the relation (5.5.26).
It should be stressed that the algorithm to be built does not require a priori knowledge of
κ and L.
From now on we assume that (5.5.1) is solvable and denote by x∗ an optimal solution to this
problem.
In the sequel, we continus to use our standard “prox-related” notation xω , X o , Vx (y), same
as the notation
argmin {h(x) + ω(x)} = {z ∈ X o : ∃h0 (z) ∈ ∂X h(z) : hh0 (z) + ω 0 (z), w − zi ≥ − ∀w ∈ X}
x∈X
see p. 384.
The algorithm we are about to develop requires the ability to find, given ξ ∈ E and α ≥ 0,
a point from the set
argminν {hξ, xi + αΨ(x) + ω(x)}
x∈X
where ν ≥ 0 is construction’s parameter.
392 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
ν ≥ 0, ν̄ ≥ 0, L > 0.
The method works in stages. The stages differ from each other only in their numbers of steps;
for stage s, s = 1, 2, ..., this number is Ns = 2s .
• Initialization: Set y0 = xω := argminx∈X ω(x), ψ0 (w) = Vxω (w), and select y0+ ∈ X such
that φ(y0+ ) ≤ φ(y0 ). Set A0 = 0.
1. Compute
zt ∈ argminν ψt (w). (5.5.28)
w∈X
2. Inner Loop: Set Lt,0 = L. For µ = 0, 1, ..., carry out the following computations.
(a) Set Lt,µ = 2µ Lt,0 ;
(b) Find the positive root at+1,µ of the quadratic equation
" s #
1 1 At
Lt,µ a2t+1 = At + at+1,µ ⇔ at+1,µ = + 2 + (5.5.29)
2Lt,µ 4Lt,µ Lt,µ
and set
at+1,µ
At+1,µ = At + at+1,µ , τt,µ = . (5.5.30)
At+1,µ
Note that τt,µ ∈ (0, 1].
(c) Set
xt+1,µ = τt,µ zt + (1 − τt,µ )yt+ . (5.5.31)
and compute f (xt+1,µ ), f 0 (xt+1,µ ).
(d) Compute
(e) Set
yt+1,µ = τt,µ x̂t+1,µ + (1 − τt,µ )yt+ , (5.5.33)
compute f (yt+1,µ ) and then compute
h i
Lt,µ
δt,µ = 2 kyt+1,µ − xt+1,µ k2 + hf 0 (xt+1,µ ), yt+1,µ − xt+1,µ i + f (xt+1,µ )
−f (yt+1,µ ).
(5.5.34)
FAST FIRST ORDER MINIMIZATION AND SADDLE POINTS 393
(f) If δt,µ < − N , pass to the step µ + 1 of Inner loop, otherwise terminate Inner loop
and set
Lt = Lt,µ , at+1 = at+1,µ , At+1 = At+1,µ , τt = τt,µ ,
xt+1 = xt+1,µ , yt+1 = yt+1,µ , δt = δt,µ ,
ψt+1 (w) = ψt (w) + at+1 [f (xt+1 ) + f 0 (xt+1 ), w − xt+1 i + Ψ(w)]
(5.5.35)
+ +
3. Select somehow yt+1 ∈ X in such a way that φ(yt+1 ) ≤ φ(yt+1 ).
Step t is completed. If t < N − 1, pass to outer step t + 1, otherwise terminate the
+
stage and output yN
Furthermore,
2 2−κ
(a) L ≤ Lt ≤ max[L, 2L κ [N/] κ ], 0 ≤ t ≤ N − 1,
P
N −1 √1
2 (5.5.37)
(b) AN ≥ τ =0 2 Lτ ,
Besides this, the number of steps of Inner loop at every outer step of SN does not exceed
2 2−κ
h i
M (N ) := 1 + log2 max 1, 2L−1 L κ [N/] κ . (5.5.38)
Corollary 5.5.1 Let function f in (5.5.1) satisfy (5.5.27) with some L ∈ (0, ∞) and κ ∈ [1, 2],
and let 2
√
κ κ
3κ−2
2
LΘ∗ 8 2 LΘ∗
N∗ () = max 2 √ , . (5.5.39)
A−1 −1
N Vxω (x∗ ) ≤ AN Θ∗ ≤ . (5.5.40)
A−1
N Θ∗ ≤ (5.5.41)
394 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
In particular, with exact prox-mappings (i.e., with ν = ν̄ = 0) it takes totaly at most Ceil(N ∗ ())
outer steps to ensure (5.5.41) and thus to ensure the error bound
+
φ(yN ) − φ∗ ≤ 2.
Postponing for a moment Corollary’s verification, let us discuss its consequences, for the time
being – in the case ν = ν̄ = 0 of exact prox-mappings (the case of inexact mappings will be
discussed in section 5.5.3).
Discussion. Corollary 5.5.1 suggests the following policy of solving problem (5.5.1) within a
given accuracy > 0: we run FUGM with somehow selected L and ν = ν̄ = 0 stage by stage,
starting with a single-step stage and increasing the number of steps N by factor 2 when passing
from a stage to the next one. This process is continued until at the end of the current stage the
condition , until at the end of the current stage the condition (5.5.41) (which is on-line verifiable)
takes place. When it happens, we terminate; by Corollary 5.5.1, at this moment we have at
our disposal a feasible 2-optimal solution to the problem of interest. On the other hand, the
total, over all stages, number of Outer steps before termination clearly does not exceed twice
the number of Outer steps at the concluding stage, that is, invoking Corollary 5.5.1, it does not
exceed
N ∗ () := 4Ceil(N∗ ()).
As about the total, over all Outer steps of all stages, number M of Inner Loop steps (which is
the actual complexity measure of FUGM), it, by Theorem 5.5.2, can be larger than N ∗ () by
at most by logarithmic in L, L, 1/ factor. Moreover, when “safe” stepsize policy is used, see
Remark 5.5.1, we have by this Remark
with M (N ) given by (5.5.38). It is easily seen that for small the “logarithmic complexity term”
M is negligible as compared to the “leading complexity term” N ∗ ().
κ/2
Further, setting L = max[L, L] and assuming ≤ LΘ∗ , we clearly have
2
κ
3κ−2
LΘ∗2
N ∗ () ≤ O(1) . (5.5.43)
Thus, our principal complexity term is expressed in terms of the true and unknown in advance
smoothness parameters (L, κ) and is the better the better is smoothness (the larger
q is κ and
the smaller is L). In the “most smooth case” κ = 2, we arrive at N ∗ () = O(1) LΘ ∗
, which,
basically, is the complexity bound for the Fast Composite Gradient√ Method FGM from section
5.5.1. In the “least smooth case” κ = 1, we get N ∗ () = O(1) L 2Θ∗ , which, basically, is the
complexity bound of Mirror Descent15 Note that while in the nonsmooth case the complexity
15
While we did not consider MD in the composite minimization setting, such a consideration is quite straight-
forward, and the complexity bounds for “composite MD” are identical to those of the “plain MD.”
FAST FIRST ORDER MINIMIZATION AND SADDLE POINTS 395
bounds of FUGM and MD are “nearly the same,” the methods are essentially different, and
in fact FUGM exhibits some theoretical advantage over MD: the complexity bound of FUGM
depends on the maximal deviation L of the gradients of f from each other, the deviation being
measured in k · k∗ , while what matters for MD is the Lipschitz constant, taken w.r.t. k · k∗ , of f ,
that is, the maximal deviation of the gradients of f from zero; the latter deviation can be much
larger than the former one. Equivalently: the complexity bounds of MD are sensitive to adding
to the objective a linear form, while the complexity bounds of FUGM are insensitive to such a
perturbation.
Proof of Corollary 5.5.1. The first inequality in (5.5.40) is evident. Now let N ≥ N∗ ().
Consider two possible cases, I and II, as defined below.
2 2−κ
Case I: 2L κ [N/] κ ≥ L. In this case, (5.5.37.a) ensures the first relation in the following
chain:
2 2−κ
Lt ≤ 2L κ [N/] κ , 0≤t≤N −1
1 2−κ 2−κ
2
⇒ AN ≥ 2−3/2 L− κ N 1− 2κ 2κ [by (5.5.37.b)]
3κ−2 2−κ
1
= 2 N κ κ
8L κ
3κ−2
κ κ 2 κ
3κ−2 2−κ
1 8 2 LΘ∗2
≥ 2
κ [by (5.5.39) and due to N ≥ N∗ ()]
8L κ
2 2−κ 2
1 −κ
= 2 8L κ Θ∗ κ = Θ∗ /,
8L κ
2 2−κ
Case II: 2L κ [N/] κ < L. In this case, (5.5.37.a) ensures the first relation in the following
chain:
Lt ≤ L, 0 ≤ t ≤ N − 1
2
⇒ AN ≥ N 4L [by (5.5.37.b)]
≥ Θ∗ / [by (5.5.39) and due to N ≥ N∗ ()]
and the second inequality in (5.5.40) follows. 2
we have δt,µ ≥ − N . Consequently, in the case in question Inner loop as finite as well, and
2 2−κ
Lt ≤ max[L, 2L κ [N/] κ ], (5.5.46)
Note that by (5.5.45) the resulting inequality holds true for κ = 2 as well.
The bottom line is that FUGM is well defined and maintains the following relations:
2 2−κ
(a.1) L ≤ Lt ≤ max[L,
q 2L [N/]
κ κ ],
1 1 At
(a.2) at+1 = 2Lt + 4L2 + Lt [⇔ Lt a2t+1 = At + at+1 ],
t
A0 = 0,
(a.3) At+1 = t+1
P
τ =1 aτ ,
(a.4) τt = Aat+1
t+1
∈ (0, 1],
1
(a.5) τ 2 A = Lt [by (a.2-4)];
t t+1
(b.1) ψ0 (w) = Vxω (w),
(b.2) ψt+1 (w) = ψt (w) + at+1 [f (xt+1 ) + f 0 (xt+1 ), w − xt+1 i + Ψ(w)] (5.5.47)
(c.1) zt ∈ argminν ψt (w);
w∈X
(c.2) xt+1 = τt zt + (1 − τt )yt+ ,
(c.3) x̂t+1 ∈ argminν̄ [at+1 [hf 0 (xt+1 ), wi + Ψ(w)] + Vzt (w)] ,
w∈X
(c.4) yt+1 = τt x̂t+1 + (1 − τt )yt+ ;
Lt 2 0
(d) 2 kyt+1 − xt+1 k + hf (xt+1 ), yt+1 − xt+1 i + f (xt+1 ) − f (yt+1 ) ≥ − N
[by termination rule for Inner loop]
(e) yt+ ∈ X & φ(yt+ ) ≤ φ(yt ).
and, on the top of it, the number M of steps of Inner loop does not exceed
2 2−κ
h i
M (N ) := 1 + log2 max 1, 2L−1 L κ [N/] κ . (5.5.48)
t
Bt + ψt∗ ≥ At φ(yt+ ),
X
t ≥ 0, Bt = (ν + ν̄)t + Aτ . (∗t )
τ =1
N
Invoking (5.5.47.b.2), the resulting inequality implies that for all w ∈ X it holds
h i
ψt+1 (w) ≥ Vzt (w) + At f (xt+1 ) + hf 0 (xt+1 ), yt+ − xt+1 i + Ψ(yt+ )
(5.5.53)
+at+1 [f (xt+1 ) + hf 0 (xt+1 ), w − xt+1 i + Ψ(w)] − ν − Bt .
By (5.5.47.c.2) and (5.5.47.a.4-5) we have At (yt+ − xt+1 ) − at+1 xt+1 = −at+1 zt , whence (5.5.53)
implies that
ψt+1 (w) ≥ Vzt (w) + At+1 f (xt+1 ) + At Ψ(yt+ ) + at+1 [hf 0 (xt+1 ), w − zt i + Ψ(w)] − ν − Bt
(5.5.54)
for all w ∈ X. Thus,
∗
ψt+1 = minw∈X ψt+1 (w)
=h(w)+ω(w) with convex lover semicontinuous h
z }| {
≥ min { Vzt (w) + At+1 f (xt+1 ) + At Ψ(yt+ ) + at+1 [hf 0 (xt+1 ), w − zt i + Ψ(w)] } − ν − Bt
w∈X
[by (5.5.54)]
≥ Vzt (x̂t+1 ) + At+1 f (xt+1 ) + At Ψ(yt+ ) + at+1 [hf 0 (xt+1 ), x̂t+1 − zt i + Ψ(x̂t+1 )]
−ν − ν̄ − Bt
[by (5.5.47.c.3) and Lemma 5.5.1]
At at+1
= Vzt (x̂t+1 ) + [At + at+1 ] [ Ψ(yt+ ) + Ψ(x̂t+1 ) ]
| {z } At + at+1 At + at+1
=At+1 by (5.5.47.a.3) | {z }
≥Ψ(yt+1 ) by (5.5.47.c.4) and (5.5.47.a.2-3); recall that Ψ is convex
+At+1 f (xt+1 ) + at+1 hf 0 (xt+1 ), x̂t+1 − zt i − ν − ν̄ − Bt
≥ Vzt (x̂t+1 ) + At+1 f (xt+1 ) + at+1 hf 0 (xt+1 ), x̂t+1 − zt i + At+1 Ψ(yt+1 ) − ν − ν̄ − Bt
that is,
∗
ψt+1 ≥ Vzt (x̂t+1 ) + At+1 f (xt+1 ) + at+1 hf 0 (xt+1 ), x̂t+1 − zt i + At+1 Ψ(yt+1 ) − ν − ν̄ − Bt . (5.5.55)
by (5.5.47.c), whence (5.5.55) implies the first inequality in the following chain:
∗
ψt+1 ≥ Vzt (x̂t+1 ) + At+1 f (xt+1 ) + at+1 0
τt hf (xt+1 ), yt+1 − xt+1 i + At+1 Ψ(yt+1 ) − ν − ν̄ − Bt
= Vzt (x̂t+1 ) + At+1 f (xt+1 ) + At+1 hf 0 (xt+1 ), yt+1 − xt+1 i + At+1 Ψ(yt+1 ) − ν − ν̄ − Bt
h i [by (5.5.47.a.4)]
= At+1 At+1 1
Vzt (x̂t+1 ) + f (xt+1 ) + hf 0 (xt+1 ), yt+1 − xt+1 i + Ψ(yt+1 ) − ν − ν̄ − Bt
h h ii
1
= At+1 φ(yt+1 ) + At+1 Vzt (x̂t+1 ) + f (xt+1 ) + hf 0 (xt+1 , yt+1 − xt+1 i − f (yt+1 )
−ν −hν̄ − Bt h ii
≥ At+1 φ(yt+1 ) + 2A1t+1 kx̂t+1 − zt k2 + f (xt+1 ) + hf 0 (xt+1 , yt+1 − xt+1 i − f (yt+1 )
−ν − ν̄ − Bt
= At+1 φ(y
h t+1 ) i
+At+1 2τ 2 A1
kxt+1 − yt+1 k2 + f (xt+1 ) + hf 0 (xt+1 , yt+1 − xt+1 i − f (yt+1 )
t t+1
−ν −hν̄ − Bt [byh(5.5.56)] ii
= At+1 φ(yt+1 ) + L2t kxt+1 − yt+1 k2 + f (xt+1 ) + hf 0 (xt+1 , yt+1 − xt+1 i − f (yt+1 )
−ν − ν̄ − Bt [by (5.5.47.a.5)]
≥ At+1 φ(yt+1 ) − At+1 N − ν − ν̄ − Bt [by (5.5.47.d)]
= At+1 φ(yt+1 ) − Bt+1 [see (∗t )]
+
≥ At+1 φ(yt+1 ) − Bt+1 [by (5.5.47.e)].
where the concluding inequality is due to (5.5.52). The resulting inequality taken together with
(∗t ) yields
φ(yt+ ) − φ∗ ≤ A−1
t [Vxω (x∗ ) + Bt ] P
= A−1
t Vxω (x∗ ) + At
−1 t −1
τ =1 Aτ N + At [ν + ν̄]t (5.5.57)
≤ A−1 t −1
t Vxω (x∗ ) + N + At [ν + ν̄]t, t = 1, 2, ..., N,
Indeed, (5.5.58) clearly holds true for t = 0. Assuming that the relation holds true for some t
and invoking (5.5.47.a.2), we get
q hP i2 hP i
1 t−1 √1 t−1 √1 √1 1
At+1 = At + at+1 ≥ Āt + Āt /Lt + 2Lt ≥ τ =0 Lτ
+2 τ =0 Lτ 2 Lt
+ 4Lt
hP i2
t √1
= τ =0 2 Lτ = Ā2t+1 ,
Bottom line. Combining (5.5.47.a.1), (5.5.57), (5.5.58) and taking into account (5.5.48), we
arrive at Theorem 5.5.2. 2
to be proximal friendly – to admit a strongly convex (and thus nonlinear!) DGF ω(·) with easy
to minimize over X linear perturbations ω(X) + hξ, xi 16 . Whenever this is the case, it is easy to
minimize over X linear forms hξ, xi (indeed, minimizing such a form over X within a whatever
high accuracy reduces to minimizing over X the sum ω(x) + αhξ, xi with large α). The inverse
conclusion is, in general, not true: it may happen that X admits a computationally cheap Linear
Minimization Oracle (LMO) – a procedure capable to minimize over X linear forms, while no
DGFs leading to equally computationally cheap proximal mappings are known. The important
for applications examples include:
1. nuclear norm ball X ⊂ Rn×n playing crucial role in low rank matrix recovery: with
all known proximal setups, computing prox-mappings requires computing full singular
value decomposition of an n × n matrix. At the same time, minimizing a linear form
over X requires finding just the pair of leading singular vectors of an n × n matrix x
(why?) The latter task can be easily solved by Power method17 For n in the range of
few thousands and more, computing the leading singular vectors is by order of magnitude
cheaper, progressively as n grows, than the full singular value decomposition.
3. Total Variation ball X in the space of n × n images (n × n arrays with zero mean). Total
Variation of an n × n image x is
n−1
XX n n n−1
X X
TV(x) = |xi+1,j − xi,j | + |xi,j+1 − xi,j |
i=1 j=1 i=1 j=1
16
On the top of proximal friendliness, we would like to ensure a moderate, “nearly dimension-independent”
value of the ω-radius of X, in order to avoid rapid growth of the iteration count with problem’s dimension. Note,
however, that in the absence of proximal friendliness, we cannot even start running a proximal algorithm...
17
In the most primitive implementation, Power method iterates matrix-vector multiplications et 7→ et+1 =
T
x xet , with randomly selected, usually from standard Gaussian distribution, starting vector e0 ; after few tens of
iterations, vectors et and xet become good approximations of the left and the right leading singular vectors of x.
400 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
and is a norm on the space of images (the discrete analogy of L1 norm of the gradient
of a function on a 2D square); convex minimization over TV balls and related problems
(like LASSO problem (5.5.4) with the space of images in the role of E and TV(·) in the
role of k · kE ) plays important role in image reconstruction. Computing the simplest –
Euclidean – prox mapping on TV ball in the space of n × n images requires computing
Euclidean projection of a point onto large-scale polytope cut off O(1)n2 `1 ball by O(1)n2
homogeneous linear equality constraints; for n of practical interest (few hundreds), this
problem is quite time consuming. In contrast, it turns out that minimizing a linear form
over TV ball requires solving a special Maximum Flow problem [15], and this turns out to
be by order of magnitudes cheaper than metric projection onto TV ball.
Here is some numerical data allowing to get an impression of the “magnitude” of the just outlined
phenomena:
2 2
10 10
1 1
10 10
0
10
0 0
10 10
3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 2 3 4
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
CPU ratio
CPU ratio CPU ratio
“metric projection”/“LMO computation”
“full svd”/”finding leading singular vectors” “full evd”/“finding leading eigenvector”
for TV ball in Rn×n vs. n
for n × n matrix vs. n for n × n symmetric matrix vs. n
n 129 256 512 1024
n 1024 2048 4096 8192 n 1024 2048 4096 8192
ratio 10.8 8.8 11.3 20.6
ratio 0.5 2.6 4.5 7.5 ratio 2.0 4.1 7.9 13.0
Metric projection onto TV ball for n = 1024
Full svd for n = 8192 takes 475.6 sec! Full evd for n = 8192 takes 142.1 sec!
takes 1062.1 sec!
Platform: 2 × 3.40 GHz CPU, 16.0 GB RAM, 64-bit Windows 7
The outlined and some other generic large-scale problems of convex minimization inspire investi-
gating the abilities to solve problems (CP) on LMO - represented domains X – convex compact
domains equipped by Linear Minimization Oracles. As about the objective f , we still assume
that it is given by First Order Oracle.
The associated Composite Conditional Gradient Algorithm (nicknamed in the sequel CGA ) as
applied to (5.5.1) generates iterates xt ∈ X, t = 1, 2, ..., as follows:
1. Initialization: x1 is an arbitrary points of X.
2. Step t = 1, 2, ...: Given xt , we
• compute f 0 (xt ) and call CLMO to get the point
0
x+
t = CLMO(f (xt ), 1) [∈ Argmin[`t (x) := f (xt ) + hf 0 (xt ), x − xt i + Ψ(x)]
x∈X
(5.5.60)
• compute the point
2
bt+1 = xt + γt (x+
x t − xt , γt = (5.5.61)
t+1
• take as xt+1 a (whatever) point in X such that φ(xt+1 ) ≤ φ(x
bt+1 ).
The convergence properties of the algorithm are summarized in the following simple statement
going back to [11]:
Theorem 5.5.3 Let Composite CGA be applied to problem (5.5.1) with convex f satisfying
(5.5.27) with some κ ∈ (1, 2], and let D be the k · k-diameter of X. Denoting by φ∗ the optimal
value in (5.5.1), for all t ≥ 2 one has
2LDκ κ−1 2κLDκ
t := φ(xt ) − φ∗ ≤ γt = . (5.5.62)
κ(3 − κ) κ(3 − κ)(t + 1)κ−1
Proof. Let
`t (x) = f (xt ) + hf 0 (xt ), x − xt i + Ψ(x).
We have
t+1 := φ(xt+1 ) − φ∗ ≤ φ(x bt+1 ) − φ∗ = φ(xt + γt (x+ t − xt )) − φ∗
= `t (xt + γt (xt − xt ) + [f (xt + γt (x+
+
t − x t )) − f (xt ) − hf 0 (xt ), γt (x+
t − xt )] − φ∗
+ L κ
≤ [(1 − γt ) `(xt ) +γt `(xt ) − φ∗ ] + κ [γt D]
| {z }
=φ(xt )
[since `t (·) is convex and due to (5.5.27); note that kγt (x+
t − xt )k ≤ γt D]
+ L κ
= [(1 − γt ) [φ(xt ) − φ∗ ] +γt [`(xt ) − φ∗ ]] + κ [γt D]
| {z }
t
= (1 − γt )t + Lκ [γt D]κ + γt [`(x+
t ) − φ∗ ]
≤ (1 − γt )t + Lκ [γt D]κ
[since `t (x) ≤ φ(x) for all x ∈ X and `(x+ t ) = minx∈X `t (x)]
where the concluding inequality in the chain is the gradient inequality for the convex function
u1−κ of u > 0. We see that the validity of (∗t ) for some t implies the validity of (∗t+1 ); induction,
and thus the proof of Theorem 5.5.3, is complete. 2
A. X is a compact set, ω(·) is continuously differentiable on X (so that X o = X), and for
some known Υ < ∞ one has
Remark: For all standard proximal setups described in section 5.3.3, except for the
Entropy one, one can take as Υ the quantity O(1)Θ, with Θ defined in section 5.3.2.
= Υ, ¯ = 3Υ.
hξ + αΨ0 (w), x − wi ≥ 0 ∀x ∈ X.
18
Indeed, the required inequality boils down to the inequality 3κ − κ3κ−1 ≤ 2κ , which indeed is true when
κ ≥ 1, due to the convexity of the function uκ on the ray u ≥ 0.
FAST FIRST ORDER MINIMIZATION AND SADDLE POINTS 403
Consequently,
whence w ∈ argmin¯ {hξ, xi + αΨ(x) + Vz (w)}. The case of α = 0 is completely similar (redefine
x∈X
Ψ as identically zero and α as 1). 2
Combining Lemma 5.5.2 and Theorem 5.5.1, we arrive at the following
Corollary 5.5.2 In the situation described in section 5.5.1.1 and under assumptions A, B,
consider the implementation of the Fast Composite Gradient Method (5.5.8) – (5.5.14) where
(5.5.8) is implemented as
zt = CLMO(∇`t (·), αt ) (5.5.65)
(see (5.5.16)), and (5.5.12) is implemented as
(which, by Lemma 5.5.2, ensures (5.5.8), (5.5.12) with = Υ, ¯ = 3Υ). Assuming that (5.5.7)
holds true and that δt ≥ 0 for all t (the latter definitely is the case when Lt = Lf for all t), one
has
4Lf
yt , yt+ ∈ X, φ(yt+ ) − φ∗ ≤ φ(yt ) − φ∗ ≤ A−1
t [Vxω (x∗ ) + 4Υt] ≤ [Vxω (x∗ ) + 4Υt] (5.5.67)
t2
for t = 1, 2, ..., where, as always, x∗ is an optimal solution to (5.5.1) and xω is the minimizer of
ω(·) on X.
A step of the algorithm, modulo updates yt 7→ yt+ , reduces to computing the value and the
gradient of f at a point, computing the value of f at another point and to computing two values of
Composite Linear Minimization Oracle, with no computations of the values and the derivatives
of ω(·) at all.
Discussion. Observe that (5.5.64) clearly implies that Vxω (x∗ ) ≤ Υ, which combines with
(5.5.67) to yield the relation
Lf Υ
φ(yt+ ) − φ∗ ≤ O(1) , t = 1, 2, ... (5.5.68)
t
On the other hand, under the premise of Corollary 5.5.2, we have at our disposal a Composite
LMO for X and the function f in (5.5.1) is convex and satisfies (5.5.27) with L = Lf and κ = 2,
see (5.5.2). As a result, we can solve the problem of interest (5.5.1) by CGA ; the resulting
efficiency estimate, as stated by Theorem 5.5.3 in the case of L = Lf , κ = 2, is
Lf D2
φ(xt ) − φ∗ ≤ O(1) , (5.5.69)
t
404 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
Υ ≤ α(X)D2 (5.5.70)
for properly selected α(X) ≥ 1, the efficiency estimate (5.5.68) is by at most factor O(1)α(X)
worse than the CGA efficiency estimate (5.5.69), so that for moderate α(X) the estimates are
“nearly the same.” This is not too surprising, since FGM with imprecise proximal mappings
given by (5.5.65), (5.5.66) is, basically, LMO-based rather than proximal algorithm. Indeed, with
on-line tuning of Lt ’s, the implementation of FGM described in Corollary 5.5.2 requires operating
with ω(zt ), ω 0 (zt ), and ω(x̂t+1 ) solely in order to compute δt when checking whether this quantity
is nonnegative. We could avoid the necessity to work with ω by computing kx bt+1 − zt k in order
to build the lower bound
1
δ̄t = bt+1 − zt k2 + h∇f (xt+1 ), yt+1 − xt+1 i + f (xt+1 ) − f (yt+1 )
kx
2At+1
on δt , and to use this bound in the role of δt when updating Lt ’s (note that from item 50 of
the proof of Theorem 5.5.1 it follows that δ̄t ≥ 0 whenever Lt = Lf ). Finally, when both the
norm k · k and ω(·) are difficult to compute, we can use the off-line policy Lt = Lf for all t, thus
avoiding completely the necessity to check whether δt ≥ 0 and ending up with an algorithm,
let it be called FGM-LMO, based solely on Composite Linear Minimization Oracle and never
“touching” the DGF ω (which is used solely in the algorithm’s analysis). The efficiency estimate
of the resulting algorithm is within O(1)α(X) factor of the one of CGA .
Note that in many important situations (5.5.70) indeed is satisfied with “quite moderate”
α(X). For example, for the Euclidean setup one can take α(X) = 1. For other standard proximal
setups presented in section 5.3.3, except for the Entropy one, α(X) just logarithmically grows
with the dimension of X.
An immediate question is, what, if any, do we gain when passing from simple and transpar-
ent CGA to much more sophisticated FGM-LMO. The answer is: as far as theoretical efficiency
estimates are concerned, nothing is gained. Nevertheless, “bridging” proximal Fast Compos-
ite Gradient Method and LMO-based Conditional Gradients possesses some “methodological
value.” For example, FGM-LMO replaces the precise composite prox mapping
which ignores Bregman distance Vz (x) at all. In principle, one could utilize a less crude ap-
proximation of the composite prox mapping, “shifting” the CGA efficiency estimate O(LD2 /t)
towards the O(LD2 /t2 ) efficiency estimate of FGM with precise proximal mappings – the option
we here just mention rather than exploit in depth.
B in force, let us look at Composite Minimization problem (5.5.1) with convex f satisfying the
smoothness requirement (5.5.27) with some L > 0 and κ ∈ (1, 2] (pay attention to κ > 1!).
By Lemma 5.5.2, setting
ν = Υ, ν̄ = 3Υ.
we ensure that for every ξ ∈ E, α ≥ 0 and every z ∈ X one has
(a) CLMO(ξ, α) ∈ argminν {hξ, xi + αΨ(x) + ω(x)} ,
x∈X
(5.5.71)
(b) CLMO(ξ, α) ∈ argminν̄ {hξ, xi + αΨ(x) + Vz (x)} .
x∈X
Therefore we can consider the implementation of FUGM (in the sequel referred to as FUGM-
LMO) where the target accuracy is a given , the parameters ν and ν̄ are specified as ν = Υ,
ν̄ = 3Υ, and
• zt is defined according to
zt = CLMO(∇`t (·), αt ).
with `t (·), αt defined in (5.5.49);
• x̂t+1,µ is defined according to
Lemma 5.5.3 Let assumption A take place, and let FUGM-LMO be applied to Composite Mini-
mization problem (5.5.1) with convex function f satisfying the smoothness condition (5.5.27) with
some L ∈ (0, ∞) and κ ∈ (1, 2]. When the number of steps N at a stage of the method satisfies
the relation
κ ! 1
κ−1
16LΥ 5κ LΥ 2
N ≥ N # () := max , 2 2(κ−1) (5.5.72)
one has
[ν + ν̄]N A−1
N ≤ .
— in Case II we have
N2
AN ≥
LΥ
4L
16LΥ 2
⇒ 4A−1
N N Υ ≤ 16 N ≤ [due to N ≥ ].
Combining Corollary 5.5.1, Lemma 5.5.3 and taking into account that the quantity Υ partici-
pating in assumption A can be taken as Θ∗ participating in Corollary 5.5.1 (see Discussion after
Corollary 5.5.2), we arrive at the following result:
406 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
Corollary 5.5.3 Let assumption A take place, and let FUGM-LMO be applied to Composite
Minimization problem (5.5.1) with convex function f satisfying, for some L ∈ (0, ∞) and κ ∈
(1, 2], the smoothness condition (5.5.27). The on-line verifiable condition
4A−1
N NΥ ≤ (5.5.73)
+
ensure that the result yN of N -step stage SN of FUGM-LMO satisfies the relation
+
φ(yN ) − φ∗ ≤ 3.
where √ 2
κ κ !
3κ−2
LΥ 8 LΥ
2 2
N# () = max 2 √ , (5.5.74)
Remarks. I. Note that FUGM-LMO does not require computing ω(·) and ω 0 (·) and operates
solely with CLMO(·, ·), First Order oracle for f , and the Zero Order oracle for k · k (the latter is
used to compute δt,µ , see (5.5.34)). Assuming κ and L known in advance, we can further avoid
the necessity to compute the norm k · k and to skip, essentially, the Inner loop. To this end it
2 2−κ
suffices to carry out a single step µ = 0 of the Inner loop with Lt,0 set to L κ [N/] κ and to
skip computing δt,0 – the latter quantity automatically is ≥ −/N , see item 10 in the proof of
Theorem 5.5.2, p. 396.
II. It is immediately seen that for small enough values of > 0, we have N # () ≥ N# (), so
that the total, over all stages, number of outer steps before -solution is built is upper-bounded
by O(1)N # (); with “safe” stepsize policy, see Remark 5.5.1, and small the total, over all
stages, number of Inner steps also is upper-bounded by O(1)N # (). Note that in the case of
(5.5.70) and for small enough > 0, we have
1
LDκ
κ κ−1
#
N () ≤ [O(1)α(X)] 2(κ−1) ,
implying that the complexity bound of FUGM-LMO for small is at most by factor
κ
[O(1)α(X)] 2(κ−1) worse that the complexity bound for CGA as given by (5.5.62). The conclu-
sions one can make from these observations are completely similar to those related to FGM-LMO,
see p. 404.
becomes too time-consuming to be practical. Thus, we were looking for methods with cheap
iterations, and stated that in the case of constrained nonsmooth problems (this is what we deal
with in most of applications), the only “cheap” methods we know are First Order algorithms. We
then observed that as a matter of fact, all these algorithms are black-box-oriented, and thus obey
the limits of performance established by Information-based Complexity Theory, and focused on
algorithms which achieve these limits of performance. A bad news here is that the “limits of
performance” are quite poor; the best we can get here is the complexity like N () = O(1/2 ).
When one tries to understand why this indeed is the best we can get, it becomes clear that
the reason is in “blindness” of the black-box-oriented algorithms we use; they utilize only local
information on the problem (values and subgradients of the objective) and make no use of a priori
knowledge of problem’s structure. Here it should be stressed that as far as Convex Programming
is concerned, we nearly always do have detailed knowledge of problem’s structure – otherwise,
how could we know that the problem is convex in the first place? The First Order methods
as presented so far just do not know how to utilize problem’s structure, in sharp contrast with
IPM’s which are fully “structure-oriented.” Indeed, this is the structure of an optimization
problem which underlies its setting as an LP/CQP/SDP program, and IPM’s, same as the
Simplex method, are quite far from operating with the values and subgradients of the objective
and the constraints; instead, they operate directly on problem’s data. We can say that the
LP/CQP/SDP framework is a “structure-revealing” one, and it allows for algorithms with fast
convergence, algorithms capable to build a high accuracy solution in a moderate number of
iterations. In contrast to this, the black-box-oriented framework is “structure-obscuring,” and
the associated algorithms, al least in the large-scale case, possess really slow rate of convergence.
The bad news here is that the LP/CQP/SDP framework requires iterations which in many
applications are prohibitively time-consuming in the large scale case.
Now, for a long time the LP/CQP/SDP framework was the only one which allowed for uti-
lizing problem’s structure. The situation changed significantly circa 2003, when Yu. Nesterov
discovered a “computationally cheap” way to do so. The starting point in his discovery was
the well known fact that the disastrous complexity N () ≥ O(1/2 ) in large-scale nonsmooth
convex optimization comes from nonsmoothness; passing from nonsmooth problems (minimiz-
ing a convex Lipschitz continuous objective over a convex compact domain) to their smooth
counterparts, where the objective, still black-box-represented one, has a Lipschitz continuous
√
gradient, reduces the complexity bound to O(1/ ), and for smooth problems with favourable
geometry, this reduced complexity turns out to be nearly- or fully dimension-independent, see
√
section 5.5.1. Now, the “smooth” complexity O(1/ ), while still not a polynomial-time one,
is a dramatic improvement as compared to its “nonsmooth” counterpart O(1/2 ). All this was
known for a couple of decades and was of nearly no practical use, since problems of minimiz-
ing a smooth convex objective over a simple convex set (the latter should be simple in order
to allow for computationally cheap algorithms) are a “rare commodity”, almost newer met in
applications. The major part of Nesterov’s 2003 breakthrough is the observation, surprisingly
simple in the hindsight19 , that
19
Let me note that the number of actual breakthroughs which are “trivial in retrospect” is much larger than
one could expect. For example, George Dantzig, the founder of Mathematical Programming, considered as one
of his major contributions (and I believe, rightly so) the very formulation of an LP problem, specifically, the idea
of adding to the constraints an objective to be optimized, instead of making decisions via “ground rules,” which
was standard at the time.
408 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
with a pretty simple (quite often, just bilinear) convex-concave function φ and simple
convex compact set Y .
Now, saddle point reformulation (SP) of the problem of interest (CP) can be used in different
ways. Nesterov himself used it to smoothen the function f . Specifically, he assumes (see [29])
that φ is of the form
φ(x, y) = xT (Ay + a) − ψ(y)
with convex ψ, and observes that then the function
where the “regularizer” d(·) is strongly convex, is smooth (possesses Lipschitz continuous gra-
dient) and is close to f when δ is small. He then minimizes the smooth function fδ by his
√
optimal method for smooth convex minimization – one with the complexity O(1/ ). Since the
Lipschitz constant of the gradient of fδ deteriorates as δ → +0, on one hand, and we need to
work with small δ, those of the order of accuracy to which we want to solve (CP), on the other
hand, the overall complexity of the outlined smoothing method is O(1/), which still is much
better than the “black-box nonsmooth” complexity O(1/2 ) and in fact is, in many cases, as
good as it could be.
where B is an n × n matrix of the spectral norm kBk not exceeding L. This problem can be
easily reformulated in the form of (SP), specifically, as
20
these are exactly the functions responsible for the lower complexity bound O(1/2 ).
FAST FIRST ORDER MINIMIZATION AND SADDLE POINTS 409
and Nesterov’s smoothing with d(y) = 12 y T y allows to find an -solution to the problem in
O(1) LR steps, provided that LR/ ≥ 1; each step here requires two matrix-vector multipli-
cations, one involving B and one involving B T . On the other hand, it is known that when
n > LR/, every method for solving (LS) which is given in advance b and is allowed to
“learn” B via matrix-vector multiplications involving B and B T , needs in the worst (over b
and B with kBk ≤ L) case at least O(1)LR/ matrix-vector multiplications in order to find
an -solution to (LS). We can say that in the large-scale case complexity of the pretty well
structured Least Squares problems (LS) with respect to First Order methods is O(1/), and
this is what Nesterov’s smoothing achieves.
Nesterov’s smoothing is one way to utilize a “nice” (with “good” φ) saddle point representation
(SP) of the objective in (CP) to accelerate processing the latter problem. In the sequel, we
present another way to utilize (SP) – the Mirror Prox (MP) algorithm, which needs φ to be
smooth (with Lipschitz continuous gradient), works directly on the saddle point problem (SP)
and solves it with complexity O(1/), same as Nesterov’s smoothing. Let us stress that the
“fast gradient methods” (this seems to become a common name of the algorithms we have just
mentioned and their numerous “relatives” developed in the recent years) do exploit problem’s
structure: this is exactly what is used to build the saddle point reformulation of the problem of
interest; and since the reformulated problem is solved by black-box-oriented methods, we have
good chances to end up with computationally cheap algorithms.
Here we present instructive examples of “nice” – just bilinear – saddle point representations of
important and “heavily nonsmooth” convex functions. Justification of the examples is left to
the reader. Note that the list can be easily extended.
1. k · kp -norm:
p
kxkp = max hy, xi, q = .
y:kykq ≤1 p−1
Variation: p-norm of the “positive part” x+ = [max[x1 , 0]; ...; max[xn , 0]] of a vector x ∈
Rn :
p
kx+ kp = max hy, xi ≡ y T x, q = .
y∈Rn :y≥0,kykq ≤1 p−1
Matrix analogy: for a symmetric matrix x ∈ Sn , let x+ be the matrix with the same
eigenvectors and eigenvalues replaced with their positive parts. Then
p
|x+ |p = max hy, xi = Tr(yx), q = .
y∈Sn :y0,|y| q ≤1 p−1
410 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
2. Maximum entry s1 (x) in a vector x and the sum sk (x) of k maximal entries in a vector
x ∈ Rn :
∀(x ∈ Rn , k ≤ n) :
s1 (x) = max hy, xi,
y∈∆n
∆n = {y ∈ Rn : y ≥ 0,
P
i yi = 1};
sk (x) = max hy, xi,
y∈∆n,k
∆n,k = {y ∈ Rn : 0 ≤ yi ≤ 1 ∀i,
P
i yi = k}
Matrix analogies: maximal eigenvalue S1 (x) and the sum of k largest eigenvalues Sk (x) of
a matrix x ∈ Sn :
∀(x ∈ Sn , k ≤ n) :
S1 (x) = max hy, xi ≡ Tr(yx)),
y∈Dn
Dn = {y ∈ Sn : y 0, Tr(y) = 1};
Sk (x) = max hy, xi ≡ Tr(yx),
y∈Dn,k
Dn,k = {y ∈ Sn : 0 y I, Tr(y) = k}
3. Maximal magnitude s1 (abs[x]) of entries in x and the sum sk (abs[x]) of the k largest
magnitudes of entries in x:
∀(x ∈ Rn , k ≤ n) :
s1 (abs[x]) = max hy, xi;
y∈Rn ,kyk1 ≤1
sk (abs[x]) = max hy, xi ≡ y T x.
y∈Rn :kyk∞ ≤1,kyk1 ≤k
We augment this sample of “raw materials” with calculus rules. We restrict ourselves with
bilinear saddle point representations:
f (x) = max [hy, Axi + ha, xi + hb, yi + c] ,
y∈Y
where x runs through a Euclidean space Ex , Y is a convex compact subset in a Euclidean space
Ey , x 7→ Ax : Ex → Ey is a linear mapping, and h·, ·i are inner products in the respective spaces.
In the rules to follow, all Yi are nonempty convex compact sets.
1. [Taking conic combinations] Let
fi (x) = max [hyi , Ai xi + hai , xi + hbi , yi i + ci ] , 1 ≤ i ≤ m,
yi ∈Yi
Then
f (x ≡ [x1 ; ...; xm ]) = m
P
fi (xi )
i=1 X X X X
= max [ λi hyi , Ai xi i + λi hai , xi i + λi hbi , yi i + λi ci ].
y=[y1 ;...;ym ]∈Y =Y1 ×...×Ym
i i i i
| {z } | {z } | {z } | {z }
=:hy,Axi =:ha,xi =:hb,yi =:c
Then
f (x) := maxi fi (x) X X X X
= max [{ hui , Ai xi + h ηi ai , xi + + hbi , ui i + ci ηi },
y=[(η1 ,u1 );...;(ηm ,um )]∈Y
i i i i
| {z } | {z }
n =:hy,Axi =:hb,yi
o
Y = y = [(η1 , u1 ); ...; (ηm , um )] : (ηi , ui ) ∈ Ybi , 1 ≤ i ≤ m,
P
i ηi =1 ,
P
maxi fi (x) = maxη∈∆ Pm i ηi fi (x) P P P
= max i hη i yi , Ai xi + i hηi ai , xi + i hbi , ηi yi i + i ηi ci
η∈∆m ,yi ∈Yi ,1≤i≤m |{z}
=:uPi P P P
= max P [[ i hui , Ai xi i + h i ηi ai , xi] + + [ i hbi , ui i + i ci ηi ]] .
(ηi ,ui )∈Y
bi ,1≤i≤m, i ηi =1
with smooth convex-concave cost function φ. Smoothness means that φ is continuously differ-
entiable on Z = X × Y , and that the associated vector field (cf. (5.3.47))
F (z = (x, y)) = [Fx (x, y) := ∇x φ(x, y); Fy (x, y) = −∇y φ(x, y)] : Z := X × Y → Rnx × Rny
is Lipschitz continuous on Z.
412 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
The setup for the MP algorithm as applied to (SP) is the same as for the saddle point Mirror
Descent, see section 5.3.5, specifically, we fix a norm k · k on the embedding space Rnx × Rny of
the domain Z of φ and a DGF ω for this domain. Since F is Lipschitz continuous, we have for
properly chosen L < ∞:
where γt are positive stepsizes. This looks pretty much as the saddle point MD algorithm, see
(5.3.48); the only difference is in what is called extra-gradient step: what was the updating
zt 7→ zt+1 in the MD, now defines an intermediate point wt , and the actual updating zt 7→ zt+1
uses, instead of the value of F at zt , the value of F at this intermediate point. Another difference
is that now the approximate solutions z t are weighted sums of the intermediate points wτ ,
and not of the subsequent prox-centers zτ , as in MD. These “minor changes” result in major
consequences (provided that F is Lipschitz continuous). Here is the corresponding result:
Theorem 5.5.4 Let convex-concave saddle point problem (SP) be solved by the MP algorithm.
Then
(i) For every τ , one has for all u ∈ Z:
γτ hF (wτ ), wτ − ui ≤ Vzτ (u) − Vzτ +1 (u) + [γτ hF (wτ ), wτ − zτ +1 i − Vzτ (zτ +1 )], (a)
| {z }
=:δτ
δτ ≤ γτ hF (zτ +1 ) − F (wτ ), wτ − zτ +1 i − Vzτ (wτ ) − Vwτ (zτ +1 ) (b)
2
≤ γ2τ kF (wτ ) − F (zτ +1 )k2∗ − 12 kwτ − zτ +1 k2 (c)
(5.5.77)
In particular, in the case of (5.5.75) one has
1
γτ ≤ ⇒ δτ ≤ 0,
L
see (5.5.77.c).
(ii) For t = 1, 2, ..., one has
Ω2 Pt
t 2 + τ =1 δτ
sad (z ) ≤ Pt . (5.5.78)
τ =1 γτ
In particular, let the stepsizes γτ , 1 ≤ τ ≤ t, be such that γτ ≥ L1 and δτ ≤ 0 for all τ ≤ t (by
(i), this is definitely so for the stepsizes γτ ≡ L1 , provided that (5.5.75) takes place). Then
Ω2 Ω2 L
sad (z t ) ≤ Pt ≤ . (5.5.79)
2 τ =1 γτ 2t
FAST FIRST ORDER MINIMIZATION AND SADDLE POINTS 413
As we remember, from w = Proxz (ξ) and z+ = Proxz (η) it follows that w, z+ ∈ Z o and
5.5.4.4 Refinement
Same as for various versions of MD, the complexity analysis of MP can be refined. The corre-
sponding version of Theorem 5.5.4 is as follows.
Theorem 5.5.5 Let (5.3.53) take place, and let positive reals λ1 , λ2 , ... satisfy (5.3.28). Let
convex-concave saddle point problem (SP) be solved by the MP algorithm, and let
Pt
t τ =1 λ τ wτ
z = P t .
τ =1 λτ
λt Ω2∗ Pt λτ
t γt 2 + τ =1 γτ δτ
sad (z ) ≤ Pt , (5.5.81)
τ =1 λτ
21
Note that the derivation we have just carried out is based solely on (5.5.80) and is completely independent
of what are ξ and η; this observations underlies numerous extensions and modifications of MP.
414 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
λt 2
γt Ω∗
sad (z t ) ≤ Pt . (5.5.82)
2 τ =1 λτ
Proof. Invoking Theorem 5.5.4, we get (5.5.77). This relation implies for every τ :
λτ λτ
∀(u ∈ Z) : λτ hF (wτ ), wτ − ui ≤ [Vzτ (u) − Vzτ +1 (u)] + δτ ,
γτ γτ
λt Ω2Pt
∗+ λτ
δτ
maxu∈Z Λ−1
Pt γt 2 τ =1 γτ
t τ =1 λτ hF (wτP
), wτ − ui ≤ Λt ,
Λt = tτ =1 λτ .
Same as in the proof of Theorem 5.3.4, the latter inequality implies (5.5.81). The latter relation
under the premise of Theorem immediately implies (5.5.82). 2
Now, the simplest way to aggregate the standard proximal setups for the standard blocks Zj
into such a setup for Zb is as follows: denoting by k · k(j) the norms, and by ωj (zj ) – the DGFs
as given by setups for blocks, we equip the embedding space of Zb with the norm
v
uN
uX
k[z 1 ; ...; z N ]k = t αj kz j k2(j) ,
j=1
Here αj and βj are positive aggregation weights, which we choose in order to end up with a
legitimate proximal setup for Z ⊂ E with the best possible efficiency estimate. The standard
recipe is as follows (for justification, see [21]):
Our initial data are the “partial proximal setups” k · k(j) , ωj (·) along with the upper
bounds Θj on the variations
Now, the direct product structure of the embedding space of Z allows to represent
the vector field F associated with the convex-concave function in question in the
“block form”
F (z 1 , ..., Z N ) = [F1 (z); ...; FN (z)].
22
We could treat the standard simplex as a standard block associated with the Entropy/Simplex proximal setup;
to save words, we prefer to treat standard simplexes as parts of the unit `1 balls.
416 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
Assuming this field Lipschitz continuous, we can find (upper bounds on) the “partial
Lipschitz constants” Lij of this field, so that
N
X
1 N 1 n
∀(i, z = [z ; ...; z ] ∈ Z, w = [w ; ...; w ] ∈ Z) : kFi (z)−Fi (w)k(i,∗) ≤ kz j −wj k(j) ,
j=1
where k · k(i,∗) are the norms conjugate to k · k(i) . W.l.o.g. we assume the bounds
Lij symmetric: Lij = Lji .
In terms of the parameters of the partial setups and the quantities Lij , the aggrega-
tion is as follows: we set
P
j
Mkj σ
, αj = Θjj ,
p
Mij = Lij Θi Θj , σk = P
M
i,j ij
which results in
Θ = max ω(·) − min ω(·) ≤ 1,
Z
b Z
b
√
(that is, the ω-radius of Z does not exceed 2). The condition (5.5.75) is now
satisfied with X q
L = L := Lij Θi Θj ,
i,j
where, as always,
r
. r∗ =
r−1
We are in the situation of Z1 = X, Z2 = Y with
h i
F (z 1 ≡ x, z 2 ≡ y) = F1 (y) = B T y; F2 (x) = b − Bx .
Besides this, the blocks Z1 , Z2 are either unit k · k2 -balls, or unit `1 balls. The associated
quantities Θi , Lij , i, j = 1, 2, within absolute constant factors are as follows:
p=1 p=2 r=2 r=∞
Θ1 ln(n) 1 Θ2 1 ln(m)
0 kBkp→r
[Lij ]2i,j=1 = , kBkp→r = max kBxkr .
kBkp→r 0 kxkp ≤1
5.6. APPENDIX: SOME PROOFS 417
The matrix norms kBkp→max[r,2] , again within an absolute constant factors, are given in the
following table:
p=1 p=2
r = 1 max kColj [B]k2 σmax (B)
j≤n
r = 2 max kColj [B]k2 σmax (B)
j≤n
where, as always, Colj [B] is j-th column of matrix B. For example, consider the situations of
extreme interest for Compressed Sensing, namely, `1 -minimization with k · k2 -fit and `1 mini-
mization with `∞ -fit (that is, the cases (p = 1, r = 2) and (p = 1, r = ∞), respectively). Here
the achievable for MP efficiency estimates (the best known so far in the large scale case, although
achievable not only with MP) are as follows:
Proof. Let us set ω b (w) = ω(w) + φ(w). Since φ is continuously differentiable on X, ω b possesses
the same smoothness properties as ω, specifically, ω is convex and continuous on X, {w ∈ X :
b
∂ωb (w) 6= ∅} = X o := {w : ∂ω(w) 6= ∅} and ω b 0 (w) := ω 0 (w) + ∇φ(w) : X o → E is a continuous
selection of subgradients of ω b.
W.l.o.g. we can assume that int X 6= ∅ and that z+ = 0; note that z+ is a minimizer of
b (w) + Ψ(w) over w ∈ X. Let w̄ ∈ int X, and let wt = tw̄, so that wt ∈ int X for 0 < t ≤ 1.
ω
Let us prove first that z+ = 0 ∈ X o . To this end it suffices to verify that the vectors ω b 0 (wt )
remain bounded as t → +0; indeed, in this case the set of limiting points of ω b 0 (wt ) as t → +0
is nonempty; since ω b (·) is continuous on X, every one of these limiting points belongs to ∂ ω b (0),
and thus ∂ ω o
b (0) 6= ∅, meaning that 0 ∈ X . To prove that ω 0
b (wt ) is bounded as t → +0,
let r > 0 be such that with B = {h ∈ E : khkE ≤ r} we have B + = w̄ + B ⊂ int X. All
we need to prove is that the linear forms hω b 0 (wt ), hi of h ∈ E are uniformly in t ∈ (0, 1/2)
bounded from above on B. Convexity of ω b says that these forms are uniformly bounded from
418 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
t ∈ [0, 1], whence hp + ω 0 (xt ), z − xt i ≤ V for all t ∈ (0, 1] and all z ∈ B. On the other hand,
We see that ω 0 (xt ) remains bounded as t → 0; this, there exists a sequence ti → +0, i → ∞,
and e ∈ Rn such that ω 0 (xti ) → e as i → ∞. Since ω is continuous on X and xti → x∗
as i → ∞, e ∈ ∂ω(x∗ ), that is, x∗ ∈ X o , whence in fact ω 0 (xt ) → ω 0 (x∗ ), t → +0, and
thus hp + ω 0 (x∗ ), y − x∗ i = limt→+0 φ0 (t) ≥ 0. Thus, x∗ ∈ X o and (5.3.6) holds true for all
u ∈ U ∩ int X. The latter set is dense in U , so that (5.3.6) holds true for all u ∈ U , as claimed.
2
d2 X
D2 (x)[h, h] := | ω(x + th) ≥ κkhk2 ∀(x, h : x > 0, xi ≤ 1) (5.6.1)
dt2 t=0 i
Theorem 5.6.1 Let E = Rk1 × ... × Rkn , k · k be the `1 /`2 norm on E, and let Z be the unit
ball of the norm k · k. Then the function
n n=1 1,
(
1 n≤2 2,
kxj kp2 ,
X
1 n 1/2, n=2 :Z→R
ω(x = [x ; ...; x ]) = p= 1 ,γ=
pγ ln n , n ≥ 3 1+ 1
, n >2
j=1
e ln(n)
(5.6.2)
is a continuously differentiable DGF for Z compatible with k · k. As a result, whenever X ⊂ Z is
a nonempty closed convex subset of Z, the restriction ω|X of ω on X is DGF for X compatible
with k · k, and the ω|X -radius of X can be bounded as
q
Ω≤ 2e ln(n + 1). (5.6.3)
420 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
2−p
Setting tj = kxj k2 ≥ 0, we have ≤ 1, whence due to 0 ≤ 2 − p ≤ 1 it holds ≤ np−1 .
P P
j tj j tj
Thus,
2
γ
X
j
kh k2 ≤ np−1 D2 ω(x)[h, h] (5.6.6)
j p−1
while
1
max ω(x) − min ω(x) ≤ (5.6.7)
x∈Z x∈Z γp
γ p−1
With p, γ as stated in Theorem 5.6.1, when n ≥ 3 we get p−1 n = 1, and similarly for n = 1, 2.
Consequently,
X 2
n
∀(x ∈ Z 0 , h = [h1 ; ...; hn ]) : khj k2 ≤ D2 ω(x)[h, h]. (5.6.8)
j=1
Since ω(·) is continuously differentiable and the complement of Z 0 in Z is the union of finitely
many proper linear subspaces of E, (5.6.8) implies that ω is strongly convex on Z, modulus 1,
w.r.t. the `1 /`2 norm k · k. Besides this, we have
= 12 , n=1
1
= 1, n=2 ≤ O(1) ln(n + 1).
γp
≤ e ln(n), n ≥ 3
with γ, p given by (5.6.2); note that p ∈ (1, 2]. As we know from Theorem 5.6.1, the restriction
of w(·) onto the unit ball of `1 /`2 norm k · k is continuously differentiable and strongly convex,
modulus 1, w.r.t. k · k. Besides this, w(·) clearly is nonnegative and positively homogeneous of
the degree p: w(λx) ≡ |λ|p w(x). Note that the outlined properties clearly imply that w(·) is
convex and continuously differentiable on the entire E. We need the following simple
Lemma 5.6.2 Let H be a Euclidean space, k · kH be a norm (not necessary the Euclidean
one), on H, and let YH = {y ∈ H : kykH ≤ 1}. Let, further, ζ : H → R be a nonnegative
continuously differentiable function, absolutely homogeneous of a degree p ∈ (1, 2], and strongly
convex, modulus 1, w.r.t. k · kH , on YH . Then the function
ζ(y)
b = ζ 2/p (y) : H → R
Proof. The facts that ζ(·)b is continuously differentiable convex and homogeneous of degree
2 function on H is evident, and all we need is to verify that the function is strongly convex,
modulus β, w.r.t k · kH :
It is immediately seen that for a continuously differentiable function ζ, b the target property is
local: to establish (5.6.10), it suffices to verify that every ȳ ∈ H\{0} admits a neighbourhood U
such that the inequality in (5.6.10) holds true for all y, y 0 ∈ U . By evident homogeneity reasons,
in the case of homogeneous of degree 2 function ζ, b to establish the latter property for every
ȳ ∈ H\{0} is the same as to prove it in the case of kȳkH = 1 − , for any fixed ∈ (0, 1/3).
Thus, let us fix an ∈ (0, 1) and ȳ ∈ H with kȳkH = 1 − , and let U = {y 0 : ky 0 − ȳkH ≤ /2},
so that U is a compact set contained in the interior of YH . Now let φ(·) be a nonnegative C∞
function on H which vanishes outside of YH and has unit integral w.r.t. the Lebesque measure
on H. For 0 < r ≤ 1, let us set
where ∗ stands for convolution. As r → +0, the nonnegative C∞ function ζr (·) converges
with the first order derivative, uniformly on U , to ζ(·), and ζbr (·) converges with the first order
derivative, uniformly on U , to ζ(·).
b Besides this, when r < /2, the function ζr (·) is, along with
ζ(·), strongly convex, modulus 1, w.r.t. k · kH , on U (since ζr0 (·) is the convolution of ζ 0 (·) and
φr (·)). Since ζr is C∞ , we conclude that when r < /2, we have
whence also
2 2 2/p−2 2/p−1
D2 ζbr (y)[h, h] = p p − 1 ζr (y)[Dζr (y)[h]]2 + p2 ζr (y)D2 ζr [h, h]
2 2/p−1
≥ p ζr (y)khk2H ∀(h ∈ H, y ∈ U ).
As r → +0, ζbr0 (·) uniformly on U converges to ζb0 (·), and βr [U ] converges to β[U ] =
2 2/p−1 (u) ≥ (1 − 3/2)2−p β (recall the origin of β and U and note that ζ is absolutely
p min ζ
u∈U
homogeneous of degree p), and we arrive at the relation
As we have already explained, the just established validity of the latter relation for every U =
{y : ky − ȳkH ≤ /2} and every ȳ, kȳkH = 1 − implies that ζb is strongly convex, modulus
(1 − 3/2)2−p β, on the entire H. Since > 0 is arbitrary, Lemma 5.6.2 is proved.
B. Applying Lemma 5.6.2 to E in the role of H, the `1 /`2 norm k · k in the role of k · kH
and the function w(·) given by (5.6.9) in the role of ζ, we conclude that the function w(·) b ≡
2/p 2 2/p−1
w (·) is strongly convex, modulus β = p [miny {w(y) : y ∈ E, kyk = 1}] , w.r.t. k · k, on
−1
the entire space E, whence the function β w(z) is continuously differentiable and strongly
convex, modulus 1, w.r.t. k · k, on the entire E. This observation, in view of the evident relation
β = p2 (n1−p γ −1 p−1 )2/p−1 and the origin of γ, p, immediately implies all claims in Corollary 5.6.1.
2
Theorem 5.6.2 Let µ = (µ1 , ..., µn ) and ν = (nu1 , ..., νn ) be collections of positive reals such
that µj ≤ νj for all j, and let
n
X
m= µj .
j=1
With E = Mµν and k · k = k · knuc , let Z = {x ∈ E : kxknuc ≤ 1}. Then the function
√
4 e ln(2m) Xm 1+q 1
ω(x) = q σi (x) : Z → R, q = , (5.6.11)
2 (1 + q) i=1 2 ln(2m)
is continuously differentiable DGF for Z compatible with k · knuc , so that for every nonempty
closed convex subset X of Z, the function ω|X is a DGF for X compatible with k · knuc . Besides
this, the ω|X radius of X can be bounded as
q √ q
Ω ≤ 2 2 e ln(2m) ≤ 4 ln(2m). (5.6.12)
5.6. APPENDIX: SOME PROOFS 423
The function ω b (·) is continuously differentiable, convex, and its restriction on the set YF = {y ∈
F : |y|1 ≤ 1} is strongly convex, modulus 1, w.r.t. | · |1 .
Proof:
10 . Let 0 < q < 1. Consider the following function of y ∈ SN :
N
1 X 1
χ(y) = |λj (y)|1+q = Tr(f (y)), f (s) = |s|1+q . (5.6.15)
1 + q j=1 1+q
20 . Function f (s) is continuously differentiable on the axis and twice continuously differentiable
outside of the origin; consequently, we can find a sequence of polynomials fk (s) converging,
as k → ∞, to f along with their first derivatives uniformly on every compact subset of R
and, besides this, converging to f uniformly along with the first and the second derivative on
every compact subset of R\{0}. Now let y, h ∈ SN , let y = uDiag{λ}uT be the eigenvalue
decomposition of y, and let h = uhu b T . For a polynomial p(s) = PK p sk , setting P (w) =
PK k=0 k
Tr( k=0 pk wk ) : SN → R, and denoting by γ a closed contour in C encircling the spectrum of
y, we have
(a) P (y) = Tr(p(y)) = N
P
j=1 p(λj (y))
(b) DP (y)[h] = K k−1 h) = Tr(p0 (y)h) = PN p0 (λ (y))h
P
k=1 kp k Tr(y j=1 j
b jj
2 d d 0
(c) D P (y)[h, h] = dt |t=0 DP (y + th)[h] = dt |t=0 Tr(p (y + th)h)
= dtd 1 H
|t=0 2πı Tr(h(zI − (y + th))−1 )p0 (z)dz = 2πı1 H
Tr(h(zI − y)−1 h(zI − y)−1 )p0 (z)dz
γ γ
H PN 0
p (z) Pn
1 b2 b2
= 2πı i,j=1 hij (z−λi (y))(z−λj (y)) dz = i,j=1 hij Γij ,
γ(
p0 (λi (y))−p0 (λj (y))
λi (y)−λj (y) , λi (y) 6= λj (y)
Γij = 00
p (λi (y)), λi (y) = λj (y)
424 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
We conclude from (a, b) that as k → ∞, the real-valued polynomials Fk (·) = Tr(fk (·)) on SN
converge, along with their first order derivatives, uniformly on every bounded subset of SN , and
the limit of the sequence, by (a), is exactly χ(·). Thus, χ(·) is continuously differentiable, and
(b) says that
N
f 0 (λj (y))h
X
Dχ(y)[h] = b jj . (5.6.16)
j=1
Besides this, (a-c) say that if U is a closed convex set in SN which does not contain singular
matrices, then Fk (·), as k → ∞, converge along with the first and the second derivative uniformly
on every compact subset of U , so that χ(·) is twice continuously differentiable on U , and at every
point y ∈ U we have
N
( f 0 (λ (y))−f 0 (λ (y))
i j
2
X
b 2 Γij , Γij = λi (y)−λj (y) , λi (y) 6= λj (y)
D χ(y)[h, h] = h ij (5.6.17)
i,j=1 f 00 (λi (y)), λi (y) = λj (y)
hχ0 (y 0 ) − χ0 (y 00 ), y 0 − y 00 i ≥ 0 (∗)
for a dense in SN × SN set of pairs (y 0 , y 00 ), e.g., those with nonsingular y 0 − y 00 . For a pair
of the latter type, the polynomial q(t) = Det(y 0 + t(y 00 − y 0 )) of t ∈ R is not identically zero
and thus has finitely many roots on [0, 1]. In other words, we can find finitely many points
t0 = 0 < t1 < ... < tn = 1 such that all “matrix intervals” ∆i = (yi , yi+1 ), yk = y 0 + tk (y 00 − y 0 ),
1 ≤ i ≤ n − 1, are comprised of nonsingular matrices. Therefore χ is convex on every closed
segment contained in one of ∆i ’s, and since χ is continuously differentiable, (∗) follows.
40 . Now let us prove that with properly defined α > 0 one has
Let > 0, and let Y be a convex open in Y = {y : |y|1 ≤ 1} neighbourhood of YF such that for
all y ∈ Y at most M eigenvalues of y are of magnitude > . We intend to prove that for some
α > 0 one has
hχ0 (y 0 ) − χ0 (y 00 ), y 0 − y 00 i ≥ α |y 0 − y 00 |21 ∀y 0 , y 00 ∈ Y . (5.6.19)
Same as above, it suffices to verify this relation for a dense in Y × Y set of pairs y 0 , y 00 ∈ Y ,
e.g., for those pairs y 0 , y 00 ∈ Y for which y 0 − y 00 is nonsingular. Defining matrix intervals ∆i as
above and taking into account continuous differentiability of χ, it suffices to verify that if y ∈ ∆i
and h = y 0 − y 00 , then D2 χ(y)[h, h] ≥ α |h|21 . To this end observe that by (5.6.17) all we have to
prove is that
N
X
D2 χ(y)[h, h] = b 2 Γij ≥ α |h|2 .
h ij 1 (#)
i,j=1
50 . Setting λj = λj (y), observe that λi 6= 0 for all i due to the origin of y. We claim that if
|λi | ≥ |λj |, then Γij ≥ q|λi |q−1 . Indeed, the latter relation definitely holds true when λi = λj .
|λi |q −|λ|qj
Now, if λi and λj are of the same sign, then Γij = |λi |−|λj | ≥ q|λi |q−1 , since the derivative of
5.6. APPENDIX: SOME PROOFS 425
the concave (recall that 0 < q ≤ 1) function tq of t > 0 is positive and nonincreasing. If λi and
|λ |q +|λ |q
λj are of different signs, then Γij = |λi i |+|λjj | ≥ |λi |q−1 due to |λj |q ≥ |λj ||λi |q−1 , and therefore
Γij ≥ q|λi |q−1 . Thus, our claim is justified.
W.l.o.g. we can assume that the positive reals µi = |λi |, i = 1, ..., N , form a nondecreasing
sequence, so that, by above, Γij ≥ qµq−1 j when i ≤ j. Besides this, at most M of µj are ≥ ,
since y 0 , y 00 ∈ Y and therefore y ∈ Y by convexity of Y . By the above,
N
b 2 µq−1 + q b 2 µq−1 ,
X X
D2 χ(y)[h, h] ≥ 2q h ij j h jj j
i<j≤N j=1
Observe that the image space F of A is a linear subspace of SN , and that the eigenvalues of
y = Ax are the 2m reals ±σi (x)/2, 1 ≤ i ≤ m, and N − M zeros, so that kxknuc ≡ |Ax|1 and
M, F satisfy the premise of Lemma 5.6.3. Setting
√ m
4 e ln(2m) X 1
ω(x) = ω (Ax) = q σ 1+q (x), q = ,
2 (1 + q) i=1 i
b
2 ln(2m)
426 LECTURE 5. SIMPLE METHODS FOR EXTREMELY LARGE-SCALE PROBLEMS
and invoking Lemma 5.6.3, we see that ω is a convex continuously differentiable function on E
which, due to the identity kxknuc ≡ |Ax|1 , is strongly convex, modulus 1, w.r.t. k · knuc , on the
k · knuc -unit ball Z of E. This function clearly satisfies (5.6.12). 2
[1] Ben-Tal, A., and Nemirovski, A., “Robust Solutions of Uncertain Linear Programs” – OR
Letters v. 25 (1999), 1–13.
[2] Ben-Tal, A., Margalit, T., and Nemirovski, A., “The Ordered Subsets Mirror Descent Op-
timization Method with Applications to Tomography” – SIAM Journal on Optimization
v. 12 (2001), 79–108.
[3] Ben-Tal, A., and Nemirovski, A., Lectures on Modern Convex Optimization: Analy-
sis, Algorithms, Engineering Applications – MPS-SIAM Series on Optimization, SIAM,
Philadelphia, 2001.
[4] Ben-Tal, A., Nemirovski, A., and Roos, C. (2001), “Robust solutions of uncertain
quadratic and conic-quadratic problems” – to appear in SIAM J. on Optimization.
[5] Ben-Tal, A., El Ghaoui, R., Nemirovski, A., Robust Optimization – Princeton University
Press, 2009.
[6] Bertsimas, D., Popescu, I., Sethuraman, J., “Moment problems and semidefinite program-
ming.” — In H. Wolkovitz, & R. Saigal (Eds.), Handbook on Semidefinite Programming
(pp. 469-509). Kluwer Academic Publishers, 2000.
[7] Bertsimas, D., Popescu, I., “Optimal inequalities in probability theory: a convex opti-
mization approach.” — SIAM Journal on Optimization 15:3 (2005), 780-804.
[8] Boyd, S., El Ghaoui, L., Feron, E., and Balakrishnan, V., Linear Matrix Inequalities
in System and Control Theory – volume 15 of Studies in Applied Mathematics, SIAM,
Philadelphia, 1994.
[9] Vanderberghe, L., Boyd, S., El Gamal, A.,“Optimizing dominant time constant in RC cir-
cuits”, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
v. 17 (1998), 110–125.
[10] Dhaene, J., Denuit, M., Goovaerts, M.J., Kaas, R., Vyncke, D., “The concept of comono-
tonicity in actuarial science and finance: theory” — Insurance: Mathematics and Eco-
nomics, 31 (2002), 3-33.
[11] Frank, M.; Wolfe, P. “An algorithm for quadratic programming” – Naval Research Logis-
tics Quarterly 3:1-2 (1956), 95–110. doi:10.1002/nav.3800030109
[12] Goemans, M.X., and Williamson, D.P., “Improved approximation algorithms for Maxi-
mum Cut and Satisfiability problems using semidefinite programming” – Journal of ACM
42 (1995), 1115–1145.
427
428 BIBLIOGRAPHY
[13] Grigoriadis, M.D., Khachiyan, L.G. A sublinear time randomized approximation algo-
rithm for matrix games. – OR Letters 18 (1995), 53-58.
[14] Grotschel, M., Lovasz, L., and Schrijver, A., The Ellipsoid Method and Combinatorial
Optimization, Springer, Heidelberg, 1988.
[15] Harchaoui, Z., Juditsky, A., Nemirovski, A. “Conditional Gradient Algorithms for
Norm-Regularized Smooth Convex Optimization” – Mathematical Programming 152:1-2
(2015), 75–112.
[16] Juditsky, A., Nemirovski, A., “On verifiable sufficient conditions for sparse signal recovery
via `1 minimization” – Mathematical Programming Series B, 127 (2011), 5788.
[17] Kiwiel, K., “Proximal level bundle method for convex nondifferentable optimization, sad-
dle point problems and variational inequalities”, Mathematical Programming Series B v.
69 (1995), 89-109.
[18] Lemarechal, C., Nemirovski, A., and Nesterov, Yu., “New variants of bundle methods”,
Mathematical Programming Series B v. 69 (1995), 111-148.
[19] Lobo, M.S., Vanderbeghe, L., Boyd, S., and Lebret, H., “Second-Order Cone Program-
ming” – Linear Algebra and Applications v. 284 (1998), 193–228.
[21] Nemirovski, A. “Prox-Method with Rate of Convergence O(1/t) for Variational Inequali-
ties with Lipschitz Continuous Monotone Operators and Smooth Convex-Concave Saddle
Point Problems” – SIAM Journal on Optimization v. 15 No. 1 (2004), 229-251.
[24] Nesterov, Yu., and Nemirovski, A. Interior point polynomial time methods in Convex
Programming, SIAM, Philadelphia, 1994.
[25] Nesterov, Yu., “A method of solving a convex programming problem with convergence
rate O(1/k 2 )” – Doklady Akademii Nauk SSSR v. 269 No. 3 (1983) (in Russian; English
translation Soviet Math. Dokl. 27:2 (1983), 372-376.
[26] Nesterov, Yu. “Squared functional systems and optimization problems”, in: H. Frenk,
K. Roos, T. Terlaky, S. Zhang, Eds., High Performance Optimization, Kluwer Academic
Publishers, 2000, 405–440.
[27] Nesterov, Yu. “Semidefinite relaxation and non-convex quadratic optimization” – Opti-
mization Methods and Software v. 12 (1997), 1–20.
BIBLIOGRAPHY 429
[28] Nesterov, Yu. “Nonconvex quadratic optimization via conic relaxation”, in: R. Saigal,
H. Wolkowicz, L. Vandenberghe, Eds. Handbook on Semidefinite Programming, Kluwer
Academis Publishers, 2000, 363-387.
[29] Nesterov, Yu. “Smooth minimization of non-smooth functions” – CORE Discussion Paper,
2003, and Mathematical Programming Series A, v. 103 (2005), 127-152.
[30] Nesterov, Yu. Gradient Methods for Minimizing Composite Objective Function — CORE
Discussion Paper 2007/96 http://www.ecore.be/DPs/dp 1191313936.pdf and Mathe-
matical Programming 140:1 (2013), 125–161.
[31] Nesterov, Yu., Nemirovski, A. “On first order algorithms for `1 /nuclear norm minimiza-
tion” – Acta Numerica 22 (2013).
[32] Nesterov, Yu., “Universal gradient methods for convex optimization problems” – Mathe-
matical Programming 152 (2015), 381–404.
[33] Roos. C., Terlaky, T., and Vial, J.-P. Theory and Algorithms for Linear Optimization:
An Interior Point Approach, J. Wiley & Sons, 1997.
[34] Wu S.-P., Boyd, S., and Vandenberghe, L. (1997), “FIR Filter Design via Spectral Fac-
torization and Convex Optimization” – Biswa Datta, Ed., Applied and Computational
Control, Signal and Circuits, Birkhauser, 1997, 51–81.
[35] Ye, Y. Interior Point Algorithms: Theory and Analysis, J. Wiley & Sons, 1997.
430 BIBLIOGRAPHY
Appendix A
Regarded as mathematical entities, the objective and the constraints in a Mathematical Pro-
gramming problem are functions of several real variables; therefore before entering the Optimiza-
tion Theory and Methods, we need to recall several basic notions and facts about the spaces Rn
where these functions live, same as about the functions themselves. The reader is supposed to
know most of the facts to follow, so he/she should not be surprised by a “cooking book” style
which we intend to use in this Lecture.
A.1.1 A point in Rn
A point in Rn (called also an n-dimensional vector) is an ordered collection x = (x1 , ..., xn ) of
n reals, called the coordinates, or components, or entries of vector x; the space Rn itself is the
set of all collections of this type.
431
432 APPENDIX A. PREREQUISITES FROM LINEAR ALGEBRA AND ANALYSIS
The as far as addition and multiplication by reals are concerned, the arithmetic of Rn inherits
most of the common rules of Real Arithmetic, like x + y = y + x, (x + y) + z = x + (y + z),
(λ + µ)(x + y) = λx + µx + λy + µy, λ(µx) = (λµ)x, etc.
A.1.3.B. Sums and intersections of linear subspaces. Let {Lα }α∈I be a family (finite or
infinite) of linear subspaces of Rn . From this family, one can build two sets:
P
1. The sum Lα of the subspaces Lα which consists of all vectors which can be represented
α
as finite sums of vectors taken each from its own subspace of the family;
T
2. The intersection Lα of the subspaces from the family.
α
2. Every vector from Rn is a linear combination of vectors from the collection (i.e.,
Lin{f 1 , ..., f m } = Rn ).
Besides the bases of the entire Rn , one can speak about the bases of linear subspaces:
A collection {f 1 , ..., f m } of vectors is called a basis of a linear subspace L, if
2. L = Lin{f 1 , ..., f m }, i.e., all vectors f i belong to L, and every vector from L is a linear
combination of the vectors f 1 , ..., f m .
In order to avoid trivial remarks, it makes sense to agree once for ever that
With this convention, the trivial linear subspace L = {0} also has a basis, specifically, an empty
set of vectors.
Theorem A.1.2 (i) Let L be a linear subspace of Rn . Then L admits a (finite) basis, and all
bases of L are comprised of the same number of vectors; this number is called the dimension of
L and is denoted by dim (L).
We have seen that Rn admits a basis comprised of n elements (the standard basic
orths). From (i) it follows that every basis of Rn contains exactly n vectors, and the
dimension of Rn is n.
(ii) The larger is a linear subspace of Rn , the larger is its dimension: if L ⊂ L0 are linear
subspaces of Rn , then dim (L) ≤ dim (L0 ), and the equality takes place if and only if L = L0 .
We have seen that the dimension of Rn is n; according to the above convention, the
trivial linear subspace {0} of Rn admits an empty basis, so that its dimension is 0.
Since {0} ⊂ L ⊂ Rn for every linear subspace L of Rn , it follows from (ii) that the
dimension of a linear subspace in Rn is an integer between 0 and n.
(iii) Let L be a linear subspace in Rn . Then
(iii.1) Every linearly independent subset of vectors from L can be extended to a basis of L;
(iii.2) From every spanning subset X for L – i.e., a set X such that Lin(X) = L – one
can extract a basis of L.
It follows from (iii) that
– every linearly independent subset of L contains at most dim (L) vectors, and if it
contains exactly dim (L) vectors, it is a basis of L;
– every spanning set for L contains at least dim (L) vectors, and if it contains exactly
dim (L) vectors, it is a basis of L.
(iv) Let L be a linear subspace in Rn , and f 1 , ..., f m be a basis in L. Then every vector
x ∈ L admits exactly one representation
m
X
x= λi (x)f i
i=1
is a one-to-one mapping of L onto Rm which is linear, i.e. for every i = 1, ..., m one has
λi (x + y) = λi (x) + λi (y) ∀(x, y ∈ L);
(A.1.2)
λi (νx) = νλi (x) ∀(x ∈ L, ν ∈ R).
The reals λi (x), i = 1, ..., m, are called the coordinates of x ∈ L in the basis f 1 , ..., f m .
E.g., the coordinates of a vector x ∈ Rn in the standard basis e1 , ..., en of Rn – the
one comprised of the standard basic orths – are exactly the entries of x.
(v) [Dimension formula] Let L1 , L2 be linear subspaces of Rn . Then
Important convention. When speaking about adding n-dimensional vectors and multiplying
them by reals, it is absolutely unimportant whether we treat the vectors as the column ones,
or the row ones, or write down the entries in rectangular tables, or something else. However,
when matrix operations (matrix-vector multiplication, transposition, etc.) become involved, it
is important whether we treat our vectors as columns, as rows, or as something else. For the
sake of definiteness, from now on we treat all vectors as column ones, independently of how
we refer to them in the text. For example, when saying for the first time what a vector is, we
wrote x = (x1 , ..., xn ), which might suggest that we were speaking about row vectors. We stress
that it is not the case, and the only reason for using the notation x = (x1 , ..., xn ) instead of
x1 x1
the “correct” one x = ... is to save space and to avoid ugly formulas like f ( ... ) when
xn xn
speaking about functions with vector arguments. After we have agreed that there is no such
thing as a row vector in this Lecture course, we can use (and do use) without any harm whatever
notation we want.
436 APPENDIX A. PREREQUISITES FROM LINEAR ALGEBRA AND ANALYSIS
Exercise A.1 1. Mark in the list below those subsets of Rn which are linear subspaces, find
out their dimensions and point out their bases:
(a) Rn
(b) {0}
(c) ∅
n
(d) {x ∈ Rn :
P
ixi = 0}
i=1
n
(e) {x ∈ Rn : ix2i = 0}
P
i=1
n
(f ) {x ∈ Rn :
P
ixi = 1}
i=1
n
(g) {x ∈ Rn : ix2i = 1}
P
i=1
3. Consider the space Rm×n of m × n matrices with real entries. As far as linear operations
– addition of matrices and multiplication of matrices by reals – are concerned, this space
can be treated as certain RN .
(a) Find the dimension of Rm×n and point out a basis in this space
(b) In the space Rn×n of square n × n matrices, there are two interesting subsets: the set
Sn of symmetric matrices {A = [Aij ] : Aij = Aij } and the set Jn of skew-symmetric
matrices {A = [Aij ] : Aij = −Aji }.
i. Verify that both Sn and Jn are linear subspaces of Rn×n
ii. Find the dimension and point out a basis in Sn
iii. Find the dimension and point out a basis in Jn
iv. What is the sum of Sn and Jn ? What is the intersection of Sn and Jn ?
3. [positive definiteness]: The quantity hx, xi always is nonnegative, and it is zero if and only
if x is zero.
Remark A.2.1 The outlined 3 properties – bi-linearity, symmetry and positive definiteness –
form a definition of an Euclidean inner product, and there are infinitely many different from
each other ways to satisfy these properties; in other words, there are infinitely many different
Euclidean inner products on Rn . The standard inner product hx, yi = xT y is just a particular
case of this general notion. Although in the sequel we normally work with the standard inner
product, the reader should remember that the facts we are about to recall are valid for all
Euclidean inner products, and not only for the standard one.
The notion of an inner product underlies a number of purely algebraic constructions, in partic-
ular, those of inner product representation of linear forms and of orthogonal complement.
f (x) = hf, xi ∀x
a linear form on Rn ;
(iii) The above one-to-one correspondence between the linear forms and vectors on Rn is
linear: adding linear forms (or multiplying a linear form by a real), we add (respectively, multiply
by the real) the vector(s) representing the form(s).
438 APPENDIX A. PREREQUISITES FROM LINEAR ALGEBRA AND ANALYSIS
L⊥ = {f : hf, xi = 0 ∀x ∈ L}.
Theorem A.2.2 (i) Twice taken, orthogonal complement recovers the original subspace: when-
ever L is a linear subspace of Rn , one has
(L⊥ )⊥ = L;
(ii) The larger is a linear subspace L, the smaller is its orthogonal complement: if L1 ⊂ L2
are linear subspaces of Rn , then L⊥ ⊥
1 ⊃ L2
(iii) The intersection of a subspace and its orthogonal complement is trivial, and the sum of
these subspaces is the entire Rn :
L ∩ L⊥ = {0}, L + L⊥ = Rn .
Remark A.2.2 From Theorem A.2.2.(iii) and the Dimension formula (Theorem A.1.2.(v)) it
follows, first, that for every subspace L in Rn one has
x = xL + xL⊥
into a sum of two vectors: the first of them, xL , belongs to L, and the second, xL⊥ , belongs
to L⊥ . This decomposition is called the orthogonal decomposition of x taken with respect to
L, L⊥ ; xL is called the orthogonal projection of x onto L, and xL⊥ – the orthogonal projection
of x onto the orthogonal complement of L. Both projections depend on x linearly, for example,
i 6= j ⇒ hf i , f j i = 0
hf i , f i i = 1, i = 1, ..., m.
A.2. SPACE RN : EUCLIDEAN STRUCTURE 439
Theorem A.2.3 (i) An orthonormal collection f 1 , ..., f m always is linearly independent and is
therefore a basis of its linear span L = Lin(f 1 , ..., f m ) (such a basis in a linear subspace is called
orthonormal). The coordinates of a vector x ∈ L w.r.t. an orthonormal basis f 1 , ..., f m of L are
given by explicit formulas:
m
X
x= λi (x)f i ⇔ λi (x) = hx, f i i.
i=1
Example of an orthonormal basis in Rn : The standard basis {e1 , ..., en } is orthonormal with
respect to the standard inner product hx, yi = xT y on Rn (but is not orthonormal w.r.t.
other Euclidean inner products on Rn ).
Proof of (i): Taking inner product of both sides in the equality
X
x= λj (x)f j
j
with f i , we get
h λj (x)f j , f i i
P
hx, fi i =
Pj
= λj (x)hf j , f i i [bilinearity of inner product]
j
= λi (x) [orthonormality of {f i }]
(ii) If f 1 , ..., f m is an orthonormal basis in a linear subspace L, then the inner product of
two vectors x, y ∈ L in the coordinates λi (·) w.r.t. this basis is given by the standard formula
m
X
hx, yi = λi (x)λi (y).
i=1
Proof:
λi (x)f i , y = λi (y)f i
P P
x =
iP i
h λi (x)f i , λi (y)f i i
P
⇒ hx, yi =
Pi i
= λi (x)λj (y)hf i , f j i [bilinearity of inner product]
i,j
[orthonormality of {f i }]
P
= λi (x)λi (y)
i
(iii) Every linear subspace L of Rn admits an orthonormal basis; moreover, every orthonor-
mal system f 1 , ..., f m of vectors from L can be extended to an orthonormal basis in L.
Important corollary: All Euclidean spaces of the same dimension are “the same”. Specifi-
cally, if L is an m-dimensional space in a space Rn equipped with an Euclidean inner product
h·, ·i, then there exists a one-to-one mapping x 7→ A(x) of L onto Rm such that
• The mapping converts the h·, ·i inner product on L into the standard inner product on
Rm :
hx, yi = (A(x))T A(y) ∀x, y ∈ L.
Indeed, by (iii) L admits an orthonormal basis f 1 , ..., f m ; using (ii), one can immediately
check that the mapping
x 7→ A(x) = (λ1 (x), ..., λm (x))
which maps x ∈ L into the m-dimensional vector comprised of the coordinates of x in the
basis f 1 , ..., f m , meets all the requirements.
Proof of (iii) is given by important by its own right Gram-Schmidt orthogonalization process
as follows. We start with an arbitrary basis h1 , ..., hm in L and step by step convert it into
an orthonormal basis f 1 , ..., f m . At the beginning of a step t of the construction, we already
have an orthonormal collection f 1 , ..., f t−1 such that Lin{f 1 , ..., f t−1 } = Lin{h1 , ..., ht−1 }.
At a step t we
After m steps of the optimization process, we end up with an orthonormal system f 1 , ..., f m
of vectors from L such that
Exercise A.2 1. What is the orthogonal complement (w.r.t. the standard inner product) of
n
the subspace {x ∈ Rn : xi = 0} in Rn ?
P
i=1
2. Find an orthonormal basis (w.r.t. the standard inner product) in the linear subspace {x ∈
Rn : x1 = 0} of Rn
A.3. AFFINE SUBSPACES IN RN 441
(L1 + L2 )⊥ = L⊥ ⊥
1 ∩ L2 ; (L1 ∩ L2 )⊥ = L⊥ ⊥
1 + L2 .
5. Consider the space of m × n matrices Rm×n , and let us equip it with the “standard inner
product” (called the Frobenius inner product)
X
hA, Bi = Aij Bij
i,j
(a) Verify that in terms of matrix multiplication the Frobenius inner product can be writ-
ten as
hA, Bi = Tr(AB T ),
where Tr(C) is the trace (the sum of diagonal elements) of a square matrix C.
(b) Build an orthonormal basis in the linear subspace Sn of symmetric n × n matrices
(c) What is the orthogonal complement of the subspace Sn of symmetric n × n matrices
in the space Rn×n of square n × n matrices?
" #
1 2
(d) Find the orthogonal decomposition, w.r.t. S2 , of the matrix
3 4
Definition A.3.1 [Affine subspace] An affine subspace (a plane) in Rn is a set of the form
M = a + L = {y = a + x | x ∈ L}, (A.3.1)
442 APPENDIX A. PREREQUISITES FROM LINEAR ALGEBRA AND ANALYSIS
E.g., shifting the linear subspace L comprised of vectors with zero first entry by a vector a =
(a1 , ..., an ), we get the set M = a + L of all vectors x with x1 = a1 ; according to our terminology,
this is an affine subspace.
Immediate question about the notion of an affine subspace is: what are the “degrees of
freedom” in decomposition (A.3.1) – how “strict” M determines a and L? The answer is as
follows:
L = M − M = {x − y | x, y ∈ M }. (A.3.2)
The shifting vector a is not uniquely defined by M and can be chosen as an arbitrary vector from
M.
Corollary A.3.1 Let {Mα } be an arbitrary family of affine subspaces in Rn , and assume that
the set M = ∩α Mα is nonempty. Then Mα is an affine subspace.
From Corollary A.3.1 it immediately follows that for every nonempty subset Y of Rn there exists
the smallest affine subspace containing Y – the intersection of all affine subspaces containing Y .
This smallest affine subspace containing Y is called the affine hull of Y (notation: Aff(Y )).
All this resembles a lot the story about linear spans. Can we further extend this analogy
and to get a description of the affine hull Aff(Y ) in terms of elements of Y similar to the one
of the linear span (“linear span of X is the set of all linear combinations of vectors from X”)?
Sure we can!
Let us choose somehow a point y0 ∈ Y , and consider the set
X = Y − y0 .
All affine subspaces containing Y should contain also y0 and therefore, by Proposition A.3.1,
can be represented as M = y0 + L, L being a linear subspace. It is absolutely evident that an
affine subspace M = y0 + L contains Y if and only if the subspace L contains X, and that the
larger is L, the larger is M :
L ⊂ L0 ⇒ M = y0 + L ⊂ M 0 = y0 + L0 .
Thus, to find the smallest among affine subspaces containing Y , it suffices to find the smallest
among the linear subspaces containing X and to translate the latter space by y0 :
Now, we know what is Lin(Y − y0 ) – this is a set of all linear combinations of vectors from
Y − y0 , so that a generic element of Lin(Y − y0 ) is
k
X
x= µi (yi − y0 ) [k may depend of x]
i=1
with yi ∈ Y and real coefficients µi . It follows that the generic element of Aff(Y ) is
k
X k
X
y = y0 + µi (yi − y0 ) = λi yi ,
i=1 i=0
where X
λ0 = 1 − µi , λi = µi , i ≥ 1.
i
We see that a generic element of Aff(Y ) is a linear combination of vectors from Y . Note, however,
that the coefficients λi in this combination are not completely arbitrary: their sum is equal to
1. Linear combinations of this type – with the unit sum of coefficients – have a special name –
they are called affine combinations.
We have seen that every vector from Aff(Y ) is an affine combination of vectors of Y . Whether
the inverse is true, i.e., whether Aff(Y ) contains all affine combinations of vectors from Y ? The
answer is positive. Indeed, if
k
X
y= λ i yi
i=1
P
is an affine combination of vectors from Y , then, using the equality i λi = 1, we can write it
also as
k
X
y = y0 + λi (yi − y0 ),
i=1
y0 being the “marked” vector we used in our previous reasoning, and the vector of this form, as
we already know, belongs to Aff(Y ). Thus, we come to the following
When Y itself is an affine subspace, it, of course, coincides with its affine hull, and the above
Proposition leads to the following
Corollary A.3.2 An affine subspace M is closed with respect to taking affine combinations of
its members – every combination of this type is a vector from M . Vice versa, a nonempty set
which is closed with respect to taking affine combinations of its members is an affine subspace.
X = Y − y0
Affinely independent sets. A linearly independent set x1 , ..., xk is a set such that no nontriv-
ial linear combination of x1 , ..., xk equals to zero. An equivalent definition is given by Theorem
A.1.2.(iv): x1 , ..., xk are linearly independent, if the coefficients in a linear combination
k
X
x= λi xi
i=1
are uniquely defined by the value x of the combination. This equivalent form reflects the essence
of the matter – what we indeed need, is the uniqueness of the coefficients in expansions. Ac-
cordingly, this equivalent form is the prototype for the notion of an affinely independent set: we
want to introduce this notion in such a way that the coefficients λi in an affine combination
k
X
y= λ i yi
i=0
for two different collections of coefficients λi and λ0i with unit sums of coefficients; if it is the
case, then
m
(λi − λ0i )yi = 0,
X
i=0
so that yi ’s are linearly dependent and, moreover, there exists a nontrivial zero combination
of then with zero sum of coefficients (since i (λi − λ0i ) = i λi − i λ0i = 1 − 1 = 0). Our
P P P
reasoning can be inverted – if there exists a nontrivial linear combination of yi ’s with zero sum
of coefficients which is zero, then the coefficients in the representation of a vector as an affine
combination of yi ’s are not uniquely defined. Thus, in order to get uniqueness we should for
sure forbid relations
k
X
µi yi = 0
i=0
with nontrivial zero sum coefficients µi . Thus, we have motivated the following
A.3. AFFINE SUBSPACES IN RN 445
With this definition, we get the result completely similar to the one of Theorem A.1.2.(iv):
Corollary A.3.3 Let y0 , ..., yk be affinely independent. Then the coefficients λi in an affine
combination
k
X X
y= λ i yi [ λi = 1]
i=0 i
of the vectors y0 , ..., yk are uniquely defined by the value y of the combination.
Proposition A.3.4 k + 1 vectors y0 , ..., yk are affinely independent if and only if the k vectors
(y1 − y0 ), (y2 − y0 ), ..., (yk − y0 ) are linearly independent.
From the latter Proposition it follows, e.g., that the collection 0, e1 , ..., en comprised of the origin
and the standard basic orths is affinely independent. Note that this collection is linearly depen-
dent (as every collection containing zero). You should definitely know the difference between
the two notions of independence we deal with: linear independence means that no nontrivial
linear combination of the vectors can be zero, while affine independence means that no nontrivial
linear combination from certain restricted class of them (with zero sum of coefficients) can be
zero. Therefore, there are more affinely independent sets than the linearly independent ones: a
linearly independent set is for sure affinely independent, but not vice versa.
Affine bases and affine dimension. Propositions A.3.2 and A.3.3 reduce the notions of
affine spanning/affine independent sets to the notions of spanning/linearly independent ones.
Combined with Theorem A.1.2, they result in the following analogies of the latter two statements:
Barycentric coordinates. Let M be an affine subspace, and let y0 , ..., yk be an affine basis
of M . Since the basis, by definition, affinely spans M , every vector y from M is an affine
combination of the vectors of the basis:
k
X k
X
y= λ i yi [ λi = 1],
i=0 i=0
and since the vectors of the affine basis are affinely independent, the coefficients of this com-
bination are uniquely defined by y (Corollary A.3.3). These coefficients are called barycentric
coordinates of y with respect to the affine basis in question. In contrast to the usual coordinates
with respect to a (linear) basis, the barycentric coordinates could not be quite arbitrary: their
sum should be equal to 1.
Comment. The “outer” description of a linear subspace/affine subspace – the “artist’s” one
– is in many cases much more useful than the “inner” description via linear/affine combinations
(the “worker’s” one). E.g., with the outer description it is very easy to check whether a given
vector belongs or does not belong to a given linear subspace/affine subspace, which is not that
easy with the inner one5) . In fact both descriptions are “complementary” to each other and
perfectly well work in parallel: what is difficult to see with one of them, is clear with another.
The idea of using “inner” and “outer” descriptions of the entities we meet with – linear subspaces,
affine subspaces, convex sets, optimization problems – the general idea of duality – is, I would
say, the main driving force of Convex Analysis and Optimization, and in the sequel we would
all the time meet with different implementations of this fundamental idea.
• Subspaces of dimension 0 are translations of the only 0-dimensional linear subspace {0},
i.e., are singleton sets – vectors from Rn . These subspaces are called points; a point is a
solution to a square system of linear equations with nonsingular matrix.
where y0 , y1 is an affine basis of l; you can choose as such a basis any pair of distinct points
on the line.
The “outer” description a line is as follows: it is the set of solutions to a linear system
with n variables and n − 1 linearly independent equations.
• Subspaces of dimension > 2 and < n − 1 have no special names; sometimes they are called
affine planes of such and such dimension.
• Affine subspaces of dimension n − 1, due to important role they play in Convex Analysis,
have a special name – they are called hyperplanes. The outer description of a hyperplane
is that a hyperplane is the solution set of a single linear equation
aT x = b
5)
in principle it is not difficult to certify that a given point belongs to, say, a linear subspace given as the linear
span of some set – it suffices to point out a representation of the point as a linear combination of vectors from
the set. But how could you certify that the point does not belong to the subspace?
A.4. SPACE RN : METRIC STRUCTURE AND TOPOLOGY 449
with nontrivial left hand side (a 6= 0). In other words, a hyperplane is the level set
a(x) = const of a nonconstant linear form a(x) = aT x.
• The “largest possible” affine subspace – the one of dimension n – is unique and is the
entire Rn . This subspace is given by an empty system of linear equations.
is well-defined; this quantity is called the (standard) Euclidean norm of vector x (or simply
the norm of x) and is treated as the distance from the origin to x. The distance between two
arbitrary points x, y ∈ Rn is, by definition, the norm |x − y| of the difference x − y. The notions
we have defined satisfy all basic requirements on the general notions of a norm and distance,
specifically:
1. Positivity of norm: The norm of a vector always is nonnegative; it is zero if and only is
the vector is zero:
|x| ≥ 0 ∀x; |x| = 0 ⇔ x = 0.
2. Homogeneity of norm: When a vector is multiplied by a real, its norm is multiplied by the
absolute value of the real:
3. Triangle inequality: Norm of the sum of two vectors is ≤ the sum of their norms:
In contrast to the properties of positivity and homogeneity, which are absolutely evident,
the Triangle inequality is not trivial and definitely requires a proof. The proof goes
through a fact which is extremely important by its own right – the Cauchy Inequality,
which perhaps is the most frequently used inequality in Mathematics:
Theorem A.4.1 [Cauchy’s Inequality] The absolute value of the inner product of two
vectors does not exceed the product of their norms:
and is equal to the product of the norms if and only if one of the vectors is proportional
to the other one:
Proof is immediate: we may assume that both x and y are nonzero (otherwise the
Cauchy inequality clearly is equality, and one of the vectors is constant times (specifi-
cally, zero times) the other one, as announced in Theorem). Assuming x, y 6= 0, consider
the function
f (λ) = (x − λy)T (x − λy) = xT x − 2λxT y + λ2 y T y.
By positive definiteness of the inner product, this function – which is a second order
polynomial – is nonnegative on the entire axis, whence the discriminant of the polyno-
mial
(xT y)2 − (xT x)(y T y)
is nonpositive:
(xT y)2 ≤ (xT x)(y T y).
Taking square roots of both sides, we arrive at the Cauchy Inequality. We also see that
the inequality is equality if and only if the discriminant of the second order polynomial
f (λ) is zero, i.e., if and only if the polynomial has a (multiple) real root; but due to
positive definiteness of inner product, f (·) has a root λ if and only if x = λy, which
proves the second part of Theorem.
The properties of norm (i.e., of the distance to the origin) we have established induce properties
of the distances between pairs of arbitrary points in Rn , specifically:
1. Positivity of distances: The distance |x − y| between two points is positive, except for the
case when the points coincide (x = y), when the distance between x and y is zero;
2. Symmetry of distances: The distance from x to y is the same as the distance from y to x:
|x − y| = |y − x|;
3. Triangle inequality for distances: For every three points x, y, z, the distance from x to z
does not exceed the sum of distances between x and y and between y and z:
|z − x| ≤ |y − x| + |z − y| ∀(x, y, z ∈ Rn )
A.4.2 Convergence
Equipped with distances, we can define the fundamental notion of convergence of a sequence of
vectors. Specifically, we say that a sequence x1 , x2 , ... of vectors from Rn converges to a vector x̄,
or, equivalently, that x̄ is the limit of the sequence {xi } (notation: x̄ = lim xi ), if the distances
i→∞
from x̄ to xi go to 0 as i → ∞:
x̄ = lim xi ⇔ |x̄ − xi | → 0, i → ∞,
i→∞
A.4. SPACE RN : METRIC STRUCTURE AND TOPOLOGY 451
or, which is the same, for every > 0 there exists i = i() such that the distance between every
point xi , i ≥ i(), and x̄ does not exceed :
n o n o
|x̄ − xi | → 0, i → ∞ ⇔ ∀ > 0∃i() : i ≥ i() ⇒ |x̄ − xi | ≤ .
• A set X ⊂ Rn is called open, if whenever x belongs to X, all points close enough to x also
belong to X:
∀(x ∈ X)∃(δ > 0) : |x0 − x| < δ ⇒ x0 ∈ X.
An open set containing a point x is called a neighbourhood of x.
Examples of closed sets: (1) Rn ; (2) ∅; (3) the sequence xi = (i, 0, ..., 0), i = 1, 2, 3, ...;
n
(4) {x ∈ Rn : aij xj = 0, i = 1, ..., m} (in other words: a linear subspace in Rn
P
i=1
n
always is closed, see Proposition A.3.6);(5) {x ∈ Rn :
P
aij xj = bi , i = 1, ..., m}
i=1
(in other words: an affine subset of Rn always is closed, see Proposition A.3.7);; (6)
Any finite subset of Rn
Examples of non-closed sets: (1) Rn \{0}; (2) the sequence xi = (1/i, 0, ..., 0), i =
n
1, 2, 3, ...; (3) {x ∈ Rn : xj > 0, j = 1, ..., n}; (4) {x ∈ Rn :
P
xj > 5}.
i=1
n
Examples of open sets: (1) Rn ; (2) ∅; (3) {x ∈ Rn :
P
aij xj > bj , i = 1, ..., m}; (4)
j=1
complement of a finite set.
Examples of non-open sets: (1) A nonempty finite set; (2) the sequence xi =
(1/i, 0, ..., 0), i = 1, 2, 3, ..., and the sequence xi = (i, 0, 0, ..., 0), i = 1, 2, 3, ...; (3)
n
{x ∈ Rn : xj ≥ 0, j = 1, ..., n}; (4) {x ∈ Rn : xj ≥ 5}.
P
i=1
452 APPENDIX A. PREREQUISITES FROM LINEAR ALGEBRA AND ANALYSIS
Exercise A.4 Mark in the list to follows those sets which are closed and those which are open:
6. {x : |x| = 1};
7. {x : |x| ≤ 1};
8. {x : |x| ≥ 1}:
2. Intersection of every family (finite or infinite) of closed sets is closed. Union of every
family (finite of infinite) of open sets is open.
3. Union of finitely many closed sets is closed. Intersection of finitely many open sets is open.
1. An affine mapping
m
P
A1j xj + b1
j=1
f (x) = .. ≡ Ax + b : Rn → Rm
.
m
P
Amj xj + bm
j=1
1. R2 ;
2. R2 \{0};
3. {x ∈ R2 : x1 = 0};
4. {x ∈ R2 : x2 = 0};
5. {x ∈ R2 : x1 + x2 = 0};
6. {x ∈ R2 : x1 − x2 = 0};
7. {x ∈ R2 : |x1 − x2 | ≤ x41 + x42 };
• X ⊂ Rn , Y ⊂ Rm ;
• g : Y → Rk be a continuous mapping.
Proof. Assume, on the contrary to what should be proved, that f is unbounded, so that
for every i there exists a point xi ∈ X such that |f (xi )| > i. By Theorem A.4.2, we can
extract from the sequence {xi } a subsequence {xij }∞ j=1 which converges to a point x̄ ∈ X.
The real-valued function g(x) = |f (x)| is continuous (as the superposition of two continuous
mappings, see Theorem A.5.1.(ii)) and therefore its values at the points xij should converge,
as j → ∞, to its value at x̄; on the other hand, g(xij ) ≥ ij → ∞ as j → ∞, and we get the
desired contradiction.
Proof. Assume, on the contrary to what should be proved, that there exists > 0 such
that for every δ > 0 one can find a pair of points x, y in X such that |x − y| < δ and
|f (x) − f (y)| ≥ . In particular, for every i = 1, 2, ... we can find two points xi , y i in X
such that |xi − y i | ≤ 1/i and |f (xi ) − f (y i )| ≥ . By Theorem A.4.2, we can extract from
the sequence {xi } a subsequence {xij }∞ j=1 which converges to certain point x̄ ∈ X. Since
|y ij − xij | ≤ 1/ij → 0 as j → ∞, the sequence {y ij }∞ j=1 converges to the same point x̄ as the
sequence {xij }∞ j=1 (why?) Since f is continuous, we have
whence lim (f (xij ) − f (y ij )) = 0, which contradicts the fact that |f (xij ) − f (y ij )| ≥ > 0
j→∞
for all j.
Proof: Let us prove that f attains its maximum on X (the proof for minimum is completely
similar). Since f is bounded on X by (i), the quantity
f ∗ = sup f (x)
x∈X
is finite; of course, we can find a sequence {xi } of points from X such that f ∗ = lim f (xi ).
i→∞
By Theorem A.4.2, we can extract from the sequence {xi } a subsequence {xij }∞
j=1 which
converges to certain point x̄ ∈ X. Since f is continuous on X, we have
Exercise A.6 Prove that in general no one of the three statements in Theorem A.5.2 remains
valid when X is closed, but not bounded, same as when X is bounded, but not closed.
A proper way to extend the notion of the derivative to real- and vector-valued functions of
vector argument is to realize what in fact is the meaning of the derivative in the univariate case.
What f 0 (x) says to us is how to approximate f in a neighbourhood of x by a linear function.
Specifically, if f 0 (x) exists, then the linear function f 0 (x)∆x of ∆x approximates the change
f (x + ∆x) − f (x) in f up to a remainder which is of highest order as compared with ∆x as
∆x → 0:
|f (x + ∆x) − f (x) − f 0 (x)∆x| ≤ ō(|∆x|) as ∆x → 0.
In the above formula, we meet with the notation ō(|∆x|), and here is the explanation of this
notation:
ō(|∆x|) is a common name of all functions φ(∆x) of ∆x which are well-defined in a
neighbourhood of the point ∆x = 0 on the axis, vanish at the point ∆x = 0 and are
such that
φ(∆x)
→ 0 as ∆x → 0.
|∆x|
For example,
1. (∆x)2 = ō(|∆x|), ∆x → 0,
2. |∆x|1.01 = ō(|∆x|), ∆x → 0,
3. sin2 (∆x) = ō(|∆x|), ∆x → 0,
4. ∆x 6= ō(|∆x|), ∆x → 0.
Later on we shall meet with the notation “ō(|∆x|k ) as ∆x → 0”, where k is a positive
integer. The definition is completely similar to the one for the case of k = 1:
ō(|∆x|k ) is a common name of all functions φ(∆x) of ∆x which are well-defined in
a neighbourhood of the point ∆x = 0 on the axis, vanish at the point ∆x = 0 and
are such that
φ(∆x)
→ 0 as ∆x → 0.
|∆x|k
Note that if f (·) is a function defined in a neighbourhood of a point x on the axis, then there
perhaps are many linear functions a∆x of ∆x which well approximate f (x + ∆x) − f (x), in the
sense that the remainder in the approximation
tends to 0 as ∆x → 0; among these approximations, however, there exists at most one which
approximates f (x + ∆x) − f (x) “very well” – so that the remainder is ō(|∆x|), and not merely
tends to 0 as ∆x → 0. Indeed, if
Thus, if a linear function a∆x of ∆x approximates the change f (x + ∆x) − f (x) in f up to the
remainder which is ō(|∆x|) as ∆x → 0, then a is the derivative of f at x. You can easily verify
that the inverse statement also is true: if the derivative of f at x exists, then the linear function
f 0 (x)∆x of ∆x approximates the change f (x + ∆x) − f (x) in f up to the remainder which is
ō(|∆x|) as ∆x → 0.
The advantage of the “ō(|∆x|)”-definition of derivative is that it can be naturally extended
onto vector-valued functions of vector arguments (you should just replace “axis” with Rn in
the definition of ō) and enlightens the essence of the notion of derivative: when it exists, this is
exactly the linear function of ∆x which approximates the change f (x + ∆x) − f (x) in f up to
a remainder which is ō(|∆x|). The precise definition is as follows:
Definition A.6.1 [Frechet differentiability] Let f be a function which is well-defined in a neigh-
bourhood of a point x ∈ Rn and takes values in Rm . We say that f is differentiable at x, if
there exists a linear function Df (x)[∆x] of ∆x ∈ Rn taking values in Rm which approximates
the change f (x + ∆x) − f (x) in f up to a remainder which is ō(|∆x|):
|f (x + ∆x) − f (x) − Df (x)[∆x]| ≤ ō(|∆x|). (A.6.1)
Equivalently: a function f which is well-defined in a neighbourhood of a point x ∈ Rn and takes
values in Rm is called differentiable at x, if there exists a linear function Df (x)[∆x] of ∆x ∈ Rn
taking values in Rm such that for every > 0 there exists δ > 0 satisfying the relation
|∆x| ≤ δ ⇒ |f (x + ∆x) − f (x) − Df (x)[∆x]| ≤ |∆x|.
1. The right hand side limit in (A.6.2) is an important entity called the directional derivative
of f taken at x along (a direction) ∆x; note that this quantity is defined in the “purely
univariate” fashion – by dividing the change in f by the magnitude of a shift in a direction
∆x and passing to limit as the magnitude of the shift approaches 0. Relation (A.6.2)
says that the derivative, if exists, is, at every ∆x, nothing that the directional derivative
of f taken at x along ∆x. Note, however, that differentiability is much more than the
existence of directional derivatives along all directions ∆x; differentiability requires also
the directional derivatives to be “well-organized” – to depend linearly on the direction ∆x.
It is easily seen that just existence of directional derivatives does not imply their “good
organization”: for example, the Euclidean norm
f (x) = |x|
2. It should be stressed that the derivative, if exists, is what it is: a linear function of ∆x ∈ Rn
taking values in Rm . As we shall see in a while, we can represent this function by some-
thing “tractable”, like a vector or a matrix, and can understand how to compute such a
representation; however, an intelligent reader should bear in mind that a representation is
not exactly the same as the represented entity. Sometimes the difference between deriva-
tives and the entities which represent them is reflected in the terminology: what we call
the derivative, is also called the differential, while the word “derivative” is reserved for the
vector/matrix representing the differential.
Case of m = 1 – the gradient. Let us start with real-valued functions (i.e., with the case
of m = 1); in this case the derivative is a linear real-valued function on Rn . As we remember,
the standard Euclidean structure on Rn allows to represent every linear function on Rn as the
inner product of the argument with certain fixed vector. In particular, the derivative Df (x)[∆x]
of a scalar function can be represented as
what is denoted ”vector” in this relation, is called the gradient of f at x and is denoted by
∇f (x):
Df (x)[∆x] = (∇f (x))T ∆x. (A.6.3)
A.6. DIFFERENTIABLE FUNCTIONS ON RN 459
How to compute the gradient? The answer is given by (A.6.2). Indeed, let us look what (A.6.3)
and (A.6.2) say when ∆x is the i-th standard basic orth. According to (A.6.3), Df (x)[ei ] is the
i-th coordinate of the vector ∇f (x); according to (A.6.2),
Df (x)[ei ] = lim f (x+teti )−f (x) ,
∂f (x)
t→+0
f (x−tei )−f (x) ⇒ Df (x)[ei ] = .
Df (x)[ei ] = −Df (x)[−ei ] = − lim t = lim f (x+teti )−f (x)
∂xi
t→+0 t→−0
Thus,
What is denoted by “matrix” in (A.6.4), is called the Jacobian of f at x and is denoted by f 0 (x).
How to compute the entries of the Jacobian? Here again the answer is readily given by (A.6.2).
Indeed, on one hand, we have
Df (x)[∆x] = f 0 (x)∆x, (A.6.5)
whence
[Df (x)[ej ]]i = (f 0 (x))ij , i = 1, ..., m, j = 1, ..., n.
On the other hand, denoting
f1 (x)
f (x) = ..
,
.
fm (x)
the same computation as in the case of gradient demonstrates that
∂fi (x)
[Df (x)[ej ]]i =
∂xj
If a vector-valued function f (x) = (f1 (x), ..., fm (x)) is differentiable at x, then the
first order partial derivatives of all fi at x exist, and the Jacobian of f at x is just
the m × n matrix with the entries [ ∂f∂x i (x)
j
]i,j (so that the rows in the Jacobian are
[∇f1 (x)]T ,..., [∇fm (x)]T . The derivative of f , taken at x, is the linear vector-valued
function of ∆x given by
[∇f1 (x)]T ∆x
Df (x)[∆x] = f 0 (x)∆x = ..
.
.
T
[∇fm (x)] ∆x
Remark A.6.1 Note that for a real-valued function f we have defined both the gradient ∇f (x)
and the Jacobian f 0 (x). These two entities are “nearly the same”, but not exactly the same:
the Jacobian is a vector-row, and the gradient is a vector-column linked by the relation
Of course, both these representations of the derivative of f yield the same linear approximation
of the change in f :
Df (x)[∆x] = (∇f (x))T ∆x = f 0 (x)∆x.
3. The first order partial derivatives of the components fi of f are continuous at the point
x0 .
Theorem A.6.2 (i) [Differentiability and linear operations] Let f1 (x), f2 (x) be mappings de-
fined in a neighbourhood of x0 ∈ Rn and taking values in Rm , and λ1 (x), λ2 (x) be real-valued
A.6. DIFFERENTIABLE FUNCTIONS ON RN 461
Df (x0 )[∆x] = [Dλ1 (x0 )[∆x]]f1 (x0 ) + λ1 (x0 )Df1 (x0 )[∆x]
+[Dλ2 (x0 )[∆x]]f2 (x0 ) + λ2 (x0 )Df2 (x0 )[∆x]
⇓
f 0 (x0 ) = f1 (x0 )[∇λ1 (x0 )]T + λ1 (x0 )f10 (x0 )
+f2 (x0 )[∇λ2 (x0 )]T + λ2 (x0 )f20 (x0 ).
If the outer function g is real-valued, then the latter formula implies that
is differentiable at every point (Theorem A.6.1) and its gradient, of course, equals g:
and we arrive at
∇(a + g T x) = g
Example 2: The gradient of a quadratic form. For the time being, let us define a
homogeneous quadratic form on Rn as a function
X
f (x) = Aij xi xj = xT Ax,
i,j
462 APPENDIX A. PREREQUISITES FROM LINEAR ALGEBRA AND ANALYSIS
where A is an n × n matrix. Note that the matrices A and AT define the same quadratic form,
and therefore the symmetric matrix B = 21 (A + AT ) also produces the same quadratic form as
A and AT . It follows that we always may assume (and do assume from now on) that the matrix
A producing the quadratic form in question is symmetric.
A quadratic form is a simple polynomial and as such is differentiable at every point (Theorem
A.6.1). What is the gradient of f at a point x? Here is the computation:
(∇f (x))T ∆x = Df (x)[∆x]
h i
= lim (x + t∆x)T A(x + t∆x) − xT Ax
t→+0
h [(A.6.2)]i
= lim xT Ax + t(∆x)T Ax + txT A∆x + t2 (∆x)T A∆x − xT Ax
t→+0
h i [opening parentheses]
= lim t−1 2t(Ax)T ∆x + t2 (∆x)T A∆x
t→+0
[arithmetics + symmetry of A]
= 2(Ax)T ∆x
We conclude that
∇(xT Ax) = 2Ax
(recall that A = AT ).
Example 3: The derivative of the log-det barrier. Let us compute the derivative of
the log-det barrier (playing an extremely important role in modern optimization)
F (X) = ln Det(X);
here X is an n × n matrix (or, if you prefer, n2 -dimensional vector). Note that F (X) is well-
defined and differentiable in a neighbourhood of every point X̄ with positive determinant (indeed,
Det(X) is a polynomial of the entries of X and thus – is everywhere continuous and differentiable
with continuous partial derivatives, while the function ln(t) is continuous and differentiable on
the positive ray; by Theorems A.5.1.(ii), A.6.2.(ii), F is differentiable at every X such that
Det(X) > 0). The reader is kindly asked to try to find the derivative of F by the standard
techniques; if the result will not be obtained in, say, 30 minutes, please look at the 8-line
computation to follow (in this computation, Det(X̄) > 0, and G(X) = Det(X)):
DF (X̄)[∆X]
= D ln(G(X̄))[DG(X̄)[∆X]] [chain rule]
= G−1 (X̄)DG(X̄)[∆X] [ln0 (t) = t−1 ]
= Det−1 (X̄) lim t−1 Det(X̄ + t∆X) − Det(X̄)
[definition of G and (A.6.2)]
t→+0
= Det−1 (X̄) lim t−1 Det(X̄(I + tX̄ −1 ∆X)) − Det(X̄)
t→+0
= Det−1 (X̄) lim t−1 Det(X̄)(Det(I + tX̄ −1 ∆X) − 1)
[Det(AB) = Det(A)Det(B)]
−1
t→+0
Det(I + tX̄ −1 ∆X) − 1
= lim t
t→+0
= Tr(X̄ −1 ∆X) = [X̄ −1 ]ji (∆X)ij
P
i,j
∂`
D` f (x)[∆x1 , ∆x2 , ..., ∆x` ] = |t1 =...=t` =0 f (x+t1 ∆x1 +t2 ∆x2 +...+t` ∆x` ).
∂t` ∂t`−1 ...∂t1
(A.6.7)
1. When ` = 1, (∗) says to us that the first order derivative of f , taken at x, is a linear function
Df (x)[∆x1 ] of ∆x1 ∈ Rn , taking values in Rm , and that the value of this function at every
464 APPENDIX A. PREREQUISITES FROM LINEAR ALGEBRA AND ANALYSIS
2. To understand what is the second derivative, let us take the first derivative Df (x)[∆x1 ],
let us temporarily fix somehow the argument ∆x1 and treat the derivative as a function of
x. As a function of x, ∆x1 being fixed, the quantity Df (x)[∆x1 ] is again a mapping which
maps U into Rm and is differentiable by Theorem A.6.1 (provided, of course, that k ≥ 2).
The derivative of this mapping is certain linear function of ∆x ≡ ∆x2 ∈ Rn , depending
on x as on a parameter; and of course it depends on ∆x1 as on a parameter as well. Thus,
the derivative of Df (x)[∆x1 ] in x is certain function
D2 f (x)[∆x1 , ∆x2 ]
of x ∈ U and ∆x1 , ∆x2 ∈ Rn and taking values in Rm . What we know about this function
is that it is linear in ∆x2 . In fact, it is also linear in ∆x1 , since it is the derivative in x of
certain function (namely, of Df (x)[∆x1 ]) linearly depending on the parameter ∆x1 , so that
the derivative of the function in x is linear in the parameter ∆x1 as well (differentiation is
a linear operation with respect to a function we are differentiating: summing up functions
and multiplying them by real constants, we sum up, respectively, multiply by the same
constants, the derivatives). Thus, D2 f (x)[∆x1 , ∆x2 ] is linear in ∆x1 when x and ∆x2 are
fixed, and is linear in ∆x2 when x and ∆x1 are fixed. Moreover, we have
∂
D2 f (x)[∆x1 , ∆x2 ] = 2
∂t2 |t2 =0 Df (x + t2 ∆x )[∆x ]
1 [cf. (A.6.8)]
∂ ∂ 2 1
= ∂t2 |t2 =0
∂t1 |t1 =0 f (x + t2 ∆x + t1 ∆x ) [by (A.6.8)]
(A.6.9)
∂2
f (x + t1 ∆x1 + t2 ∆x2 )
= ∂t2 ∂t1
t1 =t2 =0
as claimed in (A.6.7) for ` = 2. The only piece of information about the second derivative
which is contained in (∗) and is not justified yet is that D2 f (x)[∆x1 , ∆x2 ] is symmetric
in ∆x1 , ∆x2 ; but this fact is readily given by the representation (A.6.7), since, as they
prove in Calculus, if a function φ possesses continuous partial derivatives of orders ≤ ` in
a neighbourhood of a point, then these derivatives in this neighbourhood are independent
of the order in which they are taken; it follows that
∂2
D2 f (x)[∆x1 , ∆x2 ] f (x + t1 ∆x1 + t2 ∆x2 )
= ∂t2 ∂t1 [(A.6.9)]
t1 =t2 =0 | {z }
φ(t1 ,t2 )
∂2
=
∂t1 ∂t2 φ(t1 , t2 )
t1 =t2 =0
∂2
f (x + t2 ∆x2 + t1 ∆x1 )
= ∂t1 ∂t2
t1 =t2 =0
= D2 f (x)[∆x2 , ∆x1 ] [the same (A.6.9)]
3. Now it is clear how to proceed: to define D3 f (x)[∆x1 , ∆x2 , ∆x3 ], we fix in the second
order derivative D2 f (x)[∆x1 , ∆x2 ] the arguments ∆x1 , ∆x2 and treat it as a function of
A.6. DIFFERENTIABLE FUNCTIONS ON RN 465
x only, thus arriving at a mapping which maps U into Rm and depends on ∆x1 , ∆x2 as
on parameters (linearly in every one of them). Differentiating the resulting mapping in
x, we arrive at a function D3 f (x)[∆x1 , ∆x2 , ∆x3 ] which by construction is linear in every
one of the arguments ∆x1 , ∆x2 , ∆x3 and satisfies (A.6.7); the latter relation, due to the
Calculus result on the symmetry of partial derivatives, implies that D3 f (x)[∆x1 , ∆x2 , ∆x3 ]
is symmetric in ∆x1 , ∆x2 , ∆x3 . After we have at our disposal the third derivative D3 f , we
can build from it in the already explained fashion the fourth derivative, and so on, until
k-th derivative is defined.
Remark A.6.2 Since D` f (x)[∆x1 , ..., ∆x` ] is linear in every one of ∆xi , we can expand the
derivative in a multiple sum:
n
∆xi = ∆xij ej
P
j=1
⇓
n n (A.6.10)
D` f (x)[∆x1 , ..., ∆x` ] D` f (x)[ ∆x1j1 ej1 , ..., ∆x`j` ej` ]
P P
=
P j1 =1 ` j` =1
= D f (x)[ej1 , ..., ej` ]∆x1j1 ...∆x`j`
1≤j1 ,...,j` ≤n
What is the origin of the coefficients D` f (x)[ej1 , ..., ej` ]? According to (A.6.7), one has
∂`
D` f (x)[e
j1 , ..., ej` ] = f (x + t1 ej1 + t2 ej2 + ... + t` ej` )
∂t` ∂t`−1 ...∂t1
t1 =...=t` =0
∂`
= ∂xj` ∂xj`−1 ...∂xj1 f (x).
so that the coefficients in (A.6.10) are nothing but the partial derivatives, of order `, of f .
Remark A.6.3 An important particular case of relation (A.6.7) is the one when ∆x1 = ∆x2 =
... = ∆x` ; let us call the common value of these ` vectors d. According to (A.6.7), we have
∂`
`
D f (x)[d, d, ..., d] = f (x + t1 d + t2 d + ... + t` d).
∂t` ∂t`−1 ...∂t1 t1 =...=t` =0
φ(t) = f (x + td)
∂`
(`)
f (x + t1 d + t2 d + ... + t` d) = D` f (x)[d, ..., d].
φ (0) =
∂t` ∂t`−1 ...∂t1 t1 =...=t` =0
In other words, D` f (x)[d, ..., d] is what is called `-th directional derivative of f taken at x along
the direction d; to define this quantity, we pass from function f of several variables to the
univariate function φ(t) = f (x + td) – restrict f onto the line passing through x and directed
by d – and then take the “usual” derivative of order ` of the resulting function of single real
variable t at the point t = 0 (which corresponds to the point x of our line).
466 APPENDIX A. PREREQUISITES FROM LINEAR ALGEBRA AND ANALYSIS
X ∂ k f (x)
Dk f (x)[∆x1 , ..., ∆xk ] = (∆x1 )i1 ...(∆xk )ik .
1≤i1 ,...,ik ≤n
∂xik ∂xik−1 ...∂xi1
We may say that the derivative can be represented by k-index collection of m-dimensional
∂ k f (x)
vectors ∂xi ∂x i ...∂xi . This collection, however, is a difficult-to-handle entity, so that such a
k k−1 1
representation does not help. There is, however, a case when the collection becomes an entity we
know to handle; this is the case of the second-order derivative of a scalar function
h 2 (ki = 2, m = 1).
∂ f (x)
In this case, the collection in question is just a symmetric matrix H(x) = ∂x i ∂xj
. This
1≤i,j≤n
matrix is called the Hessian of f at x. Note that
Theorem A.6.3 (i) Let U be an open set in Rn , f1 (·), f2 (·) : Rn → Rm be Ck in U , and let
real-valued functions λ1 (·), λ2 (·) be Ck in U . Then the function
is Ck in U .
(ii) Let U be an open set in Rn , V be an open set in Rm , let a mapping f : Rn → Rm be
C in U and such that f (x) ∈ V for x ∈ U , and, finally, let a mapping g : Rm → Rp be Ck in
k
is Ck in U .
Remark A.6.4 For higher order derivatives, in contrast to the first order ones, there is no
simple “chain rule” for computing the derivative of superposition. For example, the second-
order derivative of the superposition h(x) = g(f (x)) of two C2 -mappings is given by the formula
(check it!). We see that both the first- and the second-order derivatives of f and g contribute
to the second-order derivative of the superposition h.
The only case when there does exist a simple formula for high order derivatives of a superposi-
tion is the case when the inner function is affine: if f (x) = Ax+b and h(x) = g(f (x)) = g(Ax+b)
with a C` mapping g, then
Df (x)[∆x1 ] = bT ∆x1
is independent of x, and therefore the derivative of Df (x)[∆x1 ] in x, which should give us the
second derivative D2 f (x)[∆x1 , ∆x2 ], is zero. Clearly, the third, the fourth, etc., derivatives of
an affine function are zero as well.
Differentiating in x, we get
h i
D2 f (x)[∆x1 , ∆x2 ] = lim t−1 2(x + t∆x2 )T A∆x1 − 2xT A∆x1 = 2(∆x2 )T A∆x1 ,
t→+0
so that
D2 f (x)[∆x1 , ∆x2 ] = 2(∆x2 )T A∆x1
Note that the second derivative of a quadratic form is independent of x; consequently, the third,
the fourth, etc., derivatives of a quadratic form are identically zero.
To differentiate the right hand side in X, let us first find the derivative of the mapping G(X) =
X −1 which is defined on the open set of non-degenerate n × n matrices. We have
lim t−1 (X + t∆X)−1 − X −1
DG(X)[∆X] =
t→+0
lim t−1 (X(I + tX −1 ∆X))−1 − X −1
=
t→+0
= lim t−1 [(I + t |X −1{z∆X})−1 X −1 − X −1 ]
t→+0
Y
t−1 )−1 X −1
= lim (I + tY −I
t→+0
= lim t−1 [I − (I + tY )] (I + tY )−1 X −1
t→+0
= lim [−Y (I + tY )−1 ] X −1
t→+0
= −Y X −1
= −X −1 ∆XX −1
and we arrive at the important by its own right relation
F (X) = ln Det(X)
⇓
DF (X)[∆X 1 ] = Tr(X −1 ∆X 1 )
⇓
D2 F (X)[∆X 1 , ∆X 2 ] = lim t−1 Tr((X + t∆X 2 )−1 ∆X 1 ) − Tr(X −1 ∆X 1 )
t→+0
lim Tr t−1 (X + t∆X 2 )−1 ∆X 1 − X −1 ∆X 1
=
t→+0
lim Tr t−1 (X + t∆X 2 )−1 − X −1 ∆X 1
=
t→+0
= Tr [−X −1 ∆X 2 X −1 ]∆X 1 ,
Since Tr(AB) = Tr(BA) (check it!) for all matrices A, B such that the product AB makes sense
and is square, the right hand side in the above formula is symmetric in ∆X 1 , ∆X 2 , as it should
be for the second derivative of a C2 function.
– this is the affine function of x which approximates “very well” f (x) in a neighbourhood of
x̄, namely, within approximation error ō(|x − x̄|). Similar fact is true for Taylor expansions of
higher order:
Theorem A.6.4 Let f : Rn → Rm be Ck in a neighbourhood of x̄, and let Fk (x) be the Taylor
expansion of f at x̄ of degree k. Then
(i) Fk (x) is a vector-valued polynomial of full degree ≤ k (i.e., every one of the coordinates
of the vector Fk (x) is a polynomial of x1 , ..., xn , and the sum of powers of xi ’s in every term of
this polynomial does not exceed k);
(ii) Fk (x) approximates f (x) in a neighbourhood of x̄ up to a remainder which is ō(|x − x̄|k )
as x → x̄:
Fk (·) is the unique polynomial with components of full degree ≤ k which approximates f up to a
remainder which is ō(|x − x̄|k ).
(iii) The value and the derivatives of Fk of orders 1, 2, ..., k, taken at x̄, are the same as the
value and the corresponding derivatives of f taken at the same point.
As stated in Theorem, Fk (x) approximates f (x) for x close to x̄ up to a remainder which is
ō(|x − x̄|k ). In many cases, it is not enough to know that the reminder is “ō(|x − x̄|k )” — we
need an explicit bound on this remainder. The standard bound of this type is as follows:
Theorem A.6.5 Let k be a positive integer, and let f : Rn → Rm be Ck+1 in a ball Br =
Br (x̄) = {x ∈ Rn : |x − x̄| < r} of a radius r > 0 centered at a point x̄. Assume that the
directional derivatives of order k + 1, taken at every point of Br along every unit direction, do
not exceed certain L < ∞:
|Dk+1 f (x)[d, ..., d]| ≤ L ∀(x ∈ Br )∀(d, |d| = 1).
Then for the Taylor expansion Fk of order k of f taken at x̄ one has
L|x − x̄|k+1
|f (x) − Fk (x)| ≤ ∀(x ∈ Br ).
(k + 1)!
Thus, in a neighbourhood of x̄ the remainder of the k-th order Taylor expansion, taken at x̄, is
of order of L|x − x̄|k+1 , where L is the maximal (over all unit directions and all points from the
neighbourhood) magnitude of the directional derivatives of order k + 1 of f .
Here Tr stands for the trace – the sum of diagonal elements of a (square) matrix. With this inner
product (called the Frobenius inner product), Mm,n becomes a legitimate Euclidean space, and
we may use in connection with this space all notions based upon the Euclidean structure, e.g.,
the (Frobenius) norm of a matrix
v
um X
n
q uX q
kXk2 = hX, Xi = t Xij2 = Tr(X T X)
i=1 j=1
470 APPENDIX A. PREREQUISITES FROM LINEAR ALGEBRA AND ANALYSIS
and likewise the notions of orthogonality, orthogonal complement of a linear subspace, etc.
The same applies to the space Sm equipped with the Frobenius inner product; of course, the
Frobenius inner product of symmetric matrices can be written without the transposition sign:
hX, Y i = Tr(XY ), X, Y ∈ Sm .
of A.
Theorem A.7.1 states, in particular, that for a symmetric matrix A, all eigenvalues are real,
and the corresponding eigenvectors can be chosen to be real and to form an orthonormal basis
in Rn .
Ai = U Λi U T , i = 1, ..., k.
You are welcome to prove this statement by yourself; to simplify your task, here are two simple
and important by their own right statements which help to reach your target:
A.7.2.E.1: Let λ be a real and A, B be two commuting n × n matrices. Then the
spectral subspace E = {x : Ax = λx} of A corresponding to λ is invariant for B
(i.e., Be ∈ E for every e ∈ E).
A.7.2.E.2: If A is an n × n matrix and L is an invariant subspace of A (i.e., L is a
linear subspace such that Ae ∈ L whenever e ∈ L), then the orthogonal complement
L⊥ of L is invariant for the matrix AT . In particular, if A is symmetric and L is
invariant subspace of A, then L⊥ is invariant subspace of A as well.
VCE says that to get the largest eigenvalue λ1 (A), you should maximize the quadratic form
xT Ax over the unit sphere S = {x ∈ Rn : xT x = 1}; the maximum is exactly λ1 (A). To get
the second largest eigenvalue λ2 (A), you should act as follows: you choose a linear subspace E
of dimension n − 1 and maximize the quadratic form xT Ax over the cross-section of S by this
subspace; the maximum value of the form depends on E, and you minimize this maximum over
linear subspaces E of the dimension n − 1; the result is exactly λ2 (A). To get λ3 (A), you replace
in the latter construction subspaces of the dimension n − 1 by those of the dimension n − 2,
and so on. In particular, the smallest eigenvalue λn (A) is just the minimum, over all linear
subspaces E of the dimension n − n + 1 = 1, i.e., over all lines passing through the origin, of the
quantities xT Ax, where x ∈ E is unit (xT x = 1); in other words, λn (A) is just the minimum of
the quadratic form xT Ax over the unit sphere S.
Proof of the VCE is pretty easy. Let e1 , ..., en be an orthonormal eigenbasis of A: Ae` =
λ` (A)e` . For 1 ≤ ` ≤ n, let F` = Lin{e1 , ..., e` }, G` = Lin{e` , e`+1 , ..., en }. Finally, for
x ∈ Rn let ξ(x) be the vector of coordinates of x in the orthonormal basis e1 , ..., en . Note
that
xT x = ξ T (x)ξ(x),
since {e1 , ..., en } is an orthonormal basis, and that
xT Ax = xT A ξi (x)ei ) = xT
P P
λi (A)ξi (x)ei =
i i
P T
λi (A)ξi (x) (x ei )
i | {z } (A.7.4)
ξi (x)
λi (A)ξi2 (x).
P
=
i
Now, given `, 1 ≤ ` ≤ n, let us set E = G` ; note that E is a linear subspace of the dimension
n − ` + 1. In view of (A.7.4), the maximum of the quadratic form xT Ax over the intersection
of our E with the unit sphere is
( n n
)
X X
2 2
max λi (A)ξi : ξi = 1 ,
i=` i=`
and the latter quantity clearly equals to max λi (A) = λ` (A). Thus, for appropriately chosen
`≤i≤n
E ∈ E` , the inner maximum in the right hand side of (A.7.3) equals to λ` (A), whence the
right hand side of (A.7.3) is ≤ λ` (A). It remains to prove the opposite inequality. To this
end, consider a linear subspace E of the dimension n−`+1 and observe that it has nontrivial
intersection with the linear subspace F` of the dimension ` (indeed, dim E + dim F` = (n −
` + 1) + ` > n, so that dim (E ∩ F ) > 0 by the Dimension formula). It follows that there
exists a unit vector y belonging to both E and F` . Since y is a unit vector from F` , we have
P̀ P̀ 2
y= ηi ei with ηi = 1, whence, by (A.7.4),
i=1 i=1
`
X
y T Ay = λi (A)ηi2 ≥ min λi (A) = λ` (A).
1≤i≤`
i=1
max xT Ax ≥ y T Ay ≥ λ` (A).
x∈E:xT x=1
Since E is an arbitrary subspace form E` , we conclude that the right hand side in (A.7.3) is
≥ λ` (A).
A.7. SYMMETRIC MATRICES 473
max xT Ax ≥ max xT Bx
x∈E:xT x=1 x∈E:xT x=1
Indeed, by VCE, λ` (Ā) = minE∈Ē` maxx∈E:xT x=1 xT Ax, where Ē` is the family of all linear
subspaces of the dimension n − k − ` + 1 contained in the linear subspace {x ∈ Rn : xn−k+1 =
xn−k+2 = ... = xn = 0}. Since Ē` ⊂ E`+k , we have
We have proved the left inequality in (A.7.5). Applying this inequality to the matrix −A, we
get
−λ` (Ā) = λn−k−` (−Ā) ≥ λn−` (−A) = −λ` (A),
or, which is the same, λ` (Ā) ≤ λ` (A), which is the first inequality in (A.7.5).
A 0 ⇔ {A = AT and xT Ax ≥ 0 ∀x}.
A is called positive definite (notation: A 0), if it is positive semidefinite and the corresponding
quadratic form is positive outside the origin:
I
A.7. SYMMETRIC MATRICES 475
as required in (vi).
(vi)⇒(v): evident.
(v)⇒(iv): Let A = B 2 with certain symmetric B, and let bi be i-th column of B. Applying
the Gram-Schmidt orthogonalization process (see proof of Theorem A.2.3.(iii)), we can find an
i
P
orthonormal system of vectors u1 , ..., un and lower triangular matrix L such that bi = Lij uj ,
j=1
or, which is the same, B T = LU , where U is the orthogonal matrix with the rows uT1 , ..., uTn . We
now have A = B 2 = B T (B T )T = LU U T LT = LLT . We see that A = ∆T ∆, where the matrix
∆ = LT is upper triangular.
(iv)⇒(iii): evident.
(iii)⇒(i): If A = DT D, then xT Ax = (Dx)T (Dx) ≥ 0 for all x.
We have proved the equivalence of the properties (i) – (vi). Slightly modifying the reasoning
(do it yourself!), one can prove the equivalence of the properties (i0 ) – (vi0 ).
Remark A.7.1 (i) [Checking positive semidefiniteness] Given an n × n symmetric matrix A,
one can check whether it is positive semidefinite by a purely algebraic finite algorithm (the so
called Lagrange diagonalization of a quadratic form) which requires at most O(n3 ) arithmetic
operations. Positive definiteness of a matrix can be checked also by the Choleski factorization
algorithm which finds the decomposition in (iv0 ), if it exists, in approximately 61 n3 arithmetic
operations.
There exists another useful algebraic criterion (Sylvester’s criterion) for positive semidefi-
niteness of a matrix; according to this criterion, a symmetric matrix A is positive definite if
and only if its angular minors are positive, and A is positive semidefinite if and only" if all# its
a b
principal minors are nonnegative. For example, a symmetric 2 × 2 matrix A = is
b c
positive semidefinite if and only if a ≥ 0, c ≥ 0 and Det(A) ≡ ac − b2 ≥ 0.
(ii) [Square root of a positive semidefinite matrix] By the first chain of equivalences in
Theorem A.7.5, a symmetric matrix A is 0 if and only if A is the square of a positive
semidefinite matrix B. The latter matrix is uniquely defined by A 0 and is called the square
root of A (notation: A1/2 ).
A.7.4.B. The semidefinite cone. When adding symmetric matrices and multiplying them
by reals, we add, respectively multiply by reals, the corresponding quadratic forms. It follows
that
A.7.4.B.1: The sum of positive semidefinite matrices and a product of a positive
semidefinite matrix and a nonnegative real is positive semidefinite,
or, which is the same (see Section B.1.4),
A.7.4.B.2: n × n positive semidefinite matrices form a cone Sn+ in the Euclidean
space Sn of symmetric n × n matrices, the Euclidean structure being given by the
Frobenius inner product hA, Bi = Tr(AB) = Aij Bij .
P
i,j
The cone Sn+ is called the semidefinite cone of size n. It is immediately seen that the semidefinite
cone Sn+ is “good” (see Lecture 5), specifically,
• Sn+ is closed: the limit of a converging sequence of positive semidefinite matrices is positive
semidefinite;
476 APPENDIX A. PREREQUISITES FROM LINEAR ALGEBRA AND ANALYSIS
• Sn+ is pointed: the only n × n matrix A such that both A and −A are positive semidefinite
is the zero n × n matrix;
• Sn+ possesses a nonempty interior which is comprised of positive definite matrices.
Note that the relation A B means exactly that A − B ∈ Sn+ , while A B is equivalent to
A − B ∈ int Sn+ . The “matrix inequalities” A B (A B) match the standard properties of
the usual scalar inequalities, e.g.:
AA [reflexivity]
A B, B A ⇒ A = B [antisymmetry]
A B, B C ⇒ A C [transitivity]
A B, C D ⇒ A + C B + D [compatibility with linear operations, I]
A B, λ ≥ 0 ⇒ λA λB [compatibility with linear operations, II]
Ai Bi , Ai → A, Bi → B as i → ∞ ⇒ A B [closedness]
with evident modifications when is replaced with , or
A B, C D ⇒ A + C B + D,
etc. Along with these standard properties of inequalities, the inequality possesses a nice
additional property:
A.7.4.B.3: In a valid -inequality
AB
one can multiply both sides from the left and by the right by a (rectangular) matrix
and its transpose:
A, B ∈ Sn , A B, V ∈ Mn,m
⇓
V T AV V T BV
Indeed, we should prove that if A − B 0, then also V T (A − B)V 0, which is
immediate – the quadratic form y T [V T (A − B)V ]y = (V y)T (A − B)(V y) of y is
nonnegative along with the quadratic form xT (A − B)x of x.
An important additional property of the semidefinite cone is its self-duality:
Theorem A.7.6 A symmetric matrix Y has nonnegative Frobenius inner products with all pos-
itive semidefinite matrices if and only if Y itself is positive semidefinite.
Proof. “if” part: Assume that Y 0, and let us prove that then Tr(Y X) ≥ 0 for every X 0.
Indeed, the eigenvalue decomposition of Y can be written as
n
X
Y = λi (Y )ei eTi ,
i=1
where the concluding equality is given by the following well-known property of the trace:
A.7. SYMMETRIC MATRICES 477
A.7.4.B.4: Whenever matrices A, B are such that the product AB makes sense and
is a square matrix, one has
Tr(AB) = Tr(BA).
Indeed, we should verify that if A ∈ Mp,q and B ∈ Mq,p , then Tr(AB) = Tr(BA). The
p P
P q
left hand side quantity in our hypothetic equality is Aij Bji , and the right hand side
i=1 j=1
q P
P p
quantity is Bji Aij ; they indeed are equal.
j=1 i=1
Looking at the concluding quantity in (A.7.6), we see that it indeed is nonnegative whenever
X 0 (since Y 0 and thus λi (Y ) ≥ 0 by P.7.5).
”only if” part: We are given Y such that Tr(Y X) ≥ 0 for all matrices X 0, and we should
prove that Y 0. This is immediate: for every vector x, the matrix X = xxT is positive
semidefinite (Theorem A.7.5.(iii)), so that 0 ≤ Tr(Y xxT ) = Tr(xT Y x) = xT Y x. Since the
resulting inequality xT Y x ≥ 0 is valid for every x, we have Y 0.
478 APPENDIX A. PREREQUISITES FROM LINEAR ALGEBRA AND ANALYSIS
Appendix B
Convex sets in Rn
[x, y] = {z = λx + (1 − λ)y : 0 ≤ λ ≤ 1}
x, y ∈ M, 0 ≤ λ ≤ 1 ⇒ λx + (1 − λ)y ∈ M.
Note that by this definition an empty set is convex (by convention, or better to say, by the
exact sense of the definition: for the empty set, you cannot present a counterexample to show
that it is not convex).
Convexity of affine subspaces immediately follows from the possibility to represent these sets as
solution sets of systems of linear equations (Proposition A.3.7), due to the following simple and
important fact:
aTα x ≤ bα , α ∈ A (!)
479
480 APPENDIX B. CONVEX SETS IN RN
S = {x ∈ Rn : aTα x ≤ bα , α ∈ A}
is convex.
In particular, the solution set of a finite system
Ax ≤ b
Remark B.1.1 Note that every set given by Proposition B.1.1 is not only convex, but also
closed (why?). In fact, from Separation Theorem (Theorem B.2.9 below) it follows that
Every closed convex set in Rn is the solution set of a (perhaps, infinite) system of
nonstrict linear inequalities.
Remark B.1.2 Note that replacing some of the nonstrict linear inequalities aTα x ≤ bα in (!)
with their strict versions aTα x < bα , we get a system with the solution set which still is convex
(why?), but now not necessary is closed.
kλxk = |λ|kxk;
kx + yk ≤ kxk + kyk.
{x ∈ E : kxk ≤ 1},
These indeed are norms (which is not clear in advance). When p = 2, we get the usual Euclidean
norm; of course, you know how the Euclidean ball looks. When p = 1, we get
n
X
kxk1 = |xi |,
i=1
n
X
V = {x ∈ Rn : |xi | ≤ 1}
i=1
When p = ∞, we get
kxk∞ = max |xi |,
1≤i≤n
V = {x ∈ Rn : −1 ≤ xi ≤ 1, 1 ≤ i ≤ n}.
Exercise B.2 † Prove that unit balls of norms on Rn are exactly the same as convex sets V in
Rn satisfying the following three properties:
A set V satisfying the outlined properties is the unit ball of the norm
n o
kxk = inf t ≥ 0 : t−1 x ∈ V .
Hint: You could find useful to verify and to exploit the following facts:
1. A norm k · k on Rn is Lipschitz continuous with respect to the standard Euclidean distance: there
exists Ck·k < ∞ such that |kxk − kyk| ≤ Ck·k kx − yk2 for all x, y
2. Vice versa, the Euclidean norm is Lipschitz continuous with respect to a given norm k · k: there
exists ck·k < ∞ such that |kxk2 − kyk2 | ≤ ck·k kx − yk for all x, y
482 APPENDIX B. CONVEX SETS IN RN
{x : (x − a)T Q(x − a) ≤ r2 }
is convex.
To see that an ellipsoid {x : (x − a)T Q(x − a) ≤ r2 } is convex, note that since Q is positive
definite, the matrix Q1/2 is well-defined and positive definite. Now, if k · k is a norm on Rn
and P is a nonsingular n ×p n matrix, the function kP xk is a norm along with k · k (why?).
Thus, the function kxkQ ≡ xT Qx = kQ1/2 xk2 is a norm along with k · k2 , and the ellipsoid
in question clearly is just k · kQ -ball of radius r centered at a.
is convex.
B.1.3 Inner description of convex sets: Convex combinations and convex hull
B.1.3.1 B.1.3.A. Convex combinations
Recall the notion of linear combination y of vectors y1 , ..., ym – this is a vector represented as
m
X
y= λi yi ,
i=1
where λi are real coefficients. Specifying this definition, we have come to the notion of an affine
combination - this is a linear combination with the sum of coefficients equal to one. The last
notion in this genre is the one of convex combination.
Definition B.1.2 A convex combination of vectors y1 , ..., ym is their affine combination with
nonnegative coefficients, or, which is the same, a linear combination
m
X
y= λ i yi
i=1
B.1.4 Cones
A nonempty subset M of Rn is called conic, if it contains, along with every point x ∈ M , the
entire ray Rx = {tx : t ≥ 0} spanned by the point:
x ∈ M ⇒ tx ∈ M ∀t ≥ 0.
A convex conic set is called a cone.
Proposition B.1.5 A nonempty subset M of Rn is a cone if and only if it possesses the fol-
lowing pair of properties:
• is conic: x ∈ M, t ≥ 0 ⇒ tx ∈ M ;
• contains sums of its elements: x, y ∈ M ⇒ x + y ∈ M .
Exercise B.6 Prove Proposition B.1.5.
As an immediate consequence, we get that a cone is closed with respect to taking linear com-
binations with nonnegative coefficients of the elements, and vice versa – a nonempty set closed
with respect to taking these combinations is a cone.
Example B.1.5 The solution set of an arbitrary (possibly, infinite) system
aTα x ≤ 0, α ∈ A
of homogeneous linear inequalities with n unknowns x – the set
K = {x : aTα x ≤ 0 ∀α ∈ A}
– is a cone.
In particular, the solution set to a homogeneous finite system of m homogeneous linear
inequalities
Ax ≤ 0
(A is m × n matrix) is a cone; a cone of this latter type is called polyhedral.
Note that the cones given by systems of linear homogeneous nonstrict inequalities necessarily
are closed. From Separation Theorem B.2.9 it follows that, vice versa, every closed convex cone
is the solution set to such a system, so that Example B.1.5 is the generic example of a closed
convex cone.
Cones form a very important family of convex sets, and one can develop theory of cones
absolutely similar (and in a sense, equivalent) to that one of all convex sets. E.g., introducing
the notion of conic combination of vectors x1 , ..., xk as a linear combination of the vectors with
nonnegative coefficients, you can easily prove the following statements completely similar to
those for general convex sets, with conic combination playing the role of convex one:
• A set is a cone if and only if it is nonempty and is closed with respect to taking all
conic combinations of its elements;
• Intersection of a family of cones is again a cone; in particular, for every nonempty set
M ⊂ Rn there exists the smallest cone containing M – its conic!hull Cone (M ), and
this conic hull is comprised of all conic combinations of vectors from M .
In particular, the conic hull of a nonempty finite set M = {u1 , ..., uN } of vectors in Rn is
the cone
XN
Cone (M ) = { λi ui : λi ≥ 0, i = 1, ..., N }.
i=1
B.1. DEFINITION AND BASIC PROPERTIES 485
2. Taking direct product: if M1 ⊂ Rn1 and M2 ⊂ Rn2 are convex sets, so is the set
3. Arithmetic summation and multiplication by reals: if M1 , ..., Mk are convex sets in Rn and
λ1 , ..., λk are arbitrary reals, then the set
k
X
λ1 M1 + ... + λk Mk = { λi xi : xi ∈ Mi , i = 1, ..., k}
i=1
is convex.
{x : |x − a| < r} [r > 0]
486 APPENDIX B. CONVEX SETS IN RN
is the closed ball {x : kx − ak2 ≤ r}. Another useful application example is the closure of a set
M = {x : aTα x < bα , α ∈ A}
given by strict linear inequalities: if such a set is nonempty, then its closure is given by the
nonstrict versions of the same inequalities:
cl M = {x : aTα x ≤ bα , α ∈ A}.
Nonemptiness of M in the latter example is essential: the set M given by two strict inequal-
ities
x < 0, −x < 0
in R clearly is empty, so that its closure also is empty; in contrast to this, applying formally
the above rule, we would get wrong answer
cl M = {x : x ≤ 0, x ≥ 0} = {0}.
The set of all interior points of M is called the interior of M [notation: int M ].
E.g.,
• The interior of the closed ball {x : kx − ak2 ≤ r} is the open ball {x : kx − ak2 < r} (why?)
• The interior of a polyhedral set {x : Ax ≤ b} with matrix A not containing zero rows is
the set {x : Ax < b} (why?)
The latter statement is not, generally speaking, valid for sets of solutions of infinite
systems of linear inequalities. E.g., the system of inequalities
1
x≤ , n = 1, 2, ...
n
in R has, as a solution set, the nonpositive ray R− = {x ≤ 0}; the interior of this ray
is the negative ray {x < 0}. At the same time, strict versions of our inequalities
1
x< , n = 1, 2, ...
n
define the same nonpositive ray, not the negative one.
It is also easily seen (this fact is valid for arbitrary metric spaces, not for Rn only), that
The interior of a set is, of course, contained in the set, which, in turn, is contained in its closure:
int M ⊂ M ⊂ cl M. (B.1.1)
∂M = cl M \int M
– is called the boundary of M , and the points of the boundary are called boundary points of
M (Warning: these points not necessarily belong to M , since M can be less than cl M ; in fact,
all boundary points belong to M if and only if M = cl M , i.e., if and only if M is closed).
The boundary of a set clearly is closed (as the intersection of two closed sets cl M and
n
R \int M ; the latter set is closed as a complement to an open set). From the definition of the
boundary,
M ⊂ int M ∪ ∂M [= cl M ],
The set of all relative interior points of M is called its relative interior [notation: ri M ].
E.g. the relative interior of a singleton is the singleton itself (since a point in the 0-dimensional
space is the same as a ball of a positive radius); more generally, the relative interior of an affine
subspace is the set itself. The interior of a segment [x, y] (x 6= y) in Rn is empty whenever n > 1;
in contrast to this, the relative interior is nonempty independently of n and is the interval (x, y) –
the segment with deleted endpoints. Geometrically speaking, the relative interior is the interior
we get when regard M as a subset of its affine hull (the latter, geometrically, is nothing but Rk ,
k being the affine dimension of Aff(M )).
Exercise B.8 Prove that the relative interior of a simplex with vertices y0 , ..., ym is exactly the
m m
set {x =
P P
λi yi : λi > 0, λi = 1}.
i=0 i=0
We can play with the notion of the relative interior in basically the same way as with the
one of interior, namely:
488 APPENDIX B. CONVEX SETS IN RN
• since Aff(M ), as every affine subspace, is closed and contains M , it contains also the
smallest closed sets containing M , i.e., cl M . Therefore we have the following analogies of
inclusions (B.1.1):
ri M ⊂ M ⊂ cl M [⊂ Aff(M )]; (B.1.2)
• we can define the relative boundary ∂ri M = cl M \ri M which is a closed set contained in
Aff(M ), and, as for the “actual” interior and boundary, we have
ri M ⊂ M ⊂ cl M = ri M + ∂ri M.
Of course, if Aff(M ) = Rn , then the relative interior becomes the usual interior, and similarly
for boundary; this for sure is the case when int M 6= ∅ (since then M contains a ball B, and
therefore the affine hull of M is the entire Rn , which is the affine hull of B).
ri M ⊂ M ⊂ cl M
can be very “non-tight”. E.g., let M be the set of rational numbers in the segment [0, 1] ⊂ R.
Then ri M = int M = ∅ – since every neighbourhood of every rational real contains irrational
reals – while cl M = [0, 1]. Thus, ri M is “incomparably smaller” than M , cl M is “incomparably
larger”, and M is contained in its relative boundary (by the way, what is this relative boundary?).
The following proposition demonstrates that the topology of a convex set M is much better
than it might be for an arbitrary set.
cl M = cl ri M
ri M = ri cl M.
Since Lin(M ) = Rn , we can find in M n linearly independent vectors a1 , .., an . Let also
a0 = 0. The n + 1 vectors a0 , ..., an belong to M , and since M is convex, the convex hull of these
vectors also belongs to M . This convex hull is the set
n
X X n
X n
X
∆ = {x = λi ai : λ ≥ 0, λi = 1} = {x = µi ai : µ ≥ 0, µi ≤ 1}.
i=0 i i=1 i=1
under linear transformation µ 7→ Aµ, where A is the matrix with the columns a1 , ..., an . The
P
standard simplex clearly has a nonempty interior (comprised of all vectors µ > 0 with µi < 1);
i
since A is nonsingular (due to linear independence of a1 , ..., an ), multiplication by A maps open
sets onto open ones, so that ∆ has a nonempty interior. Since ∆ ⊂ M , the interior of M is
nonempty.
(iii): We should prove that the closure of ri M is exactly the same that the closure of M . In
fact we shall prove even more:
Lemma B.1.1 Let x ∈ ri M and y ∈ cl M . Then all points from the half-segment [x, y),
[x, y) = {z = (1 − λ)x + λy : 0 ≤ λ < 1}
belong to the relative interior of M .
Proof of the Lemma. Let Aff(M ) = a + L, L being linear subspace; then
M ⊂ Aff(M ) = x + L.
Let B be the unit ball in L:
B = {h ∈ L : khk2 ≤ 1}.
Since x ∈ ri M , there exists positive radius r such that
x + rB ⊂ M. (B.1.3)
Now let λ ∈ [0, 1), and let z = (1 − λ)x + λy. Since y ∈ cl M , we have y = lim yi for certain
i→∞
sequence of points from M . Setting zi = (1 − λ)x + λyi , we get zi → z as i → ∞. Now, from
(B.1.3) and the convexity of M is follows that the sets Zi = {u = (1 − λ)x0 + λyi : x0 ∈ x + rB}
are contained in M ; clearly, Zi is exactly the set zi + r0 B, where r0 = (1 − λ)r > 0. Thus, z is
the limit of sequence zi , and r0 -neighbourhood (in Aff(M )) of every one of the points zi belongs
to M . For every r00 < r0 and for all i such that zi is close enough to z, the r0 -neighbourhood of
zi contains the r00 -neighbourhood of z; thus, a neighbourhood (in Aff(M )) of z belongs to M ,
whence z ∈ ri M .
A useful byproduct of Lemma B.1.1 is as follows:
(iv): The statement is evidently true when M is empty, so assume that M is nonempty. The
inclusion ri M ⊂ ri cl M is evident, and all we need is to prove the inverse inclusion. Thus, let
z ∈ ri cl M , and let us prove that z ∈ ri M . Let x ∈ ri M (we already know that the latter set is
nonempty). Consider the segment [x, z]; since z is in the relative interior of cl M , we can extend
a little bit this segment through the point z, not leaving cl M , i.e., there exists y ∈ cl M such
that z ∈ [x, y). We are done, since by Lemma B.1.1 from z ∈ [x, y), with x ∈ ri M , y ∈ cl M , it
follows that z ∈ ri M .
We see from the proof of Theorem B.1.1 that to get a closure of a (nonempty) convex set,
it suffices to subject it to the “radial” closure, i.e., to take a point x ∈ ri M , take all rays in
Aff(M ) starting at x and look at the intersection of such a ray l with M ; such an intersection
will be a convex set on the line which contains a one-sided neighbourhood of x, i.e., is either
a segment [x, yl ], or the entire ray l, or a half-interval [x, yl ). In the first two cases we should
not do anything; in the third we should add y to M . After all rays are looked through and
all ”missed” endpoints yl are added to M , we get the closure of M . To understand what
is the role of convexity here, look at the nonconvex set of rational numbers from [0, 1]; the
interior (≡ relative interior) of this ”highly percolated” set is empty, the closure is [0, 1], and
there is no way to restore the closure in terms of the interior.
Theorem B.2.1 [Caratheodory] Let M ⊂ Rn , and let dim ConvM = m. Then every point
x ∈ ConvM is a convex combination of at most m + 1 points from M .
Proof. Let x ∈ ConvM . By Proposition B.1.4 on the structure of convex hull, x is convex
combination of certain points x1 , ..., xN from M :
N
X N
X
x= λ i xi , [λi ≥ 0, λi = 1].
i=1 i=1
Let us choose among all these representations of x as a convex combination of points from M the
one with the smallest possible N , and let it be the above combination. I claim that N ≤ m + 1
(this claim leads to the desired statement). Indeed, if N > m + 1, then the system of m + 1
homogeneous equations
N
P
µi xi = 0
i=1
PN
µi = 0
i=1
N
X N
X
δi xi = 0, δi = 0, (δ1 , ..., δN ) 6= 0.
i=1 i=1
B.2. MAIN THEOREMS ON CONVEX SETS 491
What is to the left, is an affine combination of xi ’s. When t = 0, this is a convex combination
- all coefficients are nonnegative. When t is large, this is not a convex combination, since some
of δi ’s are negative (indeed, not all of them are zero, and the sum of δi ’s is 0). There exists, of
course, the largest t for which the combination (*) has nonnegative coefficients, namely
λi
t∗ = min .
i:δi <0 |δi |
For this value of t, the combination (*) is with nonnegative coefficients, and at least one of the
coefficients is zero; thus, we have represented x as a convex combination of less than N points
from M , which contradicts the definition of N .
Proof. Since N > n + 1, the homogeneous system of n + 1 scalar equations with N unknowns
µ1 , ..., µN
N
P
µi xi = 0
i=1
PN
µi = 0
i=1
has a nontrivial solution λ1 , ..., λN :
N
X N
X
µi xi = 0, λi = 0, [(λ1 , ..., λN ) 6= 0].
i=1 i=1
Let I = {i : λi ≥ 0}, J = {i : λi < 0}; then I and J are nonempty and form a partitioning of
{1, ..., N }. We have X X
a≡ λi = (−λj ) > 0
i∈I j∈J
(since the sum of all λ’s is zero and not all λ’s are zero). Setting
λi −λj
αi = , i ∈ I, βj = , j ∈ J,
a a
we get X X
αi ≥ 0, βj ≥ 0, αi = 1, βj = 1,
i∈I j∈J
492 APPENDIX B. CONVEX SETS IN RN
and
N
βj xj ] = a−1 [ (−λj )xj ] = a−1
X X X X X
[ αi xi ] − [ λi xi ] − [ λi xi = 0.
i∈I j∈J i∈I j∈J i=1
Proof. Let us prove the statement by induction on the number N of sets in the family. The
case of N ≤ n + 1 is evident. Now assume that the statement holds true for all families with
certain number N ≥ n + 1 of sets, and let S1 , ..., SN , SN +1 be a family of N + 1 convex sets
which satisfies the premise of the Helley Theorem; we should prove that the intersection of the
sets S1 , ..., SN , SN +1 is nonempty.
Deleting from our N + 1-set family the set Si , we get N -set family which satisfies the premise
of the Helley Theorem and thus, by the inductive hypothesis, the intersection of its members is
nonempty:
(∀i ≤ N + 1) : T i = S1 ∩ S2 ∩ ... ∩ Si−1 ∩ Si+1 ∩ ... ∩ SN +1 6= ∅.
Exercise B.9 Let S1 , ..., SN be a family of N convex sets in Rn , and let m be the affine di-
mension of Aff(S1 ∪ ... ∪ SN ). Assume that every m + 1 sets from the family have a point in
common. Prove that all sets from the family have a point in common.
In the aforementioned version of the Helley Theorem we dealt with finite families of convex
sets. To extend the statement to the case of infinite families, we need to strengthen slightly
the assumption. The resulting statement is as follows:
Theorem B.2.4 [Helley, II] Let F be an arbitrary family of convex sets in Rn . Assume
that
(a) every n + 1 sets from the family have a point in common,
and
(b) every set in the family is closed, and the intersection of the sets from certain finite
subfamily of the family is bounded (e.g., one of the sets in the family is bounded).
Then all the sets from the family have a point in common.
B.2. MAIN THEOREMS ON CONVEX SETS 493
Proof. By the previous theorem, all finite subfamilies of F have nonempty intersections, and
these intersections are convex (since intersection of a family of convex sets is convex, Theorem
B.1.3); in view of (a) these intersections are also closed. Adding to F all intersections of
finite subfamilies of F, we get a larger family F 0 comprised of closed convex sets, and a finite
subfamily of this larger family again has a nonempty intersection. Besides this, from (b)
it follows that this new family contains a bounded set Q. Since all the sets are closed, the
family of sets
{Q ∩ Q0 : Q0 ∈ F}
is a nested family of compact sets (i.e., a family of compact sets with nonempty intersection
of sets from every finite subfamily); by the well-known Analysis theorem such a family has
a nonempty intersection1) .
X = {x ∈ Rn : Ax ≤ b} = {x ∈ Rn : aTi x ≤ bi , 1 ≤ i ≤ m}.
We shall call such a representation of X its polyhedral description. A polyhedral set always is
convex and closed (Proposition B.1.1). Now let us introduce the notion of polyhedral represen-
tation of a set X ⊂ Rn .
X = {x ∈ Rn : ∃u ∈ Rk : Ax + Bu ≤ c} (B.2.1)
Note that every polyhedrally representable set is the image under linear mapping (even a pro-
jection) of a polyhedral, and thus convex, set. It follows that a polyhedrally representable set
definitely is convex (Proposition B.1.6).
Examples: 1) Every polyhedral set X = {x ∈ Rn : Ax ≤ b} is polyhedrally representable – a
polyhedral description of X is nothing but a polyhedral representation with no slack variables
1)
here is the proof of this Analysis theorem: assume, on contrary, that the compact sets Qα , α ∈ A, have
empty intersection. Choose a set Qα∗ from the family; for every x ∈ Qα∗ there is a set Qx in the family which
does not contain x - otherwise x would be a common point of all our sets. Since Qx is closed, there is an open
ball Vx centered at x which does not intersect Qx . The balls Vx , x ∈ Qα∗ , form an open covering of the compact
set Qα∗ , and therefore there exists a finite subcovering Vx1 , ..., VxN of Qα∗ by the balls from the covering. Since
Qxi does not intersect Vxi , we conclude that the intersection of the finite subfamily Qα∗ , Qx1 , ..., QxN is empty,
which is a contradiction
494 APPENDIX B. CONVEX SETS IN RN
is not polyhedral; at least the initial description of X is not of the form {x : Ax ≤ b}. However,
X admits a polyhedral representation, e.g., the representation
n
X
n n
X = {x ∈ R : ∃u ∈ R : −ui ≤ xi ≤ ui , 1 ≤ i ≤ n, ui ≤ 1}. (B.2.2)
| {z }
i=1
⇔|xi |≤ui
Note that the set X in question can be described by a system of linear inequalities in x-variables
only, namely, as
n
X
X = {x ∈ Rn : i xi ≤ 1 , ∀(i = ±1, 1 ≤ i ≤ n)},
i=1
that is, X is polyhedral. However, the above polyhedral description of X (which in fact is mini-
mal in terms of the number of inequalities involved) requires 2n inequalities — an astronomically
large number when n is just few tens. In contrast to this, the polyhedral representation (B.2.2)
of the same set requires just n slack variables u and 2n + 1 linear inequalities on x, u – the
“complexity” of this representation is just linear in n.
3) Let a1 , ..., am be given vectors in Rn . Consider the conic hull of the finite set {a1 , ..., am } –
m
the set Cone {a1 , ..., am } = {x = λi ai : λ ≥ 0} (see Section B.1.4). It is absolutely unclear
P
i=1
whether this set is polyhedral. In contrast to this, its polyhedral representation is immediate:
−λ ≤
0
= {x ∈ Rn : ∃λ ∈ Rm : x− m i=1 λi ai ≤ 0 }
P
−x + Pm λ a ≤ 0
i=1 i i
In other words, the original description of X is nothing but its polyhedral representation (in
slight disguise), with λi ’s in the role of slack variables.
be a polyhedral set with n variables x and single variable u; we want to prove that the projection
X = {x : ∃u : Ax + bu ≤ c}
of Y on the space of x-variables is polyhedral. To see it, let us split the inequalities defining Y
into three groups (some of them can be empty):
— “black” inequalities — those with bi = 0; these inequalities do not involve u at all;
— “red” inequalities – those with bi > 0. Such an inequality can be rewritten equivalently as
u ≤ b−1 T
i [ci − ai x], and it imposes a (depending on x) upper bound on u;
— ”green” inequalities – those with bi < 0. Such an inequality can be rewritten equivalently as
u ≥ b−1 T
i [ci − ai x], and it imposes a (depending on x) lower bound on u.
Now it is clear when x ∈ X, that is, when x can be extended, by some u, to a point (x, u) from
Y : this is the case if and only if, first, x satisfies all black inequalities, and, second, the red
upper bounds on u specified by x are compatible with the green lower bounds on u specified by
x, meaning that every lower bound is ≤ every upper bound (the latter is necessary and sufficient
to be able to find a value of u which is ≥ all lower bounds and ≤ all upper bounds). Thus,
T
ai x ≤ ci for all “black” indexes i – those with bi = 0
X= x: b−1 T −1 T
j [cj − aj x] ≤ bk [ck − ak x] for all “green” (i.e., with bj < 0) indexes j .
and all “red” (i.e., with b > 0) indexes k
k
We see that X is given by finitely many nonstrict linear inequalities in x-variables only, as
claimed.
The outlined procedure for building polyhedral descriptions (i.e., polyhedral representations
not involving slack variables) for projections of polyhedral sets is called Fourier-Motzkin elimi-
nation.
this value to a pair (t, x) with t = a = cT x and Ax ≤ b, that is, we can augment the optimal
value by an optimal solution. Thus, we can say that Fourier-Motzkin elimination is a finite Real
Arithmetics algorithm which allows to check whether an LP is feasible and bounded, and when
it is the case, allows to find the optimal value and an optimal solution. An unpleasant fact
of life is that this algorithm is completely impractical, since the elimination process can blow
up exponentially the number of inequalities. Indeed, from the description of the process it is
clear that if a polyhedral set is given by m linear inequalities, then eliminating one variable, we
can end up with as much as m2 /4 inequalities (this is what happens if there are m/2 red, m/2
green and no black inequalities). Eliminating the next variable, we again can “nearly square”
the number of inequalities, and so on. Thus, the number of inequalities in the description of T
can become astronomically large when even when the dimension of x is something like 10. The
actual importance of Fourier-Motzkin elimination is of theoretical nature. For example, the LP-
related reasoning we have just carried out shows that every feasible and bounded LP program is
solvable – has an optimal solution (we shall revisit this result in more details in Section B.2.9.B).
This is a fundamental fact for LP, and the above reasoning (even with the justification of the
elimination “charged” to it) is the shortest and most transparent way to prove this fundamental
fact. Another application of the fact that polyhedrally representable sets are polyhedral is the
Homogeneous Farkas Lemma to be stated and proved in Section B.2.5.A; this lemma will be
instrumental in numerous subsequent theoretical developments.
Mi = {x ∈ Rn : ∃ui ∈ Rki : Ai x + Bi ui ≤ ci }, 1 ≤ i ≤ m.
Then the intersection of the sets Mi is polyhedral with an explicit polyhedral representa-
tion, specifically,
m
\
Mi = {x ∈ Rn : ∃u = (u1 , ..., um ) ∈ Rk1 +...+km : Ai x + Bi ui ≤ ci , 1 ≤ i ≤ m}
| {z }
i=1
system of nonstrict linear
inequalities in x, u
Mi = {x ∈ Rn : ∃ui ∈ Rki : Ai x + Bi ui ≤ ci }, 1 ≤ i ≤ m,
4. Taking the image under an affine mapping: Let M ⊂ Rn be a polyhedral set given by
polyhedral representation
M = {x ∈ Rn : ∃u ∈ Rk : Ax + Bu ≤ c}
5. Taking the inverse image under affine mapping: Let M ⊂ Rn be polyhedral set given by
polyhedral representation
M = {x ∈ Rn : ∃u ∈ Rk : Ax + Bu ≤ c}
P −1 (M ) = {y ∈ Rm : ∃u : A(P y + p) + Bu ≤ c}.
Note that rules for intersection, taking direct products and taking inverse images, as applied to
polyhedral descriptions of operands, lead to polyhedral descriptions of the results. In contrast to
this, the rules for taking sums with coefficients and images under affine mappings heavily exploit
the notion of polyhedral representation: even when the operands in these rules are given by
polyhedral descriptions, there are no simple ways to point out polyhedral descriptions of the
results.
Finally, we note that the problem of minimizing a linear form cT x over a set M given by
polyhedral representation:
M = {x ∈ Rn : ∃u ∈ Rk : Ax + Bu ≤ c}
A reader with some experience in Linear Programming definitely used a lot the above “calculus
of polyhedral representations” when building LPs (perhaps without clear understanding of what
in fact is going on, same as Molière’s Monsieur Jourdain all his life has been speaking prose
without knowing it).
then every vector h which has nonnegative inner products with all ai should also have nonneg-
ative inner product with a:
X
a= λi ai & λi ≥ 0 ∀i & hT ai ≥ 0 ∀i ⇒ hT a ≥ 0.
i
The Homogeneous Farkas Lemma says that this evident necessary condition is also sufficient:
Lemma B.2.1 [Homogeneous Farkas Lemma] Let a, a1 , ..., aN be vectors from Rn . The vector
a is a conic combination of the vectors ai (linear combination with nonnegative coefficients) if
and only if every vector h satisfying hT ai ≥ 0, i = 1, ..., N , satisfies also hT a ≥ 0. In other
words, a homogeneous linear inequality
aT h ≥ 0
aTi h ≥ 0, 1 ≤ i ≤ N
of homogeneous linear inequalities if and only if it can be obtained from the inequalities of the
system by “admissible linear aggregation” – taking their weighted sum with nonnegative weights.
B.2. MAIN THEOREMS ON CONVEX SETS 499
Proof. The necessity – the “only if” part of the statement – was proved before the Farkas
Lemma was formulated. Let us prove the “if” part of the Lemma. Thus, assume that every
vector h satisfying hT ai ≥ 0 ∀i satisfies also hT a ≥ 0, and let us prove that a is a conic
combination of the vectors ai .
An “intelligent” proof goes as follows. The set Cone {a1 , ..., aN } of all conic combinations of
a1 , ..., aN is polyhedrally representable (Example 3 in Section B.2.5.A.1) and as such is polyhe-
dral (Theorem B.2.5):
Observing that 0 ∈ Cone {a1 , ..., aN }, we conclude that bj ≤ 0 for all j; and since λai ∈
Cone {a1 , ..., aN } for every i and every λ ≥ 0, we should have λpTj ai ≥ 0 for all i, j and all λ ≥ 0,
whence pTj ai ≥ 0 for all i and j. For every j, relation pTj ai ≥ 0 for all i implies, by the premise
of the statement we want to prove, that pTj a ≥ 0, and since bj ≤ 0, we see that pTj a ≥ bj for all
j, meaning that a indeed belongs to Cone {a1 , ..., aN } due to (!).
An interested reader can get a better understanding of the power of Fourier-Motzkin elim-
ination, which ultimately is the basis for the above intelligent proof, by comparing this proof
with the one based on Helley’s Theorem.
Proof based on Helley’s Theorem. As above, we assume that every vector h satisfying
hT ai ≥ 0 ∀i satisfies also hT a ≥ 0, and we want to prove that a is a conic combination of the
vectors ai .
There is nothing to prove when a = 0 – the zero vector of course is a conic combination of
the vectors ai . Thus, from now on we assume that a 6= 0.
10 . Let
Π = {h : aT h = −1},
and let
Ai = {h ∈ Π : aTi h ≥ 0}.
Π is a hyperplane in Rn , and every Ai is a polyhedral set contained in this hyperplane and is
therefore convex.
20 . What we know is that the intersection of all the sets Ai , i = 1, ..., N , is empty (since a
vector h from the intersection would have nonnegative inner products with all ai and the inner
product −1 with a, and we are given that no such h exists). Let us choose the smallest, in the
number of elements, of those sub-families of the family of sets A1 , ..., AN which still have empty
intersection of their members; without loss of generality we may assume that this is the family
A1 , ..., Ak . Thus, the intersection of all k sets A1 , ..., Ak is empty, but the intersection of every
k − 1 sets from the family A1 , ..., Ak is nonempty.
30 . We claim that
(A) is easy: assuming that a 6∈ E = Lin({a1 , ..., ak }), we conclude that the orthogonal
projection f of the vector a onto the orthogonal complement E ⊥ of E is nonzero.
The inner product of f and a is the same as f T f , is.e., is positive, while f T ai = 0,
i = 1, ..., k. Taking h = −(f T f )−1 f , we see that hT a = −1 and hT ai = 0, i = 1, ..., k.
In other words, h belongs to every set Ai , i = 1, ..., k, by definition of these sets, and
therefore the intersection of the sets A1 , ..., Ak is nonempty, which is a contradiction.
(B) is given by the Helley Theorem I. (B) is evident when k = 1, since in this case
linear dependence of a! , ..., ak would mean that a1 = 0; by (A), this implies that
a = 0, which is not the case. Now let us prove (B) in the case of k > 1. Assume,
on the contrary to what should be proven, that a1 , ..., ak are linearly dependent, so
that the dimension of E = Lin({a1 , ..., ak }) is certain m < k. We already know from
A. that a ∈ E. Now let A0i = Ai ∩ E. We claim that every k − 1 of the sets A0i have
a nonempty intersection, while all k these sets have empty intersection. The second
claim is evident – since the sets A1 , ..., Ak have empty intersection, the same is the
case with their parts A0i . The first claim also is easily supported: let us take k − 1 of
the dashed sets, say, A01 , ..., A0k−1 . By construction, the intersection of A1 , ..., Ak−1
is nonempty; let h be a vector from this intersection, i.e., a vector with nonnegative
inner products with a1 , ..., ak−1 and the product −1 with a. When replacing h with
its orthogonal projection h0 on E, we do not vary all these inner products, since these
are products with vectors from E; thus, h0 also is a common point of A1 , ..., Ak−1 ,
and since this is a point from E, it is a common point of the dashed sets A01 , ..., A0k−1
as well.
Now we can complete the proof of (B): the sets A01 , ..., A0k are convex sets belonging
to the hyperplane Π0 = Π ∩ E = {h ∈ E : aT h = −1} (Π0 indeed is a hyperplane
in E, since 0 6= a ∈ E) in the m-dimensional linear subspace E. Π0 is an affine
subspace of the affine dimension ` = dim E − 1 = m − 1 < k − 1 (recall that we
are in the situation when m = dim E < k), and every ` + 1 ≤ k − 1 subsets from
the family A01 ,...,A0k have a nonempty intersection. From the Helley Theorem I (see
Exercise B.9) it follows that all the sets A01 , ..., A0k have a point in common, which,
as we know, is not the case. The contradiction we have got proves that a1 , ..., ak are
linearly independent.
40 . With (A) and (B) in our disposal, we can easily complete the proof of the “if” part of the
Farkas Lemma. Specifically, by (A), we have
k
X
a= λ i ai
i=1
with some real coefficients λi , and all we need is to prove that these coefficients are nonnegative.
Assume, on the contrary, that, say, λ1 < 0. Let us extend the (linearly independent in view
of (B)) system of vectors a1 , ..., ak by vectors f1 , ..., fn−k to a basis in Rn , and let ξi (x) be the
coordinates of a vector x in this basis. The function ξ1 (x) is a linear form of x and therefore is
the inner product with certain vector:
ξ1 (x) = f T x ∀x.
Now we have
f T a = ξ1 (a) = λ1 < 0
B.2. MAIN THEOREMS ON CONVEX SETS 501
and
T 1, i=1
f ai =
0, i = 2, ..., k
so that f T ai ≥ 0, i = 1, ..., k. We conclude that a proper normalization of f – namely, the
vector |λ1 |−1 f – belongs to A1 , ..., Ak , which is the desired contradiction – by construction, this
intersection is empty.
Knowing how to answer the question (?), we are able to answer many other questions. E.g., to
verify whether a given real a is a lower bound on the optimal value c∗ of (LP) is the same as to
verify whether the system (
−cT x + a > 0
Ax − b ≥ 0
has no solutions.
The general question above is too difficult, and it makes sense to pass from it to a seemingly
simpler one:
(??) How to certify that (S) has, or does not have, a solution.
Imagine that you are very smart and know the correct answer to (?); how could you convince
me that your answer is correct? What could be an “evident for everybody” validity certificate
for your answer?
If your claim is that (S) is solvable, a certificate could be just to point out a solution x∗ to
(S). Given this certificate, one can substitute x∗ into the system and check whether x∗ indeed
is a solution.
Assume now that your claim is that (S) has no solutions. What could be a “simple certificate”
of this claim? How one could certify a negative statement? This is a highly nontrivial problem
not just for mathematics; for example, in criminal law: how should someone accused in a murder
prove his innocence? The “real life” answer to the question “how to certify a negative statement”
is discouraging: such a statement normally cannot be certified (this is where the rule “a person
is presumed innocent until proven guilty” comes from). In mathematics, however, the situation
is different: in some cases there exist “simple certificates” of negative statements. E.g., in order
to certify that (S) has no solutions, it suffices to demonstrate that a consequence of (S) is a
contradictory inequality such as
−1 ≥ 0.
502 APPENDIX B. CONVEX SETS IN RN
For example, assume that λi , i = 1, ..., m, are nonnegative weights. Combining inequalities from
(S) with these weights, we come to the inequality
m
X
λi fi (x) Ω 0 (Comb(λ))
i=1
where Ω is either ” > ” (this is the case when the weight of at least one strict inequality from
(S) is positive), or ” ≥ ” (otherwise). Since the resulting inequality, due to its origin, is a
consequence of the system (S), i.e., it is satisfied by every solution to (S), it follows that if
(Comb(λ)) has no solutions at all, we can be sure that (S) has no solution. Whenever this is
the case, we may treat the corresponding vector λ as a “simple certificate” of the fact that (S)
is infeasible.
Let us look what does the outlined approach mean when (S) is comprised of linear inequal-
ities:
”>”
(S) : {aTi x Ωi bi , i = 1, ..., m} Ωi =
”≥”
Here the “combined inequality” is linear as well:
m
X m
X
T
(Comb(λ)) : ( λai ) x Ω λbi
i=1 i=1
(Ω is ” > ” whenever λi > 0 for at least one i with Ωi = ” > ”, and Ω is ” ≥ ” otherwise). Now,
when can a linear inequality
dT x Ω e
be contradictory? Of course, it can happen only when d = 0. Whether in this case the inequality
is contradictory, it depends on what is the relation Ω: if Ω = ” > ”, then the inequality is
contradictory if and only if e ≥ 0, and if Ω = ” ≥ ”, it is contradictory if and only if e > 0. We
have established the following simple result:
Proposition B.2.1 Consider a system of linear inequalities
(
aTi x > bi , i = 1, ..., ms ,
(S) :
aTi x ≥ bi , i = ms + 1, ..., m.
with n-dimensional vector of unknowns x. Let us associate with (S) two systems of linear
inequalities and equations with m-dimensional vector of unknowns λ:
(a) λ ≥ 0;
m
P
(b) λi ai = 0;
i=1
TI : m
λi bi ≥ 0;
P
(cI )
i=1
m
Ps
(dI ) λi > 0.
i=1
(a) λ ≥ 0;
m
P
(b) λi ai = 0;
TII : i=1
m
P
(cII ) λi bi > 0.
i=1
Assume that at least one of the systems TI , TII is solvable. Then the system (S) is infeasible.
B.2. MAIN THEOREMS ON CONVEX SETS 503
B.2.5.B.2 General Theorem on Alternative. Proposition B.2.1 says that in some cases it
is easy to certify infeasibility of a linear system of inequalities: a “simple certificate” is a solution
to another system of linear inequalities. Note, however, that the existence of a certificate of this
latter type is to the moment only a sufficient, but not a necessary, condition for the infeasibility
of (S). A fundamental result in the theory of linear inequalities is that the sufficient condition
in question is in fact also necessary:
Theorem B.2.6 [General Theorem on Alternative] In the notation from Proposition B.2.1,
system (S) has no solutions if and only if either TI , or TII , or both these systems, are solvable.
Proof. GTA is a more or less straightforward corollary of the Homogeneous Farkas Lemma.
Indeed, in view of Proposition B.2.1, all we need to prove is that if (S) has no solution, then
at least one of the systems TI , or TII is solvable. Thus, assume that (S) has no solutions, and
let us look at the consequences. Let us associate with (S) the following system of homogeneous
linear inequalities in variables x, τ , :
(a) τ − ≥ 0
(b) aTi x −bi τ − ≥ 0, i = 1, ..., ms (B.2.3)
(c) aTi x −bi τ ≥ 0, i = ms + 1, ..., m
I claim that
−≥0 (B.2.4)
Recall that by their origin, ν and all λi are nonnegative. Now, it may happen that λ1 , ..., λms
are zero. In this case ν > 0 by (B.2.5.c), and relations (B.2.5a − b) say that λ1 , ..., λm solve TII .
In the remaining case (that is, when not all λ1 , ..., λms are zero, or, which is the same, when
m
Ps
λi > 0), the same relations say that λ1 , ..., λm solve TI .
i=1
504 APPENDIX B. CONVEX SETS IN RN
B.2.5.C.1 Dual to an LP program: the origin The motivation for constructing the
problem dual to an LP program
aT1
n o aT2
c∗ = min cT x : Ax − b ≥ 0 A = ∈ Rm×n (LP)
x ...
aTm
is the desire to generate, in a systematic way, lower bounds on the optimal value c∗ of (LP).
An evident way to bound from below a given function f (x) in the domain given by system of
inequalities
gi (x) ≥ bi , i = 1, ..., m, (B.2.6)
is offered by what is called the Lagrange duality and is as follows:
Lagrange Duality:
• Let us look at all inequalities which can be obtained from (B.2.6) by linear aggre-
gation, i.e., at the inequalities of the form
X X
yi gi (x) ≥ yi bi (B.2.7)
i i
with the “aggregation weights” yi ≥ 0. Note that the inequality (B.2.7), due to its
origin, is valid on the entire set X of solutions of (B.2.6).
• Depending on the choice of aggregation weights, it may happen that the left hand
side in (B.2.7) is ≤ f (x) for all x ∈ Rn . Whenever it is the case, the right hand side
P
yi bi of (B.2.7) is a lower bound on f in X.
i
P
Indeed, on X the quantity yi bi is a lower bound on yi gi (x), and for y in question
i
the latter function of x is everywhere ≤ f (x).
It follows that
• The optimal value in the problem
X y ≥ 0, (a)
max yi bi : P yi gi (x) ≤ f (x) ∀x ∈ Rn (b) (B.2.8)
y
i i
is a lower bound on the values of f on the set of solutions to the system (B.2.6).
Let us look what happens with the Lagrange duality when f and gi are homogeneous linear
functions: f = cT x, gi (x) = aTi x. In this case, the requirement (B.2.8.b) merely says that
yi ai (or, which is the same, AT y = c due to the origin of A). Thus, problem (B.2.8)
P
c =
i
becomes the Linear Programming problem
n o
max bT y : AT y = c, y ≥ 0 , (LP∗ )
y
506 APPENDIX B. CONVEX SETS IN RN
[Weak Duality] The optimal value in (LP∗ ) is less than or equal to the optimal value
in (LP).
In fact, the “less than or equal to” in the latter statement is “equal”, provided that the optimal
value c∗ in (LP) is a number (i.e., (LP) is feasible and below bounded). To see that this indeed
is the case, note that a real a is a lower bound on c∗ if and only if cT x ≥ a whenever Ax ≥ b,
or, which is the same, if and only if the system of linear inequalities
has no solution. We know by the Theorem on Alternative that the latter fact means that some
other system of linear equalities (more exactly, at least one of a certain pair of systems) does
have a solution. More precisely,
(*) (Sa ) has no solutions if and only if at least one of the following two systems with
m + 1 unknowns:
λ = (λ0 , λ1 , ..., λm ) ≥ 0;
(a)
m
−λ0 c +
P
(b) λi ai = 0;
TI : i=1
m
−λ0 a + λi bi ≥ 0;
P
(cI )
i=1
(dI ) λ0 > 0,
or
(a) λ = (λ0 , λ1 , ..., λm ) ≥ 0;
m
−λ0 c −
P
(b) λi ai = 0;
TII : i=1
m
−λ0 a −
P
(cII ) λi bi > 0
i=1
– has a solution.
Now assume that (LP) is feasible. We claim that under this assumption (Sa ) has no solutions
if and only if TI has a solution.
The implication ”TI has a solution ⇒ (Sa ) has no solution” is readily given by the above
remarks. To verify the inverse implication, assume that (Sa ) has no solutions and the system
Ax ≤ b has a solution, and let us prove that then TI has a solution. If TI has no solution, then
by (*) TII has a solution and, moreover, λ0 = 0 for (every) solution to TII (since a solution
to the latter system with λ0 > 0 solves TI as well). But the fact that TII has a solution λ
with λ0 = 0 is independent of the values of a and c; if this fact would take place, it would
mean, by the same Theorem on Alternative, that, e.g., the following instance of (Sa ):
0T x ≥ −1, Ax ≥ b
has no solutions. The latter means that the system Ax ≥ b has no solutions – a contradiction
with the assumption that (LP) is feasible.
B.2. MAIN THEOREMS ON CONVEX SETS 507
Now, if TI has a solution, this system has a solution with λ0 = 1 as well (to see this, pass from
a solution λ to the one λ/λ0 ; this construction is well-defined, since λ0 > 0 for every solution
to TI ). Now, an (m + 1)-dimensional vector λ = (1, y) is a solution to TI if and only if the
m-dimensional vector y solves the system of linear inequalities and equations
y ≥ 0;
m
AT y ≡
P
yi ai = c; (D)
i=1
bT y ≥ a
Proposition B.2.2 Assume that system (D) associated with the LP program (LP) has a solu-
tion (y, a). Then a is a lower bound on the optimal value in (LP). Vice versa, if (LP) is feasible
and a is a lower bound on the optimal value of (LP), then a can be extended by a properly chosen
m-dimensional vector y to a solution to (D).
We see that the entity responsible for lower bounds on the optimal value of (LP) is the system
(D): every solution to the latter system induces a bound of this type, and in the case when
(LP) is feasible, all lower bounds can be obtained from solutions to (D). Now note that if
(y, a) is a solution to (D), then the pair (y, bT y) also is a solution to the same system, and the
lower bound bT y on c∗ is not worse than the lower bound a. Thus, as far as lower bounds on
c∗ are concerned, we lose nothing by restricting ourselves to the solutions (y, a) of (D) with
a = bTny; the best lower bound ∗
o on c given by (D) is therefore the optimal value of the problem
maxy b y : A y = c, y ≥ 0 , which is nothing but the dual to (LP) problem (LP∗ ). Note that
T T
Proposition B.2.3 Whenever y is a feasible solution to (LP∗ ), the corresponding value of the
dual objective bT y is a lower bound on the optimal value c∗ in (LP). If (LP) is feasible, then for
every a ≤ c∗ there exists a feasible solution y of (LP∗ ) with bT y ≥ a.
Then
1) The duality is symmetric: the problem dual to dual is equivalent to the primal;
2) The value of the dual objective at every dual feasible solution is ≤ the value of the primal
objective at every primal feasible solution
3) The following 5 properties are equivalent to each other:
508 APPENDIX B. CONVEX SETS IN RN
where Im is the m-dimensional unit matrix. Applying the duality transformation to the latter
problem, we come to the problem
ξ ≥ 0
η ≥ 0
max 0T ξ + cT η + (−c)T ζ : ,
ξ,η,ζ
ζ ≥ 0
ξ − Aη + Aζ = −b
An immediate corollary of the LP Duality Theorem is the following necessary and sufficient
optimality condition in LP:
B.2. MAIN THEOREMS ON CONVEX SETS 509
Theorem B.2.8 [Necessary and sufficient optimality conditions in linear programming] Con-
sider an LP program (LP) along with its dual (LP∗ ). A pair (x, y) of primal and dual feasible
solutions is comprised of optimal solutions to the respective problems if and only if
Indeed, the “zero duality gap” optimality condition is an immediate consequence of the fact
that the value of primal objective at every primal feasible solution is ≥ the value of the
dual objective at every dual feasible solution, while the optimal values in the primal and the
dual are equal to each other, see Theorem B.2.7. The equivalence between the “zero duality
gap” and the “complementary slackness” optimality conditions is given by the following
computation: whenever x is primal feasible and y is dual feasible, the products yi [Ax − b]i ,
i = 1, ..., m, are nonnegative, while the sum of these products is precisely the duality gap:
Thus, the duality gap can vanish at a primal-dual feasible pair (x, y) if and only if all products
yi [Ax − b]i for this pair are zeros.
M ⊂ Rn is a hyperplane
m
∃a ∈ Rn , b ∈ R, a 6= 0 : M = {x ∈ Rn : aT x = b}
We can, consequently, associate with the hyperplane (or, better to say, with the associated linear
form a; this form is defined uniquely, up to multiplication by a nonzero real) the following sets:
Rn = M −− ∪ M ∪ M ++
• A hyperplane
M = {x ∈ Rn : aT x = b} [a 6= 0]
is said to separate S and T , if, first,
S ⊂ {x : aT x ≤ b}, T ⊂ {x : aT x ≥ b}
(i.e., S and T belong to the opposite closed half-spaces into which M splits Rn ), and,
second, at least one of the sets S, T is not contained in M itself:
S ∪ T 6⊂ M.
The separation is called strong, if there exist b0 , b00 , b0 < b < b00 , such that
S ⊂ {x : aT x ≤ b0 }, T ⊂ {x : aT x ≥ b00 }.
• A linear form a 6= 0 is said to separate (strongly separate) S and T , if for properly chosen
b the hyperplane {x : aT x = b} separates (strongly separates) S and T .
• We say that S and T can be (strongly) separated, if there exists a hyperplane which
(strongly) separates S and T .
E.g.,
Exercise B.11 Let S, T be nonempty convex sets in Rn . Prove that a linear form a separates
S and T if and only if
sup aT x ≤ inf aT y
x∈S y∈T
and
inf aT x < sup aT y.
x∈S y∈T
This separation is strong if and only if
sup aT x < inf aT y.
x∈S y∈T
(i), Necessity. Assume that S, T can be separated, so that for certain a 6= 0 we have
inf aT x ≤ inf aT y; inf aT x < sup aT y. (B.2.9)
x∈S y∈T x∈S y∈T
We should lead to a contradiction the assumption that ri S and ri T have in common certain
point x̄. Assume that it is the case; then from the first inequality in (B.2.9) it is clear that x̄
maximizes the linear function f (x) = aT x on S and simultaneously minimizes this function on
T . Now, we have the following simple and important
Lemma B.2.2 A linear function f (x) = aT x can attain its maximum/minimum
over a convex set Q at a point x ∈ ri Q if and only if the function is constant on Q.
Proof. ”if” part is evident. To prove the ”only if” part, let x̄ ∈ ri Q be, say,
a minimizer of f over Q and y be an arbitrary point of Q; we should prove that
f (x̄) = f (y). There is nothing to prove if y = x̄, so let us assume that y 6= x̄. Since
x̄ ∈ ri Q, the segment [y, x̄], which is contained in M , can be extended a little bit
through the point x̄, not leaving M (since x̄ ∈ ri Q), so that there exists z ∈ Q such
that x̄ ∈ [y, z), i.e., x̄ = (1 − λ)y + λz with certain λ ∈ (0, 1]; since y 6= x̄, we have
in fact λ ∈ (0, 1). Since f is linear, we have
f (x̄) = (1 − λ)f (y) + λf (z);
since f (x̄) ≤ min{f (y), f (z)} and 0 < λ < 1, this relation can be satisfied only when
f (x̄) = f (y) = f (z).
512 APPENDIX B. CONVEX SETS IN RN
By Lemma B.2.2, f (x) = f (x̄) on S and on T , so that f (·) is constant on S ∪ T , which yields
the desired contradiction with the second inequality in (B.2.9).
(i), Sufficiency. The proof of sufficiency part of the Separation Theorem is much more instruc-
tive. There are several ways to prove it, and I choose the one which goes via the Homogeneous
Farkas Lemma B.2.1, which is extremely important in its own right.
(i), Sufficiency, Step 1: Separation of a convex polytope and a point outside the
polytope. Let us start with seemingly very particular case of the Separation Theorem – the
one where S is the convex full points x1 , ..., xN , and T is a singleton T = {x} which does not
belong to S. We intend to prove that in this case there exists a linear form which separates x
and S; in fact we shall prove even the existence of strong separation.
Let
us associate with
n-dimensional vectors x1 , ..., xN , x the (n + 1)-dimensional vectors
x xi
a = and ai = , i = 1, ..., N . I claim that a does not belong to the conic hull
1 1
of a1 , ..., aN . Indeed, if a would be representable as a linear combination of a1 , ..., aN with
nonnegative coefficients, then, looking at the last, (n+1)-st, coordinates in such a representation,
we would conclude that the sum of coefficients should be 1, so that the representation, actually,
represents x as a convex combination of x1 , ..., xN , which was assumed to be impossible.
Since a does not belong to the conic hullof a1 , ..., aN , by the Homogeneous Farkas Lemma
f
(Lemma B.2.1) there exists a vector h = ∈ Rn+1 which “separates” a and a1 , ..., aN in the
α
sense that
hT a > 0, hT ai ≤ 0, i = 1, ..., N,
whence, of course,
hT a > max hT ai .
i
Since the components in all the inner products hT a, hT ai coming from the (n + 1)-st coordinates
are equal to each other , we conclude that the n-dimensional component f of h separates x and
x1 , ..., xN :
f T x > max f T xi .
i
(i), Sufficiency, Step 2: Separation of a convex set and a point outside of the set.
Now consider the case when S is an arbitrary nonempty convex set and T = {x} is a singleton
outside S (the difference with Step 1 is that now S is not assumed to be a polytope).
First of all, without loss of generality we may assume that S contains 0 (if it is not the case,
we may subject S and T to translation S 7→ p + S, T 7→ p + T with p ∈ −S). Let L be the
linear span of S. If x 6∈ L, the separation is easy: taking as f the orthogonal to L component of
x, we shall get
f T x = f T f > 0 = max f T y,
y∈S
B.2. MAIN THEOREMS ON CONVEX SETS 513
Assume, on the contrary, that no such f exists, and let us lead this assumption to a contradiction.
Under our assumption for every h ∈ Σ there exists yh ∈ S such that
hT yh > hT x.
Since the inequality is strict, it immediately follows that there exists a neighbourhood Uh of the
vector h such that
(h0 )T yh > (h0 )T x ∀h0 ∈ Uh . (B.2.11)
The family of open sets {Uh }h∈Σ covers Σ; since Σ is compact, we can find a finite subfamily
Uh1 , ..., UhN of the family which still covers Σ. Let us take the corresponding points y1 =
yh1 , y2 = yh2 , ..., yN = yhN and the polytope S 0 = Conv({y1 , ..., yN }) spanned by the points.
Due to the origin of yi , all of them are points from S; since S is convex, the polytope S 0 is
contained in S and, consequently, does not contain x. By Step 1, x can be strongly separated
from S 0 : there exists a such that
aT x > sup aT y. (B.2.12)
y∈S 0
By normalization, we may also assume that kak2 = 1, so that a ∈ Σ. Now we get a contradiction:
since a ∈ Σ and Uh1 , ..., UhN form a covering of Σ, a belongs to certain Uhi . By construction of
Uhi (see (B.2.11)), we have
aT yi ≡ aT yhi > aT x,
which contradicts (B.2.12) – recall that yi ∈ S 0 .
The contradiction we get proves that there exists f ∈ Σ satisfying (B.2.10). We claim that
f separates S and {x}; in view of (B.2.10), all we need to verify our claim is to show that the
linear form f (y) = f T y is non-constant on S ∪ T , which is evident: we are in the situation when
0 ∈ S and L ≡ Lin(S) = Rn and f 6= 0, so that f (y) is non-constant already on S.
Mathematically oriented reader should take into account that the simple-looking reasoning under-
lying Step 2 in fact brings us into a completely new world. Indeed, the considerations at Step
1 and in the proof of Homogeneous Farkas Lemma are “pure arithmetic” – we never used things
like convergence, compactness, etc., and used rational arithmetic only – no square roots, etc. It
means that the Homogeneous Farkas Lemma and the result stated a Step 1 remain valid if we, e.g.,
replace our universe Rn with the space Qn of n-dimensional rational vectors (those with rational
coordinates; of course, the multiplication by reals in this space should be restricted to multiplication
by rationals). The “rational” Farkas Lemma or the possibility to separate a rational vector from a
“rational” polytope by a rational linear form, which is the “rational” version of the result of Step 1,
definitely are of interest (e.g., for Integer Programming). In contrast to these “purely arithmetic”
considerations, at Step 2 we used compactness – something heavily exploiting the fact that our
universe is Rn and not, say, Qn (in the latter space bounded and closed sets not necessary are
compact). Note also that we could not avoid things like compactness arguments at Step 2, since the
very fact we are proving is true in Rn but not in Qn . Indeed, consider the “rational plane” – the
514 APPENDIX B. CONVEX SETS IN RN
universe comprised of all 2-dimensional vectors with rational entries, and let S be the half-plane in
this rational plane given by the linear inequality
x1 + αx2 ≤ 0,
where α is irrational. S clearly is a “convex set” in Q2 ; it is immediately seen that a point outside
this set cannot be separated from S by a rational linear form.
∆ = S − T = {x − y : x ∈ S, y ∈ T }.
By Proposition B.1.6.3, ∆ is convex (and, of course, nonempty) set; since S and T do not
intersect, ∆ does not contain 0. By Step 2, we can separate ∆ and {0}: there exists f 6= 0 such
that
f T 0 = 0 ≥ sup f T z & f T 0 > inf f T z.
z∈∆ z∈∆
In other words,
0≥ sup [f T x − f T y] & 0 > inf [f T x − f T y],
x∈S,y∈T x∈S,y∈T
It is immediately seen that in fact f separates S and T . Indeed, the quantities in the left and
the right hand sides of the first inequality in (B.2.13) clearly remain unchanged when we replace
S 0 with cl S 0 and T 0 with cl T 0 ; by Theorem B.1.1, cl S 0 = cl S ⊃ S and cl T 0 = cl T ⊃ T , and we
get inf f T x = inf 0 f T x, and similarly sup f T y = sup f T y. Thus, we get from (B.2.13)
x∈T x∈T y∈S y∈S 0
inf f T x ≥ sup f T y.
x∈T y∈S
It remains to note that T 0 ⊂ T , S 0 ⊂ S, so that the second inequality in (B.2.13) implies that
(ii), Sufficiency: Assuming that ρ ≡ inf{kx − yk2 : x ∈ S, y ∈ T } > 0, consider the sets
S 0 = {x : inf kx − yk2 ≤ ρ}. Note that S 0 is convex along with S (Example B.1.4) and that
y∈S
S 0 ∩ T = ∅ (why?) By (i), S 0 and T can be separated, and if f is a linear form which separates
S 0 and T , then the same form strongly separates S and T (why?). The “in particular” part of
(ii) readily follows from the just proved statement due to the fact that if two closed nonempty
sets in Rn do not intersect and one of them is compact, then the sets are at positive distance
from each other (why?).
Exercise B.13 Derive the statement in Remark B.1.1 from the Separation Theorem.
Exercise B.14 Implement the following alternative approach to the proof of Separation Theo-
rem:
1. Prove that if x is a point in Rn and S is a nonempty closed convex set in Rn , then the
problem
min{kx − yk2 : y ∈ S}
y
2. In the situation of 1), prove that if x 6∈ S, then the linear form e = x − x̄ strongly separates
{x} and S:
max eT y = eT x̄ = eT x − eT e < eT x,
y∈S
thus getting a direct proof of the possibility to separate strongly a nonempty closed convex
set and a point outside this set.
Definition B.2.3 [Supporting plane] Let M be a convex closed set in Rn , and let x be a point
from the relative boundary of M . A hyperplane
Π = {y : aT y = aT x} [a 6= 0]
Note that since x is a point from the relative boundary of M and therefore belongs to cl M = M ,
the first inequality in (B.2.14) in fact is equality. Thus, an equivalent definition of a supporting
plane is as follows:
516 APPENDIX B. CONVEX SETS IN RN
Proof. All we need is to prove that if M is closed and convex and 0 ∈ M , then M =
Polar (Polar (M )). By definition,
y ∈ Polar (M ), x ∈ M ⇒ y T x ≤ 1,
so that M ⊂ Polar (Polar (M )). To prove that this inclusion is in fact equality, assume, on the
contrary, that there exists x̄ ∈ Polar (Polar (M ))\M . Since M is nonempty, convex and closed
and x̄ 6∈ M , the point x̄ can be strongly separated from M (Separation Theorem, (ii)). Thus,
for appropriate b one has
bT x̄ > sup bT x.
x∈M
Since 0 ∈ M , the left hand side quantity in this inequality is positive; passing from b to a
proportional vector a = λb with appropriately chosen positive λ, we may ensure that
aT x̄ > 1 ≥ sup aT x.
x∈M
This is the desired contradiction, since the relation 1 ≥ sup aT x implies that a ∈ Polar (M ), so
x∈M
that the relation aT x̄ > 1 contradicts the assumption that x̄ ∈ Polar (Polar (M )).
Exercise B.15 Let M be a convex set containing the origin, and let M 0 be the polar of M .
Prove the following facts:
M∗ = {a : aT x ≥ 0∀x ∈ M }
of all vectors which have nonnegative inner products with all vectors from M , is called the cone
dual to M . By Proposition B.2.5 and Exercise B.15.4, the family of closed cones in Rn is closed
with respect to passing to a dual cone, and the duality is symmetric: for a closed cone M , M∗
also is a closed cone, and (M∗ )∗ = M .
Exercise B.16 Let M be a closed cone in Rn , and M∗ be its dual cone. Prove that
1. M is pointed (i.e., does not contain lines) if and only M∗ has a nonempty interior. Derive
from this fact that M is a closed pointed cone with a nonempty interior if and only if the
dual cone has the same properties.
2. Prove that a ∈ int M∗ if and only if aT x > 0 for all nonzero vectors x ∈ M .
518 APPENDIX B. CONVEX SETS IN RN
Let M1 , ..., Mk be cones (not necessarily closed), and M be their intersection; of course, M also
is a cone. How to compute the cone dual to M ?
Proposition B.2.6 Let M1 , ..., Mk be cones. The cone M 0 dual to the intersection M of the
cones M1 ,...,Mk contains the arithmetic sum M f of the cones M 0 ,...,M 0 dual to M1 ,...,Mk . If
1 k
0
all the cones M1 , ..., Mk are closed, then M is equal to cl M f. In particular, for closed cones
M1 ,...,Mk , M 0 coincides with M
f if and only if the latter set is closed.
Proof. Whenever ai ∈ Mi0 and x ∈ M , we have aTi x ≥ 0, i = 1, ..., k, whence (a1 +...+ak )T x ≥ 0.
Since the latter relation is valid for all x ∈ M , we conclude that a1 +...+ak ∈ M 0 . Thus, M
f ⊂ M 0.
Now assume that the cones M1 , ..., Mk are closed, and let us prove that M = cl M . Sincef
0
M is closed and we have seen that M f ⊂ M 0 , all we should prove is that if a ∈ M 0 , then
a ∈ M = cl M as well. Assume, on the contrary, that a ∈ M 0 \M
c f c. Since the set M f clearly is a
cone, its closure M is a closed cone; by assumption, a does not belong to this closed cone and
c
therefore, by Separation Theorem (ii), a can be strongly separated from M c and therefore – from
M ⊂ M . Thus, for some x one has
f c
k
X
aT x < inf bT x = inf
0
(a1 + ... + ak )T x = inf aTi x. (B.2.16)
b∈M ai ∈Mi ,i=1,...,k ai ∈Mi0
e i=1
From the resulting inequality it follows that inf 0 aTi x > −∞; since Mi0 is a cone, the latter is
ai ∈Mi
possible if and only if inf aT x
0 i
= 0, i.e., if and only if for every i one has x ∈ Polar (Mi0 ) = Mi
ai ∈Mi
(recall that the cones Mi are closed). Thus, x ∈ Mi for all i, and the concluding quantity
in (B.2.16) is 0. We see that x ∈ M = ∩i Mi , and that (B.2.16) reduces to aT x < 0. This
contradicts the inclusion a ∈ M 0 .
Note that in general M f can be non-closed even when all the cones M1 , ..., Mk are closed. Indeed,
take k = 2, and let M10 be the ice-cream cone {(x, y, z) ∈ R3 : z ≥ x2 + y 2 }, and M20 be the ray
p
{z = x ≤ 0, y = 0} in bR3 . Observe that the points from Mf ≡ M 0 + M 0 are exactly the points
p 1 2
of the√form (x − t, y, z − t) with√t ≥ 0 and z ≥ x2 + y 2 . In particular, for x positive the points
(0, 1, x2 + 1 − x) = (x − x, 1, x2 + 1 − x) belong to M f; as x → ∞, these points converge to
Proposition B.2.7 [Dubovitski-Milutin Lemma] Let M1 , ..., Mk be cones such that Mk is closed
and the set Mk ∩ int M1 ∩ int M2 ∩ ... ∩ int Mk−1 is nonempty, and let M = M1 ∩ ... ∩ Mk . Let
also Mi0 be the cones dual to Mi . Then
k
T
(i) cl M = cl Mi ;
i=1
f = M 0 + ... + M 0 is closed, and thus coincides with the cone M 0 dual to cl M
(ii) the cone M 1 k
(or, which is the same by Exercise B.15.1, with the cone dual to M ). In other words, every
linear form which is nonnegative on M can be represented as a sum of k linear forms which are
nonnegative on the respective cones M1 ,...,Mk .
B.2. MAIN THEOREMS ON CONVEX SETS 519
Proof. (i): We should prove that under the premise of the Dubovitski-Milutin Lemma, cl M =
T
cl Mi . The right hand side here contains M and is closed, so that all we should prove is that
i
k
cl Mi is the limit of an appropriate sequence xt ∈ M . By premise of the
T
every point x in
i=1
Lemma, there exists a point x̄ ∈ Mk ∩int M1 ∩int M2 ∩...∩int Mk−1 ; setting xt = t−1 x̄+(1−t−1 )x,
t = 1, 2, ..., we get a sequence converging to x as t → ∞; at the same time, xt ∈ Mk (since x, x̄
are in cl Mk = Mk ) and xt ∈ Mi for every i < k (by Lemma B.1.1; note that for i < k one has
k
x̄ ∈ int Mi and x ∈ cl Mi ), and thus xt ∈ M . Thus, every point x ∈
T
cl Mi is the limit of a
i=1
sequence from M .
(ii): Under the premise of the Lemma, when replacing the cones M1 , ..., Mk with their
closures, we do not vary the polars Mi0 of the cones (and thus do not vary M f) and replace the
intersection of the sets M1 , ..., Mk with its closure (by (i)), thus not varying the polar of the
intersection. And of course when replacing the cones M1 , ..., Mk with their closures, we preserve
the premise of Lemma. Thus, we lose nothing when assuming, in addition to the premise of
Lemma, that the cones M1 , ..., Mk are closed. To prove the lemma for closed cones M1 ,...,Mk ,
we use induction in k ≥ 2.
Inductive step: Assume that the statement we are proving is valid in the case of k −1 ≥ 2
cones, and let M1 ,...,Mk be k cones satisfying the premise of the Dubovitski-Milutin Lemma.
By this premise, the cone M 1 = M1 ∩ ... ∩ Mk−1 has a nonempty interior, and Mk intersects this
interior. Applying to the pair of cones M 1 , Mk the already proved 2-cone version of the Lemma,
we see that the set (M 1 )0 + Mk0 is closed; here (M 1 )0 is the cone dual to M 1 . Further, the
cones M1 , ..., Mk−1 satisfy the premise of the (k − 1)-cone version of the Lemma; by inductive
hypothesis, the set M10 +...+Mk−10 is closed and therefore, by Proposition B.2.6, equals to (M 1 )0 .
Thus, M1 + ... + Mk = (M ) + Mk0 , and we have seen that the latter set is closed.
0 0 1 0
combination of other points of the set; and the importance of the notion comes from the fact
(which we shall prove in the mean time) that the set of all extreme points of a “good enough”
convex set M is the “shortest worker’s instruction for building the set” – this is the smallest set
of points for which M is the convex hull.
x = λu + (1 − λ)v
u = v = x.
E.g., the extreme points of a segment are exactly its endpoints; the extreme points of a triangle
are its vertices; the extreme points of a (closed) circle on the 2-dimensional plane are the points
of the circumference.
An equivalent definitions of an extreme point is as follows:
1. x is extreme if and only if the only vector h such that x ± h ∈ M is the zero vector;
M = Conv(Ext(M )),
Part (ii) of this theorem is the finite-dimensional version of the famous Krein-Milman Theorem.
Proof. Let us start with (i). The ”only if” part is easy, due to the following simple
Lemma B.2.3 Let M be a closed convex set in Rn . Assume that for some x̄ ∈ M
and h ∈ Rn M contains the ray
{x̄ + th : t ≥ 0}
starting at x̄ with the direction h. Then M contains also all parallel rays starting at
the points of M :
(∀x ∈ M ) : {x + th : t ≥ 0} ⊂ M.
In particular, if M contains certain line, then it contains also all parallel lines passing
through the points of M .
Comment. For a closed convex set M, the set of all directions h such that x + th ∈
M for some x and all t ≥ 0 (i.e., by Lemma – such that x + th ∈ M for all x ∈ M
and all t ≥ 0) is called the recessive cone of M [notation: Rec(M )]. With Lemma
B.2.3 it is immediately seen (prove it!) that Rec(M ) indeed is a closed cone, and
that
M + Rec(M ) = M.
Directions from Rec(M ) are called recessive for M .
Exercise B.18 Let M be a closed nonempty convex set. Prove that Rec(M ) 6= {0}
if and only if M is unbounded.
Lemma B.2.3, of course, resolves all our problems with the ”only if” part. Indeed, here we
should prove that if M possesses extreme points, then M does not contain lines, or, which is
the same, that if M contains lines, then it has no extreme points. But the latter statement is
immediate: if M contains a line, then, by Lemma, there is a line in M passing through every
given point of M , so that no point can be extreme.
Now let us prove the ”if” part of (i). Thus, from now on we assume that M does not contain
lines; our goal is to prove that then M possesses extreme points. Let us start with the following
due to the definition of a supporting plane, and Q contains x̄ due to the closedness
of Q). Second, let a be the linear form associated with Π:
Π = {y : aT y = aT x̄},
so that
inf aT x < sup aT x = aT x̄ (B.2.17)
x∈Q x∈Q
By the same Proposition, the set T = Π ∩ M (which is closed, convex and nonempty) is of affine
dimension less than the one of M , i.e., of the dimension ≤ k. T clearly does not contain lines
(since even the larger set M does not contain lines). By the inductive hypothesis, T possesses
extreme points, and by Lemma B.2.4 all these points are extreme also for M . The inductive
step is completed, and (i) is proved.
Now let us prove (ii). Thus, let M be nonempty, convex, closed and bounded; we should
prove that
M = Conv(Ext(M )).
What is immediately seen is that the right hand side set is contained in the left hand side one.
Thus, all we need is to prove that every x ∈ M is a convex combination of points from Ext(M ).
Here we again use induction on the affine dimension of M . The case of 0-dimensional set M (i.e.,
a point) is trivial. Assume that the statement in question is valid for all k-dimensional convex
closed and bounded sets, and let M be a convex closed and bounded set of dimension k + 1. Let
x ∈ M ; to represent x as a convex combination of points from Ext(M ), let us pass through x an
arbitrary line ` = {x + λh : λ ∈ R} (h 6= 0) in the affine span Aff(M ) of M . Moving along this
line from x in each of the two possible directions, we eventually leave M (since M is bounded);
as it was explained in the proof of (i), it means that there exist nonnegative λ+ and λ− such
that the points
x̄± = x + λ± h
both belong to the relative boundary of M . Let us verify that x̄± are convex combinations of
the extreme points of M (this will complete the proof, since x clearly is a convex combination
of the two points x̄± ). Indeed, M admits supporting at x̄+ hyperplane Π; as it was explained
in the proof of (i), the set Π ∩ M (which clearly is convex, closed and bounded) is of affine
dimension less than that one of M ; by the inductive hypothesis, the point x̄+ of this set is a
convex combination of extreme points of the set, and by Lemma B.2.4 all these extreme points
are extreme points of M as well. Thus, x̄+ is a convex combination of extreme points of M .
Similar reasoning is valid for x̄− .
– those with i ∈ I, all points y of this segment satisfy the relations aTi y = aTi x = bi . Now, if i is
a “nonactive” index – one with aTi x < bi – then aTi y ≤ bi for all y ∈ ∆ , provided that is small
enough. Since there are finitely many nonactive indices, we can choose > 0 in such a way that
all y ∈ ∆ will satisfy all “nonactive” inequalities aTi x ≤ bi , i 6∈ I. Since y ∈ ∆ satisfies, as
we have seen, also all “active” inequalities, we conclude that with the above choice of we get
∆ ⊂ K, which is a contradiction: > 0 and h 6= 0, so that ∆ is a nontrivial segment with the
midpoint x, and no such segment can be contained in K, since x is an extreme point of K.
To prove the “if” part, assume that x ∈ K is such that among the inequalities aTi x ≤ bi
which are equalities at x there are n linearly independent, say, those with indices 1, ..., n, and let
us prove that x is an extreme point of K. This is immediate: assuming that x is not an extreme
point, we would get the existence of a nonzero vector h such that x ± h ∈ K. In other words,
for i = 1, ..., n we would have bi ± aTi h ≡ aTi (x ± h) ≤ bi , which is possible only if aTi h = 0,
i = 1, ..., n. But the only vector which is orthogonal to n linearly independent vectors in Rn is
the zero vector (why?), and we get h = 0, which was assumed not to be the case. .
Corollary B.2.1 The set of extreme points of a polyhedral set is finite.
Indeed, according to the above Theorem, every extreme point of a polyhedral set K = {x ∈
Rn : Ax ≤ b} satisfies the equality version of certain n-inequality subsystem of the original
system, the matrix of the subsystem being nonsingular. Due to the latter fact, an extreme point
is uniquely defined by the corresponding subsystem, so that the number of extreme points does
not exceed the number Cnm of n × n submatrices of the matrix A and is therefore finite.
Note that Cnm is nothing but an upper (ant typically very conservative) bound on the number
of extreme points of a polyhedral set given by m inequalities in Rn : some n × n submatrices of
A can be singular and, what is more important, the majority of the nonsingular ones normally
produce “candidates” which do not satisfy the remaining inequalities.
Remark B.2.1 The result of Theorem B.2.11 is very important, in particular, for
the theory of the Simplex method – the traditional computational tool of Linear
Programming. When applied to the LP program in the standard form
n o
min cT x : P x = p, x ≥ 0 [x ∈ Rn ],
x
with k×n matrix P , the result of Theorem B.2.11 is that extreme points of the feasible
set are exactly the basic feasible solutions of the system P x = p, i.e., nonnegative
vectors x such that P x = p and the set of columns of P associated with positive
entries of x is linearly independent. Since the feasible set of an LP program in the
standard form clearly does not contain lines, among the optimal solutions (if they
exist) to an LP program in the standard form at least one is an extreme point of the
feasible set (Theorem B.2.14.(ii)). Thus, in principle we could look through the finite
set of all extreme points of the feasible set (≡ through all basic feasible solutions)
and to choose the one with the best value of the objective. This recipe allows to find
a feasible solution in finitely many arithmetic operations, provided that the program
is solvable, and is, basically, what the Simplex method does; this latter method, of
course, looks through the basic feasible solutions in a smart way which normally
allows to deal with a negligible part of them only.
Another useful consequence of Theorem B.2.11 is that if all the data in an LP program
are rational, then every extreme point of the feasible domain of the program is a
vector with rational entries. In particular, a solvable standard form LP program
with rational data has at least one rational optimal solution.
B.2. MAIN THEOREMS ON CONVEX SETS 525
By Krein-Milman Theorem, Πn is the convex hull of its extreme points. What are these extreme
points? The answer is given by important
Theorem B.2.12 [Birkhoff Theorem] Extreme points of Πn are exactly the permutation ma-
trices of order n, i.e., n × n Boolean (i.e., with 0/1 entries) matrices with exactly one nonzero
element (equal to 1) in every row and every column.
Exercise B.19 [Easy part] Prove the easy part of the Theorem, specifically, that every n × n
permutation matrix is an extreme point of Πn .
Proof of difficult part. Now let us prove that every extreme point of Πn is a permutation
matrix. To this end let us note that the 2n linear equations in the definition of Πn — those
saying that all row and column sums are equal to 1 - are linearly dependent, and dropping
P
one of them, say, i xin = 1, we do not alter the set. Indeed, the remaining equalities say
that all row sums are equal to 1, so that the total sum of all entries in X is n, and that the
first n − 1 column sums are equal to 1, meaning that the last column sum is n − (n − 1) = 1.
Thus, we lose nothing when assuming that there are just 2n − 1 equality constraints in the
description of Πn . Now let us prove the claim by induction in n. The base n = 1 is trivial. Let
us justify the inductive step n − 1 ⇒ n. Thus, let X be an extreme point of Πn . By Theorem
B.2.11, among the constraints defining Πn (i.e., 2n − 1 equalities and n2 inequalities xij ≥ 0)
there should be n2 linearly independent which are satisfied at X as equations. Thus, at least
n2 − (2n − 1) = (n − 1)2 entries in X should be zeros. It follows that at least one of the columns
of X contains ≤ 1 nonzero entries (since otherwise the number of zero entries in X would be
at most n(n − 2) < (n − 1)2 ). Thus, there exists at least one column with at most 1 nonzero
entry; since the sum of entries in this column is 1, this nonzero entry, let it be xīj̄ , is equal to
1. Since the entries in row ī are nonnegative, sum up to 1 and xīj̄ = 1, xīj̄ = 1 is the only
nonzero entry in its row and its column. Eliminating from X the row ī and the column j̄, we
get an (n − 1) × (n − 1) double stochastic matrix. By inductive hypothesis, this matrix is a
convex combination of (n − 1) × (n − 1) permutation matrices. Augmenting every one of these
matrices by the column and the row we have eliminated, we get a representation of X as a
convex combination of n × n permutation matrices: X = ` λ` P` with nonnegative λ` summing
P
where A is a matrix of the column size n and certain row size m and b is m-dimensional vector.
This is an “outer” description of a polyhedral set. We are about to establish an important result
on the equivalent “inner” representation of a polyhedral set.
Consider the following construction. Let us take two finite nonempty set of vectors V (“ver-
tices”) and R (“rays”) and build the set
X X X
M (V, R) = Conv(V ) + Cone (R) = { λv v + µr r : λv ≥ 0, µr ≥ 0, λv = 1}.
v∈V r∈R v
Thus, we take all vectors which can be represented as sums of convex combinations of the points
from V and conic combinations of the points from R. The set M (V, R) clearly is convex (as
the arithmetic sum of two convex sets Conv(V ) and Cone (R)). The promised inner description
polyhedral sets is as follows:
Theorem B.2.13 [Inner description of a polyhedral set] The sets of the form M (V, R) are
exactly the nonempty polyhedral sets: M (V, R) is polyhedral, and every nonempty polyhedral set
M is M (V, R) for properly chosen V and R.
The polytopes M (V, {0}) = Conv(V ) are exactly the nonempty and bounded polyhedral sets.
The sets of the type M ({0}, R) are exactly the polyhedral cones (sets given by finitely many
nonstrict homogeneous linear inequalities).
Remark B.2.2 In addition to the results of the Theorem, it can be proved that in the repre-
sentation of a nonempty polyhedral set M as M = Conv(V ) + Cone (R)
– the “conic” part Conv(R) (not the set R itself!) is uniquely defined by M and is the
recessive cone of M (see Comment to Lemma B.2.3);
– if M does not contain lines, then V can be chosen as the set of all extreme points of M .
Postponing temporarily the proof of Theorem B.2.13, let us explain why this theorem is that
important – why it is so nice to know both inner and outer descriptions of a polyhedral set.
Consider a number of natural questions:
• A. Is it true that the inverse image of a polyhedral set M ⊂ Rn under an affine mapping
y 7→ P(y) = P y + p : Rm → Rn , i.e., the set
P −1 (M ) = {y ∈ Rm : P y + p ∈ M }
is polyhedral?
P(M ) = {P x + p : x ∈ M }
is polyhedral?
• C. Is it true that the intersection of two polyhedral sets is again a polyhedral set?
• D. Is it true that the arithmetic sum of two polyhedral sets is again a polyhedral set?
B.2. MAIN THEOREMS ON CONVEX SETS 527
The answers to all these questions are positive; one way to see it is to use calculus of polyhedral
representations along with the fact that polyhedrally representable sets are exactly the same as
polyhedral sets (section B.2.4). Another very instructive way is to use the just outlined results
on the structure of polyhedral sets, which we intend to do now.
It is very easy to answer affirmatively to A, starting from the original – outer – definition of
a polyhedral set: if M = {x : Ax ≤ b}, then, of course,
Note that in this computation we used two rules which should be justified: Conv(V ) +
Conv(V 0 ) = Conv(V + V 0 ) and Cone (R) + Cone (R0 ) = Cone (R ∪ R0 ). The second is evi-
dent from the definition of the conic hull, and only the first needs simple reasoning. To prove
it, note that Conv(V ) + Conv(V 0 ) is a convex set which contains V + V 0 and therefore contains
Conv(V + V 0 ). The inverse inclusion is proved as follows: if
λ0j vj0
X X
x= λi vi , y =
i j
are convex combinations of points from V , resp., V 0 , then, as it is immediately seen (please
check!),
λi λ0j (vi + vj0 )
X
x+y =
i,j
here c is a given n-dimensional vector – the objective, A is a given m × n constraint matrix and
b ∈ Rm is the right hand side vector. Note that (P) is called “Linear Programming program in
the canonical form”; there are other equivalent forms of the problem.
• feasible, if it admits a feasible solution, i.e., the system Ax ≤ b is solvable, and infeasible
otherwise;
• bounded, if it is feasible and the objective is above bounded on the feasible set, and
unbounded, if it is feasible, but the objective is not bounded from above on the feasible
set;
• solvable, if it is feasible and the optimal solution exists – the objective attains its maximum
on the feasible set.
If the program is bounded, then the upper bound of the values of the objective on the feasible
set is a real; this real is called the optimal value of the program and is denoted by c∗ . It
is convenient to assign optimal value to unbounded and infeasible programs as well – for an
unbounded program it, by definition, is +∞, and for an infeasible one it is −∞.
Note that our terminology is aimed to deal with maximization programs; if the program
is to minimize the objective, the terminology is updated in the natural way: when defining
bounded/unbounded programs, we should speak about below boundedness rather than about the
above boundedness of the objective, etc. E.g., the optimal value of an unbounded minimization
program is −∞, and of an infeasible one it is +∞. This terminology is consistent with the usual
way of converting a minimization problem into an equivalent maximization one by replacing
the original objective c with −c: the properties of feasibility – boundedness – solvability remain
unchanged, and the optimal value in all cases changes its sign.
I have said that you for sure know the above terminology; this is not exactly true, since you
definitely have heard and used the words “infeasible LP program”, “unbounded LP program”,
but hardly used the words “bounded LP program” – only the “solvable” one. This indeed is
true, although absolutely unclear in advance – a bounded LP program always is solvable. We
have already established this fact, even twice — via Fourier-Motzkin elimination (section B.2.4
and via the LP Duality Theorem). Let us reestablish this fundamental for Linear Programming
fact with the tools we have at our disposal now.
Theorem B.2.14 (i) A Linear Programming program is solvable if and only if it is bounded.
(ii) If the program is solvable and the feasible set of the program does not contain lines, then
at least one of the optimal solutions is an extreme point of the feasible set.
Proof. (i): The “only if” part of the statement is tautological: the definition of solvability
includes boundedness. What we should prove is the “if” part – that a bounded program is
B.2. MAIN THEOREMS ON CONVEX SETS 529
solvable. This is immediately given by the inner description of the feasible set M of the program:
this is a polyhedral set, so that being nonempty (as it is for a bounded program), it can be
represented as
M (V, R) = Conv(V ) + Cone (R)
for some nonempty finite sets V and R. I claim first of all that since (P) is bounded, the inner
product of c with every vector from R is nonpositive. Indeed, otherwise there would be r ∈ R
with cT r > 0; since M (V, R) clearly contains with every its point x the entire ray {x+tr : t ≥ 0},
and the objective evidently is unbounded on this ray, it would be above unbounded on M , which
is not the case.
Now let us choose in the finite and nonempty set V the point, let it be called v ∗ , which
maximizes the objective on V . I claim that v ∗ is an optimal solution to (P), so that (P) is
solvable. The justification of the claim is immediate: v ∗ clearly belongs to M ; now, a generic
point of M = M (V, R) is
X X
x= λv v + µr r
v∈V r∈R
P
with nonnegative λv and µr and with λv = 1, so that
v
cT x = λv cT v + µr cT r
P P
v r
≤ λv cT v [since µr ≥ 0 and cT r ≤ 0, r ∈ R]
P
v
≤ λv cT v ∗ [since λv ≥ 0 and cT v ≤ cT v ∗ ]
P
v
cT v ∗
P
= [since λv = 1]
v
(ii): if the feasible set of (P), let it be called M , does not contain lines, it, being convex
and closed (as a polyhedral set) possesses extreme points. It follows that (ii) is valid in the
trivial case when the objective of (ii) is constant on the entire feasible set, since then every
extreme point of M can be taken as the desired optimal solution. The case when the objective
is nonconstant on M can be immediately reduced to the aforementioned trivial case: if x∗ is
an optimal solution to (P) and the linear form cT x is nonconstant on M , then the hyperplane
Π = {x : cT x = c∗ } is supporting to M at x∗ ; the set Π ∩ M is closed, convex, nonempty and
does not contain lines, therefore it possesses an extreme point x∗∗ which, on one hand, clearly is
an optimal solution to (P), and on another hand is an extreme point of M by Lemma B.2.4.
B.2.9.C.1. Structure of a bounded polyhedral set. Let us start with proving a significant
part of Theorem B.2.13 – the one describing bounded polyhedral sets.
Theorem B.2.15 [Structure of a bounded polyhedral set] A bounded and nonempty polyhedral
set M in Rn is a polytope, i.e., is the convex hull of a finite nonempty set:
Proof. The first part of the statement – that a bounded nonempty polyhedral set is a polytope
– is readily given by the Krein-Milman Theorem combined with Corollary B.2.1. Indeed, a
polyhedral set always is closed (as a set given by nonstrict inequalities involving continuous
functions) and convex; if it is also bounded and nonempty, it, by the Krein-Milman Theorem,
is the convex hull of the set V of its extreme points; V is finite by Corollary B.2.1.
Now let us prove the more difficult part of the statement – that a polytope is a bounded
polyhedral set. The fact that a convex hull of a finite set is bounded is evident. Thus, all we
need is to prove that the convex hull of finitely many points is a polyhedral set. To this end
note that this convex hull clearly is polyhedrally representable:
X X
Conv{v1 , ..., vN } = {x : ∃λ : λ ≥ 0, λi = 1, x = λ i vi }
i i
B.2.9.C.2. Structure of a general polyhedral set: completing the proof. Now let us
prove the general Theorem B.2.13. The proof basically follows the lines of the one of Theorem
B.2.15, but with one elaboration: now we cannot use the Krein-Milman Theorem to take upon
itself part of our difficulties.
To simplify language let us call VR-sets (“V” from “vertex”, “R” from rays) the sets of the
form M (V, R), and P-sets the nonempty polyhedral sets. We should prove that every P-set is a
VR-set, and vice versa. We start with proving that every P-set is a VR-set.
B.2.9.C.2.A. P⇒VR:
P⇒VR, Step 1: reduction to the case when the P-set does not contain lines.
Let M be a P-set, so that M is the set of all solutions to a solvable system of linear inequalities:
M = {x ∈ Rn : Ax ≤ b} (B.2.19)
with m × n matrix A. Such a set may contain lines; if h is the direction of a line in M , then
A(x + th) ≤ b for some x and all t ∈ R, which is possible only if Ah = 0. Vice versa, if h is
from the kernel of A, i.e., if Ah = 0, then the line x + Rh with x ∈ M clearly is contained in
M . Thus, we come to the following fact:
Lemma B.2.5 Nonempty polyhedral set (B.2.19) contains lines if and only if the
kernel of A is nontrivial, and the nonzero vectors from the kernel are exactly the
directions of lines contained in M : if M contains a line with direction h, then h ∈
KerA, and vice versa: if 0 6= h ∈ KerA and x ∈ M , then M contains the entire line
x + Rh.
Given a nonempty set (B.2.19), let us denote by L the kernel of A and by L⊥ the orthogonal
complement to the kernel, and let M 0 be the cross-section of M by L⊥ :
M 0 = {x ∈ L⊥ : Ax ≤ b}.
The set M 0 clearly does not contain lines (since the direction of every line in M 0 , on one hand,
should belong to L⊥ due to M 0 ⊂ L⊥ , and on the other hand – should belong to L = KerA, since
a line in M 0 ⊂ M is a line in M as well). The set M 0 is nonempty and, moreover, M = M 0 + L.
B.2. MAIN THEOREMS ON CONVEX SETS 531
Indeed, M 0 contains the orthogonal projections of all points from M onto L⊥ (since to project a
point onto L⊥ , you should move from this point along certain line with the direction in L, and
all these movements, started in M , keep you in M by the Lemma) and therefore is nonempty,
first, and is such that M 0 + L ⊃ M , second. On the other hand, M 0 ⊂ M and M + L = M by
Lemma B.2.5, whence M 0 + L ⊂ M . Thus, M 0 + L = M .
Finally, M 0 is a polyhedral set together with M , since the inclusion x ∈ L⊥ can be represented
by dim L linear equations (i.e., by 2dim L nonstrict linear inequalities): you should say that x
is orthogonal to dim L somehow chosen vectors a1 , ..., adim L forming a basis in L.
The results of our effort are as follows: given an arbitrary P-set M , we have represented is
as the sum of a P-set M 0 not containing lines and a linear subspace L. With this decomposition
in mind we see that in order to achieve our current goal – to prove that every P-set is a VR-set
– it suffices to prove the same statement for P-sets not containing lines. Indeed, given that
M 0 = M (V, R0 ) and denoting by R0 a finite set such that L = Cone (R0 ) (to get R0 , take the set
of 2dim L vectors ±ai , i = 1, ..., dim L, where a1 , ..., adim L is a basis in L), we would obtain
M = M0 + L
= [Conv(V ) + Cone (R)] + Cone (R0 )
= Conv(V ) + [Cone (R) + Cone (R0 )]
= Conv(V ) + Cone (R ∪ R0 )
= M (V, R ∪ R0 )
We see that in order to establish that a P-set is a VR-set it suffices to prove the same
statement for the case when the P-set in question does not contain lines.
P⇒VR, Step 2: the P-set does not contain lines. Our situation is as follows: we are
given a not containing lines P-set in Rn and should prove that it is a VR-set. We shall prove
this statement by induction on the dimension n of the space. The case of n = 0 is trivial. Now
assume that the statement in question is valid for n ≤ k, and let us prove that it is valid also
for n = k + 1. Let M be a not containing lines P-set in Rk+1 :
Without loss of generality we may assume that all ai are nonzero vectors (since M is nonempty,
the inequalities with ai = 0 are valid on the entire Rn , and removing them from the system, we
do not vary its solution set). Note that m > 0 – otherwise M would contain lines, since k ≥ 0.
10 . We may assume that M is unbounded – otherwise the desired result is given already
by Theorem B.2.15. By Exercise B.18, there exists a recessive direction r 6= 0 of M Thus, M
contains the ray {x + tr : t ≥ 0}, whence, by Lemma B.2.3, M + Cone ({r}) = M .
20 . For every i ≤ m, where m is the row size of the matrix A from (B.2.20), that is, the
number of linear inequalities in the description of M , let us denote by Mi the corresponding
“facet” of M – the polyhedral set given by the system of inequalities (B.2.20) with the inequality
aTi x ≤ bi replaced by the equality aTi x = bi . Some of these “facets” can be empty; let I be the
set of indices i of nonempty Mi ’s.
When i ∈ I, the set Mi is a nonempty polyhedral set – i.e., a P-set – which does not
contain lines (since Mi ⊂ M and M does not contain lines). Besides this, Mi belongs to the
hyperplane {aTi x = bi }, i.e., actually it is a P-set in Rk . By the inductive hypothesis, we have
representations
Mi = M (Vi , Ri ), i ∈ I,
532 APPENDIX B. CONVEX SETS IN RN
where r is a recessive direction of M found in 10 ; after the claim will be supported, our induction
will be completed.
To prove (B.2.21), note, first of all, that the right hand side of this relation is contained in
the left hand side one. Indeed, since Mi ⊂ M and Vi ⊂ Mi , we have Vi ⊂ M , whence also
V = ∪i Vi ⊂ M ; since M is convex, we have
Conv(V ) ⊂ M. (B.2.22)
B.2.9.C.2.B. VR⇒P: We already know that every P-set is a VR-set. Now we shall prove
that every VR-set is a P-set, thus completing the proof of Theorem B.2.13. This is immediate:
a VR-set is polyhedrally representable (why?) and thus is a P-set by Theorem B.2.5.
Appendix C
Convex functions
If the above inequality is strict whenever x 6= y and 0 < λ < 1, f is called strictly convex.
A function f such that −f is convex is called concave; the domain Q of a concave function
should be convex, and the function itself should satisfy the inequality opposite to (C.1.1):
f (x) = aT x + b
– the sum of a linear form and a constant. This function clearly is convex on the entire space,
and the “convexity inequality” for it is equality. An affine function is both convex and concave;
it is easily seen that a function which is both convex and concave on the entire space is affine.
Here are several elementary examples of “nonlinear” convex functions of one variable:
• functions convex on the whole axis:
x2p , p is a positive integer;
exp{x};
• functions convex on the nonnegative ray:
xp , 1 ≤ p;
−xp , 0 ≤ p ≤ 1;
x ln x;
533
534 APPENDIX C. CONVEX FUNCTIONS
More examples of convex functions: norms. Equipped with Proposition C.1.1, we can
extend our initial list of convex functions (several one-dimensional functions and affine ones) by
more examples – norms. Let π(x) be a norm on Rn (see Section √ B.1.2.B). To the moment we
know three examples of norms – the Euclidean norm kxk2 = xT x, the 1-norm kxk1 = |xi |
P
i
and the ∞-norm kxk∞ = max |xi |. It was also claimed (although not proved) that these are
i
three members of an infinite family of norms
n
!1/p
X
p
kxkp = |xi | , 1≤p≤∞
i=1
(the right hand side of the latter relation for p = ∞ is, by definition, max |xi |).
i
We are about to prove that every norm is convex:
Proposition C.1.2 Let π(x) be a real-valued function on Rn which is positively homogeneous
of degree 1:
π(tx) = tπ(x) ∀x ∈ Rn , t ≥ 0.
π is convex if and only if it is subadditive:
The following elementary observation is, I believe, one of the most useful observations in the
world:
Proposition C.1.3 [Jensen’s inequality] Let f be convex and Q be the domain of f . Then for
every convex combination
N
X
λ i xi
i=1
The proof is immediate: the points (f (xi ), xi ) clearly belong to the epigraph of f ; since f is
convex, its epigraph is a convex set, so that the convex combination
N
X N
X N
X
λi (f (xi ), xi ) = ( λi f (xi ), λi xi )
i=1 i=1 i=1
of the points also belongs to Epi(f ). By definition of the epigraph, the latter means exactly that
N N
λi f (xi ) ≥ f (
P P
λi xi ).
i=1 i=1
Note that the definition of convexity of a function f is exactly the requirement on f to satisfy
the Jensen inequality for the case of N = 2; we see that to satisfy this inequality for N = 2 is
the same as to satisfy it for all N .
Proposition C.1.4 [convexity of level sets] Let f be a convex function with the domain Q.
Then, for every real α, the set
levα (f ) = {x ∈ Q : f (x) ≤ α}
The proof takes one line: if x, y ∈ levα (f ) and λ ∈ [0, 1], then f (λx + (1 − λ)y) ≤ λf (x) + (1 −
λ)f (y) ≤ λα + (1 − λ)α = α, so that λx + (1 − λ)y ∈ levα (f ).
Note that the convexity of level sets does not characterize convex functions; there are non-
convex functions which share this property (e.g., every monotone function on the axis). The
“proper” characterization of convex functions in terms of convex sets is given by Proposition
C.1.1 – convex functions are exactly the functions with convex epigraphs. Convexity of level
sets specify a wider family of functions, the so called quasiconvex ones.
536 APPENDIX C. CONVEX FUNCTIONS
If the expression in the right hand side involves infinities, it is assigned the value according to
the standard and reasonable conventions on what are arithmetic operations in the “extended
real axis” R ∪ {+∞} ∪ {−∞}:
• the sum of +∞ and a real, same as the sum of +∞ and +∞ is +∞; similarly, the sum
of a real and −∞, same as the sum of −∞ and −∞ is −∞. The sum of +∞ and −∞ is
undefined;
• the product of a real and +∞ is +∞, 0 or −∞, depending on whether the real is positive,
zero or negative, and similarly for the product of a real and −∞. The product of two
”infinities” is again infinity, with the usual rule for assigning the sign to the product.
Note that it is not clear in advance that our new definition of a convex function is equivalent
to the initial one: initially we included into the definition requirement for the domain to be
convex, and now we omit explicit indicating this requirement. In fact, of course, the definitions
are equivalent: convexity of Dom f – i.e., the set where f is finite – is an immediate consequence
of the “convexity inequality” (C.1.2).
It is convenient to think of a convex function as of something which is defined everywhere,
since it saves a lot of words. E.g., with this convention I can write f + g (f and g are convex
functions on Rn ), and everybody will understand what is meant; without this convention, I
am supposed to add to this expression the explanation as follows: “f + g is a function with
the domain being the intersection of those of f and g, and in this intersection it is defined as
(f + g)(x) = f (x) + g(x)”.
convexity of the objective f and the constraints gi is crucial: it turns out that problems with this
property possess nice theoretical properties (e.g., the local necessary optimality conditions for
these problems are sufficient for global optimality); and what is much more important, convex
problems can be efficiently (both in theoretical and, to some extent, in the practical meaning
of the word) solved numerically, which is not, unfortunately, the case for general nonconvex
C.2. HOW TO DETECT CONVEXITY 537
problems. This is why it is so important to know how one can detect convexity of a given
function. This is the issue we are coming to.
The scheme of our investigation is typical for mathematics. Let me start with the example
which you know from Analysis. How do you detect continuity of a function? Of course, there is
a definition of continuity in terms of and δ, but it would be an actual disaster if each time we
need to prove continuity of a function, we were supposed to write down the proof that ”for every
positive there exists positive δ such that ...”. In fact we use another approach: we list once for
ever a number of standard operations which preserve continuity, like addition, multiplication,
taking superpositions, etc., and point out a number of standard examples of continuous functions
– like the power function, the exponent, etc. To prove that the operations in the list preserve
continuity, same as to prove that the standard functions are continuous, this takes certain
effort and indeed is done in − δ terms; but after this effort is once invested, we normally
have no difficulties with proving continuity of a given function: it suffices to demonstrate that
the function can be obtained, in finitely many steps, from our ”raw materials” – the standard
functions which are known to be continuous – by applying our machinery – the combination
rules which preserve continuity. Normally this demonstration is given by a single word ”evident”
or even is understood by default.
This is exactly the case with convexity. Here we also should point out the list of operations
which preserve convexity and a number of standard convex functions.
Remark C.2.1 The expression F (f1 (x), ..., fk (x)) makes no evident sense at a point x
where some of fi ’s are +∞. By definition, we assign the superposition at such a point the
value +∞.
[To justify the rule, note that if λ ∈ (0, 1) and x, x0 ∈ Dom φ, then z = f (x), z 0 = f (x0 ) are
vectors from Rk which belong to Dom F , and due to the convexity of the components of
f we have
f (λx + (1 − λ)x0 ) ≤ λz + (1 − λ)z 0 ;
in particular, the left hand side is a vector from Rk – it has no “infinite entries”, and we
may further use the monotonicity of F :
φ(λx + (1 − λ)x0 ) = F (f (λx + (1 − λ)x0 )) ≤ F (λz + (1 − λ)z 0 )
and now use the convexity of F :
F (λz + (1 − λ)z 0 ) ≤ λF (z) + (1 − λ)F (z 0 )
to get the required relation
φ(λx + (1 − λ)x0 ) ≤ λφ(x) + (1 − λ)φ(x0 ).
]
Imagine how many extra words would be necessary here if there were no convention on the value
of a convex function outside its domain!
Two more rules are as follows:
• [stability under partial minimization] if f (x, y) : Rnx × Rm
y is convex (as a function of
z = (x, y); this is called joint convexity) and the function
g(x) = inf f (x, y)
y
is proper, i.e., is > −∞ everywhere and is finite at least at one point, then g is convex
[this can be proved as follows. We should prove that if x, x0 ∈ Dom g and x00 =
λx + (1 − λ)x0 with λ ∈ [0, 1], then x00 ∈ Dom g and g(x00 ) ≤ λg(x) + (1 − λ)g(x0 ).
Given positive , we can find y and y 0 such that (x, y) ∈ Dom f , (x0 , y 0 ) ∈ Dom f and
g(x) + ≥ f (x, y), g(y 0 ) + ≥ f (x0 , y 0 ). Taking weighted sum of these two inequalities,
we get
λg(x) + (1 − λ)g(y) + ≥ λf (x, y) + (1 − λ)f (x0 , y 0 ) ≥
[since f is convex]
≥ f (λx + (1 − λ)x0 , λy + (1 − λ)y 0 ) = f (x00 , λy + (1 − λ)y 0 )
(the last ≥ follows from the convexity of f ). The concluding quantity in the chain is
≥ g(x00 ), and we get g(x00 ) ≤ λg(x) + (1 − λ)g(x0 ) + . In particular, x00 ∈ Dom g (recall
that g is assumed to take only the values from R and the value +∞). Moreover, since
the resulting inequality is valid for all > 0, we come to g(x00 ) ≤ λg(x) + (1 − λ)g(x0 ),
as required.]
• the “conic transformation” of a convex function f on Rn – the function g(y, x) =
yf (x/y) – is convex in the half-space y > 0 in Rn+1 .
Now we know what are the basic operations preserving convexity. Let us look what are the
standard functions these operations can be applied to. A number of examples was already given,
but we still do not know why the functions in the examples are convex. The usual way to check
convexity of a “simple” – given by a simple formula – function is based on differential criteria
of convexity. Let us look what are these criteria.
C.2. HOW TO DETECT CONVEXITY 539
or, which is the same (write f (z) as λf (z) + (1 − λ)f (z)), that
But in this equivalent form the inequality is evident: by the Lagrange Mean Value Theorem,
its left hand side is f 0 (ξ) with some ξ ∈ (x, z), while the right hand one is f 0 (η) with some
η ∈ (z, y). Since f 0 is nondecreasing and ξ ≤ z ≤ η, we have f 0 (ξ) ≤ f 0 (η), and the left hand
side in the inequality we should prove indeed is ≤ the right hand one.
(ii) is immediate consequence of (i), since, as we know from the very beginning of Calculus,
a differentiable function – in the case in question, it is f 0 – is monotonically nondecreasing
on an interval if and only if its derivative is nonnegative on this interval.
In fact, for functions of one variable there is a differential criterion of convexity which does
not assume any smoothness (we shall not prove this criterion):
Proof of Proposition C.2.2: Let x, y ∈ M and z = λx + (1 − λ)y, λ ∈ [0, 1], and let us
prove that
f (z) ≤ λf (x) + (1 − λ)f (y).
As we know from Theorem B.1.1.(iii), there exist sequences xi ∈ ri M and yi ∈ ri M con-
verging, respectively to x and to y. Then zi = λxi + (1 − λ)yi converges to z as i → ∞, and
since f is convex on ri M , we have
passing to limit and taking into account that f is continuous on M and xi , yi , zi converge,
as i → ∞, to x, y, z ∈ M , respectively, we obtain the required inequality.
C.2. HOW TO DETECT CONVEXITY 541
From Propositions C.2.1.(ii) and C.2.2 we get the following convenient necessary and sufficient
condition for convexity of a smooth function of n variables:
Corollary C.2.1 [convexity criterion for smooth functions on Rn ]
Let f : Rn → R ∪ {+∞} be a function. Assume that the domain Q of f is a convex set with
a nonempty interior and that f is
• continuous on Q
and
hT f 00 (x)h ≥ 0 ∀x ∈ int Q ∀h ∈ Rn .
Proof. The ”only if” part is evident: if f is convex and x ∈ Q0 = int Q, then the function
of one variable
g(t) = f (x + th)
(h is an arbitrary fixed direction in Rn ) is convex in certain neighbourhood of the point
t = 0 on the axis (recall that affine substitutions of argument preserve convexity). Since f
is twice differentiable in a neighbourhood of x, g is twice differentiable in a neighbourhood
of t = 0, so that g 00 (0) = hT f 00 (x)h ≥ 0 by Proposition C.2.1.
Now let us prove the ”if” part, so that we are given that hT f 00 (x)h ≥ 0 for every x ∈ int Q
and every h ∈ Rn , and we should prove that f is convex.
Let us first prove that f is convex on the interior Q0 of the domain Q. By Theorem B.1.1, Q0
is a convex set. Since, as it was already explained, the convexity of a function on a convex
set is one-dimensional fact, all we should prove is that every one-dimensional function
Applying the combination rules preserving convexity to simple functions which pass the “in-
finitesimal’ convexity tests, we can prove convexity of many complicated functions. Consider,
e.g., an exponential posynomial – a function
N
X
f (x) = ci exp{aTi x}
i=1
with positive coefficients ci (this is why the function is called posynomial). How could we prove
that the function is convex? This is immediate:
exp{t} is convex (since its second order derivative is positive and therefore the first derivative
is monotone, as required by the infinitesimal convexity test for smooth functions of one variable);
542 APPENDIX C. CONVEX FUNCTIONS
consequently, all functions exp{aTi x} are convex (stability of convexity under affine substi-
tutions of argument);
consequently, f is convex (stability of convexity under taking linear combinations with non-
negative coefficients).
And if we were supposed to prove that the maximum of three posynomials is convex? Ok,
we could add to our three steps the fourth, which refers to stability of convexity under taking
pointwise supremum.
yτ = x + τ (y − x), 0 < τ ≤ 1,
so that y1 = y and yτ is an interior point of the segment [x, y] for 0 < τ < 1. Now let us use the
following extremely simple
Lemma C.3.1 Let x, x0 , x00 be three distinct points with x0 ∈ [x, x00 ], and let f be
convex and finite on [x, x00 ]. Then
f (x0 ) − f (x) f (x00 ) − f (x)
≤ . (C.3.2)
kx0 − xk2 kx00 − xk2
ky − xk−1 T −1
2 (y − x) ∇f (x) ≤ ky − xk2 (f (y) − f (x)),
is given by Proposition C.3.1. To prove the ”if” part, i.e., to establish the implication inverse
to the above, assume that f satisfies the Gradient inequality for all x ∈ int Q and all y ∈ Q,
and let us verify that f is convex on Q. It suffices to prove that f is convex on the interior
Q0 of the set Q (see Proposition C.2.2; recall that by assumption f is continuous on Q and
Q is convex). To prove that f is convex on Q0 , note that Q0 is convex (Theorem B.1.1) and
that, due to the Gradient inequality, on Q0 f is the upper bound of the family of affine (and
therefore convex) functions:
f (y) = sup fx (y), fx (y) = f (x) + (y − x)T ∇f (x).
x∈Q0
Remark C.4.1 All three assumptions on K – (1) closedness, (2) boundedness, and (3) K ⊂
ri Dom f – are essential, as it is seen from the following three examples:
• f (x) = 1/x, Dom F = (0, +∞), K = (0, 1]. We have (2), (3) but not (1); f is neither
bounded, nor Lipschitz continuous on K.
• f (x) = x2 , Dom f = R, K = R. We have (1), (3) and not (2); f is neither bounded nor
Lipschitz continuous on K.
√
• f (x) = − x, Dom f = [0, +∞), K = [0, 1]. We have (1), (2) and not (3); f is not
Lipschitz continuous on K 1) , although is bounded. With properly chosen convex function
f of two variables and non-polyhedral compact domain (e.g., with Dom f being the unit
circle), we could demonstrate also that lack of (3), even in presence of (1) and (2), may
cause unboundedness of f at K as well.
Remark C.4.2 Theorem C.4.1 says that a convex function f is bounded on every compact
(i.e., closed and bounded) subset of the relative interior of Dom f . In fact there is much stronger
statement on the below boundedness of f : f is below bounded on any bounded subset of Rn !.
Proof of Theorem C.4.1. We shall start with the following local version of the Theorem.
Proposition C.4.1 Let f be a convex function, and let x̄ be a point from the relative interior
of the domain Dom f of f . Then
(i) f is bounded at x̄: there exists a positive r such that f is bounded in the r-neighbourhood
Ur (x̄) of x̄ in the affine span of Dom f :
(ii) f is Lipschitz continuous at x̄, i.e., there exists a positive ρ and a constant L such that
as i → ∞. Thus, the left hand side in (C.4.2) remains bounded as i → ∞. In the right hand
side one factor – i – tends to ∞, and the other one has a nonzero limit kx − yk, so that the right
hand side tends to ∞ as i → ∞; this is the desired contradiction.
Proof of Proposition C.4.1.
10 . We start with proving the above boundedness of f in a neighbourhood of x̄. This is
immediate: we know that there exists a neighbourhood Ur̄ (x̄) which is contained in Dom f
(since, by assumption, x̄ is a relative interior point of Dom f ). Now, we can find a small simplex
∆ of the dimension m = dim Aff(Dom f ) with the vertices x0 , ..., xm in Ur̄ (x̄) in such a way
that x̄ will be a convex combination of the vectors xi with positive coefficients, even with the
coefficients 1/(m + 1):
m
X 1
x̄ = xi 2 ) .
i=0
m + 1
We know that x̄ is the point from the relative interior of ∆ (Exercise B.8); since ∆ spans the
same affine subspace as Dom f , it means that ∆ contains Ur (x̄) with certain r > 0. Now, we
have
m
X X
∆={ λi xi : λi ≥ 0, λi = 1}
i=0 i
so that in ∆ f is bounded from above by the quantity max f (xi ) by Jensen’s inequality:
0≤i≤m
m
X m
X
f( λ i xi ) ≤ λi f (xi ) ≤ max f (xi ).
i
i=0 i=0
whence
f (x) ≥ 2f (x̄) − f (x0 ) ≥ 2f (x̄) − C, x ∈ Ur (x̄),
x, x0 ∈ Ur/2 , x 6= x0 . Let us extend the segment [x, x0 ] through the point x0 until it reaches, at
certain point x00 , the (relative) boundary of Ur . We have
whence
|f (x) − f (x0 )| ≤ (4C/r)kx − x0 k2 , x, x0 ∈ Ur/2 ,
as required in (ii).
Theorem C.5.1 [“Unimodality”] Let f be a convex function on a convex set Q ⊂ Rn , and let
x∗ ∈ Q ∩ Dom f be a local minimizer of f on Q:
Since x∗ is a local minimizer of f , the left hand side in this inequality is nonnegative for all small
enough values of τ > 0. We conclude that the right hand side is nonnegative, i.e., f (y) ≥ f (x∗ ).
2) To prove convexity of Argmin f , note that Argmin f is nothing but the level set levα (f )
Q Q
of f associated with the minimal value min f of f on Q; as a level set of a convex function, this
Q
set is convex (Proposition C.1.4).
3) To prove that the set Argmin f associated with a strictly convex f is, if nonempty, a
Q
singleton, note that if there were two distinct minimizers x0 , x00 , then, from strict convexity, we
would have
1 1 1
f ( x0 + x00 ) < [f (x0 ) + f (x00 ) == min f,
2 2 2 Q
which clearly is impossible - the argument in the left hand side is a point from Q!
Another pleasant fact is that in the case of differentiable convex functions the known from
Calculus necessary optimality condition (the Fermat rule) is sufficient for global optimality:
Theorem C.5.2 [Necessary and sufficient optimality condition for a differentiable convex func-
tion]
Let f be convex function on convex set Q ⊂ Rn , and let x∗ be an interior point of Q. Assume
that f is differentiable at x∗ . Then x∗ is a minimizer of f on Q if and only if
∇f (x∗ ) = 0.
Proof. As a necessary condition for local optimality, the relation ∇f (x∗ ) = 0 is known from
Calculus; it has nothing in common with convexity. The essence of the matter is, of course, the
sufficiency of the condition ∇f (x∗ ) = 0 for global optimality of x∗ in the case of convex f . This
sufficiency is readily given by the Gradient inequality (C.3.1): by virtue of this inequality and
due to ∇f (x∗ ) = 0,
f (y) ≥ f (x∗ ) + (y − x∗ )∇f (x∗ ) = f (x∗ )
for all y ∈ Q.
A natural question is what happens if x∗ in the above statement is not necessarily an interior
point of Q. Thus, assume that x∗ is an arbitrary point of a convex set Q and that f is convex
on Q and differentiable at x∗ (the latter means exactly that Dom f contains a neighbourhood
of x∗ and f possesses the first order derivative at x∗ ). Under these assumptions, when x∗ is a
minimizer of f on Q?
The answer is as follows: let
TQ (x∗ ) = {h ∈ Rn : x∗ + th ∈ Q ∀ small enough t > 0}
be the radial cone of Q at x∗ ; geometrically, this is the set of all directions leading from x∗
inside Q, so that a small enough positive step from x∗ along the direction keeps the point in
Q. From the convexity of Q it immediately follows that the radial cone indeed is a convex cone
(not necessary closed). E.g., when x∗ is an interior point of Q, then the radial cone to Q at x∗
clearly is the entire Rn . A more interesting example is the radial cone to a polyhedral set
Q = {x : aTi x ≤ bi , i = 1, ..., m}; (C.5.3)
for x∗ ∈ Q the corresponding radial cone clearly is the polyhedral cone
{h : aTi h ≤ 0 ∀i : aTi x∗ = bi } (C.5.4)
548 APPENDIX C. CONVEX FUNCTIONS
corresponding to the active at x∗ (i.e., satisfied at the point as equalities rather than as strict
inequalities) constraints aTi x ≤ bi from the description of Q.
Now, for the functions in question the necessary and sufficient condition for x∗ to be a
minimizer of f on Q is as follows:
Proposition C.5.1 Let Q be a convex set, let x∗ ∈ Q, and let f be a convex on Q function
which is differentiable at x∗ . The necessary and sufficient condition for x∗ to be a minimizer
of f on Q is that the derivative of f taken at x∗ along every direction from TQ (x∗ ) should be
nonnegative:
hT ∇f (x∗ ) ≥ 0 ∀h ∈ TQ (x∗ ).
Proof is immediate. The necessity is an evident fact which has nothing in common with
convexity: assuming that x∗ is a local minimizer of f on Q, we note that if there were h ∈ TQ (x∗ )
with hT ∇f (x∗ ) < 0, then we would have
for all small enough positive t. On the other hand, x∗ + th ∈ Q for all small enough positive t
due to h ∈ TQ (x∗ ). Combining these observations, we conclude that in every neighbourhood of
x∗ there are points from Q with strictly better than the one at x∗ values of f ; this contradicts
the assumption that x∗ is a local minimizer of f on Q.
The sufficiency is given by the Gradient Inequality, exactly as in the case when x∗ is an
interior point of Q.
Proposition C.5.1 says that whenever f is convex on Q and differentiable at x∗ ∈ Q, the
necessary and sufficient condition for x∗ to be a minimizer of f on Q is that the linear form given
by the gradient ∇f (x∗ ) of f at x∗ should be nonnegative at all directions from the radial cone
TQ (x∗ ). The linear forms nonnegative at all directions from the radial cone also form a cone; it
is called the cone normal to Q at x∗ and is denoted NQ (x∗ ). Thus, Proposition says that the
necessary and sufficient condition for x∗ to minimize f on Q is the inclusion ∇f (x∗ ) ∈ NQ (x∗ ).
What does this condition actually mean, it depends on what is the normal cone: whenever we
have an explicit description of it, we have an explicit form of the optimality condition.
E.g., when TQ (x∗ ) = Rn (it is the same as to say that x∗ is an interior point of Q), then the
normal cone is comprised of the linear forms nonnegative at the entire space, i.e., it is the trivial
cone {0}; consequently, for the case in question the optimality condition becomes the Fermat
rule ∇f (x∗ ) = 0, as we already know.
When Q is the polyhedral set (C.5.3), the normal cone is the polyhedral cone (C.5.4); it is
comprised of all directions which have nonpositive inner products with all ai coming from the
active, in the aforementioned sense, constraints. The normal cone is comprised of all vectors
which have nonnegative inner products with all these directions, i.e., of vectors a such that the
inequality hT a ≥ 0 is a consequence of the inequalities hT ai ≤ 0, i ∈ I(x∗ ) ≡ {i : aTi x∗ = bi }.
From the Homogeneous Farkas Lemma we conclude that the normal cone is simply the conic
hull of the vectors −ai , i ∈ I(x∗ ). Thus, in the case in question (*) reads:
x∗ ∈ Q is a minimizer of f on Q if and only if there exist nonnegative reals λ∗i associated
with “active” (those from I(x∗ )) values of i such that
∇f (x∗ ) + λ∗i ai = 0.
X
i∈I(x∗ )
C.5. MAXIMA AND MINIMA OF CONVEX FUNCTIONS 549
These are the famous Karush-Kuhn-Tucker optimality conditions; these conditions are necessary
for optimality in an essentially wider situation.
The indicated results demonstrate that the fact that a point x∗ ∈ Dom f is a global minimizer
of a convex function f depends only on the local behaviour of f at x∗ . This is not the case with
maximizers of a convex function. First of all, such a maximizer, if exists, in all nontrivial cases
should belong to the boundary of the domain of the function:
Theorem C.5.3 Let f be convex, and let Q be the domain of f . Assume that f attains its
maximum on Q at a point x∗ from the relative interior of Q. Then f is constant on Q.
Proof. Let y ∈ Q; we should prove that f (y) = f (x∗ ). There is nothing to prove if y = x∗ , so
that we may assume that y 6= x∗ . Since, by assumption, x∗ ∈ ri Q, we can extend the segment
[x∗ , y] through the endpoint x∗ , keeping the left endpoint of the segment in Q; in other words,
there exists a point y 0 ∈ Q such that x∗ is an interior point of the segment [y 0 , y]:
x∗ = λy 0 + (1 − λ)y
Since both f (y 0 ) and f (y) do not exceed f (x∗ ) (x∗ is a maximizer of f on Q!) and both
the weights λ and 1 − λ are strictly positive, the indicated inequality can be valid only if
f (y 0 ) = f (y) = f (x∗ ).
The next theorem gives further information on maxima of convex functions:
Proof. To prove (C.5.5), let x ∈ ConvE, so that x is a convex combination of points from E
(Theorem B.1.4 on the structure of convex hull):
X X
x= λ i xi [xi ∈ E, λi ≥ 0, λi = 1].
i i
so that the left hand side in (C.5.5) is ≤ the right hand one; the inverse inequality is evident,
since ConvE ⊃ E.
To derive (C.5.6) from (C.5.5), it suffices to note that from the Krein-Milman Theorem
(Theorem B.2.10) for a convex compact set S one has S = ConvExt(S).
550 APPENDIX C. CONVEX FUNCTIONS
Theorem C.5.5 Let f be a convex function such that the domain Q of f is closed and does
not contain lines. Then
(i) If the set
Argmax f ≡ {x ∈ Q : f (x) ≥ f (y) ∀y ∈ Q}
Q
of global maximizers of f is nonempty, then it intersects the set Ext(Q) of the extreme points
of Q, so that at least one of the maximizers of f is an extreme point of Q;
(ii) If the set Q is polyhedral and f is above bounded on Q, then the maximum of f on Q is
achieved: Argmax f 6= ∅.
Q
Proof. Let us start with (i). We shall prove this statement by induction on the dimension
of Q. The base dim Q = 0, i.e., the case of a singleton Q, is trivial, since here Q = ExtQ =
Argmax f . Now assume that the statement is valid for the case of dim Q ≤ p, and let us
Q
prove that it is valid also for the case of dim Q = p + 1. Let us first verify that the set
Argmax f intersects with the (relative) boundary of Q. Indeed, let x ∈ Argmax f . There
Q Q
is nothing to prove if x itself is a relative boundary point of Q; and if x is not a boundary
point, then, by Theorem C.5.3, f is constant on Q, so that Argmax f = Q; and since Q is
Q
closed, every relative boundary point of Q (such a point does exist, since Q does not contain
lines and is of positive dimension) is a maximizer of f on Q, so that here again Argmax f
Q
intersects ∂ri Q.
Thus, among the maximizers of f there exists at least one, let it be x, which belongs to the
relative boundary of Q. Let H be the hyperplane which supports Q at x (see Section B.2.6),
and let Q0 = Q ∩ H. The set Q0 is closed and convex (since Q and H are), nonempty (it
contains x) and does not contain lines (since Q does not). We have max f = f (x) = max 0
f
Q Q
(note that Q0 ⊂ Q), whence
∅=
6 Argmax f ⊂ Argmax f.
Q0 Q
Same as in the proof of the Krein-Milman Theorem (Theorem B.2.10), we have dim Q0 <
dim Q. In view of this inequality we can apply to f and Q0 our inductive hypothesis to get
Ext(Q0 ) ∩ Argmax f 6= ∅.
Q0
Since Ext(Q0 ) ⊂ Ext(Q) by Lemma B.2.4 and, as we just have seen, Argmax f ⊂ Argmax f ,
Q0 Q
we conclude that the set Ext(Q) ∩ Argmax f is not smaller than Ext(Q0 ) ∩ Argmax f and is
Q Q0
therefore nonempty, as required.
To prove (ii), let us use the known to us from Lecture 4 results on the structure of a polyhedral
convex set:
Q = Conv(V ) + Cone (R),
where V and R are finite sets. We are about to prove that the upper bound of f on Q is
exactly the maximum of f on the finite set V :
This will mean, in particular, that f attains its maximum on Q – e.g., at the point of V
where f attains its maximum on V .
To prove the announced statement, I first claim that if f is above bounded on Q, then every
direction r ∈ Cone (R) is descent for f , i.e., is such that every step in this direction taken
from every point x ∈ Q decreases f :
Indeed, if, on contrary, there were x ∈ Q, r ∈ R and t ≥ 0 such that f (x + tr) > f (x), we
would have t > 0 and, by Lemma C.3.1,
s
f (x + sr) ≥ f (x) + (f (x + tr) − f (x)), s ≥ t.
t
Since x ∈ Q and r ∈ Cone (R), x + sr ∈ Q for all s ≥ 0, and since f is above bounded on Q,
the left hand side in the latter inequality is above bounded, while the right hand one, due to
f (x + tr) > f (x), goes to +∞ as s → ∞, which is the desired contradiction.
Now we are done: to prove (C.5.7), note that a generic point x ∈ Q can be represented as
X X
x= λv v + r [r ∈ Cone (R); λv = 1, λv ≥ 0],
v∈V v
and we have P
f (x) = f( λv v + r)
v∈V
P
≤ f( λv v) [by (C.5.8)]
Pv∈V
≤ λv f (v) [Jensen’s Inequality]
v∈V
≤ max f (v)
v∈V
is a nonempty convex set. Thus, there is no essential difference between convex functions
and convex sets: convex function generates a convex set – its epigraph – which of course
remembers everything about the function. And the only specific property of the epigraph as
a convex set is that it has a recessive direction – namely, e = (1, 0) – such that the intersection
of the epigraph with every line directed by h is either empty, or is a closed ray. Whenever a
nonempty convex set possesses such a property with respect to certain direction, it can be
represented, in properly chosen coordinates, as the epigraph of some convex function. Thus,
a convex function is, basically, nothing but a way to look, in the literal meaning of the latter
verb, at a convex set.
Now, we know that “actually good” convex sets are closed ones: they possess a lot of
important properties (e.g., admit a good outer description) which are not shared by arbitrary
convex sets. It means that among convex functions there also are “actually good” ones –
those with closed epigraphs. Closedness of the epigraph can be “translated” to the functional
language and there becomes a special kind of continuity – lower semicontinuity:
552 APPENDIX C. CONVEX FUNCTIONS
Definition C.6.1 [Lower semicontinuity] Let f be a function (not necessarily convex) de-
fined on Rn and taking values in R ∪ {+∞}. We say that f is lower semicontinuous at a
point x̄, if for every sequence of points {xi } converging to x̄ one has
(here, of course, lim inf of a sequence with all terms equal to +∞ is +∞).
f is called lower semicontinuous, if it is lower semicontinuous at every point.
Proposition C.6.1 A function f defined on Rn and taking values from R ∪ {+∞} is lower
semicontinuous if and only if its epigraph is closed (e.g., due to its emptiness).
I shall not prove this statement, same as most of other statements in this Section; the reader
definitely is able to restore (very simple) proofs I am skipping.
An immediate consequence of the latter proposition is as follows:
Indeed, the epigraph of the upper bound is the intersection of the epigraphs of the functions
forming the bound, and the intersection of closed sets always is closed.
Now let us look at convex lower semicontinuous functions; according to our general conven-
tion, “convex” means “satisfying the convexity inequality and finite at least at one point”,
or, which is the same, “with convex nonempty epigraph”; and as we just have seen, “lower
semicontinuous” means “with closed epigraph”. Thus, we are interested in functions with
closed convex nonempty epigraphs; to save words, let us call these functions proper.
What we are about to do is to translate to the functional language several constructions and
results related to convex sets. In the usual life, a translation (e.g. of poetry) typically results
in something less rich than the original; in contrast to this, in mathematics this is a powerful
source of new ideas and constructions.
C.6. SUBGRADIENTS AND LEGENDRE TRANSFORMATION 553
containing Epi(f ) (note that what is written in the right hand side of the latter relation,
is one of many universal forms of writing down a general nonstrict linear inequality in the
space where the epigraph lives; this is the form the most convenient for us now). Thus, e
should be a recessive direction of Π ⊃ Epi(f ); as it is immediately seen, recessivity of e for
Π means exactly that α ≥ 0. Thus, speaking about closed half-spaces containing Epi(f ), we
in fact are considering some of the half-spaces (*) with α ≥ 0.
Now, there are two essentially different possibilities for α to be nonnegative – (A) to be
positive, and (B) to be zero. In the case of (B) the boundary hyperplane of Π is “vertical”
– it is parallel to e, and in fact it “bounds” only x – Π is comprised of all pairs (t, x) with x
belonging to certain half-space in the x-subspace and t being arbitrary real. These “vertical”
subspaces will be of no interest for us.
The half-spaces which indeed are of interest for us are the “nonvertical” ones: those given
by the case (A), i.e., with α > 0. For a non-vertical half-space Π, we always can divide the
inequality defining Π by α and to make α = 1. Thus, a “nonvertical” candidate to the role
of a closed half-space containing Epi(f ) always can be written down as
(!) a proper convex function is the upper bound of affine functions – all its affine
minorants.
(indeed, we already know that it is the same – to say that a function is an upper bound of
certain family of functions, and to say that the epigraph of the function is the intersection
of the epigraphs of the functions of the family).
(!) indeed is true:
Proposition C.6.2 A proper convex function f is the upper bound of all its affine mino-
rants. Moreover, at every point x̄ ∈ ri Dom f from the relative interior of the domain f f is
even not the upper bound, but simply the maximum of its minorants: there exists an affine
function fx̄ (x) which is ≤ f (x) everywhere in Rn and is equal to f at x = x̄.
Proof. I. We start with the “Moreover” part of the statement; this is the key to the entire
statement. Thus, we are about to prove that if x̄ ∈ ri Dom f , then there exists an affine
function fx̄ (x) which is everywhere ≤ f (x), and at x = x̄ the inequality becomes an equality.
554 APPENDIX C. CONVEX FUNCTIONS
I.10 First of all, we easily can reduce the situation to the one when Dom f is full-dimensional.
Indeed, by shifting f we may make the affine span Aff(Dom f ) of the domain of f to be a
linear subspace L in Rn ; restricting f onto this linear subspace, we clearly get a proper
function on L. If we believe that our statement is true for the case when the domain of f is
full-dimensional, we can conclude that there exists an affine function
dT x − a [x ∈ L]
on L (d ∈ L) such that
f (x) ≥ dT x − a ∀x ∈ L; f (x̄) = dT x̄ − a.
The affine function we get clearly can be extended, by the same formula, from L on the
entire Rn and is a minorant of f on the entire Rn – outside of L ⊃ Dom f f simply is +∞!
This minorant on Rn is exactly what we need.
I.20 . Now let us prove that our statement is valid when Dom f is full-dimensional, so that
x̄ is an interior point of the domain of f . Let us look at the point y = (f (x̄), x̄). This is a
point from the epigraph of f , and I claim that it is a point from the relative boundary of
the epigraph. Indeed, if y were a relative interior point of Epi(f ), then, taking y 0 = y + e,
we would get a segment [y 0 , y] contained in Epi(f ); since the endpoint y of the segment is
assumed to be relative interior for Epi(f ), we could extend this segment a little through this
endpoint, not leaving Epi(f ); but this clearly is impossible, since the t-coordinate of the new
endpoint would be < f (x̄), and the x-component of it still would be x̄.
Thus, y is a point from the relative boundary of Epi(f ). Now I claim that y 0 is an interior
point of Epi(f ). This is immediate: we know from Theorem C.4.1 that f is continuous at x̄,
so that there exists a neighbourhood U of x̄ in Aff(Dom f ) = Rn such that f (x) ≤ f (x̄ + 0.5)
whenever x ∈ U , or, in other words, the set
so that the right hand side is an affine minorant of f on Dom f and therefore – on Rn
(f = +∞ outside Dom f !). It remains to note that (#) is equality at x̄, since (&) is equality
at y.
II. We have proved that if F if the set of all affine functions which are minorants of f , then
the function
f¯(x) = sup φ(x)
φ∈F
is equal to f on ri Dom f (and at x from the latter set in fact sup in the right hand side can
be replaced with max); to complete the proof of the Proposition, we should prove that f¯ is
equal to f also outside ri Dom f .
II.10 . Let us first prove that f¯ is equal to f outside cl Dom f , or. which is the same, prove
that f¯(x) = +∞ outside cl Dom f . This is easy: is x̄ is a point outside cl Dom f , it can be
strongly separated from Dom f , see Separation Theorem (ii) (Theorem B.2.9). Thus, there
exists z ∈ Rn such that
Besides this, we already know that there exists at least one affine minorant of f , or, which
is the same, there exist a and d such that
This inequality clearly says that φλ (·) is an affine minorant of f on Rn for every λ > 0.
The value of this minorant at x = x̄ is equal to dT x̄ − a + λζ and therefore it goes to +∞
as λ → +∞. We see that the upper bound of affine minorants of f at x̄ indeed is +∞, as
claimed.
II.20 . Thus, we know that the upper bound f¯ of all affine minorants of f is equal to f
everywhere on the relative interior of Dom f and everywhere outside the closure of Dom f ;
all we should prove that this equality is also valid at the points of the relative boundary of
Dom f . Let x̄ be such a point. There is nothing to prove if f¯(x̄) = +∞, since by construction
f¯ is everywhere ≤ f . Thus, we should prove that if f¯(x̄) = c < ∞, then f (x̄) = c. Since
f¯ ≤ f everywhere, to prove that f (x̄) = c is the same as to prove that f (x̄) ≤ c. This
is immediately given by lower semicontinuity of f : let us choose x0 ∈ ri Dom f and look
what happens along a sequence of points xi ∈ [x0 , x̄) converging to x̄. All the points of this
sequence are relative interior points of Dom f (Lemma B.1.1), and consequently
f (xi ) = f¯(xi ).
as i → ∞, xi → x̄, and the right hand side in our inequality converges to f¯(x̄) = c; since f
is lower semicontinuous, we get f (x̄) ≤ c.
We see why “translation of mathematical facts from one mathematical language to another”
– in our case, from the language of convex sets to the language of convex functions – may
be fruitful: because we invest a lot into the process rather than run it mechanically.
556 APPENDIX C. CONVEX FUNCTIONS
Proof. Without loss of generality we may assume that the domain of the function f is
full-dimensional and that 0 is the interior point of the domain. According to Theorem C.4.1,
there exists a neighbourhood U of the origin – which can be thought of to be a centered at
the origin ball of some radius r > 0 – where f is bounded from above by some C. Now, if
R > 0 is arbitrary and x is an arbitrary point with |x| ≤ R, then the point
r
y=− x
R
belongs to U , and we have
r R
0= x+ y;
r+R r+R
since f is convex, we conclude that
r R r R
f (0) ≤ f (x) + f (y) ≤ f (x) + c,
r+R r+R r+R r+R
and we get the lower bound
r+R r
f (x) ≥ f (0) − c
r R
for the values of f in the centered at 0 ball of radius R.
Thus, we conclude that the closure of the epigraph of a convex function f is the epigraph of
certain function, let it be called the closure cl f of f . Of course, this latter function is convex
(its epigraph is convex – it is the closure of a convex set), and since its epigraph is closed,
cl f is proper. The following statement gives direct description of cl f in terms of f :
C.6. SUBGRADIENTS AND LEGENDRE TRANSFORMATION 557
In particular,
f (x) ≥ cl f (x)
for all x, and
f (x) = cl f (x)
whenever x ∈ ri Dom f , same as whenever x 6∈ cl Dom f .
Thus, the “correction” f 7→ cl f may vary f only at the points from the relative boundary of
Dom f ,
Dom f ⊂ Dom cl f ⊂ cl Dom f,
whence also
ri Dom f = ri Dom cl f.
(ii) The family of affine minorants of cl f is exactly the family of affine minorants of f , so
that
cl f (x) = sup{φ(x) : φ is an affine minorant of f },
and the sup in the right hand side can be replaced with max whenever x ∈ ri Dom cl f =
ri Dom f .
[“so that” comes from the fact that cl f is proper and is therefore the upper bound of its
affine minorants]
C.6.2 Subgradients
Let f be a convex function, and let x ∈ Dom f . It may happen that there exists an affine
minorant dT x − a of f which coincides with f at x:
From the equality in the latter relation we get a = dT x − f (x), and substituting this repre-
sentation of a into the first inequality, we get
Thus, if f admits an affine minorant which is exact at x, then there exists d which gives rise
to inequality (C.6.3). Vice versa, if d is such that (C.6.3) takes place, then the right hand
side of (C.6.3), regarded as a function of y, is an affine minorant of f which is exact at x.
Now note that (C.6.3) expresses certain property of a vector d. A vector satisfying, for a
given x, this property – i.e., the slope of an exact at x affine minorant of f – is called a
subgradient of f at x, and the set of all subgradients of f at x is denoted ∂f (x).
Subgradients of convex functions play important role in the theory and numerical methods
of Convex Programming – they are quite reasonable surrogates of gradients. The most
elementary properties of the subgradients are summarized in the following statement:
Proposition C.6.5 Let f be a convex function and x be a point from Dom f . Then
(i) ∂f (x) is a closed convex set which for sure is nonempty when x ∈ ri Dom f
(ii) If x ∈ int Dom f and f is differentiable at x, then ∂f (x) is the singleton comprised of
the usual gradient of f at x.
558 APPENDIX C. CONVEX FUNCTIONS
Proof. (i): Closedness and convexity of ∂f (x) are evident – (C.6.3) is an infinite system
of nonstrict linear inequalities with respect to d, the inequalities being indexed by y ∈ Rn .
Nonemptiness of ∂f (x) for the case when x ∈ ri Dom f – this is the most important fact
about the subgradients – is readily given by our preceding results. Indeed, we should prove
that if x ∈ ri Dom f , then there exists an affine minorant of f which is exact at x. But this
is an immediate consequence of Proposition C.6.4: part (i) of the proposition says that there
exists an affine minorant of f which is equal to cl f (x) at the point x, and part (i) says that
f (x) = cl f (x).
(ii): If x ∈ int Dom f and f is differentiable at x, then ∇f (x) ∈ ∂f (x) by the Gradient
Inequality. To prove that in the case in question ∇f (x) is the only subgradient of f at x,
note that if d ∈ ∂f (x), then, by definition,
f (y) − f (x) ≥ dT (y − x) ∀y
Substituting y − x = th, h being a fixed direction and t being > 0, dividing both sides of the
resulting inequality by t and passing to limit as t → +0, we get
hT ∇f (x) ≥ hT d.
This inequality should be valid for all h, which is possible if and only if d = ∇f (x).
Proposition C.6.5 explains why subgradients are good surrogates of gradients: at a point
where gradient exists, it is the only subgradient, but, in contrast to the gradient, a sub-
gradient exists basically everywhere (for sure in the relative interior of the domain of the
function). E.g., let us look at the simple function
f (x) = |x|
on the axis. It is, of course, convex (as maximum of two linear forms x and −x). Whenever
x 6= 0, f is differentiable at x with the derivative +1 for x > 0 and −1 for x < 0. At the point
x = 0 f is not differentiable; nevertheless, it must have subgradients at this point (since 0
is an interior point of the domain of the function). And indeed, it is immediately seen that
the subgradients of |x| at x = 0 are exactly the reals from the segment [−1, 1]. Thus,
{−1}, x<0
∂|x| = [−1, 1], x=0.
{+1}, x>0
Note also that if x is a relative boundary point of the domain of a convex function, even
a “good” one, the set of subgradients of f at x may be empty, as it is the case with the
function
√
− y, y≥0
f (y) = ;
+∞, y<0
it is clear that there is no non-vertical supporting line to the epigraph of the function at the
point (0, f (0)), and, consequently, there is no affine minorant of the function which is exact
at x = 0.
A significant – and important – part of Convex Analysis deals with subgradient calculus –
with the rules for computing subgradients of “composite” functions, like sums, superposi-
tions, maxima, etc., given subgradients of the operands. These rules extend onto nonsmooth
convex case the standard Calculus rules and are very nice and instructive; the related con-
siderations, however, are beyond our scope.
C.6. SUBGRADIENTS AND LEGENDRE TRANSFORMATION 559
f (x) ≥ dT x − a
a ≥ dT x − f (x)
for all x. We see that if the slope d of an affine function dT x − a is fixed, then in order for
the function to be a minorant of f we should have
The supremum in the right hand side of the latter relation is certain function of d; this
function is called the Legendre transformation of f and is denoted f ∗ :
Geometrically, the Legendre transformation answers the following question: given a slope d
of an affine function, i.e., given the hyperplane t = dT x in Rn+1 , what is the minimal “shift
down” of the hyperplane which places it below the graph of f ?
From the definition of the Legendre transformation it follows that this is a proper function.
Indeed, we loose nothing when replacing sup [dT x − f (x)] by sup [dT x − f (x)], so that
x∈Rn x∈Dom f
the Legendre transformation is the upper bound of a family of affine functions. Since this
bound is finite at least at one point (namely, at every d coming form affine minorant of f ; we
know that such a minorant exists), it is a convex lower semicontinuous function, as claimed.
The most elementary (and the most fundamental) fact about the Legendre transformation
is its symmetry:
Proposition C.6.6 Let f be a convex function. Then twice taken Legendre transformation
of f is the closure cl f of f :
(f ∗ )∗ = cl f.
In particular, if f is proper, then it is the Legendre transformation of its Legendre transfor-
mation (which also is proper).
the second sup here is exactly the supremum of all affine minorants of f (this is the origin of
the Legendre transformation: a ≥ f ∗ (d) if and only if the affine form dT x − a is a minorant
of f ). And we already know that the upper bound of all affine minorants of f is the closure
of f .
The Legendre transformation is a very powerful tool – this is a “global” transformation, so
that local properties of f ∗ correspond to global properties of f . E.g.,
560 APPENDIX C. CONVEX FUNCTIONS
etc. Thus, whenever we can compute explicitly the Legendre transformation of f , we get
a lot of “global” information on f . Unfortunately, the more detailed investigation of the
properties of Legendre transformation is beyond our scope; I simply list several simple facts
and examples:
Specifying here f and f ∗ , we get certain inequality, e.g., the following one:
[Young’s Inequality] if p and q are positive reals such that p1 + 1q = 1, then
|x|p |d|q
+ ≥ xd ∀x, d ∈ R
p q
Now let 1 < p < ∞, so that also 1 < q < ∞. In this case we should prove that
X X X
|xi yi | ≤ ( |xi |p )1/p ( |yi |q )1/q .
i i i
There is nothing to prove if one of the factors in the right hand side vanishes; thus, we
can assume that x 6= 0 and y 6= 0. Now, both sides of the inequality are of homogeneity
degree 1 with respect to x (when we multiply x by t, both sides are multiplied by
|t|), and similarly with respect to y. Multiplying x and y by appropriate reals, we can
make both factors in the right hand side equal to 1: kxkp = kykp = 1. Now we should
prove that under this normalization the left hand side in the inequality is ≤ 1, which
is immediately given by the Young inequality:
X X
|xi yi | ≤ [|xi |p /p + |yi |q /q] = 1/p + 1/q = 1.
i i
C.6. SUBGRADIENTS AND LEGENDRE TRANSFORMATION 561
when p = q = 2, we get the Cauchy inequality. Now, inequality (C.6.5) is exact in the
sense that for every x there exists y with kykq = 1 such that
it suffices to take
yi = kxk1−p
p |xi |p−1 sign(xi )
(here x 6= 0; the case of x = 0 is trivial – here y can be an arbitrary vector with
kykq = 1).
Combining our observations, we come to an extremely important, although simple,
fact:
1 1
kxkp = max{y T x : kykq ≤ 1} [ + = 1]. (C.6.6)
p q
It follows, in particular, that kxkp is convex (as an upper bound of a family of linear
forms), whence
1 1
kx0 + x00 kp = 2k x0 + x00 kp ≤ 2(kx0 kp /2 + kx00 kp /2) = kx0 kp + kx00 kp ;
2 2
this is nothing but the triangle inequality. Thus, kxkp satisfies the triangle inequality;
it clearly possesses two other characteristic properties of a norm – positivity and ho-
mogeneity. Consequently, k · kp is a norm – the fact that we announced twice and have
finally proven now.
• The Legendre transformation of the function
f (x) ≡ −a
is the function which is equal to a at the origin and is +∞ outside the origin; similarly,
the Legendre transformation of an affine function d¯T x − a is equal to a at d = d¯ and is
+∞ when d 6= d; ¯
• The Legendre transformation of the strictly convex quadratic form
1 T
f (x) = x Ax
2
(A is positive definite symmetric matrix) is the quadratic form
1 T −1
f ∗ (d) = d A d
2
• The Legendre transformation of the Euclidean norm
f (x) = kxk2
is the function which is equal to 0 in the closed unit ball centered at the origin and is
+∞ outside the ball.
The latter example is a particular case of the following statement:
Let kxk be a norm on Rn , and let
Exercise C.1 Prove that k · k∗ is a norm, and that the norm conjugate to k · k∗ is the
original norm k · k.
Hint: Observe that the unit ball of k · k∗ is exactly the polar of the unit ball of k · k.
The Legendre transformation of kxk is the characteristic function of the unit ball of
the conjugate norm, i.e., is the function of d equal to 0 when kdk∗ ≤ 1 and is +∞
otherwise.
E.g., (C.6.6) says that the norm conjugate to k · kp , 1 ≤ p ≤ ∞, is k · kq , 1/p + 1/q = 1;
consequently, the Legendre transformation of p-norm is the characteristic function of
the unit k · q-ball.
Appendix D
563
564APPENDIX D. CONVEX PROGRAMMING, LAGRANGE DUALITY, SADDLE POINTS
– [below boundedness] the problem is called below bounded, if its optimal value is
> −∞, i.e., if the objective is below bounded on the feasible set
x∈ Argmin f (x0 )
x0 ∈X:g(x0 )≤0,h(x0 )=0
To solve the problem exactly means to find its optimal solution or to detect that no optimal
solution exists.
• X is a convex subset of Rn
Note that instead of saying that there are no equality constraints, we could say that there are
constraints of this type, but only linear ones; this latter case can be immediately reduced to the
one without equality constraints by replacing Rn with the affine subspace given by the (linear)
equality constraints.
We started with the observation that the fact that a point x∗ is an optimal solution can
be expressed in terms of solvability/unsolvability of certain systems of inequalities: in our now
terms, these systems are
and
x ∈ G, f (x) < c, gj (x) ≤ 0, j = 1, ..., m; (D.2.2)
here c is a parameter. Optimality of x∗ for the problem means exactly that for appropriately
chosen c (this choice, of course, is c = f (x∗ )) the first of these systems is solvable and x∗ is its
solution, while the second system is unsolvable. Given this trivial observation, we converted the
“negative” part of it – the claim that (D.2.2) is unsolvable – into a positive statement, using the
General Theorem on Alternative, and this gave us the LP Duality Theorem.
Now we are going to use the same approach. What we need is a “convex analogy” to
the Theorem on Alternative – something like the latter statement, but for the case when the
inequalities in question are given by convex functions rather than the linear ones (and, besides
it, we have a “convex inclusion” x ∈ X).
It is easy to guess the result we need. How did we come to the formulation of the Theorem on
Alternative? The question we were interested in was, basically, how to express in an affirmative
manner the fact that a system of linear inequalities has no solutions; to this end we observed
that if we can combine, in a linear fashion, the inequalities of the system and get an obviously
false inequality like 0 ≤ −1, then the system is unsolvable; this condition is certain affirmative
statement with respect to the weights with which we are combining the original inequalities.
Now, the scheme of the above reasoning has nothing in common with linearity (and even
convexity) of the inequalities in question. Indeed, consider an arbitrary inequality system of the
type (D.2.2):
(I)
f (x) < c
gj (x) ≤ 0, j = 1, ..., m
x ∈ X;
all we assume is that X is a nonempty subset in Rn and f, g1 , ..., gm are real-valued functions
on X. It is absolutely evident that
Indeed, a solution to (I) clearly is a solution to (D.2.3) – the latter inequality is nothing but a
combination of the inequalities from (I) with the weights 1 (for the first inequality) and λj (for
the remaining ones).
Now, what does it mean that (D.2.3) has no solutions? A necessary and sufficient condition
for this is that the infimum of the left hand side of (D.2.3) in x ∈ X is ≥ c. Thus, we come to
the following evident
566APPENDIX D. CONVEX PROGRAMMING, LAGRANGE DUALITY, SADDLE POINTS
Proposition D.2.1 [Sufficient condition for insolvability of (I)] Consider a system (I) with
arbitrary data and assume that the system
(II) " #
m
≥ c
P
inf f (x) + λj gj (x)
x∈X j=1
λj ≥ 0, j = 1, ..., m
with unknowns λ1 , ..., λm has a solution. Then (I) is infeasible.
Let me stress that this result is completely general; it does not require any assumptions on the
entities involved.
The result we have obtained, unfortunately, does not help us: the actual power of the
Theorem on Alternative (and the fact used to prove the Linear Programming Duality Theorem)
is not the sufficiency of the condition of Proposition for infeasibility of (I), but the necessity of
this condition. Justification of necessity of the condition in question has nothing in common with
the evident reasoning which gives the sufficiency. The necessity in the linear case (X = Rn , f ,
g1 , ..., gm are linear) can be established via the Homogeneous Farkas Lemma. Now we shall prove
the necessity of the condition for the convex case, and already here we need some additional,
although minor, assumptions; and in the general nonconvex case the condition in question simply
is not necessary for infeasibility of (I) [and this is very bad – this is the reason why there exist
difficult optimization problems which we do not know how to solve efficiently].
The just presented “preface” explains what we should do; now let us carry out our plan. We
start with the aforementioned “minor regularity assumptions”.
Definition D.2.1 [Slater Condition] Let X ⊂ Rn and g1 , ..., gm be real-valued functions on X.
We say that these functions satisfy the Slater condition on X, if there exists x ∈ X such that
gj (x) < 0, j = 1, ..., m.
An inequality constrained program
(IC) min {f (x) : gj (x) ≤ 0, j = 1, ..., m, x ∈ X}
(f, g1 , ..., gm are real-valued functions on X) is called to satisfy the Slater condition, if g1 , ..., gm
satisfy this condition on X.
We are about to establish the following fundamental fact:
Theorem D.2.1 [Convex Theorem on Alternative]
Let X ⊂ Rn be convex, let f, g1 , ..., gm be real-valued convex functions on X, and let g1 , ..., gm
satisfy the Slater condition on X. Then system (I) is solvable if and only if system (II) is
unsolvable.
Proof. The first part of the statement – “if (II) has a solution, then (I) has no solutions” –
is given by Proposition D.2.1. What we need is to prove the inverse statement. Thus, let us
assume that (I) has no solutions, and let us prove that then (II) has a solution.
Without loss of generality we may assume that X is full-dimensional: ri X = int X (indeed,
otherwise we could replace our “universe” Rn with the affine span of X).
10 . Let us set
f (x)
g1 (x)
F (x) =
...
gm x)
CONVEX PROGRAMMING AND LAGRANGE DUALITY THEOREM 567
and
T = {(u0 , ..., um ) | u0 < c, u1 ≤ 0, u2 ≤ 0, ..., um ≤ 0}.
I claim that
• (i) S and T are nonempty convex sets;
The left hand side in this inequality, due to convexity of X and f, g1 , ..., gm , is ≥ F (y), y =
λx0 + (1 − λ)x00 . Thus, for the point v = λu0 + (1 − λ)u00 there exists y ∈ X with F (y) ≤ v,
whence v ∈ S. Thus, S is convex.
The fact that S ∩ T = ∅ is an evident equivalent reformulation of the fact that (I) has no
solutions.
20 . Since S and T are nonempty convex sets with empty intersection, by Separation Theorem
(Theorem B.2.9) they can be separated by a linear form: there exist a = (a0 , ..., am ) 6= 0 such
that m m
X X
inf aj uj ≥ sup aj uj . (D.2.4)
u∈S u∈T j=0
j=0
30 . Let us look what can be said about the vector a. I claim that, first,
a≥0 (D.2.5)
and, second,
a0 > 0. (D.2.6)
Indeed, to prove (D.2.5) note that if some ai were negative, then the right hand side in (D.2.4)
would be +∞ 2) , which is forbidden by (D.2.4).
Thus, a ≥ 0; with this in mind, we can immediately compute the right hand side of (D.2.4):
m
X m
X
sup aj uj = sup aj uj = a0 c.
u∈T j=0 u0 <c,u1 ,...,um ≤0 j=0
Since for every x ∈ X the point F (x) belongs to S, the left hand side in (D.2.4) is not greater
than
m
X
inf a0 f (x) + aj gj (x) ;
x∈X
j=1
2)
look what happens when all coordinates in u, except the ith one, are fixed at values allowed by the description
of T and ui is a large in absolute value negative real
568APPENDIX D. CONVEX PROGRAMMING, LAGRANGE DUALITY, SADDLE POINTS
Now let us prove that a0 > 0. This crucial fact is an immediate consequence of the Slater
condition. Indeed, let x̄ ∈ X be the point given by this condition, so that gj (x̄) < 0. From
(D.2.7) we conclude that
m
X
a0 f (x̄) + aj gj (x̄) ≥ a0 c.
j=0
If a0 were 0, then the right hand side of this inequality would be 0, while the left one would be
m
P
the combination aj gj (x̄) of negative reals gj (x̄) with nonnegative coefficients aj not all equal
j=0
to 0 3) , so that the left hand side is strictly negative, which is the desired contradiction.
40 . Now we are done: since a0 > 0, we are in our right to divide both sides of (D.2.7) by a0
and thus get
m
X
inf f0 (x) + λj gj (x) ≥ c, (D.2.8)
x∈X
j=1
The result of Convex Theorem on Alternative brings to our attention the function
m
X
L(λ) = inf f (x) + λj gj (x) , (D.2.9)
x∈X
j=1
from which this function comes. Aggregate (D.2.10) is called the Lagrange function of the
inequality constrained optimization program
The Lagrange function of an optimization program is a very important entity: most of optimality
conditions are expressed in terms of this function. Let us start with translating of what we
already know to the language of the Lagrange function.
3)
indeed, from the very beginning we know that a 6= 0, so that if a0 = 0, then not all aj , j ≥ 1, are zeros
CONVEX PROGRAMMING AND LAGRANGE DUALITY THEOREM 569
Theorem D.2.2 Consider an arbitrary inequality constrained optimization program (IC). Then
(i) The infimum
L(λ) = inf L(x, λ)
x∈X
of the Lagrange function in x ∈ X is, for every λ ≥ 0, a lower bound on the optimal value in
(IC), so that the optimal value in the optimization program
• is convex,
• is below bounded
and
then the optimal value in (IC∗ ) is attained and is equal to the optimal value in (IC).
Proof. (i) is nothing but Proposition D.2.1 (why?). It makes sense, however, to repeat here
the corresponding one-line reasoning:
m
∗
X
L(λ) ≡ inf L(x, λ) ≤ c [L(x, λ) = f (x) + λj gj (x)],
x∈X
j=1
where c∗ is the optimal value in (IC), note that if x is feasible for (IC), then evidently
L(x, λ) ≤ f (x), so that the infimum of L over x ∈ X is ≤ the infimum c∗ of f over
the feasible set of (IC).
has no solutions in X, and by the above Theorem the system (II) associated with c = c∗ has a
solution, i.e., there exists λ∗ ≥ 0 such that L(λ∗ ) ≥ c∗ . But we know from (i) that the strict
inequality here is impossible and, besides this, that L(λ) ≤ c∗ for every λ ≥ 0. Thus, L(λ∗ ) = c∗
and λ∗ is a maximizer of L over λ ≥ 0.
570APPENDIX D. CONVEX PROGRAMMING, LAGRANGE DUALITY, SADDLE POINTS
(the variables λ of the dual problem are called the Lagrange multipliers of the primal problem).
The Theorem says that the optimal value in the dual problem is ≤ the one in the primal, and
under some favourable circumstances (the primal problem is convex below bounded and satisfies
the Slater condition) the optimal values in the programs are equal to each other.
In our formulation there is some asymmetry between the primal and the dual programs.
In fact both of the programs are related to the Lagrange function in a quite symmetric way.
Indeed, consider the program
The objective in this program clearly is +∞ at every point x ∈ X which is not feasible for
(IC) and is f (x) on the feasible set of (IC), so that the program is equivalent to (IC). We see
that both the primal and the dual programs come from the Lagrange function: in the primal
problem, we minimize over X the result of maximization of L(x, λ) in λ ≥ 0, and in the dual
program we maximize over λ ≥ 0 the result of minimization of L(x, λ) in x ∈ X. This is a
particular (and the most important) example of a zero sum two person game – the issue we will
speak about later.
We have seen that under certain convexity and regularity assumptions the optimal values in
(IC) and (IC∗ ) are equal to each. There is also another way to say when these optimal values are
equal – this is always the case when the Lagrange function possesses a saddle point, i.e., there
exists a pair x∗ ∈ X, λ∗ ≥ 0 such that at the pair L(x, λ) attains its minimum as a function of
x ∈ X and attains its maximum as a function of λ ≥ 0:
Proposition D.2.2 (x∗ , λ∗ ) is a saddle point of the Lagrange function L of (IC) if and only if
x∗ is an optimal solution to (IC), λ∗ is an optimal solution to (IC∗ ) and the optimal values in
the indicated problems are equal to each other.
Our current goal is to extract from what we already know optimality conditions for convex
programs.
(i) A sufficient condition for x∗ to be an optimal solution to (IC) is the existence of the vector
of Lagrange multipliers λ∗ ≥ 0 such that (x∗ , λ∗ ) is a saddle point of the Lagrange function
L(x, λ), i.e., a point where L(x, λ) attains its minimum as a function of x ∈ X and attains its
maximum as a function of λ ≥ 0:
(ii) if the problem (IC) is convex and satisfies the Slater condition, then the above condition
is necessary for optimality of x∗ : if x∗ is optimal for (IC), then there exists λ∗ ≥ 0 such that
(x∗ , λ∗ ) is a saddle point of the Lagrange function.
Proof. (i): assume that for a given x∗ ∈ X there exists λ∗ ≥ 0 such that (D.2.11) is satisfied,
and let us prove that then x∗ is optimal for (IC). First of all, x∗ is feasible: indeed, if gj (x∗ ) > 0
for some j, then, of course, sup L(x∗ , λ) = +∞ (look what happens when all λ’s, except λj , are
λ≥0
fixed, and λj → +∞); but sup L(x∗ , λ) = +∞ is forbidden by the second inequality in (D.2.11).
λ≥0
Since x∗ is feasible, sup L(x∗ , λ) = f (x∗ ), and we conclude from the second inequality in
λ≥0
(D.2.11) that L(x∗ , λ∗ ) = f (x∗ ). Now the first inequality in (D.2.11) reads
m
λ∗j gj (x) ≥ f (x∗ )
X
f (x) + ∀x ∈ X.
j=1
This inequality immediately implies that x∗ is optimal: indeed, if x is feasible for (IC), then the
left hand side in the latter inequality is ≤ f (x) (recall that λ∗ ≥ 0), and the inequality implies
that f (x) ≥ f (x∗ ).
(ii): Assume that (IC) is a convex program, x∗ is its optimal solution and the problem
satisfies the Slater condition; we should prove that then there exists λ∗ ≥ 0 such that (x∗ , λ∗ )
is a saddle point of the Lagrange function, i.e., that (D.2.11) is satisfied. As we know from
the Convex Programming Duality Theorem (Theorem D.2.2.(ii)), the dual problem (IC∗ ) has a
solution λ∗ ≥ 0 and the optimal value of the dual problem is equal to the optimal value in the
primal one, i.e., to f (x∗ ):
f (x∗ ) = L(λ∗ ) ≡ inf L(x, λ∗ ). (D.2.12)
x∈X
j=1
in the right hand side are nonpositive (since x∗ is feasible for (IC)), and the
P
the terms in the
j
sum itself is nonnegative due to our inequality; it is possible if and only if all the terms in the
sum are zero, and this is exactly the complementary slackness.
From the complementary slackness we immediately conclude that f (x∗ ) = L(x∗ , λ∗ ), so that
(D.2.12) results in
L(x∗ , λ∗ ) = f (x∗ ) = inf L(x, λ∗ ).
x∈X
572APPENDIX D. CONVEX PROGRAMMING, LAGRANGE DUALITY, SADDLE POINTS
On the other hand, since x∗ is feasible for (IC), we have L(x∗ , λ) ≤ f (x∗ ) whenever λ ≥ 0.
Combining our observations, we conclude that
M1 = {h : ∃t > 0 : x∗ + th ∈ X}
We claim that
(I): M2 is a closed cone which has a nonempty intersection with the interior of the convex cone
M1 ;
(II): the vector ∇f (x∗ ) belongs to the cone dual to the cone M = M1 ∩ M2 .
Postponing for the time being the proofs, let us derive from (I), (II) the existence of the required
vector of Lagrange multipliers. Applying the Dubovitski-Milutin Lemma (Theorem B.2.7), which
is legitimate due to (I), (II), we conclude that there exists a representation
where Mi0 is the cone dual to the cone Mi . By the Homogeneous Farkas Lemma, we have
where λ∗j ≥ 0. Setting λ∗j = 0 for j 6∈ I(x∗ ), we get a vector λ∗ ≥ 0 such that
L(x, λ∗ ) = ∇f (x∗ ) +
P ∗
λj ∇gj (x∗ ) = ∇f (x∗ ) − v = u,
∇x
x=x∗ j (D.2.13)
λ∗j gj (x∗ ) = 0, j = 1, ..., m.
CONVEX PROGRAMMING AND LAGRANGE DUALITY THEOREM 573
Since the function L(x, λ∗ ) is convex in x ∈ X and differentiable at x∗ , the first relation in
(D.2.13) combines with the inclusion u ∈ M10 and Proposition C.5.1 to imply that x∗ is a
minimizer of L(x, λ∗ ) over x ∈ X. The second relation in (D.2.13) is the complementary slackness
which, as we remember from the proof of Theorem D.2.3.(ii), combines with the feasibility of x∗
to imply that λ∗ is a maximizer of L(x∗ , λ) over λ ≥ 0. Thus, (x∗ , λ∗ ) is a saddle point of the
Lagrange function, as claimed.
It remains to verify (I) and (II).
(I): the fact that M2 is a closed cone is evident (M2 is a polyhedral cone). The fact that M1 is
a convex cone with a nonempty interior is an immediate consequence of the convexity of X and
the relation int X 6= ∅. By assumption, there exists a point x̄ ∈ int X such that gj (x̄ ≤ 0 for all
j. Since x̄ ∈ int X,the vector h = x̄ − x∗ clearly belongs to int M1 ; since gj (x∗ ) = 0, j ∈ I(x∗ ),
and gj (x̄) ≤ 0, from Gradient Inequality it follows that hT ∇gj (x∗ ) ≤ gj (x̄) − gj (x∗ ) ≤ 0 for
j ∈ I(x∗ ), so that h ∈ M1 . Thus, h is the intersection of int M1 and M2 , so that this intersection
is nonempty.
(II): Assume, on the contrary to what should be proven, that there exists a vector d ∈ M1 ∩M2
such that dT ∇f (x∗ ) < 0. Let h be the same vector as in the proof of (I). Since dT ∇f (x∗ ) < 0,
we can choose > 0 such that with d = d + h one has dT ∇f (x∗ ) < 0. Since both d and h
belong to M1 , there exists δ > 0 such that xt = x∗ + td ∈ X for 0 ≤ t ≤ δ; since dT ∇f (x∗ ) < 0,
we may further assume that f (xt ) < f (x∗ ) when 0 < t ≤ δ. Let us verify that for every j ≤ m
one has
This will yield the desired contradiction, since, setting t = min[δ, min δj ], we would have xt ∈ X,
j
gj (xt ) ≤ 0, j = 1, ..., m, f (xt ) < f (x∗ ), which is impossible, since x∗ is an optimal solution of
(IC).
To prove (∗j ), consider the following three possibilities:
j 6∈ I(x∗ ): here (∗j ) is evident, since gj (x) is negative at x∗ and is continuous in x ∈ X at
the point x∗ (recall that all gj are assumed even to be differentiable at x∗ ).
∈ I(x∗ ) and j ≤ k: For j in question, the function gj (x) is affine and vanishes at x∗ , while
∇gj (x∗ ) has nonpositive inner products with both d (due to d ∈ M2 ) and h (due to gj (x∗ ) = 0,
gj (x∗ + h) = gj (x̄) ≤ 0); it follows that ∇gj (x∗ ) has nonpositive inner product with d , and since
the function is affine, we arrive at gj (x∗ + td ) ≤ gj (x∗ ) = 0 for t ≥ 0.
j ∈ I(x∗ ) and j > k: In this case, the function γj (t) = gj (x∗ + t ) vanishes at t = 0
and is differentiable at t = 0 with the derivative γj0 (0) = (h + d)T ∇gj (x∗ ). This derivative is
negative, since dT ∇gj (x∗ ) ≤ 0 due to d ∈ M2 and j ∈ I(x∗ ), while by the Gradient Inequality
hT ∇gj (x∗ ) ≤ gj (x∗ + h) − gj (x∗ ) = gj (x̄) − gj (x∗ ) ≤ gj (x̄) < 0. Since γj (0) = 0, γj0 (0) < 0, γj (t)
is negative for all small enough positive t, as required in (∗j ).
and
m
∗
λ∗j ∇gj (x∗ ) ∈ NX (x∗ )
X
∇f (x ) + (D.2.15)
j=1
m
(that is, (x − x∗ )T ∇f (x∗ ) + λ∗j ∇gj (x∗ ) ≥ 0 for all x ∈ X)
P
j=1
Proof. (i) is readily given by Theorem D.2.3.(ii); indeed, it is immediately seen that under the
premise of Theorem D.2.5 the Karush-Kuhn-Tucker condition is sufficient for x∗ , λ∗ ) to be a
saddle point of the Lagrange function.
(ii) is contained in the proof of Theorem D.2.4.
Note that the optimality conditions stated in Theorem C.5.2 and Proposition C.5.1 are
particular cases of the above Theorem corresponding to m = 0.
differentiable at x∗ ) is exactly the condition that (x∗ , λ∗ = (λ∗1 , ..., λ∗m )) is a saddle point of the
Lagrange function
m
X
L(x, λ) = f (x) + λj gj (x) : (D.3.1)
j=1
(D.2.14) says that L(x∗ , λ) attains its maximum in λ ≥ 0, and (D.2.15) says that L(x, λ∗ ) attains
its at λ∗ minimum in x at x = x∗ .
Now consider the particular case of (IC) where X = Rn is the entire space, the objective
f is convex and everywhere differentiable and the constraints g1 , ..., gm are linear. For this
case, Theorem D.2.5 says to us that the KKT (Karush-Kuhn-Tucker) condition is necessary and
sufficient for optimality of x∗ ; as we just have explained, this is the same as to say that the
necessary and sufficient condition of optimality for x∗ is that x∗ along with certain λ∗ ≥ 0 form
a saddle point of the Lagrange function. Combining these observations with Proposition D.2.2,
we get the following simple result:
Proposition D.3.1 Let (IC) be a convex program with X = Rn , everywhere differentiable
objective f and linear constraints g1 , ..., gm . Then x∗ is optimal solution to (IC) if and only if
there exists λ∗ ≥ 0 such that (x∗ , λ∗ ) is a saddle point of the Lagrange function (D.3.1) (regarded
as a function of x ∈ Rn and λ ≥ 0). In particular, (IC) is solvable if and only if L has saddle
points, and if it is the case, then both (IC) and its Lagrange dual
of (IC) and to minimize it in x ∈ Rn ; this will give us the dual objective. In our case the
m m
minimization in x is immediate: the minimal value is −∞, if c − λj aj 6= 0, and is
P P
λj bj
j=1 j=1
otherwise. We see that the Lagrange dual is
m
X
(D) max bT λ : λj aj = c, λ ≥ 0 .
λ
j=1
The problem we get is the usual LP dual to (P ), and Proposition D.3.1 is one of the equivalent
forms of the Linear Programming Duality Theorem which we already know.
where the objective is a strictly convex quadratic form, so that D = DT is positive definite
matrix: xT Dx > 0 whenever x 6= 0. It is convenient to rewrite the constraints in the vector-
matrix form T
b1 a1
and minimize it in x. Since the function is convex and differentiable in x, the minimum, if exists,
is given by the Fermat rule
∇x L(x, λ) = 0,
which in our situation becomes
Dx = [AT λ − c].
Since D is positive definite, it is nonsingular, so that the Fermat equation has a unique solution
which is the minimizer of L(·, λ); this solution is
Substituting the value of x into the expression for the Lagrange function, we get the dual
objective:
1
L(λ) = − [AT λ − c]T D−1 [AT λ − c] + bT λ,
2
and the dual problem is to maximize this objective over the nonnegative orthant. Usually people
rewrite this dual problem equivalently by introducing additional variables
The pair (x; (λ, t)) of feasible solutions to the problems is comprised of the optimal solutions
to them
(i) if and only if the primal objective at x is equal to the dual objective at (λ, t) [“zero duality
gap” optimality condition]
same as
(ii) if and only if
λi (Ax − b)i = 0, i = 1, ..., m, and t = −x. (D.3.2)
Proof. (i): we know from Proposition D.3.1 that the optimal value in minimization problem
(P ) is equal to the optimal value in the maximization problem (D). It follows that the value
of the primal objective at any primal feasible solution is ≥ the value of the dual objective at
any dual feasible solution, and equality is possible if and only if these values coincide with the
optimal values in the problems, as claimed in (i).
(ii): Let us compute the difference ∆ between the values of the primal objective at primal
feasible solution x and the dual objective at dual feasible solution (λ, t):
∆ = cT x + 12 xT Dx − [bT λ − 12 tT Dt]
= [AT λ + Dt]T x + 21 xT Dx + 12 tT Dt − bT λ
[since AT λ + Dt = c]
= λT [Ax − b] + 12 [x + t]T D[x + t]
Since Ax − b ≥ 0 and λ ≥ 0 due to primal feasibility of x and dual feasibility of (λ, t), both terms
in the resulting expression for ∆ are nonnegative. Thus, ∆ = 0 (which, by (i). is equivalent
m
λj (Ax − b)j = 0
P
to optimality of x for (P ) and optimality of (λ, t) for (D)) if and only if
j=1
and (x + t)T D(x + t) = 0. The first of these equalities, due to λ ≥ 0 and Ax ≥ b, is equivalent
to λj (Ax − b)j = 0, j = 1, ..., m; the second, due to positive definiteness of D, is equivalent to
x + t = 0.
L(x, λ) : X × Λ → R
The notion of a saddle point admits natural interpretation in game terms. Consider what is
called a two person zero sum game where player I chooses x ∈ X and player II chooses λ ∈ Λ;
after the players have chosen their decisions, player I pays to player II the sum L(x, λ). Of
course, I is interested to minimize his payment, while II is interested to maximize his income.
578APPENDIX D. CONVEX PROGRAMMING, LAGRANGE DUALITY, SADDLE POINTS
What is the natural notion of the equilibrium in such a game – what are the choices (x, λ)
of the players I and II such that every one of the players is not interested to vary his choice
independently on whether he knows the choice of his opponent? It is immediately seen that the
equilibria are exactly the saddle points of the cost function L. Indeed, if (x∗ , λ∗ ) is such a point,
than the player I is not interested to pass from x to another choice, given that II keeps his choice
λ fixed: the first inequality in (D.4.1) shows that such a choice cannot decrease the payment of
I. Similarly, player II is not interested to choose something different from λ∗ , given that I keeps
his choice x∗ – such an action cannot increase the income of II. On the other hand, if (x∗ , λ∗ ) is
not a saddle point, then either the player I can decrease his payment passing from x∗ to another
choice, given that II keeps his choice at λ∗ – this is the case when the first inequality in (D.4.1)
is violated, or similarly for the player II; thus, equilibria are exactly the saddle points.
The game interpretation of the notion of a saddle point motivates deep insight into the
structure of the set of saddle points. Consider the following two situations:
(A) player I makes his choice first, and player II makes his choice already knowing the choice
of I;
(B) vice versa, player II chooses first, and I makes his choice already knowing the choice of
II.
In the case (A) the reasoning of I is: If I choose some x, then II of course will choose λ which
maximizes, for my x, my payment L(x, λ), so that I shall pay the sum
Consequently, my policy should be to choose x which minimizes my loss function L, i.e., the one
which solves the optimization problem
In the case (B), similar reasoning of II enforces him to choose λ maximizing his profit function
Note that these two reasonings relate to two different games: the one with priority of II
(when making his decision, II already knows the choice of I), and the one with similar priority
of I. Therefore we should not, generally speaking, expect that the anticipated loss of I in (A) is
equal to the anticipated profit of II in (B). What can be guessed is that the anticipated loss of I
in (B) is less than or equal to the anticipated profit of II in (A), since the conditions of the game
D.4. SADDLE POINTS 579
(B) are better for I than those of (A). Thus, we may guess that independently of the structure
of the function L(x, λ), there is the inequality
This inequality indeed is true; which is seen from the following reasoning:
consequently, the quantity sup inf L(x, λ) is a lower bound for the function L(y), y ∈ X, and is
λ∈Λ x∈X
therefore a lower bound for the infimum of the latter function over y ∈ X, i.e., is a lower bound
for inf sup L(y, λ).
y∈X λ∈Λ
Now let us look what happens when the game in question has a saddle point (x∗ , λ∗ ), so that
whence, of course,
the very first quantity in the latter chain is ≤ the very last quantity by (D.4.2), which is possible
if and only if all the inequalities in the chain are equalities, which is exactly what is said by (A)
and (B).
Thus, if (x∗ , λ∗ ) is a saddle point of L, then (*) takes place. We are about to demonstrate
that the inverse also is true:
Theorem D.4.1 [Structure of the saddle point set] Let L : X × Y → R be a function. The set
of saddle points of the function is nonempty if and only if the related optimization problems (I)
and (II) are solvable and the optimal values in the problems are equal to each other. If it is the
case, then the saddle points of L are exactly all pairs (x∗ , λ∗ ) with x∗ being an optimal solution
to (I) and λ∗ being an optimal solution to (II), and the value of the cost function L(·, ·) at every
one of these points is equal to the common optimal value in (I) and (II).
Proof. We already have established “half” of the theorem: if there are saddle points of L, then
their components are optimal solutions to (I), respectively, (II), and the optimal values in these
two problems are equal to each other and to the value of L at the saddle point in question.
To complete the proof, we should demonstrate that if x∗ is an optimal solution to (I), λ∗ is an
580APPENDIX D. CONVEX PROGRAMMING, LAGRANGE DUALITY, SADDLE POINTS
optimal solution to (II) and the optimal values in the problems are equal to each other, then
(x∗ , λ∗ ) is a saddle point of L. This is immediate: we have
whence
L(x, λ∗ ) ≥ L(x∗ , λ) ∀x ∈ X, λ ∈ Λ;
substituting λ = λ∗ in the right hand side of this inequality, we get L(x, λ∗ ) ≥ L(x∗ , λ∗ ), and
substituting x = x∗ in the right hand side of our inequality, we get L(x∗ , λ∗ ) ≥ L(x∗ , λ); thus,
(x∗ , λ∗ ) indeed is a saddle point of L.
so that the optimal value in (I) is 41 , and the optimal value in (II) is 0; according to Theorem
D.4.1 it means that L has no saddle points.
On the other hand, there are generic cases when L has a saddle point, e.g., when
m
X
L(x, λ) = f (x) + λi gi (x) : X × Rm
+ →R
i=1
is the Lagrange function of a solvable convex program satisfying the Slater condition. Note that
in this case L is convex in x for every λ ∈ Λ ≡ Rm + and is linear (and therefore concave) in λ
for every fixed X. As we shall see in a while, these are the structural properties of L which take
upon themselves the “main responsibility” for the fact that in the case in question the saddle
points exist. Namely, there exists the following
L(x, λ) : X × Λ → R
• (ii) the optimal values in (I) and (II) are equal to each other.
D.4. SADDLE POINTS 581
of the collection is equal to the minimum in x ∈ X of certain convex combination of the functions:
there exist nonnegative µi , i = 0, ..., N , with unit sum such that
N
X
min max fi (x) = min µi fi (x)
x∈X i=0,...,N x∈X
i=0
The Minmax Lemma says that if fi are convex and continuous on a convex compact set X,
then the indicated inequality is in fact equality; you can easily verify that this is nothing but
the claim that the function M possesses a saddle point. Thus, the Minmax Lemma is in fact
a particular case of the Sion-Kakutani Theorem; we are about to give a direct proof of this
particular case of the Theorem and then to derive the general case from this particular one.
5)
for those not too familiar with Analysis, I wish to stress the difference between the usual continuity and
the uniform continuity: continuity of L means that given > 0 and a point (x, λ), it is possible to choose δ > 0
such that (D.4.4) is valid; the corresponding δ may depend on (x, λ), not only on . Uniform continuity means
that this positive δ may be chosen as a function of only. The fact that a continuous on a compact set function
automatically is uniformly continuous on the set is one of the most useful features of compact sets
582APPENDIX D. CONVEX PROGRAMMING, LAGRANGE DUALITY, SADDLE POINTS
(note that (t, x) is feasible solution for (S) if and only if x ∈ X and t ≥ max fi (x)). The
i=0,...,N
problem clearly satisfies the Slater condition and is solvable (since X is compact set and fi ,
i = 0, ..., N , are continuous on X; therefore their maximum also is continuous on X and thus
attains its minimum on the compact set X). Let (t∗ , x∗ ) be an optimal solution to the problem.
According to Theorem D.2.3, there exists λ∗ ≥ 0 such that ((t∗ , x∗ ), λ∗ ) is a saddle point of the
corresponding Lagrange function
N
X N
X N
X
L(t, x; λ) = t + λi (fi (x) − t) = t(1 − λi ) + λi fi (x),
i=0 i=0 i=0
and the value of this function at ((t∗ , x∗ ), λ∗ ) is equal to the optimal value in (S), i.e., to t∗ .
Now, since L(t, x; λ∗ ) attains its minimum in (t, x) over the set {t ∈ R, x ∈ X} at (t∗ , x∗ ),
we should have
N
λ∗i = 1
X
i=0
N
" #
[min max fi (x) =] t∗ = λ∗i fi (x) ,
X
min t×0+
x∈X i=0,...,N t∈R,x∈X
i=0
so that
N
λ∗i fi (x)
X
min max fi (x) = min
x∈X i=0,...,N x∈X
i=0
N
with some λ∗i ≥ 0, λ∗i = 1, as claimed.
P
i=0
From the Minmax Lemma to the Sion-Kakutani Theorem. We should prove that the
optimal values in (I) and (II) (which, by (i), are well defined reals) are equal to each other, i.e.,
that
inf sup L(x, λ) = sup inf L(x, λ).
x∈X λ∈Λ λ∈Λ x∈X
We know from (D.4.4) that the first of these two quantities is greater than or equal to the second,
so that all we need is to prove the inverse inequality. For me it is convenient to assume that the
right quantity (the optimal value in (II)) is 0, which, of course, does not restrict generality; and
all we need to prove is that the left quantity – the optimal value in (I) – cannot be positive.
10 . What does it mean that the optimal value in (II) is zero? When it is zero, then the
function L(λ) is nonpositive for every λ, or, which is the same, the convex continuous function of
D.4. SADDLE POINTS 583
x ∈ X – the function L(x, λ) – has nonpositive minimal value over x ∈ X. Since X is compact,
this minimal value is achieved, so that the set
X(λ) = {x ∈ X | L(x, λ) ≤ 0}
is nonempty; and since X is convex and L is convex in x ∈ X, the set X(λ) is convex (as a level
set of a convex function, Proposition C.1.4). Note also that the set is closed (since X is closed
and L(x, λ) is continuous in x ∈ X).
20 . Thus, if the optimal value in (II) is zero, then the set X(λ) is a nonempty convex compact
set for every λ ∈ Λ. And what does it mean that the optimal value in (I) is nonpositive? It
means exactly that there is a point x ∈ X where the function L is nonpositive, i.e., the point
x ∈ X where L(x, λ) ≤ 0 for all λ ∈ Λ. In other words, to prove that the optimal value in (I) is
nonpositive is the same as to prove that the sets X(λ), λ ∈ Λ, have a point in common.
30 . With the above observations we see that the situation is as follows: we are given a family
of closed nonempty convex subsets X(λ), λ ∈ Λ, of a compact set X, and we should prove that
these sets have a point in common. To this end, in turn, it suffices to prove that every finite
number of sets from our family have a point in common (to justify this claim, I can refer to the
Helley Theorem II, which gives us much stronger result: to prove that all X(λ) have a point in
common, it suffices to prove that every (n + 1) sets of this family, n being the affine dimension
of X, have a point in common). Let X(λ0 ), ..., X(λN ) be N + 1 sets from our family; we should
prove that the sets have a point in common. In other words, let
all we should prove is that there exists a point x where all our functions are nonpositive, or,
which is the same, that the minmax of our collection of functions – the quantity
– is nonpositive.
The proof of the inequality α ≤ 0 is as follows. According to the Minmax Lemma (which
can be applied in our situation – since L is convex and continuous in x, all fi are convex and
continuous, and X is compact), α is the minimum in x ∈ X of certain convex combination
N
P
φ(x) = νi fi (x) of the functions fi (x). We have
i=0
N
X N
X N
X
φ(x) = νi fi (x) ≡ νi L(x, λi ) ≤ L(x, νi λi )
i=0 i=0 i=0
(the last inequality follows from concavity of L in λ; this is the only – and crucial – point where
we use this assumption). We see that φ(·) is majorated by L(·, λ) for a properly chosen λ; it
follows that the minimum of φ in x ∈ X – and we already know that this minimum is exactly
α – is nonpositive (recall that the minimum of L in x is nonpositive for every λ).
The next theorem lifts the assumption of boundedness of X and Λ in Theorem D.4.2 – now
only one of these sets should be bounded – at the price of some weakening of the conclusion.
Theorem D.4.3 [Swapping min and max in convex-concave saddle point problem (Sion-
Kakutani)] Let X and Λ be convex sets in Rn and Rm , respectively, with X being compact,
and let
L(x, λ) : X × Λ → R
584APPENDIX D. CONVEX PROGRAMMING, LAGRANGE DUALITY, SADDLE POINTS
Proof. By general theory, in (D.4.6) the left hand side is ≥ the right hand side, so that there
is nothing to prove when the right had side is +∞. Assume that this is not the case. Since X
is compact and L is continuous in x ∈ X, inf x∈X L(x, λ) > −∞ for every λ ∈ Λ, so that the left
hand side in (D.4.6) cannot be −∞; since it is not +∞ as well, it is a real, and by shift, we can
assume w.l.o.g. that this real is 0:
All we need to prove now is that the left hand side in (D.4.6) is nonpositive. Assume, on the
contrary, that it is positive and thus is > c with some c > 0. Then for every x ∈ X there exists
λx ∈ Λ such that L(x, λx ) > c. By continuity, there exists a neighborhood Vx of x on X such that
L(x0 , λx ) ≥ c for all x0 ∈ Vx . Since X is compact, we can find finitely many points x1 , ..., xn in X
such that the union, over 1 ≤ i ≤ n, of Vxi is exactly X, implying that max1≤i≤n L(x, λxi ) ≥ c
for every x ∈ X. Now let Λ̄ be the convex hull of {λx1 , ..., λxn }, so that maxλ∈Λ̄ L(x, λ) ≥ c
for every x ∈ X. Applying to L and the convex compact sets X, Λ̄ Theorem D.4.2, we get the
equality in the following chain:
Theorem D.4.4 [Existence of saddle point in convex-concave saddle point problem (Sion-
Kakutani, Semi-Bounded case)] Let X and Λ be closed convex sets in Rn and Rm , respectively,
with X being compact, and let
L(x, λ) : X × Λ → R
be a continuous function which is convex in x ∈ X for every fixed λ ∈ Λ and is concave in λ ∈ Λ
for every fixed x ∈ X. Assume that for every a ∈ R, there exist a collection xa1 , ..., xana ∈ X such
that the set
{λ ∈ Λ : L(xai , λ) ≥ a 1 ≤ i ≤ na }
is bounded6 . Then L has saddle points on X × Λ.
Proof. Since X is compact and L is continuous, the function L(λ) = minx∈X L(x, λ) is real-
valued and continuous on Λ. Further, for every a ∈ R, the set {λ ∈ Λ : L(λ) ≥ a} clearly is con-
tained in the set {λ : L(xai , λ) ≥ a, 1 ≤ i ≤ na } and thus is bounded. Thus, L(λ) is a continuous
function on a closed set Λ, and the level sets {λ ∈ Λ : L(λ) ≥ a} are bounded, implying that L at-
tains its maximum on Λ. Invoking Theorem D.4.3, it follows that inf x∈X [L(x) := supλ∈Λ L(x, λ)]
is finite, implying that the function L(·) is not +∞ identically in x ∈ X. Since L is continuous,
6
this definitely is the case true when L(x̄, λ) is coercive in λ for some x̄ ∈ X, meaning that the sets {λ ∈ Λ :
L(x̄, λ) ≥ a} are bounded for every a ∈ R, or, equivalently, whenever λi ∈ Λ and kλi k2 → ∞ as i → ∞, we have
L(x̄, λi ) → −∞ as i → ∞.
D.4. SADDLE POINTS 585
586
INDEX 587