MA412 Final
MA412 Final
MA412 Final
Optimization
Giorgio Consigli
giorgio.consigli@ku.ac.ae
Fall Term a.y. 2022-2023
Khalifa University
Overview
The course is structured in four main parts:
• Part I Mathematical Review: matrix theory, convexity, multivariable
calculus.
• Part II Unconstrained optimization: Gradient descent, Newton methods,
Quasi-Newton methods.
• Part III Linear programming: Simplex method, Interior point method.
• Part IV Nonlinear constrained optimization: Lagrange method,
Karush-Kuhn-Tucker conditions, second order conditions,convex
optimization, algorithms, multi-objective optimization.
Material
This document will only trace the main topics, while many exercises and
applications will be handed separately and posted on our BB course page,
where you will be able to collect:
• These notes that will evolve over the course, always including previous
sections.
• Exercise questions and separately solutions.
• Simple excel or matlab applications to explain some methods and results.
The course main reference textbook is the 4th edition of An Introduction to
Optimization by Edwin K.P.Chong and Stanislaw H. Zak, Wiley 2013.
Part I.a: Linear algebra
Through several exercises we wish to summarize the main relevant
concepts from linear algebra needed in our course, specifically:
• linear independence
• vector spaces and subspaces
• transformations, linear systems and matrix definiteness.
We denote with `x ∈ Rn an n-dimensional vectors with elements
{xi }, i = 1, 2, ., n. We use capital letters for matrices: A ∈ Rm,n is a
linear mapping from Rn to Rm , whose rank is denoted by rk(m) while
a subspace, say W ⊆ V ∈ Rmin(m,n) is often defined and should satisfy
the conditions for a vector space. It will also include the null vector 0
of all 0’s.
Exercises 1
Inner product operator and inequalities
Let v and w be twoP vectors in Rn . Their inner product satisfies:
< v, w >= vT w = i vi wi = cos(θ) × ||v|| · ||w|| where θ ∈ [0, π]
and ||.|| is the Euclidean norm. Thus
cos(θ) = <v,w>
||v||·||w|| and θ = arccos(.) .
For two vectors to be orthogonal you need the inner product to be 0.
→Schwarz inequality, Triangle inequality
Exercises 2
Linear systems
We define the set of linear equations Ax = b with A ∈ Rm,n a vector of
unknowns x ∈ Rn and a vector of coefficients b ∈ Rm , where for
b = 0 we have an homogeneous system.
→Rouche-Capelli, Gaussian elimination, Cramer
→Transformations
Consider the system Ax = λx. We look for the set of λ’s that allow
such equality to hold:
(A − λI) = (λI − A) = 0
Exercises 3
Diagonalization
Key results in Rn :
• det(A) = Πni=1 λi and tr (A) := i aii = i λi
P P
Exercises 4
Orthogonal decomposition
Let V be a vector space and U, W two subspaces. The two spaces are
orthogonal if their elements (vectors) are orthogonal. We define the
orthogonal complement U ⊥ := {x|xT u = 0, ∀u ∈ U }.
The orthogonal complements span a given space, due to the following
relationship. Take V to be a subspace of Rn , then we would have:
Rn = V + V ⊥ =⇒ dim(V) + dim(V ⊥ ) = n. Same relationship would
hold for the subspaces of V.
Every vector x ∈ Rn admits a unique orthogonal decomposition
x = x1 + x2 , with x1 ∈ V, x2 ∈ V ⊥ .
Projections
Consider a matrix A ∈ R(m,n) .
Let R(A) := {Ax|x ∈ Rn } define the range or image of matrix A and
N (A) := {x ∈ Rn |Ax = 0} the null space of A.
We have the following results:
• Let A be a given matrix, then R(A)T = N (AT ) and
N (A)T = R(AT ).
• A matrix P is an orthogonal projector onto the subspace
V = R(P) iff P 2 = P = P T . For any x ∈ V, Px = x.
The above results, the first establishes a relationship between two
relevant subspaces and the second defines a Necessary and Sufficient
Condition (N.S.C) for orthogonal projection.
Orthogonal projections and Gram-Schmidt method
Let V be a vector space and U a subspace endowed with orthonormal
basis u(1) , ..., u(n) , then for any v ∈ V its orthogonal projection v⊥
onto U is given by
v⊥ =< v, u(1) > u(1) + < v, u(2) > u(2) + ...+ < v, u(n) > u(n) .
From this representation we can derive the popular Gram-Schmidt
method: this allows, from any given basis in a vector space, to derive
an associated orthonormal basis. In particular let x(1) , x(2) , ..., x(n) be a
basis in V ⊆ Rn . We want to construct a new basis of orthonormal
vectors v(1) , ..., v(n) , then: v(1) = ||xx(1) || , v(2) = ||xx(2) −<x ,
(1) (2) (2) ,v(1) >v(1)
Exercises 5
Defineteness
We say that f : Rn → R is a quadratic form if f (x) = xT · Q · x with
Q an (n, n) real matrix, assumed to be symmetric. Then, for any x 6= 0
the quadratic form is said to be:
• positive definite (p.d.) if xT · Q · x > 0 for all x 6= 0
• positive semi definite (p.s.d.) if xT · Q · x ≥ 0 for any x
• negative definite (n.d.) if xT · Q · x < 0 for all x 6= 0
• negative semi definite (n.s.d.) if xT · Q · x ≤ 0 for any x
• indefinite if neither (p.s.d.) nor (n.s.d.). We see later on that the
sign of a matrix is of primary relevance when studying the
behaviour of a function in Rn and to determine its convexity.
Defineteness: few results
We give two relevant results which help to determine the sign of a
matrix directly.
Result 1: Q is p.d. iff all its eigenvalues are positive. n.d. iff all
negative. p.s.d. iff all nonnegative and n.s.d. iff all nonpositive. The
matrix is undefined if at least one eigenvalue is of opposite sign.
Result 2 (see Sylvester criterion): Let A be symmetric of order n. Then
it is p.d. iff all its north-west minors are Det(a11 ) > 0, Det(A22 ) > 0,
...,Det(Akk ) > 0,...,Det(A) > 0. It is n.d. iff Det(a11 ) < 0,
Det(A22 ) > 0, ...,(−1)k Det(Akk ) > 0,...,(−1)n Det(A) > 0. To assess
p.s.d. and n.s.d. it is necessary to introduce principal minors, whose
sign will determine the condition in case one of the sub-matrices is null.
Exercises 6
Part I.b Geometry
Let x, y be vectors in Rn and z be a vector lying in the segment joining
x and y: z = αx + (1 − α)y for α ∈ [0, 1].
Let u1 , u2 , ..., un , elements of u, and v be in R. The set
H := {x ∈ Rn |uT · x = v} defines a hyperplane of dimension n − 1.
For n = 2 : u1 x1 + u2 x2 = v is a straight line; for
n = 3 : u1 x1 + u2 x2 + u3 x3 = v defines a plane (a subspace only for
v = 0). We define positive and negative half-spaces in Rn as:
H+ := {x|uT x ≥ v} H− := {x|uT x ≤ v}
The vector u ∈ Rn is said to be normal to H under the following
conditions. Let a be in H, then uT · a = v and
(uT · x − v) − (uT · a − v) = uT · (x − a) = 0 for u ⊥ (x − a). Then:
H+ := {x|uT · (x − a) ≥ 0} H− := {x|uT · (x − a) ≤ 0}
Convex sets
Let A ∈ Rm,n , b ∈ Rm , a linear variety in Rn is a set
{x ∈ Rn |A · x = b}.
• The empty set, a single point, a line, a subspace, a hyperplane, a
linear variety, a half space, Rn are convex sets in their k − dim
spaces for k = 0, 1, 2, 3, .....
• Assume a convex subset Θ ⊆ Rn , then:
• βΘ := {x|x = βv, v ∈ } is convex
• If Θ1 , Θ2 are convex, then
Θ1 + Θ2 := {x|x = v1 + v2 , vi ∈ i , i = 1, 2} is convex
• V = Xi=1,2,..,n Θi the intersection of convex sets is convex.
• The set B (x) := {y ∈ Rn | ||y − x|| < } defines a
neighbourhood of x, it is convex. A set S is open if it contains a
neighbourhood of every point in the set. It is closed if contains its
boundary. If is finite the set is bounded, if closed and bounded it
is said to be compact.
Functions
A set that can be expressed as the intersection of a finite number of
half-spaces is a polyhedron, when bounded and non-empty this is a
polytope.
Consider two disjoint convex sets S and T : then there exists a
u 6= 0|uT x ≤ uT y for every x ∈ S and y ∈ T. Such vector will define a
separating hyperplane for the elements in either sets.
Let S ⊆ Rn be convex, f : Rn → R is:
• convex: f (αx + (1 − α)x0 ) ≤ αf(x) + (1 − α)f(x0 )
• concave if for any (x, x0 ) ∈ S, x 6= x0 and α ∈ (0, 1):
f (αx + (1 − α)x0 ) ≥ αf(x) + (1 − α)f(x0 ) (thus −f is convex)
• strictly convex: if x 6= x0 and LHS < RHS.
• strictly concave: if x 6= x0 and LHS > RHS (again −f is strictly
convex)
Exercises 7
Part I.c Calculus
Few useful results on limits of vector sequences in Rn can be
summarised:
• The sequence {x(k) }∞ k=1 → x if there exists a k s.t.
∗
Exercises 8
Level sets
We define the level set of f : Rn → R as the set S := {x|f (x) = c} for
any c ∈ R in the function domain.
For varying coefficient c = {c1 , c2 , ...} we identify different level curves
that can be studied in the (n − 1) dimension.
Consider an arbitrary c and a mapping g : R → Rn , with g(t0 ) = x0
and Dg(t0 ) = v 6= 0 so that v is tangent to the level set at x0 .
Consider the composite function h(t) = f (g(t)) = c with f : Rn → R,
we apply the chain rule:
Df (g(t0 )) · Dg(t0 ) = Df (x)0 · v. But since h(t) = f (g(t)) = c,
h0 (t) = 0, thus Df (x0 ) · v = 0
The tangent vector at x0 and the gradient are orthogonal and this is
true for any point on the level set.
As we define different level curves we can study the directions on the
surface. The gradients provide the direction of maximal rate of increase
over the surface in Rn .
Exercises 9
Taylor expansion in Rn
We now have everything we need to extend Taylor expansion to Rn .
Let f : Rn → R be in C 2 . Then
1
f (x) = f (x0 )+Df (x0 )·(x−x0 )+ (x−x0 )T D 2 f (x0 )(x−x0 )+o(||x−x0 ||2 )
2!
dT ∇f (x∗ ) ≥ 0
Exercises 11
1-d search methods
Assume now f : X ⊂ R → R and a unimodal function. We consider
three iterative approaches to identify the minimum of a continuous
function: find
x ∗ = argminx∈R f (x)
• Fibonacci search method: it is based on the celebrated Fibonacci
numbers and relies only on function evaluation steps.
• First-order methods: in this generic approach (with several specific
versions) the problem is solved relying on first derivatives.
• Newton’s method implements instead an optimum search based
on first and second derivatives. We’ll see that this has then been
extended to Rn in more recent times.
Fibonacci method
This is an algorithm inspired by the ancient greek Golden section method. It is
globally convergent and employs a range reduction for function evaluation
based on ρk = 1 − FFN−k+1
N−k+2
, for k = 1, .., N, where Fk are Fibonacci numbers
satisfying Fk+1 = Fk − Fk−1 with F−1 = 0 and F0 = F1 = 1.
1. Initial conditions: the initial range [a0 , b0 ], the number of evaluations
finalrange
N| 1+2
FN+1 ≤ initialrange = c leading to FN+1 ≥ c .
1+2
2. iteration 1: say N = 5, 1 − ρ1 = 13 8
, a1 = a0 + ρ1 (b0 − a0 ),
b1 = a0 + (1 − ρ1 )(b0 − a0 ), f (a1 ), f (b1 ) evaluations, assume
f (a1 ) < f (b1 ), then range reduced to [a0 , b1 ].
3. ... iteration k
4. Last (fifth) iteration: 1 − ρ5 = 1/2, a5 = ak + (ρ5 − )(b4 − ak ), say
b5 = ak+1 , then f (a5 ), f (b5 ) and take the minimum, STOP.
Develop example.
First order algorithms
Assume now f : R → R, f ∈ C 1 . Adopt the following algorithm (based
on FONC):
df
1. Find all stationary points of f (x) by solving dx =0
2. Evaluate f at each such point
3. Evaluate f (∞) and f (−∞) as x → +/ − ∞
4. Select the least of the values of f in steps 2 and 3. This is
minx f (x).
In this simple algorithm, we are just introducing a function evaluation
step over though a selected finite number of stationary points, making
sure that the function is bounded (step 3).
The algorithm is surely convergent to a local minimum in the interior of
the function domain if the function is bounded. An issue may arise in
presence of saddle points.
1-d Newton method
f : R → R, f ∈ C 2 and we consider now an iterative procedure. Given a value
x (k) we assume that f (x (k) ), f 0 (x (k) ), f 00 (x (k) ) are defined: we then introduce
a quadratic approximation
1
q(x) = f (x (k) ) + f 0 (x (k) )(x − x (k) ) + f 00 (x (k) )(x − x (k) )2
2
Instead of minimizing f we minimize q at each iteration. Choose x (0) , then
• 0 = q 0 (x) = f 0 (x (k) ) + f 00 (x (k) )(x − x (k) ) is now the FONC
f 0 (x (k) )
• Let x = x (k+1) we have x (k+1) = x (k) − . This is Newton step.
f 00 (x (k) )
The method is accurate and fast until the second derivative is positive,
otherwise may not converge.
Exercises 12
Part II.b: Gradient method
Consider f : Rn → R, f ∈ C 1 : we saw that any x0 |f (x0 ) = c on the level set
of c is an element of a tangent hyperplane to c for which tg(x0 ) ⊥ ∇f (x0 ).
Then we define a direction of increase along the surface through
< ∇f (x), d > with d = ||∇f (x)|| , then < ∇f (x), ||∇f (x)|| >= ||∇f (x)|| is the
∇f (x) ∇f (x)
T
g (k) g (k) (k)
x(k+1) = x(k) − g
g (k)T Qg (k)
Exercises 13
Convergence analysis
Convergence analysis focuses on:
• it’s characterization as global or local convergence.
• results characterizing the convergence to an optimum x(k) → x∗ so that
f (x∗ ) is the associated optimal value of f ,
• the speed of convergence to that value.
As for the first point, an algorithm is said to be globally convergent when from
any x(0) it will surely reach a point x∗ where FONC are satisfied. When such
condition requires the starting point to be close to x∗ , in a neighbourhood of,
then we speak of local convergence. We focus for simplicity on the QP case
with f (x) = 12 xT · Q · x with Q p.d. and symmetric.
Convergence analysis
In the case of steepest descent two results help characterizing its convergence.
3. Qp = ∞ otherwise.
The following result holds: let Qp be Q factors of x(k) , then one of the
following is true: (a) Qp = 0 for any p ∈ [1, ∞); (b) Qp = ∞ for any
p ∈ [1, ∞); (c) There is a p0 ∈ [1, ∞) such that Qp = 0 for p ∈ [1, p0 ) and
Qp = ∞ for p ∈ (p0 , ∞). We complete this section with few additional
definitions.
Convergence analysis, p = 1, 2
Consider the case p = 1. We have:
• If Q1 = 0 thus limk→∞ ek+1
ek = 0, we say that the speed of convergence
is Q-superlinear,
• If 0 < Q1 < 1 under any norm we say that in the given norm the speed
of convergence is Q-linear,
• For Q1 ≥ 1 in the given norm, we can say that x(k) has speed of
convergence Q-sublinear.
The distinction generalizes to p = 2: we can distinguish accordingly between
Q-superquadratic, Q-quadratic and Q-subquadratic speed of convergence.
The steepest descent algorithm is at least Q-linearly convergent under the
norm ||x||Q = (xT · Q · x)1/2 since
∗ λM − λm
||x(k+1)
− x ||Q ≤ · ||x(k) − x∗ ||Q
λM + λm
Part II.c: Newton and Quasi-Newton methods
Consider now Newton method in Rn . The extension is quite straightforward,
f : Rn → R, with f ∈ C 2 :
1. Let again g (k) = ∇f (x(k) ) and assume the approximation of f with a
quadratic function
q(x) = f (x) + (x − x(k) )T · g (k) + 12 (x − x(k) )T F (x(k) )(x − x(k) )
2. From FONC: 0 = ∇q(x) = g (k) + F (x(k) ) · (x − x(k) ) thus
is Newton step in Rn .
The Newton step is actually and conveniently decomposed in two steps: (i)
F (x(k) ) · d(k) = −g (k) and (ii) x(k+1) = x(k) + d(k) . Positive definiteness of
the Hessian at every iteration is key to the convergence (in a minimization
problem).
Newton in Rn
We can characterize the convergence properties of this algorithms. In the
(QP) case we have f (x) = 12 xT · Q · x − xT b, thus g(x) = ∇f (x) = Qx − b
and F (x) = Q which is symmetric and assumed to be invertible. Then in 1
single step:
x(1) = x(0) − F (x(0) )−1 g (0) = x(0) − Q −1 Qx(0) − b = Q −1 b = x∗
h i
In the general case, there are two relevant results we can quote:
1. Let f ∈ C 2 , x∗ ∈ Rn such that ∇f (x∗ ) = 0 and F (x∗ ) invertible. Then
for any x(0) sufficiently close to x∗ the Newton step is well defined for all
k and converges to x∗ with an order of convergence at least 2 (quadratic
convergence).
2. Let x(k) be the sequence generated by Newton method to minimize f (x).
If F (x(k) ) > 0 and g (k) = ∇f (x(k) ) 6= 0, then
d(k) = −F (x(k) )−1 g (k) = x(k+1) − x(k) is a descent direction for f and
there exists an α0 such that for any α ∈ (0, α0 )
f (x(k) + αd(k) ) < f (x(k) .
Exercises 14
Quasi-Newton methods
Quasi-Newton methods (QNM) are motivated by the numerical problems
associated with the direction d(k) = F (x(k) )−1 · g (k) and the associated
Hessian inversion. We look for an approximation Hk of the Hessian inverse
and a step α leading to the iterations: x(k+1) = x(k) − αk Hk g (k) for
k = 1, 2, ... until termination.
The definition of Hk is simple in the case of (QP) problems, in which the
Hessian is independent of the iterations, F (x) = Q for any x(k) . Then
g (k+1) − g (k) = Q(x(k+1) − x(k) ), or ∆g (k) = Q∆x(k) and
.
We may then require in the general case that Hk+1 ∆g (k) = ∆x(i) for all
i ≤ k.
Rank 1 algorithm
5. k = k + 1 go to 2.
The algorithm is based on satisfying at every iteration the system
Hk+1 ∆g (k) = ∆x(k) . It turns out that this is sufficient to have
Hk+1 ∆g (i) = ∆x(i) for i = 0, 1, ..., k.
DFP Hessian update
The DFP algorithm is the first one extending the rank 1 methods in the early
60’s.
1. k = 0: select x(0) and a real symmetric p.d. matrix H0 .
2. if g (k) = 0 stop, else d(k) = −Hk · g (k) .
3. compute αk = argminα≥0 f (x(k) + αd(k) ), x(k+1) = x(k) + αk d(k)
4. Derive ∆x(k) = αk d(k) , ∆g (k) = g (k+1) − g (k) ,
h ih i T
T Hk ∆g (k) Hk ∆g (k)
∆x(k) ·∆x(k)
Hk+1 = Hk + −
∆x(k)T ∆g (k) ∆g (k)T ·Hk ·∆g (k)
5. k = k + 1 go to 2.
The DFP algorithm is indeed a QNM: when applied to a QP problem we have
Hk+1 ∆g (i) = ∆x(i) for all i = 0, 1, .., k
BFGS update
iT
T
h ih
Hk ∆g (k) Hk ∆g (k)
From the DFP update we have Hk+1 = Hk + ∆x(k) ·∆x(k)
T − T .
∆x(k) ∆g (k) ∆g (k) ·Hk ∆g (k) )
The BFGS update relies on the direct approximation of the Hessian through matrix
Bk at the k-th iteration with:
iT
(k)T
h
(k)T
(k)
(k)
∆g · ∆g Bk ∆x ∆x Bk
Bk+1 = Bk + −
∆g (k)T ∆x(k) ∆x(k)T · Bk ∆x(k) )
T T
" #
BFGS ∆g (k) · Hk · ∆g (k) ∆x(k) ∆x(k)
Hk+1 = Hk + 1 + T
· T
+
∆g (k) ∆x(k) ∆x(k) ∆g (k)
T T
Hk · ∆g (k) ∆x(k) + (Hk · ∆g (k) ∆x(k) )T
−
∆g (k)T ∆x(k)
Exercises 15
Least square approximation
Let A ∈ Rm,n , b ∈ Rm , m ≥ n, rk(A) = n, b ∈ / R(A), thus Ax = b is
inconsistent. We address the problem: which xo s.t.
{||Ax − b||2 } = min? so that for any other x ∈ Rn the squared
difference in norm would be greater. Then xo is referred to as least
square solution (l.s.s.) to Ax = b
Theorem: The unique vector xo that minimizes ||Ax − b||2 is the
solution to the system AT · A · x = AT · b, thus xo = (AT · A)−1 · AT · b
proof: assume xo = (AT · A)−1 · AT · b. Then
||Ax − b||2 = ||A(x − xo ) + (Axo − b)||2 =
[A(x − xo ) + (Axo − b)]T · [(A(x − xo ) + (Axo − b)]=
||A(x − xo )||2 + ||Axo − b||2 + 2(A(x − xo ))T · (Axo − b)
By construction the last product is 0 so we have
||Ax − b||2 = ||A(x − xo )||2 + ||Axo − b||2 with the first term on the
RHS surely positive for x 6= xo with rk(A) = n, which completes the
proof.
Least square approximation
Alternatively, we can derive the minimum squared error by explicitly
applying the optimality conditions from the gradient. We have
f (x) = ||Ax − b||2 = (Ax − b)T · (Ax − b), from which we derive:
∇f (x) = 2AT Ax − 2AT b = 0 with solution xo = (AT A)−1 AT b.
Rationale: consider A ∈ Rm,n , m ≥ n. The column vectors of A span
R(A), n-dim subspace of Rm . For Ax = b to be solved we need
b ∈ R(A), which is equivalent to have Ax = b consistent (it admits
solution). Which vector h ∈ R(A) is the closest to b? It’s orthogonal
projection onto the subspace spanned by h. Then we can write:
h ∈ R(A)|(h − b) ⊥ R(A), h = Axo = A(AT · A)−1 AT b
Exercises 16
Part III: Linear programming
Consider the problem of minimizing a linear cost function subject to linear
constraints with c ∈ Rn , A ∈ Rm,n , b ∈ Rm and decision vector x ≥ 0 ∈ Rn+ :
minx≥0 c T x s.t. A · x = b
. The above is a linear program (LP) due to the linearity of both the objective
function and the constraints. Any x satisfying the constraints is said to be
feasible: among them we seek the one that leads to a minimal cost: the
optimal solution of the problem.
The following are stylized school applications of LP:
• manufacturing problem: maximize overall production time under technical and
operational constraints, with x to represent the time allocated to each
production line.
• transportation problem: minimize transportation costs to dispatch goods and
materials across several origin-destination pairs.
• newsvendor problem: decide how many journals to buy daily to maximize
profit under demand uncertainty.
Standard form LP
In presence of inequality constraints, an LP is taken back to standard form
(constraints’ equality and min) by introducing surplus or slack variables. No
changes in the objective, we have:
• surplus vector: Ax − Im y = b, y ≥ 0
• slack vector: Ax + Im y = b, y ≥ 0
Basic solution: let in the general case A ∈ R(m,n) with m < n and rk(A) = m.
We wish to define the sub-matrix B ∈ R(m,m) ofA whose columns are l.i., so
T T
that A = [B|D]. We have xB = B −1 b and x = xT B , 0 is the basic
solution to the problem. B is referred to as the basis of the LP.
If some of the elements of the basic solution are 0, we talk of a degenerate
basic solution.
Fundamental theorem of LP
Consider a linear program in standard form. The fundamental theorem of LP
establishes that:
• If there exists a feasible solution, then there exists a basic feasible
solution.
• If there exists an optimal feasible solution, then there exists a basic
optimal feasible solution.
The following theorem relates basic solutions to extreme points of the
feasibility set:
Th: Let Ω be the convex set consisting of all feasible solutions, that is all x
s.t. Ax = b, x ≥ 0 with A ∈ R(m,n) , m < n. Then x is an extreme point of Ω
iff x is a basic feasible solution to Ax = b, x ≥ 0.
The search of an optimal solution can then be limited to extreme points.
Part III.a: Simplex
The introduction of slack variables in an LP with Ax ≤ b leads to a simple
updating rule based on elementary matrices E1 , E2 , ..., Et and the definition of
the augmented coefficient matrix A = [B, D] where basic and nonbasic
variables are considered. The following result sets the grounds for the simplex
method.
Th.: Let A ∈ Rn . Then A is non-singular iff there exists elementary matrices
E1 , E2 , .., Et such that Et · ... · E2 · E1 · A = I and there is a t for which
Et · ... · E2 · E1 = A−1 .
Several implications:
1. From x∗ = A−1 b = (Et · ... · E2 · E1 )b = E b = b̃
2. [A, b] → [I, D, E b] where I = E · A ∈ Rm,m , D ∈ Rm,n−m , E b = b̃
Simplex
Let x solve Ax = b, then x = [xB , xD ], and xB + DxD = b̃ → xB = b̃ − DxD
iT
then x∗ = b̃T , 0T is a solution of the system. We consider a system in
h
This system has basic solution: x = (y10 , ..., ym0 , 0, ..., 0)T = (xT T T
B 0 ) .
Then b = y10 a1 + y20 a2 + ... + ym0 am in the current basis and the cost
function z = cT · xB = c1 y10 + ... + cm ym0 . We also have that any nonbasic
vector aq = y1q a1 + y2q a2 + ... + ymq am .
To evaluate the impact on the objective of a different basis, assume the qth
vector to enter then zq = c1 y1q + ... + cm ymq . For any i we denote with
ri = ci − zi the reduced cost coefficients, and through ri we can evaluate the
induced change of the cost function.
The algorithm
Theorem: A b.f.s is optimal iff the corresponding reduced cost coefficients are all
nonnegative.
The simplex algorithm:
1. Form a canonical augmented matrix corresponding to the initial b.f.s.,
2. compute rj associated to nonbasic var’s
3. If rj ≥ 0 for all j: STOP, the b.f.s. is optimal. Else
4. select q with least rq
yi0
5. If no yiq > 0 STOP: the problem is infeasible. Else p = argmini
h i
|y
yi q iq
>0
6. Update the augmented matrix with pivot the (p, q) element. Go to 2.
Exercises 17
Simplex in matrix form – the Tableau
The basis updating and final linear cost function evaluation can be effectively
interpreted as follows. Consider the block matrices:
A b = B D b
(2)
cT 0 cBT cDT 0
.
Through elementary row operations we derive the Tableau in final form (RHS
of (4)):
[I]
B −1 0 B D b = Im B −1 D B −1 b
(3)
0T 1 cBT cDT 0 cBT cDT 0
[II]
Im 0 Im B −1 D B −1 b = Im B −1 D B −1 b
(4)
−cBT 1 cBT cDT 0 0T cDT − cBT B −1 D −cBT B −1 b
Revised simplex
The last set of matrices in (4) shows the optimal solution of the problem
xB = B −1 b and the reduced cost vector rDT = cDT − cBT B −1 D. Let
λT = cBT B −1 , then we can write rDT = cDT − λT D: indeed the vector λ is the
dual vector of the DLP, we will consider shortly.
The revised simplex develops from observing that all we need to perform
simplex iterations are the current basis with associated coefficient matrix and
the reduced coefficient vector. The updating of the Tableau and the simplex
iterations are then based on [B − 1 y0 yq ]. We have:
• If all reduced costs are non-negative, the optimal solution has been
reached.
• If a negative reduced cost is found so that no yiq > 0 then the problem
is unbounded.
• O.w. iterate until all reduced costs are non negative.
Part III.b: Duality
We introduce duality for LPs and then this will represent a key topic also for
nonlinear programs. Consider the linear program
minx≥0 cT x s.t. A · x ≥ b
. Where c ∈ Rn , A ∈ Rm,n , b ∈ Rm and x ∈ Rn . We refer to this formulation
as LP in primal form. The corresponding dual problem is defined by
maxλ≥0 λT b s.t. λT · A ≤ cT
in which λ ∈ Rm is referred to as the dual vector. The two problems define
the so-called symmetric form of duality.
We define the asymmetric form of duality by deriving the dual problem
associated with a primal in standard form:
Primal Dual
minimize cT x maximize λT b
(5)
subject to Ax = b subject to λT A ≤ cT
x≥0 λ unrestricted
Duality
The possibility of an unrestricted (free) λ arises from the representation of the
feasibility region of the primal in standard form as Ax ≥ b and −Ax ≥ −b:
then we introduce non negative dual vectors u and v for b and −b and define
λ = u − v.
The dual as an LP problem in the symmetric form reads:
maxλ λ1 b1 + λ2 b2 + ... + λm bm
s.t. λ1 a11 + λ2 a21 + ... + λm am1 ≤ c1
λ1 a12 + λ2 a22 + ... + λm am2 ≤ c2
...
λ1 a1n + λ2 a2n + ... + λm amn ≤ cn
Exercises 18
Part III.c: Interior point methods
The rationale of Interior Point Methods (IPM) and key difference with simplex
method is that in the former the search for optimality starts in the strict
interior of the feasible region and iteration by iteration seeks a direction within
the interior to the optimum. Consider the following LP in so-called Karmarkar
(1984) canonical form (KcnclLP):
minx∈Rn {cT x| x ∈ Ω ∩ ∆}
Where Ω = N (A) = {x ∈ Rn |Ax = 0}, ∆ = {x ∈ Rn |eT x = 1, x ≥ 0}. ∆ is
T
a simplex whose center is a0 = ne = [1/n, ..., 1/n] . The feasible region can
T T
be specified as Ω ∩ ∆ = {x ∈ Rn | A eT x = [0 1] , x ≥ 0}
Where: T xi
A0 = [A − b] , c0 = cT 0 , yi=1,..,n = x1 +...+xn +1 , yn+1 x1 +...+xn +1 .
1
=
K artificial problem: a P/D interpretation of KcnclLP
From LP duality we know that the (PLP) problem minx≥0 {cT x|Ax ≥ b} has
the same solution as the (DLP) maxλ≥0 {λT b|λT A ≤ cT }. We combine
them to define the set of conditions for optimality:
cT x − bT λ = 0
Ax ≥ b
AT λ ≤ c
x≥0 λ≥0
from which, by including slack and surplus vectors v, u and a set of vector
values (x0 , λ0 , u0 , v0 ) ≥ 0 we define the Karmarkar artificial problem
(KartLP)
minimize z
subject to cT x − bT λ + (−cT x0 + bT λ0 ) · z = 0
Ax − v + (b − Ax0 + v0 ) · z = b (6)
AT λ + u + (c − AT λ0 ) · z = c
x, λ, u, v, z ≥ 0
The algorithm
KartLP leads to the set of linear equations on x, λ, u, v for optimality:
cT x − bT λ = 0, Ax − v = b, AT λ + u = c, (x, λ, u, v) ≥ 0.
This problem and set of conditions are fully equivalent to the original KcnclLP
in restricted form and the conditions are used to follow a central path to
optimality.
Steps:
1. Initialize: k = 0, x0 = a0 = 1n e, set q
2. set x(k+1) = Ψ(x(k) )
3. if c·x(k)
cT ·x(0)
≤ 2−q STOP, else
4. k = k + 1 and return to 2.
We focus on the Ψ update, which employs the descent direction within the
feasible region towards the optimum at 0 through a set of orthogonal
projections.
The update Ψ
Consider k = 1:
• x(1) = x(0) + αd(0) , α ∈ (0, 1).
• Let −c be the max rate of decrease of the obj: select d(0) = −r ĉ(0) as
the direction of the orthogonal projection of −c onto N (B0 ),
T
B0 = A eT : ĉ(0) = ||PP00 cc|| , r = p(n·(n−1))
1
.
As k =
2, .. the projection is updated relying on Dk = diag{x i }, i = 1, .., n,
(k)
Exercises 19
Part IV: Constrained nonlinear programming
We consider minimization (maximization) problems specified as follows:
where: x ∈ Rn , f : Rn → R, h : Rn → Rm , g : Rn → Rp .
This will be further qualified but in absence of additional details, f (x) is
continuous and differentiable once f ∈ C 1 (.) or at least twice, then f ∈ C 2 (.).
The constraints are also assumed continuous and differentiable and
h = (h1 (x) h2 (x) ...hm (x))T defines m equality constraints, each one
continuous and differentiable. While g = (g1 (x) g2 (x) ...gp (x))T defines p
inequality constraints, each one continuous and differentiable.
The Jacobian matrix associated with h is defined as an (m, n) matrix with
rows the gradients ∇hi (x) transpose, for i = 1, 2, .., m.
We focus initially on nonlinear problems with equality constraints.
Part IV.a: NLN optimization with equality constraints
Prior to the explanation of the optimization methods to be adopted in this
context, we need several definitions to frame properly the topic. We define:
1. A point x∗ for which hi (x∗ ) = 0 for all i = 1, .., m is said to be a regular
point of the constraints if the gradient vectors
∇h1 (x∗ ), ∇h2 (x∗ ), ..., ∇hn (x∗ ) are l.i.
2. A surface in Rn as the set S = {x ∈ Rn |h1 (x) = 0, ..., hm (x) = 0}.
3. A curve C on S as the set of points
{x(t) ∈ S, t ∈ (a, b), x : (a, b) → S, x(t) a continuous function. This
curve C is differentiable if for all t there exists
x0 (t) = dx(t) 0 0 0 T
dt = (x1 (t) x2 (t) ...xn (t)) . It is twice differentiable if there
d 2 x(t)
exists x00 (t) = dt = (x100 (t) x200 (t) ...xn00 (t))T .
Tangent and Normal spaces
We also need the following definitions:
1. The tangent space at x∗ on S = {x ∈ Rn |h(x) = 0} is the set
T (x∗ ) = {y|Dh(x∗ ) · y = 0}, thus this is the Null space of the
differential N (Dh(x∗ )).
2. We distinguish T (x∗ ) from the tangent plane at x∗ , namely
TP(x∗ ) = T (x∗ ) + x∗ .
3. The normal space at x∗ on the surface S is the set
N(x∗ ) = {x ∈ Rn |x = Dh(x∗ ) · z, z ∈ Rm }, or range R(Dh(x∗ )T ). It is
the subspace of Rn spanned by the gradients
x = j zj ∇hj (x), zj ∈ R, j = 1, .., m.
P
Ax∗ = AQ −1 AT λ∗ = b
Exercises 20
Part IV.b: NLN optimization with inequality
constraints
Let’s go back to the nonlinearly constrained problem introduced at the
beginning of this section with both equality and inequality constraints:
where: x ∈ Rn , f : Rn → R, h : Rn → Rm , g : Rn → Rp .
We generalize the concept of a regular point in this new setting, by
distinguishing active versus inactive constraints in case of inequality.
Let x∗ satisfy h(x∗ ) = 0 and g(x∗ ) ≤ 0: we denote with
J(x∗ ) := {j|gj (x∗ ) = 0}.
Then x∗ is a regular point if ∇hi (x∗ ) and ∇gj (x∗ ) are l.i. for all i and
j ∈ J(x∗ ).
We specify the Lagrange function in this case as:
L(x, λ, µ) = f (x) + λT h(x) + µT g(x), expecting λ to be free and µ non
negative.
Karush-Kuhn-Tucker (KKT) conditions
The KKT theorem provides FONC for a local optimum.
Karush-Kuhn-Tucker theorem: Let f , g, h ∈ C 1 and x∗ be a regular point
which minimizes f under h(x∗ ) = 0 and g(x∗ ) ≤ 0. Then there exist a
λ∗ ∈ Rm and a µ∗ ∈ Rp such that:
1. µ∗ ≥ 0
T T
2. Df (x∗ ) + λ∗ Dh(x∗ ) + µ∗ Dg(x∗ ) = 0T
T
3. µ∗ g(x∗ ) = 0.
λ∗ is referred to as the vector of Lagrange multipliers, µ∗ as the vector of
KKT multipliers.
Remarks:
• The KKT multipliers are 0 for non binding constraints and non-negative
for active constraints at the optimum.
• The optimality condition 2 then implies, given x∗ regular, that the
gradient of f is a l.c. of the gradients of the m equality and p inequality
constraints, with weights given by the multipliers.
KKT conditions
We refer to 1., 2., 3., and h(x∗ ) = 0, g(x∗ ) ≤ 0 all together as KKT
necessary conditions for a local optimum under equality and inequality
constraints.
• If we are maximizing, rather than minimizing the KKT conditions do not
change but we are now considering a Lagrange function specified as
L(x, λ, µ) = f (x) − λT h(x) − µT g(x)
• If KKT multipliers are associated with the constraints g(x∗ ) ≥ 0 instead
of ≤ then the last condition needs to change: g(x∗ ) ≥ 0 and the KKT
multipliers are non-positive µ∗ ≤ 0.
Second order conditions
Let’s now consider second order necessary and sufficient conditions for an
optimum in Rn under equality and inequality constraints.
Theorem (SONC): Let x∗ be a local minimizer of f : Rn → R subject to
h(x) = 0, g(x) ≤ 0, h : Rm → Rn , m ≤ n, g : Rp → Rn , f , h, g ∈ C 2 and x∗
regular. Then there exists a λ∗ ∈ Rm and a µ∗ ∈ Rp such that:
T T T
1. µ∗ ≥ 0, Df (x∗ ) + λ∗ Dh(x∗ ) + µ∗ Dg(x∗ ) = 0T , µ∗ g(x∗ ) = 0 and
2. For all y ∈ T (x∗ ), yT L(x∗ , λ∗ , µ∗ )y ≥ 0.
We have the first-order stationary conditions and the second order p.s.d.
condition for a local minimum, now restricted to the tangent space
T (x ∗ ) ⊂ Rn . Namely:
• T (x∗ ) := {y ∈ Rn |Dh(x∗ )y = 0, Dgj (x∗ )y = 0, j ∈ J(x∗ )} and
• L(x∗ , λ∗ , µ∗ ) = F(x∗ ) + k λ∗k Hk (x∗ ) + j∈J(x∗ ) µ∗j Gj (x∗ ) where
P P
Exercises 21
Part IV.c: Convex optimization
Convex optimization problems are of particular relevance due to their unique
properties and relevant application domains. We introduce few definitions.
• The graph of f : Ω → R, Ω ⊂ Rn is the set of points in Ω × R ⊂ Rn+1
T
given by {[x f (x)] , x ∈ Ω}.
• The epigraph is denoted by epi(f ) = {[x β]T x ∈ Ω, β ∈ R, β ≥ f (x)}
• a function f : Ω → R is convex on Ω if its epigraph is convex. If f is
convex on Ω then Ω is convex.
• Jensen inequality in Rn N.S.C. for convexity and strict convexity.
• f , f1 , f2 convex imply for α ∈ R αf convex and f1 + f2 convex.
• f : Ω → R, f ∈ C 1 defined on a convex set Ω ⊂ Rn open, then f is
convex iff for all x, y ∈ Ω: f (y) ≥ f (x) + ∇f (x)(y − x).
Convexity: FOC are sufficient
Here come the key results:
• f : Ω → R, f ∈ C 2 defined on an open convex set Ω ⊂ Rn . Then f is
convex iff for each x ∈ Ω the Hessian matrix of f at x is positive
semi-definite. This can be easily extended to closed convex sets.
• Let f : Ω → R be a convex function f ∈ C 1 defined on a convex set.
Suppose x∗ ∈ Ω is such that for all x ∈ Ω, x 6= x∗ : ∇f (x∗ )(x − x∗ ) ≥ 0,
then x∗ is a global minimizer of f over Ω. Under the same assumptions,
same holds if for any feasible direction d at x∗ , dT ∇f (x∗ ) ≥ 0. FONC
are sufficient for global optimality if the problem is convex.
• Similarly under f ∈ C 1 convex over a convex set
Ω = {x ∈ Rn |h(x) = 0} with h : Rn → Rm , h ∈ C 1 , then x∗ is a global
T
minimizer if Df (x∗ ) + λ∗ Dh(x∗ ) = 0T . (Lagrange theorem)
• Global optimality extends naturally to KKT conditions.
Examples
We have gone through several examples and problems during the course to be
classified as convex programs. Notice that concavity of f leads to convexity of
−f and all the above results apply.
• spaces Rn for n = 1, ..., are all convex domains, Half spaces are also
convex. Intersections and union of convex sets generate convex sets.
• Quadratic programs xT Qx are convex on Ω ⊂ Rn iff for any x, y ∈ Ω,
(x − y)T Q(x − y) is p.s.d.: minx xT Qx| Ax = b is a convex program.
• For f : R → R you need convexity to apply 1d search methods. If f ∈ C 1
then if f 0 (x) is increasing from negative to positive over the domain, f is
convex.
• In Rn if the feasible region is convex and f (x) is convex then the
problem is convex. Again here you can study convexity through the
partial derivatives ∂x
∂f
j
and the directional derivatives < ∇f (x), d >.
• Saddle points are the primary causes of lack of convexity.
MATH 412 OPTIMIZATION
END