An Optimization Primer
An Optimization Primer
An Optimization Primer
AMS Classification:
Date:
PROLOGUE
The primordial objective of these lectures is to prepare the reader to deal
with a wide variety of applications that include optimal allocation of (limited) resources, finding best estimates or best-fits, etc. Optimization problems of this type arise in almost all areas of human industry: engineering,
economics, agriculture, logistics, ecology, finance, information and communication technology, and so on. Because the solution may have to respond
to an evolutionary system, in time and/or space, or must take into account
uncertainty about some of the problems data, we usually end up with having to solve a large scale optimization problems and this, to a large extent,
conditions our overall approach. Consequently, the layout of the material
doesnt follow the pattern of more traditional introduction to optimization
textbooks.
The main thrust wont be on the detailed analysis of specific algorithms,
but on setting up the tools to deal with these large scale applications. This
doesnt mean, that eventually, we wont describe, justify and even establish
convergence of some basic algorithmic procedures. To achieve our goals, we
proceed, more or less, on three parallel tracks:
(i) modeling,
(ii) theoretical foundations that will allow us to analyze the properties of
solutions as well as hone our modeling skills to help us build stable,
easier to solve optimization problems, and
(iii) some numerical experimentation that will highlight some of the difficulties inherent in numerical implementation, but mostly to illustrate the
use of elementary algorithmic procedures as building blocks of more
sophisticated solution schemes.
The lectures are designed to serve as an introduction to the field of optimization for students who have a background roughly equivalent to a bachelors degree in science, engineering or mathematics. More specifically, its
ii
expected that the reader has a good foundation in Differential Calculus,
Linear Algebra, and is familiar with the abstract notion of function1 . The
presentation also includes an introduction to a plain version of Probability
that will enable the non-initiated reader to follow the sections dealing with
stochastic programming.
A novel feature of this book is that decision making under uncertaintymodels are an integral part of the exposition. There are two basic reasons
for this. Stochastic optimization motivates the study of linear, non-linear
optimization, large scale optimization, non-differentiable functions and variational analysis. But, more significantly, given our concern with intelligent
modeling, one is bound to realize that very few important decision problems
do not involve some level of uncertainty about some of their parameters. Its
thus imperative, that from the outset, the reader be aware of the potential
pitfalls of simplifying a model so as to skirt uncertainty. Nonetheless, its
possible to skip the chapters or sections dealing with stochastic programming
without compromising the continuity of the presentation, but not without being shortchanged on the insight that comes from including uncertainty in the
modeling of optimization models.
There are 16 chapters, each one corresponds to what could be covered
in about a weeks lectures (three to four hours). Proofs, constructive in nature whenever possible, have been provided so that (i) an instructor doesnt
have to go through them in meticulous detail but can limit the discussion
to the main idea(s) accompanied with some relevant examples and counterexamples, and (ii) so that the argumentation can serve as guide to solving
the exercises. The theoretical side comes with almost no compromises, but
there are a few rare exceptions that would have required lengthy mathematical detours that are not germane to the subject at hand, and are more
appropriately dealt with in other texts or lectures.
Numerical software
As already mentioned earlier, although we end up describing a significant
number of algorithmic procedures, we dont concern ourselves directly with
1
iii
implementation issues. This is best dealt with in specialized courses and textbooks such as [16, 10, 20]. For example, in the case of linear programming,
there is a description of both the simplex and the interior point methods
in Chapters 7 and 6, but from the outset on, its assumed that packages to
solve mathematical programs, of various types, including linear programs,
are available (C-Plex, IBM Solutions, LOQO, . . . ). To allow for some experimentation with these solution procedures, its assumed that the reader
has access to Matlab2 , in particular to the functions found in the Matlab
Optimization Toolbox and will be able to use them to solve the numerical
exercises. These Matlab functionalities were used to solve the examples, and
in a number of instances, the corresponding m-file has been supplied.
iv
Contents
1 PRELUDE
1.1 Mathematical curtain rise . . . . . . .
1.2 Curve fitting I . . . . . . . . . . . . . .
1.3 Steepest Descent and Newton methods
1.4 The Quasi-Newton methods . . . . . .
1.5 Integral functionals . . . . . . . . . . .
1.6 In conclusion . . . . . . . . . . . . . . .
2 FORMULATION
2.1 A product mix problem . . . . . . . . .
2.2 Curve fitting II . . . . . . . . . . . . .
2.3 A network capacity expansion problem
2.4 Discrete decision variables . . . . . . .
2.5 The Broadway producer problem . . .
3 PRELIMINARIES
3.1 Variational analysis I . . . . . . . .
3.2 Variational analysis II . . . . . . .
3.3 Plain probability distributions . . .
3.4 Expectation functionals I . . . . . .
3.5 Analysis of the producers problem
4 LINEAR CONSTRAINTS
4.1 Linearly constrained programs
4.2 Variational analysis III . . . .
4.3 Variational analysis IV . . . .
4.4 Lagrange multipliers . . . . .
v
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
4
6
14
17
21
.
.
.
.
.
23
23
29
35
40
42
.
.
.
.
.
47
47
58
68
75
79
.
.
.
.
83
85
87
94
98
vi
CONTENTS
4.5 Karush-Kuhn-Tucker conditions I . . . . . . . . . . . . . . . . 101
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 LAGRANGIANS
6.1 Saddle functions . . . . . . . . . . . . . .
6.2 Primal and Dual problems . . . . . . . .
6.3 A primal-dual interior-point method . .
6.4 Monitoring functions . . . . . . . . . . .
6.5 Lake Stoopt I . . . . . . . . . . . . . . .
6.6 Separable simple recourse II . . . . . . .
6.7 The Lagrangian finite generation method
6.8 Lake Stoopt II . . . . . . . . . . . . . . .
7 POLYHEDRAL CONVEXITY
7.1 Polyhedral sets . . . . . . . .
7.2 Full duality: linear programs .
7.3 Variational analysis V . . . .
7.4 The simplex method . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
107
. 108
. 111
. 114
. 119
. 122
.
.
.
.
.
.
.
.
127
. 127
. 131
. 134
. 138
. 143
. 145
. 147
. 152
.
.
.
.
157
. 157
. 165
. 167
. 172
.
.
.
.
.
.
.
.
183
. 183
. 188
. 193
. 197
. 200
. 205
. 210
. 210
CONTENTS
vii
9 LINEAR RECOURSE
9.1 Fixed recourse and fixed costs . . . . . .
9.2 A Manufacturing model . . . . . . . . .
9.3 Feasibility . . . . . . . . . . . . . . . . .
9.4 Stochastic linear programs with recourse
9.5 Optimality conditions . . . . . . . . . . .
9.6 Network capacity expansion II . . . . . .
9.7 Preprocessing . . . . . . . . . . . . . . .
9.8 A summary . . . . . . . . . . . . . . . .
9.9 Practical probability II . . . . . . . . . .
9.10 Expectation functionals II . . . . . . . .
9.11 Disintegration Principle . . . . . . . . .
9.12 Stochastic programming: duality . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
211
. 212
. 213
. 215
. 221
. 222
. 223
. 224
. 224
. 225
. 227
. 229
. 229
10 DECOMPOSITION
10.1 Lagrangian relaxation . . . . . . . .
10.2 Sequential linear programming . . .
10.3 The L-shaped method . . . . . . .
10.4 Dantzig-Wolfe decomposition . . .
10.5 An optimal control problem . . . .
10.6 A targetting problem . . . . . . . .
10.7 Linear-quadratic control models . .
10.8 A hydro-power generation problem
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
255
. 255
. 255
. 258
. 258
. 258
. 258
. 258
. 258
. 258
. 258
. 258
. 258
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11 APPROXIMAION THEORY
11.1 Formulation . . . . . . . . . . . . . . . . . . . . .
11.2 Epi-convergence . . . . . . . . . . . . . . . . . . .
11.3 Barrier & Penalty methods, ecaxt? . . . . . . . .
11.4 Infinite dimensional theory . . . . . . . . . . . . .
11.5 Approximation of control problems . . . . . . . .
11.6 Approximation of stochastic programs . . . . . .
11.7 Approximation of statistical estimation problems
11.8 Augmented Lagrangians . . . . . . . . . . . . . .
11.9 Variational Analysis ?? . . . . . . . . . . . . . . .
11.10Proximal point algorithm . . . . . . . . . . . . . .
11.11Method of multipliers: equalities . . . . . . . . . .
11.12Method of multipliers: inequalities . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
231
231
231
235
242
247
254
254
254
viii
CONTENTS
11.13Application to engineering design . . . . . . . . . . . . . . . . 258
12 NONLINEAR OPTIMIZATION
12.1 Statistical estimation: An introduction
12.1.1 The discrete case . . . . . . . .
12.2 Statistical estimation: parametric . . .
12.3 Satistical estimation: non-parametric .
12.4 Non-convex optimization . . . . . . . .
12.5 KKT-optimality conditions . . . . . . .
12.6 Sequential quadratic programming . .
12.7 Trust regions . . . . . . . . . . . . . .
13 EQUILIBIRUM PROBLEMS
13.1 Convex-type equilibrium problems .
13.2 Variational inequalities . . . . . . . .
13.3 Monotone Operators . . . . . . . . .
13.4 Complementarity problem . . . . . .
13.5 Application in Mechanics . . . . . . .
13.6 Pricing an American option . . . . .
13.7 Market Equilibrium: Walras . . . . .
13.8 Application to traffic, transportation
13.9 Non-cooperative games: Nash . . . .
13.10Energy? Communications pricing . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
14 NON-DIFFERENTIAL OPTIMIZATION
14.1 Bundle method . . . . . . . . . . . . . . . .
14.2 Example of the bundle method . . . . . . .
14.3 Stochastic quasi-gradient method . . . . . .
14.4 Application: urn problem . . . . . . . . . .
14.5 Sampled gradient (Burke, Lewis & Overton)
14.6 Eigenvalues calculations . . . . . . . . . . .
15 DYNAMIC PROBLEMS
15.1 Optimal control problems . . .
15.2 Hamiltonian, Pontryagins . . .
15.3 Polaks minmax approach . . .
15.4 Multistage stochastic programs
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
259
. 259
. 262
. 270
. 274
. 281
. 282
. 282
. 287
.
.
.
.
.
.
.
.
.
.
289
. 289
. 289
. 289
. 289
. 289
. 289
. 289
. 289
. 289
. 289
.
.
.
.
.
.
291
. 291
. 291
. 291
. 291
. 291
. 291
.
.
.
.
293
. 293
. 293
. 293
. 293
CONTENTS
15.5
15.6
15.7
15.8
ix
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
293
293
293
293
.
.
.
.
.
.
295
. 295
. 295
. 295
. 295
. 295
. 295
CONTENTS
Chapter 1
PRELUDE
How simple and clear this is thought Pierre. How could I not have known
this before. War and Peace, Leonid Tolstoy.
1.1
The modern theory of optimization starts in the middle of the 14th Century
with Nicholas Oresme, (1323-1382), part-time mathematician and full-time
Bishop of Lisieux (France). In his treatise [15] on , he remarks that near
a minimum, the increment of a variable quantity becomes 0. A present day
1
CHAPTER 1. PRELUDE
df(x; )
About three centuries later, Pierre de Fermat (1601-1665), another parttime mathematician and full time ??-lawyer at the Royal Court in Toulouse
(France) while working on the long-standing tangent problem, observed that
for x to be a minimizer of a function
f , the tangent to the graph of the
function f at the point x , f (x ) must be parallel to the x-axis. In the
notation of Differential Calculus2 , one would express this as
Fermat Rule: x argmin f = f 0 (x ) = 0,
where
f 0 (
x) := lim 1 f (
x + ) f (
x)
0
called the G
ateaux derivative when its necessary to distinguish it from some alternative definitions of derivative.
2
whose development can be viewed as a continuation and a formalization of Fermats
work on the tangent problem
(x*,f(x*))
x*
is the slope3 of the tangent at x. Implicit in the formulation of these optimality criteria is the assumption: f is smooth, i.e., continuously differentiable;
in those days, only smooth functions were considered to be of any interest.
And for smooth functions, one has
w IR :
df (
x; w) = f 0 (
x)w,
f (
x),
f (
x), . . . ,
f (
x)
f (
x) =
x1
x2
xn
with the partial derivatives defined by
for i = 1, . . . , n,
3
f (
x + ei ) f (
x)
f (
x) := lim
= df (
x; ei ),
0
xi
CHAPTER 1. PRELUDE
df (
x; w) = hf (
x), wi =
n
X
f (
x)
j=1
xj
wj ,
and so, also for smooth functions defined on IRn , one can derive Fermats
rule from Oresmes rule and vice verse. However, the fact that one can rely
on either one of these rules to check for optimality turns out to be quite
efficient.
1.2
Curve fitting I
With
and
z1n
z2n
Z=
. .
zLn
j=0
. . . z1
. . . z2
......
. . . zL
1
1
. .
1
y = h(z1 ), h(z2 ), . . . , h(zL ) ,
aIRn+1
or still mina f (a) = |Za y|2 , i.e., the least squares solution will minimize
the square of the norm of the error. Applying Fermats rule, we see that the
minimizer(s) a must satisfy
f (a ) = 2Z > (Za y) = 0,
or equivalently, the so-called normal equation,
Z >Za = Z > y.
If we assume, as might be expected, that n + 1 L, and recalling that
the points z1 , . . . , zL are distinct, the columns of the matrix Z are linearly
independent. Hence, Z >Z is invertible and
a = Z >Z)1 Z > y.
This is the solution calculated by the Matlab-function polyfit. In Figure
1.3, a 5th degree polynomial has been fitted to the given data points; the
plot has been obtained by polyval, another Matlab-function.
1
0.5
0
0.5
1
1.5
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
CHAPTER 1. PRELUDE
1.1 Exercise (polyfit and polyval functions). Let x = (0, 0.05, . . . , 0.95, 1)
and y = (0.95, 0.23, 0.61, 0.49, 0.89, 0.76, 0.46, 0.02, 0.82, 0.44, 0.62, 0.79, 0.92,
0.74, 0.18, 0.41, 0.94, 0.92, 0.41, 0.89, 0.06) use the functions polyfit to obtain
a polynomial fit of degree n = 1, 2, . . . , 11 and n = 21. For each n, calculate
the mean square error and plot you results so that you can visuallly see the
fit and the graph of the polynomial.
Guide. For given n, let p = polyfit(x, y, n) the coefficients of the polynomial of degree n. To graph the resulting polynomial and check the fit, use the
command: plot(z, polyval(p, z), x, y, 0 xm0 ) with z = (0.2 : 0.001 : 1.2)4 .
1.3
When we applied Fermats rule to the polynomial fit problem, the minimizer
could be found by solving a system of linear equations, and there are very
efficient procedures available to solve (n n)-linear systems. But, in general, the function to be minimized wont be quadratic, and consequently, the
Fermat rule may result in a system consisting of n nonlinear equations. For
example, when
f (x1 , x2 ) = (x2 x21 )2 + (1 x1 )2 ,
Fermats rule yields,
2x31 2x2 x1 + x1 = 1,
x21 + x2 = 0.
And that system is not any easier to solve than minimizing f . In fact, procedures to solve nonlinear equations and minimizing functions on IRn go hand
in hand5 . In this section, and the next, we outline algorithmic procedures
to find a point x that satisfy Fermats rule. Such points will minimize the
function f , at least locally, when the function f is locally convex; convexity
is dealt with in Chapter 3 and locally convex means that f is convex in the
neighborhood of x , say on a ball IB(x , ) with > 0.
4
In Matlab-Figure, use the Export functionality to obtain printable files, for example,
EPS-files.
5
Roughly speaking, a system of n equations, linear or nonlinear, in n variables can be
thought of as the gradient of some nonlinear function defined on IR n .
argmin
f
local minimizer
Figure 1.4: Local and global minimizers
The first step in the design of algorithmic procedures to minimize a function f is to identify directions of descent.
1.3 Lemma (direction of descent). Let f : IRn IR be smooth. Whenever
df (
x; d) < 0, d is a direction of descent for f at x, i.e.,
? > 0 such that (0, ? ) :
f (
x + d) < f (
x).
Because f is smooth at x,
every d IRn such that hf (
x), di < 0 is a direction of descent at x.
Proof. If df (
x; d) < 0, the definition of the derivative immediately implies
that for some ? > 0, f (
x + d) f (
x) < 0 for all (0, ? ). The assertion
involving the gradient simply follows from df (
x; d) = hf (
x), di, when f is
smooth.
Steepest Descent Method.
Step 0. x0 IRn , := 0.
CHAPTER 1. PRELUDE
Step 1. Stop if f (x ) = 0, otherwise, d := f (x ).
Step 2. := argmin0 f (x + d ) f (x ) .
Step 3. x+1 := x + d , + 1, go to Step 1.
2 6= 0.
|d |2 |d|
Thats an unrealistic premise. The best one can hope for is an approximating
minimizer. When implementing this algorithm, the step size in Step 2 is
commonly calculated as follows: for parameters , (0, 1) selected at the
outset in Step 0,
A-Step 2. := maxk=0,1,...{ k f (x + k d ) f (x ) |d |2 k }.
One then refers to as the Armijo step size7 ; k is to the power k, so,
in particular, 0 = 1.
The convergence proof can easily be adjusted to
|d |2
f(x + d ) - f(x )
0
10
CHAPTER 1. PRELUDE
d = 2 f (x )
1
f (x ),
0.
1.
2.
3.
x0 IRn , := 0.
1
Stop if f (x ) = 0, otherwise, d := 2 f (x ) f (x ).
:= argmin0 f (x + d ) f (x ) .
x+1 := x + d , + 1, go to Step 1.
The convergence proof for the Newton method is essentially the same as
that for the Steepest Descent method, the only difference is that a different
descent direction is calculated in Step 1. And when implementing the Newton
method, one would again replace Step 2 by A-Step 2, i.e., determine by
calculating the Armijo step size.
What really makes the Newton method very attractive is that it comes
with particularly desirable local convergence properties! The proof that follows is for the classical version of Newtons method that doesnt include a
line minimization step, i.e., with = 1.
8
A matrix C is positive definite if hx, Cxi > 0 for all x 6= 0. When the Hessian at x, of
a twice continuously differentiable function f : IR n IR is positive definite, f is strictly
convex on a neighborhood of x
, cf. 3.15. We kick off our study of convexity in Chapter 3.
11
Newton direction
Steepest descent
f(x) =
z IRn .
Then, there is a > 0 such that if the Newtons method is started at a point
x0 IB(
x, ), the iterates {x }IN will converge quadratically to x, i.e.,
0 lim
|x+1 x|
< .
|x x|2
Proof. From the assumptions follows the existence of > 0 and <
such that
k2 f (x0 ) 2 f (x)k |x0 x|,
x, x0 IB(
x, ),
x IB(
x, ), z IRn ,
kAk =
P
m,n
i,j=1
a2ij
1/2
12
CHAPTER 1. PRELUDE
IB(
x, ) and with z = x x,
2
f (x )(x
+1
x ) = f (x ) + f (
x) = z ,
1
0
2 f (x tz ) dt,
2
1
2 f (x tz ) dt, .
(x
x ) = f (x )
z ,
0
R
R
Since |x+1 x| |x x| + |x+1 x | and | g(t) dt| |g(t)| dt,
|x
+1
x| |z | k f (x )
1
|z |
|z |2 =
|z |
l
l
k2 f (x ) 2 f (
x + tz )k dt
2
|z | = +1 2 ,
l
13
Guide. The Rosenbrock function has boomerang shaped level sets. The
minimum occurs at (1, 1). Starting at x0 = (1.2, 1), the Steepest Descent
method may require as many as a 1000 steps to reach the neighborhood of
the solution. As, can be observed, the method of Steepest Descent can not
only be quite inefficient but, due to numerical round-off, it might even get
stuck at a non-optimal point.
The Newton method, like the steepest descent method, can be viewed as
a procedure to find a solution to the system of n equations:
f
f
f
(x) = 0,
(x) = 0, . . . ,
(x) = 0.
x1
x2
xn
f
(x) have, in principle, no preassigned propBecause, these functions x 7 x
j
erties, one could have described Newtons method as one of solving a system
of n non-linear equations in n unknowns, say
G1 (x)
0
.. ..
G(x) = . = .
Gn (x)
0
with
G(x) =
h G
xj
(x)
in
i,j=1
14
CHAPTER 1. PRELUDE
x2
x4
x5
x3
x1
1.4
15
Quasi-Newton Method10 .
Step 0. x0 IRn , := 0, pick B 0 (= I, for example).
Step 1. Stop if f (x ) = 0,
otherwise, choose d such that B d = f (x ).
Step 2. := argmin0 f (x + d ) f (x ) .
U s = c B s .
u1 v 1 u1 v 2 . . . u 1 v n
u2 v 1 u2 v 2 . . . u 2 v n
uv =
. . . . . . . . . . . . . . . . . .
un v 1 un v 2 . . . u n v n
is the outer product of the vectors u and v. We must have
[ u v ]s = hv, s i u = c B s .
10
Methods of this type are also known are variable metric methods. One can think of
the descent direction generated in Step 1 as the steepest descent but with respect to a
different metric on IRn than the usual Euclidean metric.
16
CHAPTER 1. PRELUDE
1
(c B s )
hv, s i
and
U =
1
[ (c B s ) v ].
hv, s i
In the preceding expression all quantities are fixed except for v and the only
restriction is that v shouldnt be orthogonal to s . One can choose v so
as to restrict B to a class of matrices that have some desirable properties.
For example, one may wish to have the matrices B symmetric; the Hessian
2 f (x ) is symmetric when it is defined. Choosing
v = c B s
yields
B +1 = B +
hv , s i
[ v v ].
11
B +1 = B +
11
1
1
c
c
B
s
B
s
.
hc , s i
hB s , s i
12
17
D +1 = D +
1
1
s
s
D
c
D
c
.
hc , s i
hc , D c i
The Matlab-function fminunc (with option.LargeScale set off) implements the BFGS Quasi-Newton method. Again, the Rosenbrock function
could be used as test function and the results compared to those obtained in
Exercise 1.6.
1.5
Integral functionals
dt.
0
where X fcns([0, 1], IR) consists of all functions with some specified properties, for example X = ac-fcns([0.1], IR), the space of absolutely continuous
real-valued functions defined on [0, 1]. For simplicity sake, let assume that
X = C 1 ([0, 1]; IR) is the space of real-valued, continuously differentiable functions defined on [0, 1]. One has,
Oresme rule: x argmin f = df (x ; w) = 0, w W X,
where the set W of admissible variations is such that
x X, w W = IR : (x + w)(0) = , (x + w)(1) = ,
that is,
12
13
W = w X w(0) = w(1) = 0 .
18
CHAPTER 1. PRELUDE
It isnt straightforward to write down Fermats rule. There isnt a ready made
calculus to find the gradient of functions defined on an infinite dimensional
linear space. In the late 17th Century (?), the path going from Oresmes
to Fermats rule for this problem was pioneered by a trio of mathematical
superstars: the Bernoulli brothers, Johan and Jacob, and Isaac Newton.
Lets sketch out their approach when
L : [0, 1] IR IR IR
is a really nice function. Here, this means that the partial derivatives
Lx (t, x, v) =
L
(t, x, v),
x
Lx (t, x, v) =
L
(t, x, v)
v
1
df (x; w) = lim
L(t, (x + w)(t), (x + w)(t))
dt
0
0
Z 1
=
lim 1 L(t, (x + w)(t), (x + w)(t))
dt
0 0
Z 1h
i
=
Lx t, x(t), x(t)
w(t)
dt
1
dt = 0
0
and thus,
df (x; w) =
1
0
Z t
Lx , x( ), x(
) d dt.
w(t)
Lx t, x(t), x(t)
R1
Because w W implies 0 w(t)
dt = 0 and df (x ; w) = 0 must hold for all
such functions w: on [0, 1],
t 7 Lx t, x (t), x (t)
t
0
Lx , x ( ), x (t) d
must be constant.
19
g
2
14
1 + y 0 (x)2
dx.
y(x)
originally formulated by Galileo, the name is derived from the Greek, brachistos for
shortest and chronos for time
20
CHAPTER 1. PRELUDE
Rather than the Euler equation itself, lets rely on a variant, namely,
d
L y 0 Ly 0 = L x ;
dx
y(0) = 0, y(1) = 2.
dy dt
y
dy
=
= = (1 cos t)1 sin t.
dx
dt dx
x
1.6. IN CONCLUSION . . .
1.6
21
In conclusion . . .
We have seen that the rules of Oresme and Fermat, with the help of Differential Calculus, can be used effectively to identify potential minimizers
of a smooth function in a variety of situations. But, many interesting optimization problems involve non-differentiable functions and minimizers have
a predilection for being located at the cusps and kinks of such functions!
Moreover, the presence of constraints in a minimization problem come with
an intrinsic lack of smoothness. There is a sharp discontinuity between points
that are admissible and those that are not.
To deal with this more inclusive class of functions, we need to enrich our
calculus. Our task, on the mathematical side, will thus be to set up a Subdifferential15 Calculus with rules that mirror those of Differential Calculus,
and that culminates in versions of Oresmes and Fermats rules to ferret out
the minimizers of non-smooth, and even discontinuous, functions16 .
15
the prefix sub has the meaning: requiring less than differentiability.
For more comprehensive expositions of Subdifferential Calculus, one should consult
[4, 14, 1, 3, 18], our notation and terminology will be consistent with that of [18].
16
22
CHAPTER 1. PRELUDE
Chapter 2
FORMULATION
Lets begin with a few typical (constrained) optimization problems that fit
under the mathematical programming umbrella. In almost all of these examples, we start with a deterministic version and then switch to a more
realistic model that makes a place for the uncertainty about some of the
parameters.
When we allow for data uncertainty, not only do we gain credibility for
the modeling process but we are also lead to consider a number of issues
that are at the core of optimization theory and practice, namely, how to
deal with non-linearities, with lack of smoothness, and how to design solution procedures for large scale problems. In addition, due to the addition
of randomness (uncertainty) its also necessary to clarify a number of the
basic modeling issues, in particular, how stochastic programs differ from the
simpler, but less realistic, deterministic formulations.
For all these reasons, we are going to rely rather extensively, but by no
means exclusively, on stochastic programming examples to motivate both the
theoretical development and the design of algorithmic procedures.
2.1
A furniture maker can manufacture and sell four different dressers. Each
dresser requires a certain number tcj of man-hours for carpentry, and a certain
number tf j of man-hours for finishing, j = 1, . . . , 4. In each period, there are
dc man-hours available for carpentry, and df available for finishing. There is
a (unit) profit cj per dresser of type j thats manufactured. The owners goal
23
24
CHAPTER 2. FORMULATION
T =
4 9 7 10
1 1 3 40
This is a linear program, i.e., an optimization problem in finitely many (realvalued) variables in which a linear function is to be minimized (or maximized) subject to a system of finitely many linear constraints: equations and
inequalities. A general formulation of a linear program could be
min
so that
n
X
j=1
n
X
j=1
25
The objective and the constraints of our product mix problem are linear
and it may be written compactly as:
min hc, xi
so that T x d,
x 0.
As part of the ensuing development, many of the properties of linear programs will be brought to the fore including optimality conditions, solution
procedures and the associated geometry. For now, lets simply posit that
such problems can be solved efficiently when they are not too large. The
(optimal) solution of our product mix problem is:
xd = (4000/3, 0, 0, 200/3)
3.60
8.25
6.85
9.25
0.85
0.85
2.60
37.0
possible values
3.90
4.10
4.40
8.75
9.25
9.75
6.95
7.05
7.15
9.75 10.25 10.75
0.95
1.05
1.15
0.95
1.05
1.15
2.90
3.10
3.40
39.0
41.0
43.0
Bold face will be used for random variables with normal print for their possible values.
26
CHAPTER 2. FORMULATION
We have 8 random variables, each taking four possible values, that yields a
total of 48 = 65,536 possible T matrices (outcomes) and each one of these
has equal probability of occurring! In practice, this could be a discretization
that approximates a continuous distribution (e.g. a uniform distribution).
Lets denote the probability of a particular outcome T l by pl = (0.25)8 , for
l = 1, . . . , 65, 536.
Because the manufacturer must decide on the production plan before the
number of hours required for carpentry or finishing are known with certainty,
there is the possibility that they actually exceed the number of hours available. Therefore, the possibility of having to pay for overtime must be factored
in. The recourse costs are determined by: qc per extra carpentry hour and
qf per extra finishing hour, say q = (qc , qf ) = (5, 10).
This recourse decision will only enter into play after the production plan x
has been selected and the time required, T l , for each task, has been observed.
Our manufacturer will, at least potentially, make a different decision about
overtime when confronted with each one of these 65,536 possible different
outcomes for T . Let ycl and yfl denote the number of hours of overtime hours
hired for carpentry and finishing when the matrix T turns out to be T l . The
problem is then to choose (xj 0, j = 1, . . . , 4) that minimizes
4
X
j=1
65,536
c j xj +
X
l=1
pl qc ycl + qf yfl
so that
4
X
j=1
4
X
j=1
tcjl xj ycl dc ,
l = 1, . . . , 65, 536,
ycl 0, yfl 0,
l = 1, . . . , 65, 536.
Notice that the objective now being minimized is the sum of the immediate costs (actually, the negative profits) and the expected futures costs since
one must consider 65,536 possible outcomes; the constraints involving random quantities are written out explicitly for all 65,536 possible outcomes.
In addition to non-negativity for the decision variables xj and the recourse
27
variables ycl , yfl , the constraints say that the number of man-hours it takes
P
for the carpentry of all dressers ( 4j=1 tcjl xj ) must not exceed the total number of hours made available for carpentry (dc + ycl ), i.e., regular hours plus
overtime, and the same must hold for finishing.
Because there is the possibility of making a recourse decision y l = (ycl , yfl )
that will depend on the outcomes of the random elements, this type of problem is called a stochastic program with recourse. This class of problems will
be studied in more depth later. For now, it suffices to understand how the
decision/information process is evolving:
decision: x ; observation: T l ; recourse: y l
In summary, the manufacturer makes today, a decision x of how much
of each dresser type to produce based on the knowledge that he will be
able tomorrow, to observe how many man-hours T l it actually took to
manufacture the dressers, as well as to decide how much overtime labor y l
to hire based on this observation.
The problem is still a linear program, but of much larger size! Notice the
block-angular structure of the problem when written in the following way:
min
so that
y 65536 0.
hc, xi
T 1x
T 2x
..
.
Later, we shall see how to solve these large scale linear programs by exploiting
their structure. For now, it is enough to observe that these are indeed, large
scale problems.
Oftentimes, there is more than one source of uncertainty in a problem. For
example, due to employee absences, the available man-hours for carpentry
and finishing may also have to be modeled as random variables, say
entry
dc :
df :
5,873
3,936
possible values
5,967 6,033 6,127
3,984 4,016 4,064
28
CHAPTER 2. FORMULATION
We now need to replace d by d l = (dcl , dfl ) and we must take into account
the 42 = 16 possible d l vectors, that gives a total of L = 410 = 1, 048, 576
possible (T , d) realizations. With pl = 1/L, the problem reads:
min
so that
y L 0.
hc, xi
T 1x
T 2x
..
.
The relatively small linear program we started out with, in the deterministic
setting, has now become almost enormous! Lets refer to this problem as the
(equivalent) extensive version of the (given) stochastic program.
The optimal solution is
x = (257, 0, 665.2, 33.8)
Because of its large size, this problem is more difficult to solve than its
deterministic counterpart, and any efficient solution procedure must exploit
the problems special structure. But the solution x is robust, meaning that
it has examined all one million plus possibilities, and has taken into account
the resulting recourse costs for overtime and the associated probabilities of
having to pay these costs.
With xd = (4000/3, 0, 0, 200/3), the solution of the deterministic version,
the expected cost would have been $ -16,942; the expected overtime costs are
$ 1,725. Of course, xd is not an optimal solution of the stochastic program,
but more significantly, xd isnt getting us on the right track! The solution x
suggests that a large number of dressers of type 3 should be manufactured,
while the production plan suggested by xd doesnt even include any dresser of
type 3. This is exactly the information a decision maker would want to have,
viz, what are the activities that should be included in a (robust) optimal
solution.
2.1 Exercise (stochastic resources). Consider the product mix problem
when the only uncertainty is about the number of hours that will be available
for carpentry and finishing. Overtime will still be paid at the rates of $ 5 an
29
4,800
3,936
possible values
5,500 6,050 6,150
3,984 4,016 4,064
2.2
Curve fitting II
30
CHAPTER 2. FORMULATION
where z0 and v0 play the role of integration constants and a : [0, 1] IR,
the second derivative, is some function, not necessarily continuous. In fact,
to render the problem computationally manageable, lets restrict a to be
piecewise constant. Thats not as serious a limitation as might appear at first
glance since any piecewise continuous function on [0, 1] can be approximated
arbitrarily closely by such a piecewise constant function.
Getting down to specifics: Partition (0, 1] in N sub-intervals (tk1 , tk ], of
length = 1/N so that the points at which
h is known are some
the function
of the end points of these intervals, say tl , l L . For k = 1, . . . , N , set
t (tk1 , tk ];
a(t) = xk , (a constant) ,
z (t) = v0 +
a(s) ds = v0 +
0
k1
X
j=1
xj + (t tk1 )xk ,
and
z(t) = z0 +
t
0
z (s) ds = z0 +
0
= z0 + v0 t +
k1 Z
X
j=1
k1
X
j=1
tj
z (s) ds +
tj1
z 0 (s) ds
tk1
(t tj + /2)xj + 21 (t tk1 )2 xk .
In particular, when t = tk ,
z(tk ) = z0 + kv0 + 2
k
X
j=1
(k j + 21 )xj .
31
i.e., least squares
minimization. With z = zl = z(tl ), l L and h = hl =
h(tl ), l L , one ends up with the following formulation,
min |z h|2 =
X
lL
|zl hl |2
so that zl = z0 + lv0 + 2
l
X
k=1
xk ,
(l k + 12 )xk ,
l L,
k = 1, . . . , N.
Thats a quadratic program: the constraints are linear and the function to be
minimized is quadratic. One can write the equality constraints as
z = Ax
where
x = (z0 , v0 , x1 , . . . , xN );
32
CHAPTER 2. FORMULATION
% (x,h): data points
% kappa: -lower and upper bound on 2nd derivatives
msh = xr/N; [m, m0] = size(x(:)); xidx = round(N*x);
N1 = N + 1; N2 = N + 2;
mx = 2+max(abs(h)); ub = [kappa*ones(1,N);100;mx]; lb = -ub;
%
generating the coefficients of matrix A
for i = 1:m
for j = 1:xidx(i)
A(i,j) = (xidx(i)-j+0.5)*msh^2;
end %for
A(i,N1) = xidx(i)*msh; A(i,N2) = 1;
end %for
xx = quadprog(A*A,-A*h(:),[],[],[],[],-ub,ub);
%
z-curve calculation
for l = 1:N
zd = 0;
for k = 1:l
zd = zd + (l-k+0.5)*xx(k);
end %for
z(l) = xx(N2) + xx(N1)*l*msh + zd*msh^2;
end %for
1.5
1
0.5
0
0.5
1
1.5
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
33
P
one can find the minimum of lL |zl hl | by minimizing
X
l with l zl hl , l hl zl for l L.
lL
With the `1 -norm criterion, the curve fitting problem takes the form,
X
min
l
lL
so that l zl hl , l L,
l zl + hl , l L,
Xl
zl = z0 + lv0 + 2
k=1
xk ,
(l k + 21 )xk ,
l L,
k = 1, . . . , N.
This is a linear program, the constraints are linear and the function to be
minimized is also linear.
2.2 Example (yield curve tracing). The spot rate of a Treasury Note that
matures in t months always includes a risk premium as well as a forecast component that represent the markets perception of future interest rates. Such
spot rates are quoted for Treasury Notes with specific maturities, t = 3, 6, . . . .
To evaluate financial instruments that generate the cash flow (coupons, final
payments) at intermediate dates, one needs to have access to to a yield curve
that supplies the spot rate for every possible date.
15
14
13
12
11
10
9
50
100
150
200
250
300
34
CHAPTER 2. FORMULATION
term
rt
3
11.9
6
12.8
12
13.2
24
13.8
36
14.0
60
14.1
84
14.1
120
13.9
240
13.8
360
13.6
To trace the yield curve, the simplistic approach is to rely on linear interpolation. However, thats not really satisfactory. Financial markets make
continuous adjustments to the changing environment and this suggests that
the yield curve has to be quite smooth. Certainly, there shouldnt be an
abrupt change in the slope of the yield curve, and a fortiori, this shouldnt
occur at t = 3, 6, . . . . So, lets fit a smooth curve to the data. Because the
spot rates are nonnegative4 , one can express the yield curve as s(t) = ez(t)
in which case we need to search for a smooth z-curve that will fit the pairs
{(3, ln r3 ), (6, ln r6 ), . . . }. The following Matlab-file generates the coefficients of the linear program and then relies on linprog to calculate the
solution. Figure 2.2 graphs the (historical) yield curve calculated by our program.
function spots = YieldCurve(N,x,r,kappa)
% N: # of months, range [0, N]; % (x,r): data points
% kappa: -lower and upper bound on 2nd derivative
[m, m0] = size(x(:)); N1 = N + 1; N2 = N + 2;
ub = [kappa*ones(1,N);0;0;10*ones(1,m)];
lb = [-ub(1:N);-1;-3.25;zeros(1,m)];
%
generating the coefficients of linear program
for i = 1:m
i2 = 2*i; i1= i2-1;
b(i2) = log(r(i)); b(i1) = -b(i2);
A(i1,:) = zeros(1,N2+m);
for j = 1:x(i)
A(i1,j) = (x(i)-j+0.5);
end %for
A(i1,N1) = x(i); A(i1,N2) = 1; A(i1,N2+i) = -1;
A(i2,:) = -A(i1,:); A(i2,N2+i) = -1;
end %for
c = [ zeros(1,N2) ones(1,m)];
xx = linprog(c,A,b,[],[],lb,ub);
%
yield curve calculation
for l = 1:N
zd = 0;
4
and its expedient to have an expression for the spot rates that makes calculating
forward rates and discount factors particularly easy
35
for k = 1:l
zd = zd + (l-k+0.5)*xx(k);
end %for
z(l) = xx(N2) + xx(N1)*l + zd;
end %for
spots = exp(-z);
2.3
Lets consider a power transmission network, Figure 2.3, with ei be the external flow at node i, i.e., the difference between demand and supply at node i.
The internal flow yj on arc j is limited by its capacity j of the transmission
line. Total supply exceeds total demand but the capacity of the transmission
lines needs to be expanded from j to j + xj , with j an upper bound on xj ,
in order to render the problem feasible5 . The total cost of such an expansion
P
is nj=1 j (xj ).
e2
e1
|yj | <_ j
In the 2001 California energy crisis, some of the blackouts were blamed on the lack of
capacity of the transmission lines between South and North California.
36
CHAPTER 2. FORMULATION
The deterministic version of this capacity expansion problem would be:
min
n
X
j (xj )
j=1
so that 0 xj j ,
j = 1, . . . , n,
|yj | j + xj , j = 1, . . . , n,
X
X
yj ei , i = 1, . . . , m;
yj
i
P
yj stands for the (internal) flow into i whereas i yj is the flow from
i to the other nodes. Since the constraint |yj | j + xj can be split in
the two linear constraints yj j + xj and yj j xj , this is again a
linear programming problem if the cost functions j are linear. Usually, the
functions j are nonlinear, and the problem then belongs to a more general
class of optimization problems.
A nonlinear program is an optimization problem in finitely many (realvalued) variables in which a function is to be minimized (or maximized)
subject to a system of finitely many constraints: equations and inequalities.
A general formulation of a nonlinear program could be
37
9
y2 < 7+x 1
y3 < 5+x 3
1
y1 < 2+x 2
38
CHAPTER 2. FORMULATION
satisfaction:
if < 0,
(supply exceeds demand)
0
2
i ( ) = i /2i
if [0, i ], (low level excess demand)
i i i /2 if > i ,
(high level excess demand).
Lets suppose that the budget available for the capacity expansion is fixed,
2
2
so that
L h X
m
X
X i
X
yjl +
yjl
pl
i il
l=1
n
X
j=1
i=1
j (xj ) ,
0 xj j , j = 1, . . . , n,
yjl xj j , j = 1, . . . , n, l = 1, . . . , L,
yjl + xj j , j = 1, . . . , n, l = 1, . . . , L.
39
again minimizing an expected cost. We can make this even more explicit by
rewriting the problem in the following form:
min
x
so that
L
X
pl q( l , x)
l=1
n
X
j=1
j (xj ) ,
0 xj j , j = 1, . . . , n,
where
q(, x) = infn
yIR
m
nX
i=1
i i
yj +
X
i
o
yj |yj | j + xj , j = 1, . . . , n .
Like the product mix problem, this is a (two-stage) stochastic program with
recourse. The first stage decision x, what were really interested in, has to
be selected now, before the supply/demand situation can be observed. The
second stage or recourse decisions, i.e., the decision about flows y is made
after the supply/demand situation is known. The function q is called the
recourse cost function.
In another version of this problem, the joint distribution function P for
could actually be continuous with (joint) density function p : IR N IR+ .
For a function : IRN IR, the expected value of () is given by
Z
Z
Z
Z
()p() d =
d1
d2
dN ()p().
IRN
|
{z
}
N times
If this is the case, again with the recourse cost function q as defined above,
our optimization problem takes the form:
min
so that
IRN
n
X
j=1
q(, x)p() d
j (xj ) ,
0 xj j , j = 1, . . . , n,
40
CHAPTER 2. FORMULATION
R
Notice that we dont have an explicit expression for IRN q(, x)p()d as a
function of x. For each x, an N -dimensional integration must be performed,
and thats computationally expensive, and anyway we dont even have an
explicit representation for the recourse cost function q. The major challenge
in stochastic programming is to find ways to deal with this situation.
2.4
2.4 Example (the traveling salesman problem). Given a set of cities (nodes)
V = {1, . . . , n} and connecting routes (arcs) A with cij the travel time between cities i and j, the problem is to find a tour that starts and terminates
in city 1, visits all other cities exactly once, and takes the least total travel
time.
Details. Let xij = 1 if the tour includes the route from city i to city j,
otherwise xij = 0; note that we dont identify traveling from i to j with
41
determine a linear program, but the binary restriction a route must be traveled or not hasnt been included in the formulation of the problem. So, one
must add the constraint:
xi,j {0, 1} (i, j) A.
The resulting problem is no longer a linear program, but a so-called integer
program; problems where some of the variables are restricted to be integervalued and others are allowed to be real-valued are mixed integer programs
(MIP).
42
CHAPTER 2. FORMULATION
2.5 Exercise (mini-traveling tour) Formulate and solve, the following traveling salesman problem where the traveling time between cities {1, . . . , 4} is
given by the following matrix:
ij
1
C =
2
3
4
1
?
3
6
3.5
2
2
?
?
6
3
5
4
?
4
4
3
?
6
?
? indicates that there is no usable route from city i (row) to city j (column)
Guide. For the numerical solution use the Matlab function bintprog.
2.5
43
( x)p()d.
In this case, a simple calculation shows that the optimal solution x must
satisfy
0 = [1 P (
x)].
More generally, if we define P ( ) := lim % P ( ), one must have
P (
x )
P (
x),
1()
1()
x^
x^
44
CHAPTER 2. FORMULATION
unique solution x = 15 75 . If has a discrete distribution with equal probability on the points {10, 11, . . . , 20}, write down an expression for the expected
costs. What is the distribution function P ? Show that x = 16 is an optimal
solution in this discrete case7 .
2.7 Example (the newsboy problem). A newsboy (a firm) must place an
order for x newspapers (perishable items) to meet a demand thats known
only in a probabilistic sense, i.e., is a random variable with known distribution function P () = prob [ ]. A paper costs cents and is sold for
cents; unsold papers cant be returned. The newsboy wants to choose x to
maximize expected profit.
Detail. The newsboy problem, a classic inventory problem, is a problem of
the same type as the producers except that it is one of maximizing instead
of minimizing. The optimality conditions are essentially the same as those
for the Broadway show producer problem.
The stochastic programming model of the producers problem starts from
a deterministic version of the problem:
min x
such that x =
Remark: The solution of the producers problem may not be unique. Uniqueness will
fail whenever there is an interval on which P takes the value ( )/. With = 3 and
= 7, again, and uniformly distributed on [10, 14][17, 20], the solution x
is any point in
the interval [14, 17]. Since can never take on any value between 14 and 17, one might be
tempted to choose the lowest possible value for x
, namely 14. But in terms of the stated
objective, minimization of expected costs, it doesnt matter which point gets selected.
45
of the contract, and the second term evaluating the decision in terms of
expected costs to come after the random event is observed. The second term
is called the expected recourse cost:
(
0
when y 0,
E{q( x)} where q(y) =
y when y 0.
The recourse cost function is q( x).
q
2.8 Exercise (alternative expression for recourse cost). Show that the cost
function q (with > 0), a function commonly used to define recourse costs,
admits the alternative representations:
q(y) = max [ 0, y ]
= min {y + | y + y = y, y + 0, y 0}.
With E{} denoting expectation with respect to the distribution function
P , the stochastic programming formulation of the producers problem is:
min x + E q( x) .
x
46
CHAPTER 2. FORMULATION
Chapter 3
PRELIMINARIES
At the congenital level, one makes a distinction between two classes of optimization problems, namely those that are convex and those that are nonconvex1 . Fortunately, a major portion of the optimization problems that
have to be dealt with practice are convex; all examples in Chapter 2 fall in
this class. In the two first sections of this chapter, we build the basic tools to
analyze convex optimization problems, and in particular, whats needed to
generalize Oresme and Fermat rules. The three last sections, set up a minimal
probabilistic framework that will allow us to deal with (convex) stochastic
optimization problems and commence the study of expectation functionals.
3.1
Variational analysis I
Of course, non-convex problems have large subfamilies that possess particular properties that can be exploited effectively in the design of solution procedures, e.g., combinatorial optimization, optimization problems with integer variables, complementarity
problems, equilibrium problems, and so on.
47
48
CHAPTER 3. PRELIMINARIES
section. The next one will be devoted to a more detailed analysis of convex
functions and their (sub)differentiability properties.
A subset C of IRn is convex if for all x0 , x1 C, the line segment [x0 , x1 ]
C, i.e.,
x = (1 )x0 + x1 C for all [0, 1].
Note that x0 , x1 dont have to be distinct, and thus if C consists of a single
point its convex; the condition is also vacuously satisfied when C = , the
empty set. Balls, lines, line segments, cubes, planes are all examples of
convex sets. Sets with dents or holes are typical examples of sets that fail to
be convex, cf. Figure 3.1.
x =
L
X
l x
for some
l=1
l 0, l = 1, . . . , L, such that
L
X
l = 1
l=1
C = con(x , . . . , x ) = x =
L
X
l=1
L
X
l x
l = 1, l 0, l = 1, . . . , L
l
l=1
49
x2
x1
xL
x0
x1
1C 1 + 2 C2
C2
!!!!!!!!!
!!!!!!!!!
50
CHAPTER 3. PRELIMINARIES
the set L(C) = z = Ax + b x C is convex whenever C IRn is convex.
In particular, L(C) is convex when its the projection of the convex set C on
a subspace of IRn .
%#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
&#
&#
&#
&#
&#
&#
&#
&#
&#
&#
&#
&%&%&%
%#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%#
%
%
%
%
%
%
%
%
%
%
%
%#
%
%
%
%
%
%
%
%
%
%
%
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
&%&%%
%#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
&#
&#
&#
&#
&#
&#
&#
&#
&#
&#
&#
%#
%
%
%
%
%
%
%
%
%
%
%
&&%&%
%#
%
%
%
%
%
%
%
%
%
%
%
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
%#
%
%
%
%
%
%
%
%
%
%
%
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
%#
%
%
%
%
%
%
%
%
%
%
%
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
&%%&%
C%#
%#
%#
%#
%#
%#
%#
%#
%#
%#
%#
%#
&
&
&
&
&
&
&
&
&
&
&
%#
%
%
%
%
%
%
%
%
%
%
%
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
%#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%#&#
% &#
% &#
% &#
% &#
% &#
% &#
% &#
% &#
% &#
% &#
% &&%&%
" $#
" $#
" $#
" $#
" $#
" $#
" $#
" $#
" S $#
" = I$"R2
$#
" $#
" $#
" $#
" $#
" $#
" $#
" $#
" $#
" $#
" $"
$#
"#
"#
$" "#
$" "#
$" "#
$" proj
$" "#
$" C"#
$" "#
$" "#
$" "#
$" "$"
$#
#
$
#
$
#
$
#
$
#
$
#
$
#
$
#
$
" $#
" $#
" $#
" $#
" $#
" $#
" $#
" $#
" $$"
$"#$#
$#
proj C
S
Guide. The first statement only requires a simple application of the definition of convexity. For the second, simply write the projection as an affine
mapping.
Warning: projections preserve convexity but the projection of a closed convex
x2
set is not necessarily
closed.
A
simple
example:
let
C
=
(x
,
x
)
1
2
1/x1 , x1 > 0 , then the projection of C on the x-axis is the open interval
(0, 1), This cant occur if C is also bounded, cf. Proposition 8.9.
A particularly important subclass of convex sets are those that are also
cones. A ray
from the origin, i.e., a set of the
is a closed half-line emanating
n
type x 0 for some 0 6= x IR . A set K IRn is a cone if 0 K
and x K for all x K and > 0. Aside from the zero cone {0}, the
cones K in IRn are characterized as the sets expressible as nonempty unions
of rays. The following sets are all convex cones:
{0}, IR + sn , IRn and (closed)
n
half-spaces, i.e., sets of the type x IR ha, xi 0 with a 6= 0.
3.3 Exercise (convex cones). A nonempty set C is a convex cone if and
only if x1 , x2 C implies x1 + x2 C.
51
)(')(' )(')(' )(')(' )(')(' )(')(' )(')(' )(')(' )(')(' )(')(' )')'
)'()'()(')(' )(')(' )(')(' )(')(' )(')(' )(')(' )(')(' )(')(' )')'
)(')'()(')(' )(')(' )(')(' )(')(' )(')(' )(')(' )(')(' )(')(' )')'
)(')(' )(')(' )(')(' )(')(' )(')(' )(')(' )(')(' )(')(' )(')(' )')'
f (1 )x0 + x1
otherwise.
52
CHAPTER 3. PRELIMINARIES
f
x
f0
f
C = dom f
i.e., the minimizers of the two functions are the same and so is the value of
their minimum (=infimum)2 . Addition and multiplication in the extended
2
If we had opted for maximization as our canonical set-up, then the extension would
have assigned the value to the points outside the effective domain.
53
epi g
epi f
f
54
CHAPTER 3. PRELIMINARIES
i.e., hypo f consists of all points in IRn+1 that lie on or below the graph of f .
3.5 Proposition (convexity of epigraphs). A function f : IRn IR is convex if and only if epi f IRn+1 is convex.
Its concave if and only if hypo f IRn+1 is convex.
Proof. The convexity of epi f means that whenever (x0 , 0 ), (x1 , 1 ) epi f
and (0, 1), the point (x , ) := (1 )(x0 , 0 ) + (x1 , 1 ) belongs to
epi f . This is the same as saying that whenever f (x0 ) 0 and f (x1 ) 1 ,
one has f (x ) .
The assertion about concavity follows by symmetry, passing from f to
f .
The epigraph is not the only convex set associated with a convex function.
lev f
55
3.7 Exercise (convexity of max-functions). Let f i , i I be a collection
of convex functions. Then the function f (x) = supiI f i (x) is convex.
f
f4
f3
f1
f2
iI
epi C
56
CHAPTER 3. PRELIMINARIES
Follows from 3.1(b), 3.5, and 3.6 since epi C = C IR+ and C = lev0 C .
3.9 Proposition (inf-projection of convex functions). Let f : IR n IR be
the inf-projection of the convex function g : IRm IRn IR, i.e., for all
x IRn ,
f (x) = inf uIRm g(u, x).
Then f is a convex function.
epi g
epi f
u
x
Figure 3.13: Inf-projection of a convex function
Proof. Follows from 3.2 and 3.5 since epi f is the vertical closure of the
projection,
(u, x, ) 7 (x, ),
57
58
CHAPTER 3. PRELIMINARIES
f (x1 ). But then, strict convexity would imply that f (x) < f (x0 ) for every
point x (x0 , x1 ).
3.2
Variational analysis II
This section continues the study of the properties of convex functions but we
are now mostly concerned with their (sub)differentiability properties. The
class of functions to which we can apply the classical optimality conditions
of Chapter 1, doesnt include many that come up in the mathematical programming context. Restricting the development to models involving only
differentiable (convex) functions would leave by the wayside all constrained
optimization models, and they include the large majority of the applications.
One needs a calculus that applies to functions that are not necessarily differentiable. Eventually, this will enable us to formulate Oresmes and Fermats
rules for convex functions that arent necessarily differentiable, or even continuous. This Subdifferential Calculus is introduced in the this section and
will be expanded throughout the ensuing development.
For the sake of exposition, and so that the readers can drill their intuition,
1-dimensional functions are featured prominently in this section. In some
instances, for the sake of simplicity, the proof of a statement is only provided
for 1-dimensional convex functions3 .
3.13 Proposition (continuity of convex functions). A real-valued convex
function f defined on IRn is continuous.
Proof. The proof is for n = 1. It will be sufficient to show that f : IR IR is
continuous at 0; continuity at any other point x follows from the continuity
at 0 of the function g(z) = f (z + x). By symmetry, it suffices to show
that f (0) = lim f (x ) for any sequence x & 0 with x (0, 1]. From the
convexity of f , one has for all :
f (x ) (1 x )f (0) + x f (1),
f (0)
3
x
1
f
(x
)
+
f (1),
x + 1
x + 1
A complete proof would have required additional background material that would let
us stray too far from the objectives of these lectures; for proofs in full generality, one can
consult [3], [18, Chapter 2], for example.
59
.
y x0
x1 x 0
x1 y
s1
x0
s2
s3
y
x1
3.14 Proposition (one-dimensional derivative tests). When f is a differentiable function on an open interval O IR, each one of the following
conditions is both necessary and sufficient for f to be convex on O:
(a) f 0 is nondecreasing on O, i.e., f 0 (x0 ) f 0 (x1 ) when x0 < x1 in O;
(b) f (y) f (x) + f 0 (x)(y x) for all x and y in O;
(c) f 00 (x) 0 for all x in O (assuming twice differentiability).
Proof. The equivalence between (a) and (c) when f is twice differentiable
is well known from elementary calculus. The proof can be limited to showing
60
CHAPTER 3. PRELIMINARIES
61
A sufficient, but not necessary, condition for strict convexity is that the
Hessian matrix is positive definite for all x IRn .
Proof. From the definition of convexity, f is convex on IR n if and only if
its convex on every line segment. This is equivalent to the property that for
every choice of y and z, the function 7 g( ) = f (y + z) is convex, and
D
E
D
E
g 0 ( ) = z, f (y + z) , g 00 ( ) = z, 2 f (y + z)z .
The asserted conditions for convexity and strict convexity of f are equivalent
to requiring in each case that the corresponding condition in Proposition 3.14
is satisfied for all such functions g.
62
CHAPTER 3. PRELIMINARIES
the notion of tangent, at least in the classical sense, isnt defined; slope
cant be defined uniquely. One then must be satisfied with something less
than derivatives and gradients, namely with what we are going to call subderivatives and subgradients4 .
3.18 Definition (subdifferentiation of convex functions). For f : IRn IR
convex, the subderivative at x dom f is the function df (
x; ) defined by
f (
x + w) f (
x)
;
&0
df (
x; w) = lim
f()
&0
%0
f (
x + w) f (
x)
,
which implies that the subderivative is then linear. More generally one has:
4
The prefix sub must be interpreted as identifying something less than. Subderivatives and subgradients can be defined for any function, not just convex ones. The definitions are then somewhat more involved [18, Chapter 8], but in the convex case, they
coincide with those introduced here.
63
f
_
x
_
df(x;. )
_
x
3.19 Exercise (sublinearity of the subderivative). Given f : IR n IR convex and x such that f (
x) is finite, the subderivative w 7 df (
x; w) at x is
sublinear.
Hence, the epigraphs of the subderivatives of convex functions are convex
cones.
Guide. Show that df (
x; ) is convex and positively homogeneous.
Its immediate from the definition that v is subgradient of f at x if and
only if the affine function x 7 a(x) = hv, xi + [f (
x) hv, xi] is majorized by
f and a(
x) = f (
x), cf. Figure 3.18 The set of subgradients could be empty.
Indeed one could very well be at a point x at which there is no affine function
thats majorized bypf and takes on the value f (
x) at x. For example, the
2
function f (x) = 1 (x + 1) + [2,0] at x = 0 has infinite slope and
thus there is no affine function majorized by f that takes on the value 0 at
x = 0. Thus, f (0) = , cf. Figure 3.19; note df (0; 1) = , df (0; 1) =
while df (0; 0) = 0.
3.20 Proposition (subderivatives and subgradients). Let f : IRn IR be
64
CHAPTER 3. PRELIMINARIES
<v, .
>+
[f(x_
)
_
x
<v, _
x>]
_
x
f
x
df (
x; w) = hf (
x), wi w IRn .
Proof. If df (
x; w) = for some w, there is no v such that hv, wi
and consequently f (
x) must be empty. So, in the remainder of the proof
65
if and only if
f (
x + w) f (
x) + hv, x + w xi w IRn , 0,
or equivalently, if and only if
f (x) f (
x) + hv, x xi x IRn ,
since every x IRn can be written as x + w for some w IRn and 0.
And, this is the case if and only if v f (
x).
Certainly f (
x) is convex when its empty, otherwise from the definition
it follows readily that for all [0, 1], v = (1)v 0 +v 1 f (
x) whenever
0 1
v , v f (
x).
If f is differentiable at x, then for all w IRn ,
df (
x; w) = lim 1 [ f (
x + w) f (
x) ] = hf (
x), wi.
0
&0
f 0 (x ) = lim
f (x + ) f (x)
.
&0
f 0 (x+ ) = lim
Since 1 [ f (
x + w) f (
x) ] is non-increasing as & 0, refer also to Fig0
ure 3.15, both f (x ) and f 0 (x+ ) are well defined since monotone sequences
always have limits in the extended reals, and f 0 (x ) f 0 (x+ ).
Because left and right dont have a meaning in IR n for n 2, the expression for
1-dimensional subderivatives and subgradients cant be trivially generalized to the ndimensional case.
5
66
CHAPTER 3. PRELIMINARIES
3.21 Exercise (1-dimensional subderivatives and subgradients). For a convex function f : IR IR,
df (x; 1) = f 0 (x ),
df (x; 1) = f 0 (x+ ),
and
f (
x) = [ f 0 (
x ), f 0 (
x+ ) ],
i.e., the set of subgradients is the interval spanned by the left and right hand
derivatives. Moreover, the subderivative is the max-function:
df (x; w) = max [ f 0 (x )w, f 0 (x+ )w ] = max [ vw v f (x) ].
f (
x ) f (
x) + v(
x x),
w if x < 0,
f (x) = |x| : df (x; w) = |w| if x = 0,
w
if x > 0;
67
2(3x + 2)w
max [ 2w, 4w ]
2(3x + 2)w
f (x) = ex :
if
if
if
if
if
x < 3,
x = 3,
3 < x < 0,
x = 0,
x > 0;
df (x; w) = ex w.
n
Y
fj (
xj ).
j=1
f (x) f (
x) + hv, x xi, x IRn ,
n
n
n
X
X
X
fj (xj )
fj (
xj ) +
vj (xj xj ), x IRn ,
j=1
for all j:
for all j:
j=1
j=1
fj (xj ) fj (
xj ) + vj (xj xj ), xj IR,
vj fj (
xj ).
68
CHAPTER 3. PRELIMINARIES
Clearly, the penultimate assertion implies the previous one. For the converse,
consider those x IRn that are such that for i 6= j, xi = xi and xj IR.
To complete the proof, when the functions fj are differentiable, one simply
appeals to 3.20 and 3.21.
What does all of this lead up to? Subdifferential Calculus allows us
characterize the minimizers of any convex functions. Moreover, in view of
Theorem 3.12, these conditions become necessary and sufficient for a point
x to be a minimizer.
3.24 Theorem (generalized Oresme and Fermat rules). Let f : IR n IR
be convex with f (x ) IR. Then
Oresme rule: x argmin f df (x ; ) 0,
One also has,
Fermat rule: x argmin f 0 f (x ).
Proof. Indeed, x argmin f f (x ) f (x)+h0, xx i for all x IRn
and this occurs if and only if df (x ; w) 0 for all w, and in view of 3.20, if
and only if 0 f (x ).
3.3
69
This expedient abuse of notation never creates any problems because in the
latter case the argument is a point in IRN , and in the former its a subset
of IRN . The support of a probability distribution P on IRN is the smallest
closed set IRN such that P () = 1; this will be made precise below.
Two important classes of probability distributions on IRN are the discrete
and the continuous distributions. We refer to a random variable as being discrete if its support consists of a countable number of points. The distribution
function Pd of such a random variable will have at most a countable number
of jumps and the random variable is said to be discretely distributed. For
example, a random variable with a Poisson distribution,
prob[ = k] = e
k
,
k!
k = 0, 1, . . . , for > 0 ,
70
CHAPTER 3. PRELIMINARIES
1
= prob[ < z]
P(z)
10
12
14
16
18
20
z, d
p ,
Pd (A) =
p .
Ad
When the support is a finite set, the random vector is said to be finitely
supported , or equivalently, to have finite support6 . For example, a random
6
In this book, whenever we deal with random variables that are discrete, they will
nearly always be finitely supported
71
10
12
14
16
18
20
The support of Pc is the closure of the set where the density function is
positive
c := cl{ p() > 0}
In some instances, one may have to distinguish between random vectors with
bounded support, like a uniformly distributed random variable with
(
1/( ) if [, ],
p() =
0
otherwise,
72
CHAPTER 3. PRELIMINARIES
p
p
1
1
2
2
e() /2
2
where
P () = Pd (d ) + Pc (c ) = 1,
and = d c is the union of the support of the discrete and continuous parts. The overriding reason for considering this class of probability
73
distributions is that it includes both the discrete and the continuous cases!
Consequently, anything proved for plain distributions applies to discrete as
well as to continuous distributions. Plain distributions are all whats needed
to deal with all the applications we have in mind. Moreover, its low technology! We dont need to go through to the full foundations of Probability
Theory7 , a superb, but substantial, monument.
All probability distributions on IR are plain, but that isnt the case for all
probability distributions on IRN ; simply think of the probability distribution
associated with a random variable uniformly distributed on the perimeter of
a circle of radius 1 in IR2 . Restricting ourselves to plain probability distributions simplifies the definition of what is to be understood by expectation and,
as already indicated, bypasses the measure theoretic technicalities required
to deal with the general case. So, in this book, all probability distributions
will be assumed to be of this (practical) plain type. With p the probability
associated with the points in the support of the discrete part and p() the
density function not necessarily summing up to 1 associated with the
continuous part, the probability of a set A is given by
Z
X
P (A) =
p +
p() d,
Ad
assuming that A is nice enough so that one can carry out the integration.
The expectation of a function h : IRN IR with respect to such a probability distribution P is then
Z
Z
X
h()p +
h()p() d
E{h()} =
h() P (d) :=
IRN
IRN
All the results for expectation functionals defined by plain distributions, can actually
be proved for expectation functionals defined by (general) probability distributions, often
with the same arguments.
74
CHAPTER 3. PRELIMINARIES
h() P (d) =
IRN
N Z
X
k=1
h()p() d;
k
in this expression, its taken for granted that those pieces k are nice (=
measurable in Measure Theory).
A function h pc-fcns(; IR) is said to be summable, with respect to
a plain distribution P , if E{h()} is finite. A vector-valued mapping v
pc-fcns(, IRn ) is summable, if for each index i, the function vi : IR is
summable.
3.25 Proposition
(properties of the integral).
Let P be a plain probability
h h
h () P (d)
h2 () P (d).
lim
h () P (d) = h() P (d)
Proof. The assertions in (a) and (b) follow immediately from the properties
of sums when P is a discrete distribution, and of standard integration when
P is a continuous distribution. The combination of these two facts yields (a)
and (b) for plain distributions.
The same approach can be used to obtain the Dominated Convergence
(c) when P is a discrete distribution or when P is a continuous distribution
75
and h is continuous on . This still works when h and the h are piecewise
continuous functions and the pieces on which these functions are continuous
are the same, i.e., dont depend on . The statement remains valid without
this additional restriction, but the proof requires more background material
about integration than whats at our disposal.
3.4
Expectation functionals I
defined on IRn and, in general, with values in IR; here P is the distribution of
the random variable . In our present state of affairs, P is a plain probability
distribution and f (, x) pc-fcns(; IR). Usually, we shall denote expectation
functional by the name of the function being integrated, the integrand,
preceded by an E.
Until Chapter 9, we only deal with expectation functionals Ef that are
real-valued and whose integrands (, x) 7 f (, x) are convex in x for all
. This section is concerned with the convexity and the subdifferentiability properties of expectation functionals. We conclude with a remarkable
characterization of the minimizers of expectation functionals.
3.26 Proposition (convex expectation functionals). Let f : IR n IR.
Assume that for all , x 7 f (, x) is convex and that Ef is finite-valued,
where
Z
Ef (x) :=
f (, x) P (d),
76
CHAPTER 3. PRELIMINARIES
In Figure 3.25 Ef is the expectation function of the integrand
(
x
if x ,
f (, x) = 1
2
when x .
2 (x )
f(.,1)
4
3
f(.,2)
4
3
f(.,1)
1
0
2
Ef(.)
0
2
and
Ef (x) =
nZ
o
v() P (d) v() f (, x), v : IRn summable .
Z
= [ lim 1 (f (, x + w) f (, x)) ] P (d).
&0
77
f( . ,x )
v( .)
This interchange of limits and integration is justified by the Dominated Convergence property
assumption.
Ef is finite-valued on IR, by
R 3.25(c), since
n
Let D =
v() P (d) v() f (, x), v : IR summable .
To show that D Ef (
x), simply observe that v D implies Rthat there is
a summable vector-valued function v : IRn such that v = v() P (d)
and
hv(), x xi f (, x) f (, x
), x IRn , .
Integrating on both sides with respect to yields
h
v , x xi Ef (x) Ef (
x),
x IRn ,
78
CHAPTER 3. PRELIMINARIES
v () f (, x) = [f 0 (, x ), f 0 (, x+ )],
v () P (d) = v.
R
n
or equivalently, v : IR such that v() P (d) = 0 and
: x argmin f (, x) hv(), xi .
xIRn
On
R the other hand, if 0 Ef (x ), the existence of a function v such
that v() P (d) = 0 and v() f (, x ) is guaranteed by 3.27. The
equivalence between the two conditions x argminx f (, x) hv(), xi and
v() f (, x ) can be validated via 3.24.
8
To fully appreciate its significance, one really needs to see how it brings to light the
relationship between deterministic and stochastic optimization models that can then be
exploited, at least conceptually, to build a number of solutions procedures for stochastic
optimization models.
3.5
79
where
q(y) = max [ 0, y ].
if x < ,
( )w
( )
df (, x; w) = max [ ( )w, w ], f (, x) = [( ), ]
if x = ,
if x > .
The subderivatives of the expectation functional Ef are then obtained by
integration using the formula in 3.27:
Z
X
dEf (x; w) =
df (, x; w)p + df (, x; w)p() d.
d
A similar formula can be written down for the subgradients, but its informative to go through a more detailed analysis. If
(i) x
/ d , then P ({x}) = 0 and the distribution function of is continuous at x:
Ef (x) = ( )(1 P (x)) + P (x)
= ( ) + P (x).
80
CHAPTER 3. PRELIMINARIES
f(, )
P (x ) px
P (x );
recall that px = P ({x }). Not surprisingly, this is the same optimality
condition as the one obtained in 2.4.
81
if < x,
v() =
v(
x) [ , ],
v() P (d) = 0.
if > x,
Then x is an optimal solution of the producers problem.
Guide. Appeal to 3.28.
3.30 Exercise (expected monitoring costs). Let f (, x) = ( x) where
is the monitoring function introduced in 2.3, i.e.,
if < 0,
0
2
( ) = /2
if [0, ],
/2 if > .
Let be uniformly distributed on [0, 1], i.e., the density function is given by
(
1 if [0, 1],
p() =
0 if
/ [0, 1].
Find analytic expressions for the subderivatives and subgradients of Ef .
Conclude from these calculations that Ef is differentiable.
82
CHAPTER 3. PRELIMINARIES
Chapter 4
LINEAR CONSTRAINTS
Since one can always substitute the constraint fi (x) 0 for fi (x) 0, and
one can replace max f0 (x) by min f0 (x), with no loss of generality, we
choose
min f0 (x)
so that fi (x) 0,
i = 1, . . . , s,
fi (x) = 0, i = s + 1, . . . , m,
x X IRn ;
fi (x) = 0, i = s + 1, . . . , m ,
otherwise.
84
s
\
i=1
lev0 fi
m
\
i=s+1
its the intersection of a finite number of closed convex sets, and thus closed
and convex, cf. 3.1(a).
The epigraph of the objective function f is closed since its the intersection
of the epigraph of a continuous function and the set S IR = epi S . And its
convex for basically the same reason: the epigraph of f0 is convex, cf. 3.5,
and S IR is the product of two convex sets, cf. 3.1(b). Because its epigraph
is convex, the function f itself is convex, again by Proposition 3.5.
So, the condensed version of a convex program,
min f (x)
for
x IRn .
85
4.1
are affine,
86
4.2 Proposition (polyhedral sets). Polyhedral sets are closed and convex.
Proof. A polyhedral set can be expressed as
C=
r
\
l=l
lev0 fl
m
\
l=r+1
where the functions fl are affine. Convexity now follows from 3.1 and 3.6.
Since affine functions are continuous their level sets are closed, and hence C
is closed since its the intersection of a finite number of closed sets.
4.3 Example (box constraints). A set X IRn is a box if its the product
of n closed intervals Xj IR, not necessarily bounded. For instance, the
nonnegative orthant
IRn+ := x = (x1 , . . . , xn ) xj 0 for all j = [0, )n
is a box; and so is IRn itself. Boxes are polyhedral sets.
8698686
9 86
9 86
9 89
8696
8 96
8 96
8 96
8 98
8696
8 96
8 96
8 96
8 98
8696
8 96
8 96
8 96
8 98
8696
8 96
8 96
8 96
8 98
8696
8 96
8 96
8 96
8 98
8696
8 96
8 96
8 96
8 98
5 76
5 76
5 76
5 76
5 76
5 76
5 75
76
5 76
5 76
5 76
5 76
5 76
5 75
75676
5 76
5 76
5 76
5 76
5 76
5 75
75676
5 76
5 76
5 76
5 76
5 76
5 75
75676
5 76
5 76
5 76
5 76
5 76
5 75
75676
87
:6;6
: ;6
: ;6
: ;6
: ;6
: ;:
:6;6
: ;6
: ;6
: ;6
: ;6
: ;:
:6;6
: ;6
: ;6
: ;6
: ;6
: ;:
:6;6
: ;6
: ;6
: ;6
: ;6
: ;:
:6;6
: ;6
: ;6
: ;6
: ;6
: ;:
:6;6
: ;6
: ;6
: ;6
: ;6
: ;:
:6;6
: ;6
: ;6
: ;6
: ;6
: ;:
<6
= <=
=<6=<
=<6=<
=<6=<
=<6=<
=<6=<
=<6=<
f (x) =
n
X
j=1
Of course, affine functions are separable and so are quadratic functions when
the matrix Q is a diagonal matrix. A mathematical program (in canonical
form) is a
separable program if for i = 0, . . . , s, all the functions fi are separable,
P
i.e., are of the form fi (x) = nj=1 fij (xj ) with fij : IR IR and the
set X IRn is a box. A linear program is a separable program. For
a quadratic program, one can always find an equivalent separable program; this requires a change of variables predicated by the diagonalization of the positive (semi-)definite matrix, cf. the comments preceding
Theorem 3.15.
4.2
Our goal in this section and the next, is to develop the tools that will enable
us to write down optimality conditions for the linearly constrained convex
88
program:
min f0 (x)
so that hAi , xi bi ,
i = 1, . . . , s,
hAi , xi = bi , i = s + 1, . . . , m,
x X IRn ;
where X is a polyhedral set, not necessarily bounded. One can think of the
vector Ai as the ith row of a matrix A and of bi as the ith entry of a vector
b, and a compact version of this program would then read
min f0 (x) so that Ax ./ b, x X,
where ./ would consist of s less than or equal to-inequalities and m s
equalities. We already know that the feasible set S is polyhedral and that
the effective objective, f : IRn IR, with
f (x) = f0 (x) + S (x),
is convex. Recall that S denotes the indicator function of S; refer to 3.8 for
the properties of indicator functions.
We also know that a minimizer x of the effective objective f , i.e., an optimal solution of our linearly constrained convex program, must obey Oresmes
and Fermats rules 3.24:
df (x ; ) 0,
0 f (x ).
89
4.4 Theorem (sum of subgradients: inclusion). Given any two proper convex functions f, g : IRn IR, for x dom(f + g):
df (x; ) + dg(x; ) = d(f + g)(x; ),
f (x) + g(x) (f + g)(x).
Proof. Because the ratio 1 f (x + w) f (x) monotonically decreases as
& 0, one has
f (x + w) f (x) + g(x + w) g(x)
d(f + g)(x; w) = lim & 0
g(x + w) g(x)
f (x + w) f (x)
+ lim & 0
= lim & 0
hu, x xi f (x) f (
x),
hv, x xi g(x) g(
x),
imply
x IRn :
hu + v, x xi (f + g)(x) (f + g)(
x),
but
(f + g)(0) = IR.
This simple example already pinpoints the cause of the problem, there is no
substantial overlap of dom f and dom g. Their intersection consists exclusively of boundary points of these sets. We need a sum rule that will exclude
such situations.
90
f+g
0
1
0
0.5
2
f
g 0.5
0
91
0.5
4.7 Definition (normals and normal cone). Let C IRn be convex and
x C. A vector v is a normal to C at x, written v NC (
x), if
hv, x xi 0, x C.
T
Since NC (
x) = zCx v IRn hv, zi 0 is the intersection of closed
half-spaces, its a closed convex cone, called the normal cone to C at x.1
> @?
> @?
> @?
> @?
> @?
> @?
> @?
> @>
@?
> @?
> @?
> @?
> @?
> @?
> @?
> @>
@>?@?
> @?
> @?
> @?
> @?
> @?
> C@?
> @>
@>?@?
> @?
> @?
> @?
> @?
> @?
> @?
> @>
@>?@?
A B?
A B?
A B?
A B?
A B?
A B?
A BA
B?
A B?
A B?
A B?
A B?
A B?
A BA
BA?B?
A B?
A B?
A B?
A B?
A C B?
A BA
BA?B?
A B?
A B?
A B?
A B?
A B?
A BA
BA?B?
v
v
0
C D?
C D?
C D?
C D?
C DC
D?
C D?
C D?
C D?
C DC
DC?D?
C D?
C D?
C C D?
C DC
DC?D?
C D?
C D?
C D?
C DC
DC?D?
F?F?F?F?F
F?
F?
F?
F?
FE _
E?
E?
E?
E?
F?
?
F
?
F
?
F
E?
E?E?
E?FEF NC (x)
F?
E?F?
E?F?
E?F?
E?E
E?E?E?0 E?E
One can also define a normal to a set that is not necessarily convex. Instead of
hv, x x
i 0, x C, one requires that hv, x x
i o(|x x|), x C, cf. [18, Chapter
6]. Its easy to show that this more general definition is consistent, in the convex case,
with that used here.
92
Moreover,
C (
x) = NC (
x).
Indeed, since C (
x) = 0,
C (
x) = v hv, x xi C (x), x IRn ,
equivalently,
x argmin f (x) v f (x ) such that v NC (x ).
xC
A versatile chain rule is at the heart of Subdifferential Calculus just like the standard
chain rule is the key to a rich Differential Calculus, but his doesnt come cheap, cf. [18,
Chapter 10]. Because we deal with convex functions and linear transformations, the simple
chain rule stated here will do.
93
4.10 Theorem (chain rule: linear case). Let f (x) = g(Ax + b) where g :
IRm IR is proper, convex and A is a (m n)-matrix, b IRm . Then
f : IRn IR is convex and for x dom f and
(a) u g(Ax + b) = A>u f (x);
Proof. Appeal to 3.10 for the convexity of f . For (a) and (b), one simply
goes back to the Definition 3.18 of the set of subgradients recalling that
hv, Axi = hA>v, xi.
The ingredients missing for a complete proof of (c) are the same as those
required for a proof of the Moreau-Rockafellar sum rule. By relying on 3.21
we could easily sketch a 1-dimensional proof, i.e., when f (x) = g(ax + b)
and a, b are scalars. But thats already covered by (b) when a 6= 0, and the
identity is trivially satisfied when a = 0.
g
y
G IH
G IH
G IH
G IH
G IH
G IH
G IH
G IH
G IH
G IH
G IH
G IH
G IG
IH
G IH
G IH
G IH
G IH
G IH
G IH
G IH
G IH
G _IH
G IH
G IH
G IG
IGHIH
G IH
G IH
G IH
G IH
G 1IH
G IH
G IH
G fIH
G =IH
G 1IH
G IH
G IG
IGHIH
G IH
G IH
G IH
G IH
G IH
G IH
G IH
G IH
G IH
G IH
G IH
G IG
IGHIH
x2
x1
94
4.3
Variational analysis IV
When our convex program is linearly constrained, the feasible set C is polyhedral. This section is concerned with computing the normal cones of polyhedral sets. As we shall see, the optimality conditions of the next couple of
sections are pretty straightforward consequences of Proposition 4.9 and these
formulas for the normal cones of polyhedral sets.
4.11 Exercise (normals of hyperplanes and half-spaces). Let 0 6= a IR n
and IR. When,
C = x IRn ha, xi =
is a hyperplane and for x C:
v NC (
x) IR such that v = a.
When,
C = x IRn ha, xi
v NC (
x) 0 such that v = a, (ha, x
i ) = 0.
POP
_
NC (x)
NMN
_
x
_
NC (x)
RQR
JKJK
L JK
L JK
L JK
L JK
L JK
L JL
JKLK
J LK
J LK
J LK
J LK
J LK
J LJ
JKLK
J LK
J LK
J LK
J LK
J CLK
J LJ
JKLK
J LK
J T LK
J LK
J LK
J LK
J LJ
JKLK
J LK
J STS LK
J LK
J LK
J LK
J LJ
JKLK
J LK
J _ LK
J LK
J LK
J LK
J LJ
JKLK
J LK
J x LK
J LK
J LK
J LK
J LJ
IR . Its immediate that S = a IR NS (0). There remains only
to show that if v
/ S , i.e. v = a + v for some R and 0 6= v S, there
is a x S such that hv, x
i > 0.
95
and x C. Then,
v NC (
x) 1 0, . . . , s 0, s+1 IR, . . . m IR,
such that
v=
m
X
i A i
and for i = 1, . . . , s,
i=1
i (hAi , xi bi ) = 0.
Even more precisely, if x is such that among the s inequalities those with
index i I {1, . . . , s} are satisfied with equality, the remaining ones
holding with strict inequality, i.e., hAi , x
i < bi for i I = {1, . . . , s} \ I, then
{i IR, i > s},
v NC (
x) {i 0, i I}, {i = 0, i I},
such that
v=
m
X
i=1
i A i
and for i I,
i (hAi , xi bi ) = 0.
96
_
NC (x)
_
x
C
0
Figure 4.8: Normal cone to a polyhedral set at x
The conditions
i (hAi , xi bi ) = 0
for
i = 1, . . . , s,
97
IR if x = ,
NC (
x) = 0
if < x < ,
IR+ if x = .
Hence,
(
u 0, u( x) = 0,
w N[,] (
x) u, v such that w = u + v,
v 0, v( x) = 0.
v
NC(x)
98
Qn
j=1
v NC (
x) j : vj NCj (
xj ).
In particular, if C = IRn+ ,
v NIRn+ (
x) j :
vj 0 and vj xj = 0.
Detail. Again this can be derived from 4.12. But more information is gained
from an alternative approach: use the identity NCj (
xj ) = Cj (
xj ) and
Qn
Proposition 3.23 to conclude that when C is a box, NC (
x) = j=1 NCj (
xj ).
The assertion then follow from 4.14.
4.4
Lagrange multipliers
Lets begin our analysis of linearly constrained problems with those that
only involve equality constraints. Optimality conditions are obtained first by
relying on the generalized Fermat rule and the calculus for normals developed
in the previous section. We then take a more classical approach based on
Fermats rule for smooth functions, see 1.1. This allows us to highlight the
demarcation between what can be dealt with in the classical framework and
what falls by the wayside.
4.16 Proposition (convex program: equality constraints). Consider the
convex program,
min f0 (x) so that Ax = b;
where A is an mn-matrix and f0 : IRn IR. Then x is an optimal solution
if and only if
(a) Ax = b,
and there is a y IRm such that
(b) A> y f0 (x ).
In particular, if f0 is also differentiable at x , one must have
(b) A> y = f0 (x ).
Proof. With C = x Ax = b , it follows from 4.9 that x minimizes f0
on C if and only if one can find v f0 (x ) such that v NC (x ). The
99
m
X
i A i .
i=1
Our derivation of the preceding optimality conditions relies on the generalized Fermat rule, but its also possible to deduce these conditions from
the classical version of Fermats rule at least when f0 is smooth, m n, and
the rows Ai of A are linearly independent (the matrix A has rank m); this
last condition is always satisfied after deletion of the redundant equations.
When A is of rank m, one can always, by pivoting for example, find a system
equivalent to Ax = b of the form IxB + N xN = b where xB consists of m
of the n variables x1 , . . . , xn , and xN of the remaining n m variables. So,
without loss of generality, lets assume that actually A = [ I, N ] is of that
form and then xB = (x1 , . . . , xm ), xN = (xm+1 , . . . , xn ). Hence
xB
b
N
x=
=
+
xN .
xN
0
I
With d := (b, 0), D > := [ N > , I ] and g(xN ) = f (d+DxN ), our optimization
problem can be solved as follows: find xN that minimizes g on IRnm and set
x = d + DxN . Because f0 is smooth, so is g and
g(xN ) = D > f0 (d + DxN ) = D > f0 (x ).
Fermats rule tells us that for xN to minimize g, one must have
0 = g(xN ) = D > f0 (x ).
In other words, f0 (x ) must belong to the null space of D > and this can
only occur if f0 (x ) is a linear combination of the rows of [ I, N ] = A, i.e.,
if there exists y IRm such that f0 (x ) = A> y.
This is our condition 4.16(b), except that here we havent used the convexity
of f0 which allowed us to claim that Fermats rule yields a sufficient as well
as a necessary condition for optimality.
100
Except for the matrix notation, the tools used in our second derivation
were those available to J.L. Lagrange3 , when he came up with the preceding
optimality conditions. Actually, Lagrange considered a more general class of
optimization problems, namely,
min f0 (x) so that G(x) = 0,
where f0 : IRn IR and G : IRn IRm are smooth, m n and the mnJacobian of G,
m,n
Gi
G(x) =
,
(x)
xj
i,j=1
has (full) rank m at x . An argument almost identical to the one above4
leads us to the following necessary conditions for optimality:
(i) G(x ) = 0,
(ii) y IRm such that f0 (x ) = G(x )> y.
Taking into account our smoothness assumptions, condition (ii) can also be
written:
y IRm such that x argminx f0 (x) G(x)> y.
Each yi plays the role of a multiplier; the magnitude of yi suggesting how
much the ith constraint affects the choice of the optimal solution. To remind
us that Lagrange was responsible for this breakthrough in characterizing the
optimal solution of equality constrained problems, one refers to the multipliers, y1 , . . . , ym , as Lagrange multipliers.
Excluding some technical embellishments, this is as far as the classical
tools will take us: the minimization of a smooth function f0 on a manifold
M that is differentiable, has no boundaries and is determined
by a number of
smooth equality constraints, i.e., of the form M = x G(x) = 0 . Once we
have to cope with inequality constraints, or even with equality constraints
that dont describe a nice manifold, we have to switch to the more potent
tools of Subdifferential Calculus.
3
Joseph Louis Lagrange (1736-1813) was the leading mathematician of his time. Among
many other achievements one must count the key role he played in the development of the
Metric System.
4
relying on the Implicit Function Theorem to express m of the variables in terms of
the remaining nm variables
4.5
101
Karush-Kuhn-Tucker conditions I
For a twenty five years span, these conditions were known as the Kuhn-Tucker conditions on the basis of a landmark paper of H. Kuhn and A. Tucker [13]. However, in
1976 H. Kuhn became aware that W. Karush, in his unpublished 1939 Master thesis [12],
at the University of Chicago, had actually scooped them. During the 1930s and 40s,
the University of Chicago was a hotbed for research in Optimization, in particular on the
Calculus of Variations. Quite a number of the leaders in the development of Optimal
Control Theory in the 50s and 60s had either been students at the University of Chicago
or had been closely associated with that school.
6
In the literature one finds reference, somewhat indiscriminately, to these multipliers
as KKT-multipliers or Lagrange multipliers. One of the reasons for making a distinction
between Lagrange and KKT-multipliers is to underscore the difference between results
that can be derived via the classical tools of Differential Calculus and those that require
the use of Subdifferential Calculus.
102
(c) for j = r + 1, . . . , n:
f0
Pnxj
(x ) = 0.
f0
(x )
xj
by
lb x ub,
where the matrix A and the vectors h, lb, ub are defined in 2.2. Exploit the
optimality conditions to analyze the properties of the least squares fit.
Guide. To compute the normals of C =
Qn
j=1 [lbj ,
103
All the results in the remainder of this section have to do with the linearly
constrained convex program:
min f0 (x)
so that hAi , xi bi ,
i = 1, . . . , s,
(P)
hAi , xi = bi , i = s + 1, . . . , m,
x X IRn .
>
(c) x argmin f0 (x) hA y, xi x X .
Another version of (c): v NX (x ) such that f0 (x ) 3 A>y + v.
Proof. Let
X0 = x IRn hAi , xi bi , i = 1, . . . , s,
hAi , xi = bi , i = s + 1, . . . , m .
104
X = IRr+ IRnr
105
From 2.8 we already know that there is more than one way to express
a piecewise linear convex function, at least when it consists of two pieces.
The fact that when , one could also write g(x) = max [ y, y ] as a the
optimal value of some minimization problem, namely,
g(x) = min y + y + y + y = x, y + 0, y 0 .
allows us to use this latter expression and integrate it, when convenient, in
a minimization problem; this will be convincingly illustrated in 5.4.
f
s0
sL
s1
1
4.22 Exercise (representation of piecewise linear functions). Let the piecewise linear convex function g : IR IR be defined by
0 + s0 x when x 1 ,
g(x) = l + sl x
when l x l+1 , l = 1, . . . , L 1, (L finite),
L + sL x when x L ,
l=1
l=1
where
S = (z0 , . . . , zL ) 0 z0 , 0 zl l+1 l for l = 1, . . . , L 1, zL 0 .
106
z0 = 1 x;
for l = 1, . . . , L 1,
l x < l+1 :
x L :
zl = 0, l = 1, . . . , L;
(
z0 = 0; zk = k+1 k , k = 1, . . . , l 1;
zl = x l ; zk = 0, k = l + 1, . . . , L;
z0 = 0; zL = x L ;
zl = l+1 l , l = 1, . . . , L 1.
Check that z is feasible and satisfies the optimality conditions with appropriately selected KKT-multipliers; for example, use 4.19 and to expand
condition 4.19(c) appeal to 4.18. Finally, show that the optimal value of this
linear program, as a function of x, matches the definition of g.
Chapter 5
SIMPLE RECOURSE: RHS
Ignoring, conveniently, uncertainty in the formulation of a decision model,
seldom comes without impunity! The solutions suggested by the simplified
deterministic version of the examples in Chapter 2, were misleading the decision maker, even about the type of activities that should be part of the
optimal activity mix. When uncertainty is included in the the formulation of
the product mix problem (2.1), the capacity expansion problem (2.3) and
the Broadway producer problem (2.5), one ends up with a problem whose
decision process fits the following pattern:
decision: x ; observation: ; recourse cost evaluation.
We refer to such problems as stochastic programs with recourse. The cost
evaluation may or may not be simple and its useful to make this distinction
in the selection of solutions procedures. In some instances, this cost evaluation may require solving another optimization problem. For example, in the
capacity expansion problem, the recourse costs were computed by solving a
certain network flow problem. If evaluating the recourse cost doesnt involve
costs associated with making additional (recourse) decisions, the problem is
said to be one with simple recourse. More precisely, a stochastic program
with simple recourse is an optimization problem of the following type:
min f0 (x) + E{Q(, x)}
xSIRn
S = IR+ ,
Q(, x) = max [ 0, ( x) ].
107
108
5.1 Exercise (product mix problem: simple recourse). With = (T, d), the
product mix problem is a stochastic program with simple recourse. Set
X
f0 (x) = hc, xi, S = IR4+ , Q(, x) =
max [ 0, qi (di hTi , xi) ].
i=c,f
5.1
m2
X
i=1
max[ qi yi , qi+ yi ]
q( T x) P (d).
109
110
(SRtndr)
where
(, ) = q( ),
and with
E() =
(, ) P (d),
The functions E and EQ essentially have the same properties since is just
a linear transformation of x. In particular, E is convex. Its finite-valued
when EQ is finite-valued and then,
E() = E{q( )},
as follows from 4.10(c) and 3.27. And, the optimality conditions for (SRtndr)
are easy translations of those for (SRrhs). So, why bother?
In many applications it turns out that , and thus also E, is separable
while EQ is not, i.e.,
(, ) =
m2
X
i=1
i (i , i ), =
m2
X
i=1
qi (i i ),
E() =
m2
X
Ei (i ).
i=1
111
Separability renders all operations that need to be carried out much easier because one is basically dealing with a juxtaposition of 1-dimensional
cases. Subgradients can be calculated with the help of 3.23 and 3.21, one
can get away with just 1-dimensional integration when evaluating integrals,
and one needs only information about the (marginal) distribution of the random variables i rather than about the joint distribution of , usually much
more difficult to estimate.
Before we pursue the analysis of linear stochastic programs with separable simple recourse and outline solution procedures, lets go through one
more example.
5.2
This example is a historical one. It was the first non-trivial, published example of a stochastic program with recourse [9], and to top this, with a solution
procedure that the authors carried out, successfully, by hand1 !
Facing uncertain passenger demand, an airline wishes to allocate airplanes
of various types among its routes in such a way as to maximize expected
revenue. Equivalently, one can minimize (i) operating costs, assumed to be
independent of the number of passengers on the plane, plus (ii) expected
revenue loss from flying with less than a full load of passengers on any given
route. Let
- routes: j = 1, . . . , m2 .
- bi the number of aircraft of type i available, i = 1, . . . , m1 ;
1
We wont review their procedure that relies on the special transportation structure
of the problem for which highly efficient solution procedures are available, but its a very
elegant approach and [9] is certainly on the recommended reading list.
112
SFO
DNV
LAX
so that
Xm2
j=1
Xm1
i=1
aij xij
j=1 l=1
bi , i = 1, . . . , m1 ,
xij 0, i = 1, . . . , m1 , j = 1, . . . , m2 ,
yjl 0, j = 1, . . . , m2 , l = 1, . . . , L.
113
xij 0, i = 1, . . . , m1 , j = 1, . . . , m2 .
The decision variables xij determine the number of aircraft of type i to allocate to route j. The variables j represent the capacity (= the total number
of seats) made available on route j. There is no revenue loss if the demand j
114
on route j exceeds the capacity j made available, but for each empty seat
there is a loss of potential revenue rj , or a cost rj .
In terms of our simple recourse model (SRtndr), the first group of inequalities corresponds to the deterministic constraints2 . In compact notation,
the deterministic equivalent program has the form:
min hc, xi +
x,
m2
X
j=1
Ej (j ) so that Ax b, T x = , x 0.
(AA)
5.3
Our model for a stochastic program with separable simple recourse, random
rhs, will be:
min hc, xi + E
x,
m2
nX
i ( i , i )
i=1
(SSRtndr)
so that Ax = b, T x = , x 0,
where for i = 1, . . . , m2 ,
i (i , i ) = qi (i i ),
qi : IR IR is convex, and Pi is the (marginal) probability distribution of
the random variable i , its support is i IR. With
Z
Ei (i ) = E{i (i , i )} =
i (, i ) Pi (d),
i
i=1
If so desired, by adding a (non-negative) slack variable one can always turn an inequality constraint into an equality constraint. Namely, for the inequality ha, xi one
can substitute ha, xi + s = and s 0
115
5.3 Exercise (piecewise linear recourse cost function). The recourse cost
functions of the producer problem (2.4 & 3.5), the product mix problem
(2.1 & Exercise 5.1) and the aircraft allocation problem (5.2) are all of the
following type (ignoring subscripts, for now):
(
y if y 0,
(, ) = q( ) where q(y) = max [ y, y ] =
y if y 0;
with . Show that
Z
Z
E() =
( ) P (d) +
( ) P (d)
Z
= (E{} ) + ( ) P ()
P (d) ,
where
d() = ( )P () .
116
xj
j
j = [ j,)
5.5 Exercise (piecewise linear recourse, discrete distribution). As in Exercise 5.3, let (, ) = max [ y, y ] with with a discretely distributed
random variable such that
prob [ = l ] = pl ,
l = 1, . . . , L,
with
[ ( )R( l )] + [( )P ( l ) ]
E() =
117
= prob[ l ] and
if < 1 ,
if l < l+1 ,
l = 1, . . . , L 1,
if L ,
1
2
y 2 if y < ,
= 12 1 y 2
if y [ , ],
1
y 2 2 if y > .
Detail. With (, ) = q( ), one has
if < + ,
n d(, ) o
1
(, ) =
() = ( ) if + + ,
if > + .
118
x playing the role of an index ( the index i in 3.7). Functions of this type
can be viewed as monitoring compliances with certain constraints. They
are special instances of a rich class of functions, called monitoring functions, that will be explored in 6.4. Of course, not all simple recourse costs
functions are linear-quadratic, e.g., reliability consideration might lead to
q(y) = max [ 0, (ey 1) ] with , > 0. In fact, in our separable simple recourse model, we only assume that the functions qi are convex and
finite-valued.
Anyway, returning to our stochastic program with separable simple recourse (SSRtndr), since by separability, cf. 3.23,
Xm2
i=1
Ei (i ) =
m2
Y
E(i ),
i=1
the optimality conditions will follow from those derived in the preceding
chapter.
5.7 Proposition (KKT-conditions: separable simple recourse, rhs). As
long as for i = 1, . . . , m2 , Eqi is finite-valued, a pair (x , ) is an optimal
solution of (SSRtndr) if and only if one can find KKT-multipliers u IRm1
and for i = 1, . . . , m2 , summable KKT-multipliers vi : i IR such that
(a) Ax = b, T x = ;
(b) for i = 1, . . ., m2 : i , vi () q i ( i );
(c) x argmin hc A>u T >v, xi x RIRn+
where for i = 1, . . . , m2 , vi = E{vi (i )} = i vi () Pi(d).
Proof. Again we go straightforward to Corollary 4.20 making use of the
separability of the objective
R function and the (general) identity for the subgradients Ei (i ) = { i vi () Pi(d)} with vi () qi ( i ).
119
Proof. By 5.8, the functions Eqi are finite-valued hence, the optimality conditions of the proposition apply. Conditions (c) of the proposition and that
of the corollary are equivalent by 4.17. As to conditions (b), the equivalence
comes via Proposition 3.27 since for i ,
(
i if < i ,
and vi () [ i , i ] if = i ,
vi () =
i if > i ,
as follows from 5.3.
Lets note that condition (b) can also be stated: for i = 1, . . . , m2 ,
Pi (i ) pi
i vi
Pi (i ),
i i
which should bring to mind the optimality criterion in 2.4 for the Broadway
producer problem!
5.4
Lets now return to the aircraft allocation problem of 5.2 and exploit what
we have learned about stochastic programs with separable simple recourse
to design a solution procedure when the passenger demand is, or is approximated by, a discrete distribution. The recourse costs are then given by
j (j , j ) = max [ rj (j j ), 0 ],
Ej (j ) =
Lj
X
l=1
j (jl , j )pjl ,
120
where for l = 1, . . . , Lj ,
prob [ j = jl ] = pjl
with
Pl
k=1
0
Ej (j ) = rj Rj (jl ) + rj Pj (jl )j
rj j + rj j
Pl
k=1
if j < j1 ,
if jl j < jl+1 , l = 1, . . . , Lj 1,
L
if j j j .
Lj
P
X
L
j
1 +
zjLj 0,
j
l=1 zjl
rj
Pj (jl )zjl j
Ej (j ) = min
0 zjl jl , l = 1, . . . , Lj 1,
z IRLj
j
l=1
where jl = jl+1 jl , Introducing this representation for Ej in our formulation of aircraft allocation problem AA in 5.2, and with
djl =
rj Pj (jl )
= rj
l
X
pjk ,
k=1
one has,
min
x,z
so that
Xm1 ,m2
i,j=1
Xm2
j=1
Xm1
i=1
cij xij +
Lj
m2 X
X
bi , i = 1, . . . , m1 ,
aij xij
tij xij
djl zjl
j=1 l=1
Lj
X
l=1
zjl j1 , j = 1, . . . , m2 ,
xij 0, i = 1, . . . , m1 , j = 1, . . . , m2 .
0 zjl jl , l = 1, . . . , Lj 1, zjLj 0,
j = 1, . . . , m2 .
121
5.2, this is a much preferable situation than having to deal with the extenQ 2
sive version of the problem that involves m1 + m2 ( m
j=1 Lj ) constraints. In
fact, if we recklessly decided to replace the given problem by a deterministic version, say, by replacing the random variables j by their expectations
or some other estimates, the resulting linear program would also be one
with m1 + m2 linear constraints. So, our analysis has paid off extremely well.
Solving the stochastic version of the problem will essentially require the same
computational effort as solving a questionable deterministic version!
5.10 Exercise (aircraft allocation: numerical example). Set up and solve
the following mini-example3 of the aircraft allocation problem:
b = [10, 19, 25, 15],
aircraft
A
B
C
D
routes:
16
?
?
9
1
capacities: tij
15 28 23
10 14 15
5
?
7
11 22 17
2
3
4
81
57
29
55
5
200 : 0.2
50 : 0.3
140 : 0.1
10 : 0.2
580 : 0.1
demand : probability
220 : 0.05 250 : 0.35 270 : 0.2
150 : 0.7
160 : 0.2
180 : 0.4 200 : 0.2
50 : 0.2
80 : 0.3
100 : 0.2
600 : 0.8
620 : 0.1
300 : 0.2
220 : 0.1
340 : 0.1
10
?
?
7.34
1
solution: xij
0
0
0
12.85 0.82 5.33
4.31
?
0
0
7.66
0
2
3
4
0
0
20.69
0
5
This example was used by Ferguson and Dantzig to illustrate their method [9].
122
This solution assigns each available aircraft to some route, and the capacities
made available on routes j = 1, . . . , 5, are
= T x = [ 226.07, 150, 180, 80, 600 ].
Check the optimality conditions of Corollary 5.9 with
u = [ 138, 39.84, 17.42, 70.75 ],
the KKT-multipliers. These multipliers are part of the output available when
solving linear programs; the Matlab function linprog returns these multipliers as part of the lambda-structure.
One can compare the solution with that obtained when the random variables are replaced by their expectations, = E{}, and one solves the resulting linear program:
x0 .
xd = argmin hc, xi Ax b, T x = ,
P 2
d
5.5
Approximations I
Following our approach, it takes less than the blink of an eye, to set up
and solve the aircraft allocation problem of the previous section, of course,
not counting the time to enter the data. This naturally raises the following
question: If some or all of the random variables are not discretely distributed,
wouldnt it be possible to substitute for the given distribution functions,
approximating discrete distributions, so that the solution of (SSRtndr) with
these approximating distributions could be substituted, at little cost, for x .
This will be a recurring issue that will have to be dealt with at different
levels. For now, lets just deal with the specific situation at hand. Dropping
subscripts for now, let P : IR [0, 1] is a continuous distribution thats
strictly increasing on the interval IR, the support of . From 5.3, with
and = E{},
Z
E() := ( ) + ( ) P ()
P (d) ;
5.5. APPROXIMATIONS I
123
v
=: ,
The space of distribution functions with a fixed number of jumps is not a linear space.
mesh refers here to the maximum of the distance between two adjacent quantiles
6
This is not a proof but its certainly believable. One can actually prove that the
optimal solutions converge, but we wont go into this until Chapter ??
5
124
P
Q
What our analysis has also revealed is that approximating the given problem by replacing continuous distributions by discrete ones has its drawback.
For Q to be a good approximation of P with respect to the quantile-distance,
one should use a very fine mesh, and consequently Q will come with many
(small) jumps. And this will result in having to solve a linear program with
many variables, although still with only m1 + m2 linear constraints. So, that
doesnt really invalidate this approach, but one wonders if another approach
might not yield better results.
Instead of approximating P by a piecewise constant distribution, lets
consider a piecewise linear approximation as illustrated in Figure 5.5. Its
obvious that when we are limited to a prescribed number of pieces, we can
more easily match the quantiles of P with a piecewise linear Q that with
a piecewise constant one; in fact, dramatically so. But the question is now
what type of problem will we have to solve if the distribution functions are
piecewise linear. It turns out to be a separable quadratic program that can
be solved almost as efficiently as linear programs of the same size7 .
5.5. APPROXIMATIONS I
125
P
Q
when 1 ,
0
P () = P (l ) + sl ( l ) when l l+1 ,
1
when L ,
l = 1, . . . , L 1,
with 1 < 2 < < L ; 1 , . . . , L are called the nodes of the distribution
functions P . Note that the coefficients sl are necessarily nonnegative, otherwise P would not be monotone nondecreasing. Let = E{}; its always
finite. Then, with = ( 1 ),
P
L
L1
= 1 z0 + Ll=1 zl ,
X
X
a l zl +
bl zl2
E() = + min
z0 0, zL 0,
z
l=0
l=1
0 zl l , l = 1, . . . , L 1,
where
al = + ( )P (l ), l = 0, . . . , L,
bl = ( )sl /2, l = 1, . . . , L 1,
l = l+1 l , l = 1, . . . , L 1.
Detail. One begins with 5.3 and then proceed as in 5.4 or 5.5. Finally,
one appeals to the same arguments as those used in 4.22 to obtain the representation of E as the optimal value of a separable quadratic program.
Furthermore, one can also find explicit expressions for dE and E.
126
More general approximations schemes for stochastic and other optimization problems will be explored in Chapters 11 and 16.
296
Appendix A
Notation and terminology
The notation is pretty much that of Analysis and Linear Algebra textbooks
with one possible exception: h, i for the vector-inner product.
IR the real numbers, IR = [ , ] the extended reals, IRn the ndimensional vector space.
IN = {1, 2, . . . } the natural numbers. Sequences are usually be indexed
by IN , subsequences by k with k as k .
Generally, lower case letters x, y, a, b, . . . are be reserved for vectors,
upper case letters A, T, W, . . . for matrices, and Greek letters , , . . .
for scalars.
The inner product of two vectors a, x IRn is denoted by ha, xi =
Pn
j=1 aj xj . By adopting this notation, one never has to specify if vectors
are row or column vectors and ha, xi = hx, ai.
Matrix multiplication: Ax is the vector in IRm resulting from multiplying the vector x IRn by the m n-matrix A. For IR, x is the
-multiple of x and the entries of A are those of A multiplied by .
Aj is the jth column, while Ai denotes the ith row of A; A> is the
transpose of A. One has hy, Axi = hA> y, xi.
Unless otherwise specified, the norm of x IRn is the euclidean norm,
i.e., |x| = hx, xi1/2 . The (closed) ball of radius centered at x is
297
298
I
R
such
that |x| = 1. When C = (x1 , x2 , x3 ) x1 = 0, x22 + x23 1 , the set C
is closed, but int C = and then bdry C = C.
A.1
299
a fence and then bisect the desert by a fence running (say) north and south.
The lion is in one half or the other; bisect this half by a fence running east
to west. The lion is now is one of two quarters; bisect this by a fence and so
on: the lion ultimately becomes enclosed by an arbitrarily small enclosure.
in applying this idea to our problem is that a sequence
The essential point
x C, IN lies in a bounded set (contained in C). Because C is
bounded it is enclosed in a (square) box whose sides are less than or equal
to > 0. Rather than bisecting C, let us split it in 2n parts, each one
contained in a box whose sides are less than or equal to /2. Then at least
one of the parts
must contain an infinite number of members of the sequence
x , IN . Let C1 be the portion of C containing an infinite number of
points x and split C1 . Again one of the parts, say C2 , must contain an
infinite number of points of the subsequence in C1 . Continue this process.
We obtain a nested sequence of subset of C (C C1 C2 ), each
containing an infinite number of points of the sequence and the sides of Ck of
diameter less than or equal to /2k . The low (coordinate-wise) end points of
all boxes Ck are bounded above (again coordinate-wise, using the fact that C
is bounded) and so there is a least upper bound (again coordinate-wise),
say x. Every neighborhood of x contains some Ck since the length of the
sides of the sets Ck goes to 0. That x is a cluster point of the sequence
x C, IN .
A.3 Theorem (Existence of a minimizer) Given any real-valued continuous
function f defined on a compact set C IRn one can find a point x C at
which the minimum of f on C is actually attained; i.e., such that f (x )
f (x) for all x C. One then writes x argminxC f (x).
A.2
Function expansions
X
f (k) (x0 )
k=0
k!
(x x0 )k
f 000 (x0 )
f 00 (x0 )
2
(x x0 ) +
(x x0 )3 +
= f (x0 ) + f (x0 )(x x0 ) +
2
6
0
300
for some x1 such that x0 < x1 < x or x < x1 < x0 (if x < x0 ).
(b) Multivariate Taylors formula. For f : IRn IR of class C r
f (x + y) = f (x) + hf (x), yi +
hy, 2 f (x)yi + + D r f (x + y)
2
r!
D f (x) =
n X
n
X
i1 =1 i2 =1
n
X
ir =1
y i1 y i2 y ir
r f
(x).
xi1 xi2 xir
Bibliography
[1] J.-P. Aubin and H. Frankowska. Set-Valued Analysis. Birkhauser Boston
Inc., 1990.
[2] R.G. Bland. New pivoting rules for the simplex method. Mathematics
of Operations Research, 2:103107, 1977.
[3] J.M. Borwein and A.S. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2000.
[4] F.H. Clarke. Optimization and Nonsmooth Analysis. Wiley, 1983.
[5] R.W. Cottle.
On dodging degeneracy in linear programming.
Manuscript, Stanford University, 200?
[6] R.W. Cottle, J.-S. Pang, and R.E. Stone. The Linear Complementarity
Problem. Academic Press, 1992.
[7] G.B. Dantzig. Linear Programming and Extensions. Princeton University Press, 1963.
[8] J.H. Dula and F.J. Lopez. Algorithms for the frame of a finitely generated unbounded polyhedra. Manuscript, University of Mississppi, 2002.
[9] A.R. Ferguson and G.B. Dantzig. The allocation of aircraft to routes. an
example of linear programming under uncertain demand. Management
Sciences, 1:4573, 1958.
[10] P.E. Gill, W. Murray, and M.H. Wright. Practical Optimization. Academic Press, 1981.
301
302
BIBLIOGRAPHY
Index
active constraints, 95
affine mapping, 49
affine set, 158, 181
argmin, 2
convex, 54
strictly, 51, 53
convex combination, 48
convex hull, 48, 161
convex polyhedron, 85
convex program, 84
convex system, 194
convex-concave bifunction, 128
ball, 298
barrier method, 134
bifunction, 128
convex-concave, 128
effective domain, 128
bivariate function, 128
Blands rule, 179
boundary, 90, 298
set, 183
density function, 71
derivative, 2
diag, 134
direction of descent, 7
discrete distribution, 69
discrete random variable, 69
dual problem, 131
effective domain, 52
bifunction, 128
effective objective, 83
epi-multiplication, 203
epi-polyhedral function, 169
epi-sum, 203
epigraph, 53
convex, 54
Euler equation, 19
expectation functional, 68, 75
extended arithmetic, 53
extensive version, 28, 112
extreme point, 173
closed
set, 183
closure, 90, 183, 298
cluster, 298
complementary slackness, 96
concave function, 51
cone, 50, 158
constraint qualification, 194
continuity, 185
continuous distribution, 71
convex
function, 51
locally, 6
set, 48
304
feasible set, 83
Fermat rule, 2, 3
finite generation, 162
finite support, 70
frame, 163
function
affine, 53, 85
continuous, 185, 298
convex, 51
epi-polyhedral, 169
improper, 88
inf-compact, 186
linear-quadratic, 117, 139
Lipschitz continuous, 298
monitoring, 118, 139
piecewise continuous, 74
proper, 88
quadratic, 61
separable, 67, 87
smooth, 3
sublinear, 57
summable, 74
support, 139
INDEX
interior, 90, 298
interior-point method, 134, 136
Jacobian, 13, 100
KKT-conditions, 101
KKT-multipliers, 101
half-space, 157
closed, 94
Hessian matrix, 10
hyperplane, 94, 157
hypograph, 53
indicator function, 55
inf-compact function, 186
inf-projection, 56
integrand, 75
game
zero-sum, 130
gradient, 3
INDEX
local, 7
Minkowski theorem, 161
monitoring function, 37, 118, 139
Moreau-Rockafellar sum rule, 90
Newton direction, 10
Newton method, 10
Newton-Raphson method, 13
nonlinear program, 36
norm, 297
normal, 91
normal cone, 91
normal equation, 5
Oresme rule, 2, 3
outer product, 15
penalty function, 37
plain distribution, 72
polar, 160
polyhedral set, 85, 157
polytope, 162
positive definite, 60
positive hull, 158
positive semi-definite, 60
positively homogeneous, 57
primal problem, 131
primal-dual interior-point method,
134
probability distribution, 69
product
sets, 49
proper subsets, 298
quadratic program, 31, 36, 86, 104,
132
Quasi-Newton
condition, 15
305
method, 15
random technology, 145
ray, 50
saddle function, 129
saddle point, 129
semicontinuity
lower, 185
upper, 185
separable program, 87
set
boundary, 298
bounded, 298
closed, 298
compact, 298
interior, 298
open, 298
simple recourse, 107, 145
slack variable, 114
steepest descent, 7
step size
Armijo, 9
stochastic program
with recourse, 27
stochastic program with recourse,
39
subderivative, 62
subgradient, 62
sublinear function, 57
sublinear polyhedral, 169
sum rule
Moreau-Rockafellar, 90
support function, 139
unit ball, 298
upper limit, 185
upper semicontinuity, 184, 185
306
usc
function, 184
usc at a point, 185
usc function, 185
value function, 170
vertex, 173
vertical closure, 56
Weyl theorem, 158
zero-sum game, 130
INDEX