Lecture Notes Stochastic Optimization-Koole
Lecture Notes Stochastic Optimization-Koole
http://www.math.vu.nl/obp/edu/so
Ger Koole
Department of Mathematics, Vrije Universiteit Amsterdam, The Netherlands
http://www.math.vu.nl/koole, koole@few.vu.nl
22nd January 2006
1 Introduction
This course is concerned with dynamic decision problems. Dynamic refers to the notion of
time, i.e., decisions taken at some point in time have consequences at later time instances.
This is, historically speaking, the crucial difference with linear programming. For this reason
the main solution technique is called dynamic programming (Bellman [2]). Often the evolu-
tion of the problem is subject to randomness, hence the name stochastic dynamic programming
(cf. Ross [14]). Nowadays the usual term is (semi-)Markov decision theory, emphasizing the
connection with (semi-)Markov chains. Note also that dynamic programming is usually identi-
ed with a solution technique, and not with a class of problems.
The focus is on practical aspects (especially on the control of queueing systems), what we
can do with it.
Some typical examples of Markov decision chains are:
- admission control to a queue. Objective: minimize queue length and rejections. Dynamics:
Decisions inuence future queue length.
- routing in a call center. Objective: minimize waiting of calls. Dynamics: routing decisions
inuence future availabilities.
- investment decisions, when to buy a new car, etc.
During most of this course we impose the following restrictions on the problems considered:
- there is a single decision maker (no games or decentralized control);
- discrete state spaces and thus discrete events (continuous state space and thus also possibly
continuously changing states: system theory)
Prior knowledge: basic probability theory (e.g., Poisson process) and some programming
experience.
1
Koole Lecture notes Stochastic Optimization 22nd January 2006 2
2 Refresher probability theory
(copied from lecture notes modeling of business processes)
The Poisson process and the exponential distribution play a crucial role in many parts of these
lecture notes. Recall that for X exponentially distributed with parameter holds:
F
X
(t) =P(X t) = 1e
t
, f
X
(t) = F
/
X
(t) = e
t
,
EX =
1
, and
2
(X) =E(X EX)
2
=
1
2
.
We start with some properties of minX,Y if both X and Y are exponentially distributed (with
parameters and ) and independent:
P(minX,Y t) = 1P(minX,Y >t) = 1P(X >t,Y >t) =
1P(X >t)P(Y >t) = 1e
t
e
t
= 1e
(+)t
.
Thus minX,Y is again exponentially distributed with as rate the sum of the rate. Repeating this
argument shows that the minimum of any number of exponentially distributed random variables
has again an exponential distribution. We also have:
P(X Y[ minX,Y t) =
P(X Y, minX,Y t)
P(minX,Y t)
=
P(X Y, X t,Y t)
P(X t,Y t)
=
P(X Y, X t)
P(X t)P(Y t)
=
_
t
_
x
e
x
e
y
dydx
e
t
e
t
=
_
t
e
x
e
x
dx
e
t
e
t
=
+
e
t
e
t
dx
e
t
e
t
=
+
.
This means that the probability that the minimum is attained by X in minX,Y is proportional
to the rate of X, independent of the value of minX,Y.
A nal extremely important property of the exponential distribution is the fact that it is mem-
oryless:
P(X t +s[X >t) =
P(X t +s, X >t)
P(X >t)
=
P(X t +s) P(X t)
e
t
=
e
t
e
(t+s)
e
t
= 1e
s
=P(X s).
We continue with characterizing the Poisson process with rate . A Poisson process is a
counting process on R
+
, meaning that it counts events. The Poisson process is commonly
dened by taking N(s, t), the number of events in [s, t], equal to a Poisson distribution with
parameter (t s):
P(N(s, t) = k) = e
(ts)
((t s))
k
k!
.
Koole Lecture notes Stochastic Optimization 22nd January 2006 3
Next to that we assume that the numbers of arrivals in disjunct intervals are stochastically inde-
pendent.
One of its many equivalent characterizations is by the interevent times, which are independent
and exponentially distributed with parameter . That this is equivalent can be seen by looking at
the probability that there are no events in [s, t]:
P(next event after s +t[event at s) =P(N(s, s +t) = 0) = e
t
.
Thus the time until the kth event has as distribution a sum of exponentially distributed random
variables, which is commonly known as Gamma or Erlang distribution with shape parameter k
and scale parameter .
Note that, thanks to the properties of the exponential distribution, the superposition of two
Poisson processes is again a Poisson process with as rate the sum of the rates.
Finally a few words on conditioning. Let A and B be events in some probability space. Then
P(A[B), the probability of A given B, is dened as
P(A[B) =
P(AB)
P(B)
.
Now P(A) = P(AB) +P(AB
c
) = P(A[B)P(B) +P(A[B
c
)P(B
c
). This is called the law of total
probability. It can be generalized as follows: let B
1
, B
2
, . . . be events such that B
i
B
j
= / 0, and
k=1
B
k
A. Then
P(A) =
k=1
P(A[B
k
)P(B
k
).
Exercise 2.1 (thinning Poisson processes) Consider a Poisson process with rate . Construct
a new point process by selecting independently each point of the Poisson process with the same
probability p.
a. Show that the interarrival times of the new process are exponentially distributed, and give the
parameter (hint: use the law of total probability).
b. Prove that the new process is again a Poisson process (check all conditions!).
Exercise 2.2 Let X and Y be i.i.d. exponentially() distributed.
a. Compute P(X t, X +Y >t).
b. Explain why P(N(0, t) = 1) =P(X t, X +Y >t) with N as above.
c. Verify the answer of a) using the Poisson distribution.
3 Markov chains: forward recursion
AMarkov chain is a dynamic process taking values in its state space X. Fromone time to the next
it changes state occurding to its transition probabilities p(x, y) 0, x, y X,
yX
p(x, y) = 1.
Let the r.v. X
t
be the state at t, with distribution
t
. Then for t > 0
t
(x) =
t1
(y)p(y, x),
Koole Lecture notes Stochastic Optimization 22nd January 2006 4
0
is given. Thus p(x, y) should be interpreted as the probability that the chain moves to state y
given it is in state x.
We will make the following three assumptions. Relaxing any of these is possible, but usually
leads to additional constraints or complications. Moreover, in most practical situations all three
constraints are satised. Before formulating the assumptions it is convenient to dene the notion
of a path in a Markov chain.
Denition 3.1 (path) A sequence of states z
0
, z
1
, . . . , z
k1
, z
k
X with the property that
p(z
0
, z
1
), . . . , p(z
k1
, z
k
) > 0 is called a path from z
0
to z
k
of length k.
Assumption 3.2 [X[ < .
Assumption 3.3 There is at least one state x X, such that there is a path from any state to x.
If this is the case we call the chain unichain, state x is called recurrent.
Assumption 3.4 The gcd of all paths from x to x is 1, for some recurrent state x. If this is the
case we call the chain aperiodic.
Dene the matrix P as follows: P
xy
= p(x, y). Then
t
=
t1
P, and it follows immediately
that
t
=
0
P
t
. For this reason we call p
t
(x, y) = P
t
xy
the t-step transition probabilities.
Theorem 3.5 Under Assumptions 3.23.4, and for
0
some arbitrary distribution, lim
t
t
=
P, (1)
independent of
0
.
Theorem 3.5 gives an iterative procedure to compute
, then
t
=
for all t.
Writing out the matrix equation
P gives
(x) =
(x) = 1 is added.
If we skip one of the assumptions, then Theorem 3.5 does not hold anymore. We give three
examples, each one violating one of the assumptions.
Example 3.6 (necessity Assumption 3.2) Let X = N
0
,
0
(0) = 1, p(0, 1) = 1 and p(x, x +1) = p(x, x
1) = 1/2 for all x > 0. Any solution to Equation (1) has
0
(x +1) =
0
(x). This, combined with the
countable state space, leads to the fact that the normalizing assumption cannot be satised.
Example 3.7 (necessity Assumption 3.3) Let X =0, 1 and p(x, x) = 1. There is no path from 0 to 1 or
vice versa. It holds that
=
t
=
0
, and thus the limiting distribution depends on
0
.
Koole Lecture notes Stochastic Optimization 22nd January 2006 5
Example 3.8 (necessity Assumption 3.4) Let X =0, 1 and p(0, 1) = p(1, 0) = 1. Then all paths from
0 to 0 and from 1 to 1 have even length, the minimum being 2: the chain is periodic with period 2. We
nd
2t
=
0
and
2t+1
(x) = 1
0
(x). There is no limiting distribution, unless
0
(0) = 1/2.
Forward recurrence can be interpreted as the simultaneous simulation of all possible paths
of the Markov chain. Compare this with regular simulation where only one path of the Markov
chain is generated. From this single path the stationary probabilities can also be obtained:
T
1
T1
t=0
IX
t
= x
(x) a.s., because of the law of large numbers. Thus with probabil-
ity 1 every path of the Markov chain visits a state x a fraction of times that converges to
(x)
in the long run. The stationary or limiting distribution
.
4 Markov reward chains: the Poisson equation
Quite often we have direct rewards attached to states. Then we are interested in the limiting
expected rewards.
Now we have, next to X and p(x, y), r(x) R, the direct reward that is obtained each time
state x is visited. Thus, instead of the distribution of X
t
, we are interested in Er(X
t
), and especially
in its limit for t . This number g is given by g =
T1
t=0
r(X
t
)/T g a.s.
It is not possible to generalize this method to include actions: actions depend on future
behavior, and under simulation/forward recursion only the history is known.
Thus a backward recursion method is needed for optimization, one that already takes the
(possible) future(s) into account. We need some notation to develop this crucial idea.
Let V
T
(x) be the total expected reward in 0, . . . , T 1 when starting at 0 in x:
V
T
(x) =
T1
t=0
yX
p
t
(x, y)r(y) = E
T1
t=0
r(X
t
) with X
0
= x. Note that
x
(x)V
T
(x) =
(x)
T1
t=0
y
p
t
(x, y)r(y) =
T1
t=0
(x)p
t
(x, y)r(y) =
T1
t=0
(y)r(y) = gT.
Let V(x) = lim
T
[V
T
(x) gT]. Then V(x) is the total expected difference in reward between
starting in x and starting in stationarity.
Koole Lecture notes Stochastic Optimization 22nd January 2006 6
We calculate V
T+1
(x) in two different ways. V
T+1
(x) = V
T
(x) +
T
(y)r(y) for
0
with
0
(x) = 1. As
T
and
(x)r(x) = g, V
T+1
(x) =V
T
(x) +g +o(1), where o(1) means
that this term disappears if t . On the other hand, for V
T+1
the following recursive formula-
tion exists:
V
T+1
(x) = r(x) +
y
p(x, y)V
T
(y). (2)
Thus
V
T
(x) +g+o(1) = r(x) +
y
p(x, y)V
T
(y).
Subtract gT from both sides, and take T :
V(x) +g = r(x) +
y
p(x, y)V(y). (3)
This equation is also known as the Poisson equation. Note that V represents the information on
the future.
Note however that Equation (3) does not have a unique solution: If V is a solution, then so is
V
/
(x) =V(x) +C. There are two possible solutions: either take V(0) = 0 for some reference
state 0, or add the additional condition
x
(x)V(x) = 0.
5 Markov decision chains: policy iteration
Finally we introduce decisions. Next to the state space X we have an action set A. The idea is
that depending on the state X
t
an action A
t
is selected, according to some policy R : X A. Thus
A
t
= R(X
t
).
Evidently the transition probabilities also depend on the action: p(x, a, y) is the probability
of going from x to y when a is chosen. Also the rewards depend on the actions: r(x, a).
Assumption 5.1 [A[ < .
We also have to adapt the assumptions we made earlier. Assumption 3.2, which states that
[X[ <, remains unchanged. For the other two assumption we have to add that they should hold
for any policy R.
Koole Lecture notes Stochastic Optimization 22nd January 2006 7
Assumption 5.2 For every policy R there is at least one state x X (that may depend on R),
such that there is a path from any state to x. If this is the case we call the chain unichain, state x
is called recurrent.
Assumption 5.3 For every policy R the gcd of all paths from x to x is 1, for some recurrent state
x. If this is the case we call the chain aperiodic.
Let V
R
t
(x) be the total expected reward in 0, . . . , t 1, when starting at 0 in x, under policy R.
We are interested in nding argmax
R
lim
T
V
R
T
(x)/T, the maximal average expected long-run
reward. This maximum is well dened because the number of different policies is equal to
[X[[A[, and thus nite.
How to compute the optimal policy?
1. Take some R.
2. Compute g
R
and V
R
(x) for all x.
3. Find a better R
/
. If none exists: stop.
4. R := R
/
and go to step 2.
This algorithm is called policy iteration. Step 2 can be done using the Poisson equation: for
a xed policy R the direct rewards are r(x, R(x)) and the transition probabilities are p(x, R(x), y).
How to do step 3? Take
R
/
(x) = argmax
a
r(x, a) +
y
p(x, a, y)V
R
(y)
in each state. If the maximum is attained by R for each x then no improvement is possible.
Example 5.4 (Replacement decisions) Suppose we have a system that is subject to wear-out, for example
a car. Every year we have to pay maintenance costs, that are increasing in the age of the system. Every
year we have the option to replace the system at the end of the year by a new one. After N years we are
obliged to replace the system. Thus X =1, . . . , N, A =1, 2, action 1 meaning no replacement, action
2 replacement. Thus p(x, 1, x +1) = 1 for x < N and p(x, 2, 1) = p(N, 1, 1) = 1 for all x. The rewards are
given by r(x, 1) =C(x) for x < N and r(x, 2) = r(N, 1) =C(x) P for all x, with P the price of a new
system. Consider a policy R dened by R(1) = 1 and R(x) = 2 for x > 1, thus we replace the system if its
age is two years or higher. The Poisson equations are as follows:
V
R
(1) +g
R
=C(1) +V
R
(2), V
R
(x) +g
R
=C(x) P+V
R
(1) for x > 1.
We take V(1) = 0. Then the solution is g
R
=
C(1)C(2)P
2
and V
R
(x) =g
R
C(x) P, x > 1. Next we
do the policy improvement step, giving
R
/
(x) = argmaxC(x) +V
R
(x +1), C(x) P+V
R
(1) = argmaxg
R
C(x +1), 0 for x < N.
If we assume that C is increasing, then R
/
is 1 up to some x and 2 above it.
Take, for example, C(x) = x and P = 10. Then g
R
=6.5 and V
R
(x) =3.5x, x > 1. From this it
follows that R
/
(1) = = R
/
(5) = 1, R
/
(6) = = R
/
(N) = 2, with average reward g
R
/
= (123
45610)/6 5.17 >6.5 = g
R
. Note that there are two optimal policies, that replace after 4 or
5 years, with average cost 5.
Koole Lecture notes Stochastic Optimization 22nd January 2006 8
For the optimal policy R
it holds that
r(x, R
(x))) +
y
p(x, R
(x), y)V
R
(y) = max
a
r(x, a) +
y
p(x, a, y)V
R
(y).
At the same time, by the Poisson equation:
V
R
(x) +g
R
= r(x, R
(x))) +
y
p(x, R
(x), y)V
R
(y).
Combining these two gives
V
R
(x) +g
R
= max
a
r(x, a) +
y
p(x, a, y)V
R
(y).
This equation is called the optimality equation or Bellman equation. Often the superscript is left
out: g and V are simply the average reward and value function of the optimal policy.
Exercise 5.1 Consider Example 5.4, with N = 4, P = 3, C(1) = 5, C(2) = 10, C(3) = 0 and
C(4) = 10. Start with R(x) = 2 for all x. Apply policy iteration to obtain the optimal policy.
6 Markov reward chains: backward recursion
In this section we go back to the Markov reward chains to obtain an alternative method for
deriving V. Recall that V
T+1
(x) V
T
(x) g, and note that V
T
(x) V
T
(y) V(x) V(y). Thus
simply by computing V
T
for T big we can obtain all values we are interested in. To compute V
T
we can use the recursion (2). Initially one usually takes V
0
= 0, although a good initial value can
improve the performance signicantly. One stops iterating if the following holds:
span(V
t+1
(x) V
t
(x)) for all x X,
which is equivalent to
there exists a g such that g
2
V
t+1
(x) V
t
(x) g+
2
for all x X.
Value iteration algorithm pseudo code Let [X[ = N, E(x) y[p(x, y) > 0, and some
small number (e.g., 10
6
).
Vector V[1..N], V
/
[1..N]
Float min, max
V 0
do
V
/
V
Koole Lecture notes Stochastic Optimization 22nd January 2006 9
for(x = 1, .., N) % iterate
V(x) r(x)
for(y E(x)) V(x) V(x) + p(x, y)V
/
(y)
max 10
10
min 10
10
for(x = 1, .., N) % compute span(V V
/
)
if(V(x) V
/
(x) < min) min V(x) V
/
(x)
if(V(x) V
/
(x) > max) max V(x) V
/
(x)
while(max min > )
Some errors to avoid:
- Avoid taking E(x) =X, but use the sparseness of the transition matrix P;
- Compute P online, instead of calculating all entries of P rst;
- Insert the code for calculating P at the spot, do not use a function or subroutine;
- Span(V V
/
) need not be calculated at every iteration, but once every 10th iteration for example.
7 Markov decision chains: backward recursion
Value iteration works again in the situation of a Markov decision chain, by including actions in
the recursion of Equation (2):
V
t+1
(x) = max
a
r(x, a) +
y
p(x, a, y)V
t
(y). (4)
We use the same stop criterion as for the Markov reward case.
Remark 7.1 (terminology) The resulting algorithm is known under several different names. The same
holds for that part of mathematics that studies stochastic dynamic decision problems. The backward re-
cursion method is best known under the name value iteration, which stresses the link with policy iteration.
The eld is known as Markov decision theory, although stochastic dynamic programming is also used.
Note that in a deterministic setting dynamic programming can best be described by backward recursion.
Thus the eld is identied by its main solution method. We will call the eld Markov decision theory, and
we will mainly use value iteration for the backward recursion method.
Remark 7.2 V
t
(x) is interesting by itself, not just because it helps in nding the long-run average optimal
policy: it is the maximal reward over an horizon t. For studying these nite-horizon rewards we do not
need Assumptions 3.33.4: they were only necessary to obtain limit results.
Example 7.3 Consider a graph with nodes V =1, . . . , N, arc set E V V, and distances d(x, y) > 0
for all (x, y) E. What is the shortest path from 1 to N? This can be solved using backward recursion as
follows. Take X = A = V. For all x < N we dene r(x, a) = d(x, a) if (x, a) E, otherwise, and
p(x, a, a) =1. Dene also r(N, a) =0 and p(N, a, N) =1. Start with V
0
=0. Then V
N1
gives the minimal
distances from all points to N.
Exercise 7.1 Prove the claim of Example 7.3.
Koole Lecture notes Stochastic Optimization 22nd January 2006 10
Exercise 7.2 Consider again Example 7.3.
a. Formulate the Poisson equation for the general shortest path problem.
b. Give an intuitive explanation why V(N) = 0.
c. Give an interpretation of the values V(x) for x ,= N.
d. Verify this using the Poisson equation.
Exercise 7.3 Consider a model with X =0, . . . , N, A =0, 1, p(x, 0, x) = p(N, 1, N) = for
all x > 0, p(x, 1, x +1) = for all x < N, p(0, 0, 0) = 1, p(0, 1, 0) = p(x, a, x 1) = 1 for
all a A and x > 0, r(x, 0) = x c for some c > 0, r(N, 1) = N c, and r(x, 1) = x for x < N.
Implement the value iteration algorithm in some suitable programming environment and report
for different choices of the parameters (with N at least 10) on the optimal policy and values.
8 Continuous time: semi-Markov processes
Consider a Markov chain where the time that it takes to move from a state x to the next state is
not equal to 1 anymore, but some random variable T(x). This is called a semi-Markov process
(C inlar [6], Ch. 10). We assume that 0 < (x) =ET(x) < .
If we study the semi-Markov process only at the moments it changes state, the jump times,
then we see what is called the embedded Markov chain. This Markov chain has stationary distri-
bution
. Consider now the stationary distribution over time, i.e., the time-limiting distribution
that the chain is in a certain state. This distribution
is specied by:
(x)
(y)
=
(x)
(x)
(y)
(y)
,
from which it follows that
(x) =
(x)(x)
(y)(y)
. (5)
Example 8.1 (Repair process) Take a model with X = 0, 1, p(0, 1) = p(1, 0) = 1, and some arbitrary
T(0) and T(1). This could well model the repair of a system, with T(1) the time until failure, and T(0)
the repair time. Note that the embedded Markov chain is periodic, but the stationary distribution exists:
(0) =
(1) =
1
2
. Using Equation (5) we nd
P(system up in long run) =
(1) =
(1)(1)
(0)(0) +
(1)(1)
=
(1)
(0) +(1)
.
The same result can be obtained from renewal theory.
Exercise 8.1 Calculate
0
,
1
,
2
,
, and
z
(x, z). Thus, Markov processes, dened through their rates (x, y), are
indeed special cases of semi-Markov processes.
Sometimes it is convenient to have T(x) equally distributed for all x. Let be such that
y
(x, y) for all x. We construct a new process with rates
/
(x, y) as follows. First, take
/
(x, y) = (x, y) for all x ,= y. In each state x with
y
(x, y) < , add a ctituous or dummy
transition from x to x such that the rates sum up to :
/
(x, x) =
y,=x
(x, y) for all x X.
This new process has expected transition times
/
as follows:
/
(x) = 1/. Because
/
(x) =
/
(y)
it follows that
/
=
/
, from Equation (5). The idea of adding dummy transitions to make the
rates out of states constant is called uniformization.
Using uniformization we can derive the standard balance equations for Markov processes
from Equation (1). To do so, let us write out Equation (1), in terms of the rates. Note that for the
transition probabilities we have p
/
(x, y) =
/
(x, y)/:
yX
/
(x, y)
(x) =
/
(x) =
yX
(y)p
/
(y, x) =
y,=x
(y)
/
(y, x)
+
/
(x)
/
(x, x)
.
Multiplying by , and subtracting
/
(x)
/
(x, x) from both sides leads to
y,=x
/
(x, y)
/
(x) =
y,=x
(y)
/
(y, x), (6)
the standard balance equations.
Example 9.1 (M/M/1 queue) An M/M/1 queue has X = N
0
, and is dened by its arrival rate and its
service rate , thus (x, x +1) = for all x 0 and (x, x 1) = for all x > 0. All other transition rates
are 0. (We assume that < for reasons to be explained later.) Filling in the balance equations (6) leads
to
(0) =
(1), (+)
(x) =
(x 1) +
(x +1), x > 0.
It is easily veried that the solution is
(x) = (1 )
x
, with = /. Note that Assumption 3.2 is
violated. For
(1) =
+
(0),
(x) =
+
_
x1
(0) =
2
,
(1) =
2
+
(x) =
2
+
_
x1
for x > 0.
The denominator of Equation (5) is equal to 1/(2), leading to
(0) = 2
(0)(0) = 2
2
1
= 1,
equal to what we found above. In a similar way we can nd
(x)r(x).
Example 11.1 Take the repair process studied earlier, suppose a reward R for each unit of time the system
is up and labour costs C for each unit of time the system is in repair. Then the expected long-run stationary
reward is given by
g =
(0)C+(1)R
(0) +(1)
.
A simple algorithm to compute g would consists of computing the stationary distribution of
the embedded chain
:
g =
(x)r(x) =
(x)(x)r(x)
(y)(y)
. (7)
The denominator of g has the following interpretation: it is the expected time between two jumps
of the process.
An alternative method is to simulate the embedded chain X
t
, and then to compute
T
t=0
r(X
t
)(X
t
)
T
t=0
(X
t
)
g a.s.
As for the discrete-time case, we move next to backward recursion. Let V
t
(x) be the total
expected reward in [0, t] when starting at 0 in x. We are again interested in lim
t
V
t
(x)
t
, the
average expected long-run rewards.
Koole Lecture notes Stochastic Optimization 22nd January 2006 14
Using similar arguments as for the discrete-time case we nd
V
T
(x) +(x)g+o(1) = r(x)(x) +
y
p(x, y)V
T
(y).
Subtracting gT from both sides and taking T leads to:
V(x) +(x)g = r(x)(x) +
y
p(x, y)V(y) (8)
Note that again this equation does not have a unique solution.
Example 11.2 (Repair process) Consider again the repair process. We get the following set of equations:
V(0) +(0)g =C(0) +V(1), V(1) +(1)g = R(1) +V(0)
All solutions are given by:
g =
C(0) +R(1)
(0) +(1)
,
V(1) =V(0) +
(0)(1)(R+C)
(0) +(1)
Exercise 11.1 We add a reward component to the semi-Markov process that we studied for some
of the models of Exercises 8.1 and 9.1. Compute the expected stationary reward using two differ-
ent methods: by utilizing the stationary distribution
y
p(x, a, y)V
R
(y).
Koole Lecture notes Stochastic Optimization 22nd January 2006 15
Exercise 12.1 Consider again Exercise 8.1. As in Exercise 11.1, there is a reward R for each
time unit the machine is up. The costs for repairing are equal to C per unit of time. Suppose
there is the option to repair in 3 hours for costs C
/
per unit of time.
a. Formulate this as a semi-Markov decision process.
b. Use policy iteration to determine the minimal value of C
/
for which it would be attractive to
choose the short repair times.
13 Semi-Markov reward processes: backward recursion
To derive the backward recursion algorithm for semi-Markov reward processes we note rst that
the optimal policy depends only on T(x) through (x): the distribution of T(x) does not play
a role. This means that we can choose T(x) the way we like: with 0 < (x) for all x, we
take T(x) = G(x), where G(x) has a geometric distribution with parameter q(x) = /(x). This
means that P(G(x) = 1) = q
x
, P(G(x) = 2) = (1 q(x))q(x), etc. Then
k
kP(G(x) = k) =
k
kq(x)(1q(x))
k1
= q(x)
1
, and thus indeed ET(x) = EG(x) = (x).
Note that G(x) is memoryless: after each interval of length the sojourn time in state x
nishes with probability q(x). Thus the original system is equivalent to one with sojourn times
equal to and dummy transitions with probability 1q(x). This leads to the following backward
recursion:
V
(t+1)
(x) = r(x) +q(x)
y
p(x, y)V
t
(y) +(1q(x))V
t
(x).
From V
(t+1)
(x) =V
t
(x) +g+o(1) it follows that V
(t+1)
(x) V
t
(x) g. Thus the value
iteration algorithm consists of taking V
0
= 0, and then computing V
,V
2
, . . .. The stop criterion
is equivalent to the one for the discrete-time case.
14 Semi-Markov decision processes: backward recursion
Value iteration can again be generalized to the case that includes decisions. This leads to the
following value function:
V
(t+1)
(x) = max
a
r(x, a) +q(x, a)
y
p(x, a, y)V
t
(y) +(1q(x, a))V
t
(x).
The last policy is optimal, and [V
(t+1)
(x) V
t
(x)]/ for t sufciently large gives the maximal
average rewards.
Remark 14.1 In the discrete-time setting V
t
(x) had an interpretation: it is the total expected maximal
reward in t time units. In the continuous-time case V
t
(x) does not have a similar interpretation, due to the
randomness of T(x, a).
Exercise 14.1 Consider the repair process of Example 11.2. Take (1) = 5, (0) = 2, R = 2 and
C = 0. Here there is also the additional option to shorten repair times to 1, for costs 1.
Koole Lecture notes Stochastic Optimization 22nd January 2006 16
a. Formulate the value function.
b. Solve it using a suitable computer program of package.
c. Find the optimal policy as a function of the parameter t. Can you explain what you found?
15 Other criterion: discounted rewards
Average rewards are often used, but sometimes there are good reasons to give a lower value to a
reward obtained in the future than the same reward right now.
Example 15.1 A reward of 1 currently incurred will be 1+ after one year if put on a bank account with
an interest rate of . Thus a reward of 1 after 1 year values less than a reward of 1 right now.
In the example we considered a yearly payment of interest of rate . To make the step to
continuous-time models, we assume that each year is divided in m periods, and after each period
we received an interest of /m. Thus after t years our initial amount 1 has grown to (1 +
m
)
tm
.
This converges to e
t
as m . Thus, in a continuous-time model with interest , an amount of
1 values e
t
after t years. By dividing by e
t
we also obtain: Reward 1 at 1 is evaluated at 0 as
e
0
e
t
dt < the total expected discounted rewards are well dened. Let V
(x) be
the total expected discounted rewards, starting in x. Note that the starting state is crucial here,
unlike for the average reward model.
To determine V
(x), we rst have to derive the total expected discounted rate rewards if the
model is in x from 0 to T(x). If r(x) = 1, then this is equal to
E
_
T(x)
0
e
s
ds.
We write Ef (T(x)) =
_
0
f (t)dT(x)(t), irrespective of the type of distribution of T(x). (This
notation comes from measure theory.) Then
E
_
T(x)
0
e
s
ds =
_
0
_
t
0
e
s
dsdT(x)(t) =
1
(1(x)),
with (x) = Ee
T(x)
, the so-called Laplace-Stieltjes transform of T(x) in . From T(x) on the
discounted rewards are equal to V
(x) =
1
(1(x))r(x) +(x)
y
p(x, y)V
(y).
This can of course be utilized as part of a policy improvement algorithm. The improvement step
is then given by:
R
/
(x) = argmax
a
1
(1(x, a))r(x, a) +(x, a)
y
p(x, a, y)V
R
(y).
Koole Lecture notes Stochastic Optimization 22nd January 2006 17
The optimality equation becomes
V
(x) = max
a
1
(1(x, a))r(x, a) +(x, a)
y
p(x, a, y)V
(y),
and also value iteration works for the discounted model.
Remark 15.2 It is interesting to note that discounting is equivalent to taking total rewards up to T with
T random and exponential. Indeed, let r(t) be a function indicating the rate reward at t. Let T exp().
Then c(t)e
t
is the expected reward at t, equal to the discounted reward.
Exercise 15.1 Determine the Laplace-Stieltjes tranforms of the exponential and gamma distri-
butions.
Exercise 15.2 Repeat exercise 12.1 for the discounted reward case. Consider the cases in which
the transition times are constant and exponentially distributed, for some well-chosen .
Exercise 15.3 Consider a Markov reward process with state space X = 0, 1, p(0, 1) =
p(1, 0) = 1, T
0
is exponentially distributed, T
1
is constant, and r = (1, 0). Assume that the
reward at t is discounted with a factor e
t
for some > 0. Let V
.
c. Give an interpretation for the results you found for b.
16 (Semi-)Markov decision processes: literature
Some literature on the theory of (semi-)Markov decision chains/processes: Bertsekas [3], Kallen-
berg [10], Puterman [13], Ross [14], Tijms [16].
17 Modeling issues
Suppose you have some system or model that requires dynamic optimization. Can it be
(re)formulated as a (semi-)Markov decision problem, and if so, how to do this in the best way?
Different aspects of this question will be answered under the heading modeling issues.
Modeling issues: dependence on next state
Let us start by introducing some simple generalizations to the models that can sometimes be
quite helpful.
Sometimes it is more appropriate to work with costs instead or rewards. This is completely
equivalent, by multiplying all rewards with 1 and replacing max by min everywhere.
Koole Lecture notes Stochastic Optimization 22nd January 2006 18
It sometimes occurs that the direct rewards r
/
depend also on the next state: r
/
(X
t
, A
t
, X
t+1
).
We are interested in the sum of the expected rewards, E
T1
t=0
r
/
(X
t
, A
t
, X
t+1
), which gives
E
T1
t=0
r
/
(X
t
, A
t
, X
t+1
) =
T1
t=0
Er
/
(X
t
, A
t
, X
t+1
) =E
T1
t=0
yX
p(X
t
, A
t
, y)r
/
(X
t
, A
t
, y).
Thus we can replace the direct rewards r
/
(x, a, y) by r(x, a) =
y
p(x, a, y)r
/
(x, a, y), which ts
within our framework.
A similar reasoning can be applied to the case where the transitions times depend
also on y, notation T
/
(x, a, y) with
/
(x, a, y) = ET
/
(x, a, y). In this case, take (x, a) =
y
p(x, a, y)
/
(x, a, y).
Modeling issues: lump rewards
We presented the theory for continuous-time models assuming that rewards are obtained in a
continuous fashion, so-called rate rewards. Sometimes rewards are obtained in a discrete fashion,
once you enter or leave a state (and after having chosen an action): lump rewards. In the long run
lump rewards r
l
(x, a) are equivalent to rewards r
l
(x, a)/(x, a). If we write out Equation (7) for
lump rewards instead of rate rewards, then we get the following somewhat simpler expression:
g =
(x)r
l
(x)
(y)(y)
.
Note that now also the numerator has a simple interpretation: it is the expected lump reward per
jump of the process.
Modeling issues: aperiodicity
Forward and backward recursion do not need to converge in the case of Markov (decision) chains
that are periodic. This we illustrate with an example.
Example 17.1 Consider a Markov reward chain with X = 0, 1, r = (1, 0), p(0, 1) = p(1, 0) = 1. This
chain is periodic with period 2. If we apply value iteration, then we get:
V
n+1
(0) = 1+V
n
(1), V
n+1
(1) = 1+V
n
(0).
Take V
0
(0) =V
0
(1) = 0. Then
V
n
(0) =
_
n
2
if n even;
n+1
2
if n odd,
V
n
(1) =
_
n
2
if n even;
n1
2
if n odd.
From this it follows that
V
n+1
(0) V
n
(0) =
_
1 if n even;
0 if n odd,
V
n+1
(1) V
n
(1) =
_
0 if n even;
1 if n odd.
But in this case the stop criterion is never met: span(V
n+1
V
n
) = 1 for all n.
Koole Lecture notes Stochastic Optimization 22nd January 2006 19
A simple trick to avoid this problem is to introduce a so-called aperiodicity transformation.
It consists in fact of adding a dummy transition to each state, by replacing P by P+ (1 )I,
0 < < 1. Thus with probability 1 the process stays in the current state, irrespective of the
state and action; with probability the transition occurs according to the original transition prob-
abilities (which could also mean a transition to the current state). Because it is now possible to
stay in every state, the current chain is aperiodic, and forward and backward recursion converge.
This model has the following Poisson equation:
V +g = r +(P+(1)I)V = r +PV +(1)V V +g = r +PV. (9)
If (V, g) is a solution of V +g = r +PV, then (V/, g) is a solution of Equation (9). Thus
the average rewards remain the same, and V is multiplied by a constant. This is intuitively
clear, because introducing the aperiodicity transformation can be interpreted as slowing down
the system, making it longer for the process to reach stationarity.
Modeling issues: states
The rst and perhaps most important choice when modeling is how to choose the states of the
model. When the states are chosen in the wrong way, then sometimes the model cannot be put
into our framework. This fact is related to the Markov property, which we discuss next.
Denition 17.2 (Markov property) A Markov decision chain has the Markov property if
P(X
t+1
= x[X
t
= x
t
, A
t
= a
t
) = P(X
t+1
= x[X
s
= a
s
, A
s
= a
s
, s = 1, . . . , t) for all t, x, and x
s
, a
s
for s = 1, . . . , t.
Thus the Markov property implicates that the history does not matter for the evolution of the
process, only the current state does. It also shows us how to take the transition probabilities p:
p(x, a, y) =P(X
t+1
=y[X
t
=x, A
t
=a
t
), where we made the additional assumption that X
t+1
[X
t
, A
t
does not depend on t, i.e., the system is time-homogeneous. If the Markov property does not hold
for a certain sytem then there are no transition probabilities that describe the transition law, and
therefore this system cannot be modeled as a Markov decision chain (with the given choice of
states). For semi-Markov models we assume the Markov property for the embedded chain. Note
that for Markov (decision) processes the Markov property holds at all times, because of the
memoryless property of the exponentially distributed transition times.
It is important to note that whether the Markov property holds might depend on the choice of
the state space. Thus the state space should be chosen such that the Markov property holds.
Example 17.3 Consider some single-server queueing system for which we are interested in the number
of customers waiting. If we take as state the number of customers in the queue, then information on
previous states gives information on the state of the server which inuences the transitions of the queue.
Therefore the Markov property does not hold. Instead, one should take as state the number of customers
in the system. Under suitable conditions on service and interarrival times the Markov property holds. The
queue length can be derived from the number of customers in the system.
Koole Lecture notes Stochastic Optimization 22nd January 2006 20
Our conclusion is that the state space should be chosen such that the Markov property holds.
Next to that, note that policies are functions of the states. Thus, for a policy to be implementable,
the state should be observable by the decision maker.
Remark 17.4 A choice of state space satisfying the Markov property that always works is taking all
previous observations as state, that is, the observed history is the state. Disadvantages are the growing size
of the state space, and the complexity of the policies.
Modeling issues: decision epochs
Strongly related to the choice of states is the choice of decision epoch. Several choices are
possible: does the embedded point represent the state before or after the transition or decision?
This choice often has consequences for the states.
Example 17.5 Consider a service center to which multiple types of customers can arrive and where de-
cisions are made on the basis of the arrival. If the state represents the state after the arrival but before the
decision, then the type of arrival should be part of the state space. On the other hand, if the state represents
the state before the potential arrival then the type of arrival will not be part of the state; the actions however
will be more-dimensional, with for each possible arrival an entry.
The general rule when choosing the state space and the decision epochs is to do it such that
the size of the state space is minimized. To make this clear, consider the following examples
from queueing theory.
The M/M/1 queue can be modeled as a Markov process with as decision epochs all events
(arrivals and departures) in the system, and as states the number of customers. In the case of the
M/G/1 queue this is impossible: the remaining service time depends not only on the number of
customers in the system, and thus the Markov property does not hold. There are two solutions:
extend the state space to include the attained or remaining service time (the supplementary vari-
able approach) or choose the decision epoch in such a way that the state space remains simple.
We discuss methods to support both approaches, the latter rst.
Example 17.6 Consider a model with s servers that each work with rate , two types of customers that
arrive with rates
1
and
2
, no queueing, and the option to admit an arriving customer. Blocking costs
c
i
for type i, if all servers are busy the only option is blocking. How to minimize blocking costs? When
modeling this as a Markov decision process we have to choose the states and decision epochs. The problem
here is that the optimal action depends on the type of event (e.g., an arrival of type 1) that is happening.
There are two solutions to this: either we take state space 0, . . . , s a1, a2, d, the second dimension
indicating the type of event. This can be seen as epoch the state after the transition but before a possible
assignment. We can also take as state space 0, . . . , s, but then we have as action 0, 10, 1 with the
interpretation that action (a
1
, a2) means that if an arrival of type i occurs then action a
i
is selected. This
has the advantage of a smaller state space and is preferable.
Exercise 17.1 Give the transition probabilities and direct rewards for both choices of decision
epochs of Example 17.6.
Koole Lecture notes Stochastic Optimization 22nd January 2006 21
Modeling issues: Poisson arrivals in interval of random length
Consider again the M/G/1 queue. To keep the state space restricted to the number of customers
in the system the decision epochs should lie at the beginning and end of the service times, which
have a duration S. This means that (x) =ES for x >0. We take (0) =1/, and thus p(0, 1) =1,
p(x, x1+k) =P(k arrivals in S) for x >0. Let us consider how to calculate these probabilities.
First note that
q
k
=P(k arrivals in S) =
_
0
(t)
k
k!
e
t
dS(t).
Let us calculate the generating function Q of the q
k
s:
Q() :=
k=0
q
k
k
=
_
0
e
t(1)
dS(t) = g((1)),
with g thus the Laplace-Stieltjes transform of S. The coefcients q
k
can be obtained from Q in
the following way: Q
(k)
(0) = ()
k
g
(k)
() = q
k
k!. In certain cases we can obtain closed-form
expressions for this.
Example 17.7 (S exponential) Suppose S Exp(), then g(x) =
1
1+x/
. From this it follows that Q() =
(1+(1)/)
1
, and thus
Q
(n)
(0) = n!(
)
n
(1+/)
(n+1)
= n!(
+
)
n
+
.
From this it follows that q
k
= (
+
)
k
+
, and thus the number of arrivals is geometrically distributed.
If there is no closed-form expression for the q
k
then some numerical approximation has to be
used.
Modeling issues: Phase-type distributions
As said when discussing the choice of decision epochs, another option to deal with for example
the M/G/1 queue is adding an additional state variable indicating the attained service time. This
not only adds an additional dimension to the state space, but this extra dimension describes a
continuous-time variable. This can be of use for theoretical purposes, but from an algorithmic
point of view this variable has to be discretized in some way. A method to do so with a clear
interpretation is the use of phase-type distributions.
The class of phase-types distributions can be dened as follows.
Denition 17.8 (Phase-type distributions) Consider a Markov process with a single absorbing
state 0 and some initial distribution. The time until absorbtion into 0 is called to have a Phase-
type (PH) distribution.
Koole Lecture notes Stochastic Optimization 22nd January 2006 22
Example 17.9 The exponential, gamma (also called Erlang or Laplace), and hyperexponential distribu-
tions (the latter is a mixture of 2 exponential distributions) are examples of PH distributions.
An interested property of PH distributions is the fact that the set of all PH distributions is
closed in the set of all non-negative distributions, i.e., any distribution can be approximated
arbitrarily close by a PH distribution. This holds already for a special class of PH distributions,
namely for mixtures of gamma distributions (all with the same rate).
Consider some arbitrary distribution function F, and write E(k, m) for the distribution func-
tion of the gamma distribution with k phases (the shape parameter) and rate m.
Theorem 17.10 For m N, take
m
(k) = F(
k
m
) F(
k1
m
) and
m
(m
2
) = 1F(
m
2
1
m
). Then for
F
m
with F
m
(x) =
m
2
k=1
k
(m)E(k, m)(x) it holds that lim
m
F
m
(x) = F(x) for all x 0.
Proof The intuition behind the result is that E(km, m)(x) Ik x and that
m
2
k=1
k
(m)Ik/m x
F(x). An easy formal proof is showing that the Laplace-Stieltjes transforms of F
m
converge to the trans-
form of F (which is equivalent to convergence in distribution).
Modeling issues: Littles law and PASTA
Sometimes we are interested in maximizing performance measures that cannot be formulated
directly as long-run average direct rewards, e.g., minimizing the average waiting time in some
queueing system. There are two results that are helpful in translating performance measures such
that they can be obtained through immediate rewards: Littles law and PASTA.
Littles law is an example of a cost equation, in which performance at the customer level
is related to performance at the system level. E.g., in a system with Poisson arrivals it gener-
ally holds that EL = EW with L the stationary queue length, W the stationary waiting time of
customers, and the arrival rate. See El-Taha & Stidham [8] for an extensive treatment of cost
equations.
PASTA stands for Poisson arrivals see time averages. It means that an arbitrary arrival sees
the system in stationarity. For general arrival processes this is not the case.
Example 17.11 (Waiting times in the M/M/1 queue) Suppose we want to calculate the waiting time in
the M/M/1 queue by backward recursion. Denote the waiting time by W
q
, and the queue length by L
q
.
Then Littles law states that EW
q
= EL
q
/. If the state x denotes the number in the system, then taking
immediate reward r(x) = (x 1)
+
/ will give g =EW
q
.
We can calculate EW
q
also using PASTA. PASTA tells us that EW
q
= EL/: we have to wait for all
the customers to leave, including the one in service. Thus taking r(x) = x/ also gives g =EW
q
.
The equivalence of both expressions for EW
q
can be veried directly from results for the M/M/1
queue, using EL = /(1) and EL =EL
q
+ with = /:
EL
Q
=
EL
=
1
=
()
=
EL
.
Koole Lecture notes Stochastic Optimization 22nd January 2006 23
Modeling issues: Countable state spaces
In our physical world all systems are nite, but there are several reasons why we would be inter-
ested in systems with an innite state space: innite systems behave nicer than nite systems,
bounds cannot exactly be given, the state space does not represent something physical but some
concept of an unbounded nature, and so forth.
Example 17.12 (M/M/1 vs. M/M/1/N queue) For the M/M/1 queue there are nice expressions for the
most popular performance measures, for the M/M/1/N queue these expressions are less attractive.
Example 17.13 (Work in process in production systems) In certain production systems the amount of
work of process that can be stocked is evidently bounded. In other production systems such as adminis-
trative processes this is less clear: how many les that are waiting for further processing can a computer
system store? This question is hard to answer if not irrelevant, given the current price of computer disk
space.
Example 17.14 (Unobserved repairable system) Suppose we have some system for which we have no
immediate information whether it is up or not, but at random times we get a signal when it is up. A natural
candidate for the states are the numbers of time units ago since we last got a signal. By its nature this is
unbounded.
Although we have good reasons to prefer in certain cases innite state spaces, we need a
nite state space as soon as we want to compute performance measures or optimal policies. A
possible approach is as follows.
Consider a model with countable state space X. Approximate it by a series of models with -
nite state spaces X
n
, such that X
n+1
X
n
and lim
n
X
n
=X. Let the nth model have transition
probabilities p
(n)
. These should be changed with respect to p such that there are no transitions
from X
n
to XX
n
. This can for example be done by taking for each x X
n
p
(n)
(x, y) = p(x, y) for x ,= y X
n
, p
(n)
(x, y) = 0 for y , X
n
, p
(n)
(x, x) = p(x, x) +
y,X
n
p(x, y).
Having dened the approximating model it should be such that the performance measure(s) of
interest converge to the one(s) of the original model. There is relatively little theory about these
types of results, in practice one compares for example g
(n)
and g
(n+1)
for different values of n.
Example 17.15 (M/M/1 queue) As nite nth approximation for the M/M/1 queue we could take the
M/M/1/n queue. If and only if the queue is stable (i.e., < ) then
(n)
(x)
N
n=1
K
n
, with
N
n=1
K
n
k=1
(B
nk
+1) the number of states. This would be sufcient
if all machines had a production time of 1 and no costs for switching from one product to the next. If this
is the case (as it is often in job shops) then additional state variables indicating the states of the machines
have to be added.
Example 18.2 (Service center) Consider a service center (such as a call center) where customers of dif-
ferent types arrive, with multiple servers that can each process a different subset of all customer types.
To model this we need at least a variable for each customer class and a variable for each server (class).
Service centers with 5 or 10 customer and server classes are no exception, leading to 10 or 20 dimensions.
In the following sections we will discuss a number of approximation methods that can be
used (in certain cases) to solve high-dimensional problems.
Exercise 18.1 Consider a factory with m machines and n different types of products. Each prod-
uct has to be processed on each machine once, and each type of product has different processing
requirements. There is a central place for work-in-process inventory that can hold a total of k
items, including the products that are currently in service. Management wants to nd optimal
Koole Lecture notes Stochastic Optimization 22nd January 2006 25
dynamic decisions concerning which item to process when on which machine. Suppose one con-
siders to use backward recursion.
a. What is the dimension of the state space?
b. Give a description of the state space.
c. Give a formula for the number of states.
d. Give a rough estimate of this number for m = n = k = 10.
19 One-step improvement
The one-step improvement works only for models for which the value function of a certain xed
policy R can be obtained, without being hindered by the curse of dimensionality. Crucial is the
fact that practice shows that policy improvement gives the biggest improvement during the rst
steps. One-step improvement consist of doing the policy improvement step on the basis of the
value of R. It is guaranteed to give a better policy R
/
, but how good R
/
is cannot be obtained in
general, because of the curse of dimensionality.
Note that it is almost equally demanding to store a policy in memory than it is to store a value
function in memory. For this reason it is often better to compute the action online. That is, if the
current state is x, then V
R
(y) is calculated for each y for which p(x, a, y) > 0 for some a, and on
the basis of these numbers R
/
(x) is computed. Note that the sparseness of the matrix P is crucial
in keeping the computation time low.
The most challenging step in one-step improvement is computing V
R
for some policy R. The
practically most important case is where R is such that the Markov process splits up in several
independent processes. We show how to nd V
R
on the basis of the value functions of the
components.
Suppose we have n Markov reward processes with states x
i
X
i
, rewards r
i
, transition rates
i
(x
i
, y
i
), uniformization parameters
i
, average rewards g
i
, and value functions V
i
. Consider now
a model with states x = (x
1
, . . . , x
n
) X =X
1
X
n
, rewards r(x) =
n
i=1
r
i
(x
i
), transition
rates (x, y) =
i
(x
i
, y
i
) if x
j
= y
j
for all j ,= i, average reward g, and value function V.
Theorem 19.1 g =
i
g
i
and V(x) =
i
V
i
(x
i
) for all x X.
Proof Consider the Poisson equation of component i:
V
i
(x
i
)
y
i
i
(x
i
, y
i
) +g
i
= r
i
(x
i
) +
y
i
i
(x
i
, y
i
)V
i
(y
i
).
Sum over i and add
y
i
i
(x
i
, y
i
)
j,=i
V
j
(x
j
) to both sides. Then we get the optimality equation of the
system as specied, with g =
i
g
i
and V =
i
V
i
.
Example 19.2 Consider a fully connected communication network, where nodes are regional switches
and links consists of several parallel communication lines. The question to be answered in this network
is how to route if all direct links are occupied? To this model we apply one-step optimization with initial
policy R that rejects calls if all direct links are occupied. This implies that all link groups are independent,
Koole Lecture notes Stochastic Optimization 22nd January 2006 26
and Theorem 19.1 can be used to calculate the value function for each link group. If a call arrives for a
certain connection and all direct links are occupied, then online it is calculated if and how this call should
be redirected. Note that in this case it will at least occupy two links, thus it might be optimal to reject the
call, even if routing over multiple links is possible.
Calculating the value for each link group can be done very rapidly, because these are one-dimensional
problems. But they even allow for an exact analysis, if we assume that they behave like Erlang loss
models.
We derive the value function of an Erlang loss model with parameters , = 1, s, and reward 1 per
lost call. The Poisson equation is as follows:
sV(s) +g = +sV(s 1);
x < s : (x +)V(x) +g = V(x +1) +xV(x 1).
Note that we are only interested in differences of value functions, not in the actual values. These differ-
ences and g are given by g = B(s, a) and V(k +1) V(k) = B(s, a)/B(k, a) with a = / and
B(k, a) =
a
k
/k!
k
j=0
a
j
/ j!
,
the Erlang blocking formula. This example comes from Ott & Krishnan [12].
Example 19.3 (value function M/M/1 queue) The M/M/1 queue with the queue length as immediate
reward is another example for which we can compute the value function explicitly. Instead of solving
the Poisson equation we give a heuristic argument. It is known that g = /( ). By using a coupling
argument it is readily seen that V(x +1) V(x) = (x +1)EB, with B the length of a busy period. For EB
we can obtain the following relation by conditioning on the rst event in a busy period:
EB =
1
+
+
+
2EB EB = ()
1
,
and thus
V(x) V(0) =
x
i=1
(V(i) V(i 1)) =
x(x +1)
2()
.
Exercise 19.1 Consider an M/M/s/s queue with two types of customers, arrival rates
1
,
2
,
and
1
=
2
. Blocking costs are different, we are interested in the average weighted long-run
blocking costs. Give g and V for this queue. Now we have the possibility to block customers,
even if there are servers free. What will be the formof the policy after one step of policy iteration?
Compute it for some non-trivial parameter values.
Exercise 19.2 We consider a single arrival stream with rate 2 that can be routed to two parallel
single-server queues, with service rates 1 and 2. Consider rst a routing policy that assigns all
customers according to i.i.d. Bernoulli experiments, so-called Bernoulli routing. We are inter-
ested in the average total long-run number of customers in the system. Compute the optimal
routing probability and the g and V belonging to this policy. Show how to compute the one-step
improved policy. How can you characterize this policy?
Koole Lecture notes Stochastic Optimization 22nd January 2006 27
20 Approximate dynamic programming
Approximate dynamic programming is another, more ambitious method to solve high-
dimensional problems. It was introduced in the dynamic programming community and further
formalized in Bertsekas & Tsitsiklis [4].
The central idea is that V
R
(x) can be written as or estimated by a function with a known
structure (e.g., quadratic in the components). Let us call this approximation W
R
(r, x), with r the
vector of parameters of this function. Thus the problem of computing V
R
(x) for all x is replaced
by computing the vector r, which has in general much less entries.
Example 20.1 Let W
R
(r, x) represent the value function for the long-run average weighted queue length
in a single-server 2-class preemptive priority queue with Poisson arrivals and exponential service times.
Then W
R
(r, x) = r
0
+r
1
x
1
+r
2
x
2
+r
11
x
2
1
+r
22
x
2
2
+r
12
x
1
x
2
, with x
i
the number of customers of type i in
the system. Instead of computing V
R
(x) for all possible x we only have to determine the six coefcients.
(Groenevelt et al. [9])
A summary of the method is as follows (compare with policy iteration):
0. Choose some policy R.
1. Simulate/observe model to estimate V
R
by V
R
in states x X
/
X;
2. Approximate
V
R
(x) by W
R
(r, x) (with r such that it minimizes
xX
/ [
V
R
(x) W
R
(r, x)[);
3. Compute new policy R
/
minimizing W
R
(r, x), go to step 1 with R = R
/
.
21 LP approach
For Markov reward chains we saw in Section 4 that the average reward g is equal to
(x)r(x)
with
(y)p(y, x)
(x) = 0
and
x
(x) = 1,
(x) 0.
Of course,
(x, a)r(x, a)
subject to
(y, a)p(y, a, x)
(x, a) = 0
Koole Lecture notes Stochastic Optimization 22nd January 2006 28
and
x
(x, a) = 1,
(x, a) 0.
Now the
(x, a) are called the (stationary) state-action frequencies, they indicate which frac-
tion of the time the state is x and action a is chosen at the same time,
(x, a) > 0 for at least one a A, but there might be more. Any action a for which
(x, a) > 0
is optimal in x. However, the standard simplex method will nd only solutions with
(x, a) > 0
for one a per x. To see this, consider the number of equations. This is X +1. However, one
inequation is redundant, because the rst X rows are dependent:
x
[
(y, a)p(y, a, x)
(x, a)] = 0.
Thus we can leave out one of the inequalities, giving a total of X. Now the simplex method
rewrites the constraint matrix such that each solution it evaluates consists of a number of non-
negative basic variables (equal to the number of constraints) and a number of non-basic variables
equal to 0. In our case there are thus X basic variables, one corresponding to each state.
Exercise 21.1 Give the LP formulation for semi-Markov decison processes.
22 Multiple objectives
Up to now we looked at models with a single objective function. In practice however we of-
ten encounter problems with more than one objective function, which are often formulated as
maximizing one objective under constraints on the other objectives.
There are two solution methods for this type of problems: linear programming and Lagrange
multipliers. A general reference to constrained Markov decision chains is Altman [1].
Let us rst consider the LP method. Constraints of the form
(x, a) > 0 for more than one a, then the optimal policy chooses action a
/
with
probability
(x, a
/
)/
(x, a). The next example shows that this type of randomization now
plays a crucial role in obtaining optimal policies.
Example 22.1 Take [X[ = 1, A =1, 2, 3, r = (0, 1, 2), c = (0, 1, 4), = 2. The LP formulation is:
max(1, 2) +2(1, 3)
s.t.
(1, 1) +(1, 2) +(1, 3) = 1,
(1, 2) +4(1, 3) 2.
It is readily seen that the optimal solution is given by (0, 2/3, 1/3), thus a randomization between action
2 and 3.
Koole Lecture notes Stochastic Optimization 22nd January 2006 29
Let us now discuss the Lagrange multiplier approach. The main disadvantage of the LP
approach is that for a problem with K constraints one has to solve an LP with [X[ [A[ decision
variables and X +K constraints.
Instead, we use backward recursion together with Lagrange multipliers. Assume we have
single constraint, and introduce the Lagrange multiplier .
The crucial idea is to replace the direct reward r by r c. We will use the following notation:
the average reward for a policy R is written as usual with g
R
, the average constraint value is
written as f
R
.
Theorem 22.2 Suppose there is a 0 such that
R() = argmax
R
g
R
f
R
with f
R()
= , then R() is constrained optimal.
Proof Take some R such that f
R
. Then g
R
f
R
g
R()
f
R()
and thus g
R
g
R()
( f
R
)
g
R()
.
The function f
R()
is decreasing, but not continuous in . The function g
R()
f
R()
is piece-
wise linear, and non-differentiable in those values of for which multiple policies are optimal:
there the derivative changes. How this can be used to construct an optimal policy is rst illus-
trated the next example, and then formalized in the algorithm that follows the example.
Example 22.3 Consider again [X[ = 1, A = 1, 2, 3, r = (0, 1, 2), c = (0, 1, 4). Now g f = (0, 1
, 24). Policy 3 is optimal up to = 1/3, then policy 2 is optimal until = 1.
Suppose that = 2. < 1/3 gives R() = 2 and f = 4. 1 > > 1/3 gives R() = 1 and f = 1. For
= 1/3 both policy 2 and 3 are optimal. By randomizing between the policies we nd an R with f = ,
which is therefore optimal.
We introduce the following notation: Let R () be the set of optimal policies for Lagrange
parameter , and with R = pR
1
+ (1 p)R
2
we mean that in every state with probability p the
action according to R
1
is chosen and with 1p the action according to R
2
.
Then we have the following algorithm to construct optimal policies.
Algorithm:
1. if R R (0) with f
R
R is optimal
2. else: vary until one of 2 situations occurs:
A. R R () with f
R
= R is optimal
B. R
1
, R
2
R () with f
R
1
< , f
R
2
>
take R = pR
1
+(1p)R
2
such that f
R
= R is optimal
Exercise 22.1 Consider the M[M[1[3 queue with admission control, i.e., every customer can be
rejected on arrival. The objective is to maximize the productivity of the server under a constraint
on the number of customers that are waiting.
a. Formulate this problem as a LP with general parameter values.
b. Solve the LP for = = 1 and = 0.5. Interpret the results: what is the optimal policy?
c. Reformulate the problem with general parameter values using a Lagrange multiplier approach.
d. Give for each the value of f and g for the same parameter values as used for the LP method.
e. What is the optimal value of ?
Koole Lecture notes Stochastic Optimization 22nd January 2006 30
23 Dynamic games
In this section we restrict to non-cooperative games (i.e., no coalitions are allowed) and 2 players.
We distinguish between two situations: at each decision epoch the players reveal their decision
without knowing the others decision, or the players play one by one after having witnessed the
action of the previous player and its consequences (perfect information).
We start with the rst situation. Now the action a is two-dimensional: a = (a
1
, a
2
) A
1
A
2
with a
i
the action of player i. This looks like a trivial extension of the 1-player framework,
but also the reward is two-dimensional: r(x, a) = (r
1
(x, a), r
2
(x, a)), and every player has as
objective maximizing its own long-run average expected reward. A special case are zero-sum
games for which r
2
(x, a) =r
1
(x, a); these can be seen as problems with a single objective, and
one player maximizing the objective, and the other minimizing it. For each state the value is a
2-dimensional vector. In the one-dimensional situation choosing an action is looking for given x
at r(x, a) +
y
p(x, a, y)V(y) for various a; now we have to consider for given x the vector
_
r
1
(x, a) +
y
p(x, a, y)V
1
(y), r
2
(x, a) +
y
p(x, a, y)V
2
(y)
_
=:
_
Q
1
(x, a), Q
2
(x, a)
_
for vectors a = (a
1
, a
2
). This is called a bi-matrix game, and it is already interesting to solve this
by itself. It is not immediately clear how to solve these bi-matrix games. An important concept
is that of the Nash-equilibrium. An action vector a
/
is called a Nash-equilibrium if no player has
an incentive to deviate, which is in the two-player setting equivalent to
Q
1
(x, (a
1
, a
/
2
)) Q
1
(x, a
/
) and Q
2
(x, (a
/
1
, a
2
)) Q
2
(x, a
/
).
Example 23.1 (prisoners dilemma) The prisoners dilemma is a bi-matrix game with the following value
or pay-off matrix:
_
(5, 5) (0, 10)
(10, 0) (1, 1)
_
.
Its interpretation is as follows: If both players choose action 2 (keeping silent), then they get each 1 year
prison. If they talk both (action 1), then they get each 5 years. If one of them talks, then the one who
remains silent gets 10 years and the other one is released. The Nash-equilibrium is given by a
/
= (1, 1),
while (2, 2) gives a higher pay-off for both players.
Two-player zero-sum games, also called matrix games, have optimal policies if we allow for
randomization in the actions. This result is due to Van Neumann (1928). Let player 1 (2) be
maximizing (minimizing) the pay-off, and let p
i
be the policy of player i (thus a distribution on
A
i
), and Q the pay-off matrix. Then the expected pay-off is given by p
1
Qp
2
. Van Neumann
showed that
max
p
1
min
p
2
p
1
Qp
2
= min
p
2
max
p
1
p
1
Qp
2
:
the equilibrium always exists, is unique, and knowing the opponents distribution does improve
the pay-off.
We continue with games where the players play one by one, and we assume a zero-sum
setting. Good examples of these types of games are board games, see Smith [15] for an overview.
Koole Lecture notes Stochastic Optimization 22nd January 2006 31
We assume that A
1
A
2
= / 0. An epoch consists of two moves, one of each player. The value
iteration equation (4) is now replaced by
V
t+1
(x) = max
a
1
A
1
_
r(x, a
1
) +
y
p(x, a
1
, y) min
a
2
A
2
_
r(y, a
2
) +
z
p(y, a
2
, z)V
t
(z)
__
.
Quite often the chains are not unichain, but r(x, a) =0 for all but a number of different absorbing
states that corresponds to winning or losing for player 1. The goal for player 1 is to reach a
winning end state by solving the above equation. Only for certain simple games this equation
can be solved completely, for games such as chess other methods have to be used.
Exercise 23.1 Determine the optimal policies for the matrix game
_
1 2
0 3
_
.
Does this game have an equilibrium if we do not allow for randomization?
Exercise 23.2 Determine the optimal starting move for the game of Tic-tac-toe.
24 Disadvantages of average optimality
If there are multiple average optimal policies, then it might be interesting to consider also the
transient reward. Take the following example.
Example 24.1 X = A = 1, 2, p(i, a, 2) = 1 for i X and a A, r(1, 1) = 10
6
, other rewards are 0.
Then all policies are average optimal, but action 1 in state 1 is evidently better.
We formalize the concept of better within the class of average optimal policies. Let R be the
set of average optimal policies. Then the policy R R that has the highest bias (= value function
normalized w.r.t. stationary reward) is called bias optimal, a renement of average optimality.
Example 24.2 Consider the model in the gure below. > denotes action 1, denotes action 2; the
numbers next to the arrows denote the transition probabilities. The direct rewards are given by r(0, 1) =0,
r(1, 1) = 1, and r(1, 2) = 2/3. This model is communicating with 2 average optimal policies and 1 bias
optimal policy, as we shall see next.
0 1
1
1/2
1/2
1/2
1/2
Koole Lecture notes Stochastic Optimization 22nd January 2006 32
The optimality equation is as follows:
V(0) +g =
1
2
V(0) +
1
2
V(1);
V(1) +g = max1+V(0),
2
3
+
1
2
V(0) +
1
2
V(1).
The solutions are given by:
g = 1/3, V(0) = c, V(1) = 2/3+c, c R.
The maximum is attained by actions 1 and 2, thus both possible policies (R
1
= (1, 1) and R
2
= (1, 2)) are
average optimal. Do both policies have the same bias?
The bias B
R
i
is the solution of the optimality equation with, additionally, the condition
R
i
(x)B
R
i
(x) = 0. This assures that the bias of the stationary state is equal to 0. Reformulating the last
condition gives
B
R
i
=V<
R
i
,V > e.
Let us calculate the bias. Take V = (0, 2/3), the stationary distributions under both policies are
R
1
=
(2/3, 1/3),
R
2
,=
R
2
.
Next we formulate a method to determine the bias optimal policy. Let (g,V) be a solution of
V +g = max
R:XA
r(R) +P(R)V,
and R = argmaxr +PV, the set of average optimal policies. We have to look for a policy in
R that minimizes <
R
2
(R) = lim
t
E(r(X
t
(R), A
t
(R)) g
R
)
2
.
In terms of state-action frequencies:
2
(R) =
x,a
(r(x, a) g
R
)
2
(x, a).
Because g
R
=
x,a
r(x, a)
R
(x, a), we nd
2
(R) =
x,a
_
r
2
(x, a) 2r(x, a)g
R
+(g
R
)
2
_
(x, a) =
x,a
r
2
(x, a)
R
(x, a)
_
x,a
r(x, a)
R
(x, a)
_
2
.
Now we have a multi-criteria decision problem. We have the following possibilities for
choosing the objective:
- max
R
g
R
[
2
(R) ;
- min
R
2
(R)[g
R
;
- max
R
g
R
2
(R).
Note that all are risk-averse, i.e., the utility is concave. All problems are constrained Markov
decision problems with either a quadratic objective or a quadratic constraint, thus we have to
rely on mathematical programming to solve these problems.
Exercise 24.1 Consider a discrete-time Markov decision process with states 0, 1, 2, 2 ac-
tions in 0 and 1 action in 1 and 2, and transition probabilities p: p(0, 1, 1) = p(0, 2, 2) = 2/3,
p(0, 1, 2) = p(0, 2, 1) = 1/3, p(1, 1, 0) = p(2, 1, 0) = 1, and rewards r equal to: r(1, 1) = 1,
r(0, 2) = 1/3, r(0, 1) = r(2, 1) = 0.
Compute the average reward, the bias and the average variance for both possible policies.
Which policy would you prefer?
25 Monotonicity
For several reasons it can be useful to show certain structural properties of value functions.
This can be used to characterize (partially) optimal policies, or it can be used as a rst step in
comparing the performance of different systems.
Koole Lecture notes Stochastic Optimization 22nd January 2006 34
Admission control
Consider the value function of the M/M/1 queue with admission control (assuming that +
1), for rejection costs r and direct costs C(x) if there are x customers in the system:
V
n+1
(x) =C(x) +minr +V
n
(x),V
n
(x +1)+V
n
((x 1)
+
) +(1)V
n
(x).
A threshold policy is a policy that admits customers up to a certain threshold value, above that
value customers are rejected. Whether or not the optimal policy for a certain n+1 is a threshold
policy depends on the form of V
n
. The next theorem gives a sufcient condition.
Theorem 25.1 If V
n
is convex, then a threshold policy is optimal for V
n+1
.
Proof V
n
convex means:
2V
n
(x +1) V
n
(x) +V
n
(x +2) for all x 0,
and thus
V
n
(x +1) r V
n
(x) V
n
(x +2) r V
n
(x +1)
for all x 0.
If rejection is optimal in x, thus 0 V
n
(x+1) r V
n
(x), then also 0 V
n
(x+2) r V
n
(x+1), and
thus rejection is also optimal in x +1.
Theorem 25.2 If C and V
0
are convex and increasing (CI), then V
n
is CI for all n.
Proof By induction to n. Suppose V
n
is CI. This means
V
n
(x) V
n
(x +1) for all x 0
2V
n
(x +1) V
n
(x) +V
n
(x +2) for all x 0
They can be used to show that the same inequalities hold for V
n+1
.
Corollary 25.3 If C is CI, then a threshold policy is optimal.
Proof V
n
(x) V
n
(0) converges as n to a solution V of the optimality equation and thus V is CI, and
therefore the average optimal policy is of threshold type.
A standard choice is C(x) = cx, which amounts to c holding costs for each unit of time that a
customer is in the system. If we take C(x) = c(x 1)
+
, then only the customers in queue count.
Note that c(x 1)
+
is convex (if c 0).
Remark 25.4 This result holds also for the M/M/s queue.
Koole Lecture notes Stochastic Optimization 22nd January 2006 35
A server-assignment model
Consider next a model with m classes of customers, arrival rates
i
and service rates
i
(max
i
i
=
,
i
+ 1), a single server, and dynamic preemptive server assignment.
The value function is as follows:
V
n+1
(x) =C(x) +
i
V
n
(x +e
i
)+
min
i
i
V
n
((x e
i
)
+
) +(
i
)V
n
(x)+(1
i
)V
n
(x).
Theorem 25.5 If c and V
0
satisfy
i
f (x e
i
) +(
i
) f (x)
j
f (x e
j
) +(
j
) f (x)
for all x and i < j, x
i
> 0 and x
j
> 0 and
f (x) f (x +e
i
)
for all x and i, then so do V
n
for all n > 0.
Corollary 25.6 Under the above conditions a preemptive priority policy is optimal. In the spe-
cial case that C(x) =
i
c
i
x
i
then the conditions are equivalent to c
i
0 and
1
c
1
m
c
m
,
resulting in the so-called c-rule.
With this method many one and two-dimensional systems can be analyzed, but few multi-
dimensional.
Remark 25.7 Note that other cost functions might lead to the same optimal policy. Whether or not this
is the case is checked by verifying if the conditions hold for the choice of cost function.
Exercise 25.1 Consider the M/M/s/s queue with admission control with two classes of cus-
tomers. Both classes have the same average service time, but differ in the reward per admitted
customer.
a. Formulate the backward recursion value function.
b. Is the value function concave or convex? Show it.
c. Consider a similar system but with rewards for each nished customer. How could you analyze
that?
Exercise 25.2 Consider 2 parallel single-server queues with exponential service times, with
equal rates. Customers arrive according to a Poisson process and are assigned in a dynamic
way to the 2 queues. The objective is to minimize the average number of customers in the sys-
tem.
a. Formulate the backward recursion equation.
b. Which relation on V
n
must hold for shortest queue routing to be optimal?
c. Show that the value function is symmetric (i.e., V
n
(x, y) =V
n
(y, x)).
d. Prove by induction to n that shortest queue routing is optimal.
Koole Lecture notes Stochastic Optimization 22nd January 2006 36
26 Incomplete information
The curse of dimensionality is not the only reason why Markov decision theory (or mathematical
modeling in general) is little used in practice. There are two more reasons, that both are related
to the same general concept of incomplete or partial information. The rst is that our methods
require all parameters to be given. This is in many cases an unrealistic assumption: parameters
vary, depending on many other known and unknown underlying parameters. Therefore it is often
not possible to give reliable estimates of parameters.
Example 26.1 A crucial parameter when optimizing a call center is the rate at which customers arrive.
This parameter depends on many variables, some of which are known at the moment the planning is done
(time of day, week of the year, internal events inuencing the rate such as advertisement campaigns, etc.)
and some of which are unknown (mainly external events such as weather conditions). This estimation
issue is of major importance to call centers.
Next to unknown parameters it might occur that the state is not (completely) observed. In
principle the observation at t can be any random function of the whole history of the system (thus
including models with delay in observations), but usually it is a random function of the state at t.
Example 26.2 In a telecommunication network decisions are made at the nodes of the system. In each
node we often have delayed incomplete info on the other nodes. How to make for example decisions
concerning the routing of calls?
Example 26.3 The state of a machine deteriorates when it is functioning, but we only known whether it
is up or down. Timely preventive maintenance prevents expensive repairs after failure. When to schedule
preventive maintenance? What are the advantages of condition monitoring?
Example 26.4 In most card games we have partial information on the other hands.
The standard method in the case of unknown parameters is to observe the system rst, esti-
mate the parameters (the learning phase), and then control the systemon the basis of the estimates
(the control phase). The disadvantages are that we do not improve the system during the learning
phase, and that we do not improve the parameter estimates during the control phase. This method
gets even worse if the parameters change over time, which is the rule in practice. Thus we need
more sophisticated methods.
There are several methods with each having their own advantages and disadvantages. Crucial
to all these methods is that they do not rst estimate and then control on the basis of these
estimates, but that while the system is being controlled the parameter estimations are improved
and with that the decisions. The simplest method, useful for example in the case of unknown
arrival parameters consists of a standard statistical estimation procedure giving the most likely
value, followed by the execution of for example backward recursion at each time epoch using the
most recent estimates. There are other methods in which the estimation and optimization steps
cannot so easily be separated. One is Q-learning, in which the value function is updated using
ideas from stochastic approximation. It is useful in the case that nothing is known about the
transition probabilities, it is a method that makes no initial assumptions. A mathematically more
Koole Lecture notes Stochastic Optimization 22nd January 2006 37
sophisticated and numerically more demanding method is Bayesian dynamic programming, for
which an initial (prior) distribution of the unknown parameters is needed.
Note that also approximate dynamic programming can be used for problems with uncomplete
information. We made no assumptions on the transition structure, and therefore it can also be
used for the partial-information case. As for Q-learning no initial distribution is needed.
Remark 26.5 If we consider just the average long-run reward then learning over a (very) long period and
then controlling is (almost) optimal. The disadvantage is the loss of reward during the learning phase, see
the discussion of disadvantages of the average reward criterion in Section 24. For this reason we mainly
look at discounted rewards. Another problem is the issue of parameters varying slowly over time. Then
we can never stop learning, and learning and control have to be done simultaneously.
Exercise 26.1 Consider an M/M/1 queue with admission control, thus arriving customers may
be rejected. Every customer in queue costs 1 unit per time holding costs, but every admitted
customer gives a reward r. It is the objective to maximize the discounted revenue minus holding
costs. The arrival rate is 1, but the service rate is unknown.
a. Give a procedure to estimate the parameter of the service time distribution on the basis of the
realizations up to now.
b. How would you use this to design a control algorithm for this model?
c. Implement a computer program that repeats the following experiment a number of times: draw
the service rate from a homogeneous [0, 2] distribution, and simulate the queue with admission
with the algorithm of part b implemented. Report on it for a few choices of parameters.
d. Compare the results of c to the model where the service rate is known from the beginning.
Q-learning
Q-learning is a method that is suitable for systems of which the state is observed, but no infor-
mation is known about the transition probabilities, not even the structure.
We will work with a model with discounting (with parameter, [0, 1)), in discrete time.
We allow the reward to depend on the new state, thus r is of the form r(x, a, y) with x the current
state, a the action, and y the new state.
The value function is then given by:
V(x) = max
a
y
p(x, a, y)[r(x, a, y) +V(y)].
Dene
Q(x, a) =
y
p(x, a, y)[r(x, a, y) +V(y)].
Then
Q(x, a) =
y
p(x, a, y)[r(x, a, y) +max
a
Q(y, a)].
This gives an alternative way for calculating optimal policies.
Koole Lecture notes Stochastic Optimization 22nd January 2006 38
Now we move to systems with unknown parameters, and we assume that a simulator exists:
we have a system for which it is possible to draw transitions according to the correct probability
laws, but it is too time consuming to calculate transition probabilities (otherwise standard value
iteration would be preferable).
Let
n
(x, a) X be the outcome of the simulation at stage n, for state x and action a. Let b
n
be a series with
n
b
n
= ,
n
b
2
n
< .
The Q-learning algorithm works as follows:
Q
n+1
(x, a) = (1b
n
)Q
n
(x, a) +b
n
max
a
[r(x, a,
n
(x, a)) +Q
n
(
n
(x, a), a)].
Then: Q
n
Q, and thus in the limit the algorithm gives the optimal actions.
Remark 26.6 If the system is controlled in real time, then the algorithm is executed in an asynchronous
way: only Q(x, a) is updated for the current x and a. This has the risk that certain (x, a) combinations never
get updated. Therefore sub-optimal actions should also be chosen to give the algorithm the possibility to
learn on all (x, a) combinations.
Remark 26.7 Why did we take
n
b
n
= and
n
b
2
n
< ? If
n
b
n
< then there is no guaranteed
convengence to the right value (e.g., b
n
0). On the other hand, if
n
b
2
n
= , then there need not be
convengence (e.g., b
n
1). The usual choice of b
n
is b
n
=
1
n
.
See also the Robbins-Monro stochastic approximation algorithm.
Bertsekas & Tsitsiklis [4] gives more information on Q-learning.
Exercise 26.2 Consider the model of Exercise 26.1. Is it useful to apply Q-learning to this
model? Motivate your answer.
Bayesian dynamic programming
In Bayesian dynamic programming we make a difference between the state of the model (with
state space X) and the state of the algorithm (the information state) that represents the informa-
tion we have. The information state is actually a distribution on the set of model states. This
information state holds all information on all observations up to the current time. Thus it is ob-
servable, it only depends on the observations, and the Markov property holds: all information
about the past is used. Thus (see Section 17) the information state can be used to solve the
problem.
Starting with some well-chosen prior, the information state is updated every time unit using
Bayes rule. This gives the optimal policy, given the initial distribution. Note however that
for state space X the information state space is given by [0, 1]
[X[
, thus an [X[-dimensional state
space, which each variable a probability. Thus X is not even countable, we will certainly have
to discretize the state space to make computations possible. Even for small state spaces and a
coarse discretization of the interval [0, 1] this quickly leads to unfeasible state space sizes.
Koole Lecture notes Stochastic Optimization 22nd January 2006 39
Let us formalize the framework. Next to the state space X we have an observation space
Z. With q(x, z) we denote the probability of observing z Z in x X. Next we dene the
information states, which are thus distributions on X: P = [0, 1]
X
(in case [X[ countable).
Consider u P. For each z Z there is a transition to a new state v P; for observation z
the distribution v is dened by
v(y) =P(now at y[before at u, a chosen, z observed) =
x
u(x)p(x, a, y)q(y, z)
y
u(x)p(x, a, y)q(y, z)
,
and thus the transition probabilities are given by
p
/
(u, a, v) =
y
u(x)p(x, a, y)q(y, z).
Here we assumed that for u different observations lead to different v; if this is not the case then
the transition probabilities should be added.
Example 26.8 Aservice center is modeled as a single-server queue. We have the option to send customers
to another center when the queue gets too long (admission control). We assume that the capacity of
the queue is N: above N customers balk automatically. The partial-information aspect is that we do
not observe the queue, we only observe whether or not the server is busy. The state is the number of
customers in the system, X =0, 1, 2, . . . , N, action set A =0, 1 representing rejection and admission.
The observation is 1 (0) is the server is busy (idle), Z = 0, 1, and q(x, 1) = 1 for x > 0, q(0, 0) = 1.
Every state is a vector with N +1 probabilities. Let u P, and let and be the uniformized transition
parameters, + =1. Then p
/
(u, a, e
0
) =Ia =0u(0)+(u(0)+u(1)). The second possible transition
is p
/
(u, a, v) = 1p
/
(u, a, e
0
) with
v(y) =
Ia = 0u(y) +Ia = 1u(y 1) +u(y +1)
1Ia = 0u(0) u(1)
,
for y > 0 (we assume that a = 0 is chosen in state N). It can probably be shown that v is stochastically
increasing in the number of times that 1 is observed since the last observation of 0. From this we conclude
that an optimal policy rejects from a certain number of consecutive observations of 1 on. See [11] for a
similar result.
Example 26.9 Suppose there are several different medical treatment for a certain illness, each with un-
known success probability. How to decide which treatment to use for each patient?
At rst sight this can be translated in a one-state model. However, the reward depends on the success
probability of the chosen treatment. Thus the success probabilities should be part of the state space.
Suppose there are two possible treatments, then the state is represented by the tuple (
1
,
2
), giving the
success probabilities of the two treatments. However, this state is unobserved: instead of this we have as
information state a tuple of distributions on [0, 1]. Each time unit a new patient arrives, the question is how
to treat this patient. This question can be answered by solving a Markov decision problem with as state
space all possible tuples of independent 0-1 distributions. Thus X = [0, 1]
2
, A = 1, 2, the treatment
to be used, Z = 0, 1, the result of the treatment (1 means success). Because X is a continuous set,
information states are densities: P = (R
[0,1]
+
, R
[0,1]
+
), tuples of densities, both on the probabilities [0, 1].
Koole Lecture notes Stochastic Optimization 22nd January 2006 40
Take u = (u
1
, u
2
). For action a only u
a
is updated. Consider a = 1 (the case a = 2 is equivalent). Then a
success occurs with probability
_
1
0
xu
1
(x)dx, and the resulting information state is v = (v
1
, v
2
) with v
2
=u
2
and v
1
dened by
v
1
(y) =
yu
1
(y)
_
1
0
xu
1
(x)dx
in case of a success (which occurs thus with probability
_
1
0
xu
1
(x)dx), and v
1
such that
v
1
(y) =
(1y)u
1
(y)
_
1
0
(1x)u
1
(x)dx
in case of a failure (with probability
_
1
0
(1x)u
1
(x)dx).
Using this in an optimization algorithm is computationally infeasable, therefore we have to look for a
method to reduce the size of the state space. That we will discuss next. See [5] for more information on
this example.
In the examples we saw that in partial-information problems the state space very quickly
becomes so big that direct computation of optimal policies is infeasable. In the rst example we
saw that instead of working with all distributions we can work with the time until the last time
the system was empty. This is a much simpler representation of the state space. For systems with
unknown parameters such a simpler representation of the information states exists sometimes
as well. In such cases the densities that occur as information states fall into a certain class of
parametrized families, such as Beta distributions. The crucial property is that the distribution
should be closed under the Bayesian update. We will show that this is the case in the setting of
Example 26.9.
Let us introduce the class of Beta distributions. A Beta (k, l) distribution has density f (x)
x
k
(1x)
l
. Note that for k = l = 0 we nd the uniform distribution on [0, 1].
Theorem 26.10 Consider a Bernoulli random variable X with parameter that has a Beta(k, l)
distribution. Then [X =1 (the a posteriori distribution) has a Beta (k+1, l) distribution; [X =
0 has a Beta(k, l +1) distribution.
Proof We have:
P(x x +h[X = 1) =
P(x x +h, X = 1)
P(X = 1)
=
P(x x +h)P(X = 1[x x +h)
P(X = 1)
.
Dividing by h and taking the limit as h 0 gives (with f
X
denoting the density of X):
f
[X=1
(x) =
f
(x)P(X = 1[ = x)
P(X = 1)
=
f
(x)x
P(X = 1)
.
because f