Markov Decision Processes
Markov Decision Processes
Markov Decision Processes
2
(~) Elsevier Science Publishers B.V. (North-Holland) 1990
Chapter 8
Martin L. Puterman*
Faculty of Commerce, The University of British Columbia, Vancouver, B.C., Canada
V6T 1Y8
1. Introduction
* This research has been supported by Natural Sciences and Engineering Research Council
(Canada) Grant A-5527.
331
332 M . L . Puterman
2. Problem formulation
This section defines the basic elements of a Markov decision process and
presents notation.
There are two consequences of choosing action a when the system is in state
s at time t; the decision maker receives an immediate reward and the
probability distribution for the state of the system at the next stage is
determined. The reward is denoted by the real valued function r,(s, a); when it
is positive it can be thought of as income and when negative as cost. In some
applications, it is convenient to think of rt(s, a) as the expected reward received
at time t. This will be the case if the reward for the current period depends on
the state of the system at the next decision epoch. In such situations rt(s , a, j )
is the reward received in period t if the state of the system at time t is s, action
a @ A s is selected and the system is in state j at time t + 1. Then the expected
reward in period t is
where Pt(J Is, a) is defined below. The example in Section 3.2 illustrates this
alternative.
The function p,(jls, a) denotes the probability that the system is in state
j E St+ 1 if action a E As, , is chosen in state s at time t; p,(jls, a) is called the
transition probability function. When S t is not discrete, Pt( J ]s, a) is a density, if
it exists; otherwise the problem formulation is in terms of a distribution
function. In most applications it is convenient to assume that
pt(jls, a ) = l . (2.1)
jESt+ 1
tion probability function if the system is in state s and the action corresponding
to decision rule system is in state s and the action corresponding to decision
rule d(s) is used. Note that if d is a randomized decision rule, then
and
3. Examples
This section presents two very simple examples of Markov decision pro -~
cesses. The reader is referred to White (1985b) for a recent survey of
applications of MDP's.
110,1}
Fig. 3.1. Symbolic representation of two state Markov process.
The above problem is stationary, i.e., the set of states, the sets of actions,
the rewards and transition probabilities do not depend on the stage in which
the decision is made. Thus, the t subscript on these quantities is unnecessary
and will be deleted in subsequent references to the problem. A formal
description follows.
Decision epochs:
T = {1,2 . . . . , N } , N~o
States:
S,={sl,s2}, t~T.
Actions:
Rewards:
Transition probabilities:
The functions c(u) and h(u) are increasing in u. For finite horizon problems,
the inventory on hand after the last decision epoch has value g(u). Finally, if j
Ch. 8. Markov DecMon Processes 339
T={1,2,...,N}, N ~<0o.
As,,={0,1,2,...,M-s}, t=l,2,...,N.
rN+I(S , a) = g ( s ) , t= N + 1.
ifM>-j>s+a,
I
p t ( j l s , a) = p,+~,-j
lqs+,~
if M > ~ s + a > ~ j > O ,
if j = O , s + a < ~ M a n d s + a ~ < D , ,
where
q,,~<, = P { D , >>-s -~ a} = Z
d=s+a
Pd"
If the demand exceeds s + a units, then the inventory at the start of period
t + 1 is 0 units. This occurs with probability qs+a. Finally the probability that
the inventory level ever exceeds s + a units is 0, since demand is non-negative.
As a consequence of assumption (b) above, the inventory on hand through-
out the month is s + a so that the total monthly holding cost is h(s + a). If
instead, the demand is assumed to arrive at the beginning of a month h(s + a)
is the expected holding cost.
The decision sets consist of all rules which assign the quantity of inventory to
be ordered each month to each possible starting inventory position in a month.
A policy is a sequence of such ordering rules. An example of a decision rule is:
order only if the inventory level is below 3 units at the start of the month and
order the quantity which raises the stock level to 10 units. In month t this
decision rule is given by:
10-- s , s<3,
d,(s)= 0, s1>3.
Such a policy is called in (s, S) policy (See Chapter 12 for more details).
A numerical example is now provided. It will be solved in subsequent
sections using dynamic programming methods. The data for the problem are as
follows: K = 4, c(u) = 2u, g(u) = O, h(u) = u, M = 3, N = 3, f ( u ) = 8u and
P'~
= {
~ if d = 0 ,
~1 i f d = l ,
~ if d = 2 .
Table 3.1
u ~u)
0 0
1 0x~+8x~=6
2 0x+8x+16x~=8
3 0x~+8+16xl=8
Combining the expected revenue with the ordering, and holding costs gives
the expected profit in period t if the inventory level is s at the start of the
period and an order for a units is placed. If a = O, the ordering and holding cost
equals s and if a is positive, it equals 4 + s + 3a. It is summarized in the table
Ch. 8. Markov Decision Processes 341
Table 3.2
r,(s, a) p,(j[s, a)
a=0 a=l a=2 a=3 j=0 j=l j=2 j=3
s=0 0 -1 -2 -5 s+a=0 1 0 0 0
s=l 5 0 -3 x s+a=l 3 ~ 0 0
s=2 6 -1 x x s+a=2 1 ~ 4l 0
s=3 5 x x x s+a=3 0 ~ ~
4. T h e finite h o r i z o n case
This section presents and analyzes finite horizon, discrete time Markov
decision problems. It introduces a concept of optimality and discusses the
structure of optimal policies and their computation. The Principle of Optimali-
ty which underlies the backward induction procedure is shown to be the basis
for analysis. The section concludes with a numerical example.
4.1. O p t i m a l i t y criteria
Each policy yields a stream of random rewards over the decision making
horizon. In order to determine which policy is best, a method of comparing
these reward streams is necessary. Most of the dynamic programming literature
assumes that the decision maker has a linear, additive and risk neutral utility
function over time and uses expected utility as an evaluation function. Con-
sequently, the expected total reward over the decision making horizon is used
for reward stream evaluation and comparison.
The results in this section requires a formal definition of the history of a
Markov decision process. Let /4, denote the history up to epoch t, t =
1, 2 . . . . . N + 1. Define A, = Xses, A s . ,. Then
H 1 = {$1}, (4.1a)
H, = {S1, A t , S 2. . . . . A,__ 1, S,}
={H,_,,A,_t,S,}, t=2 ..... N+I. (4.1b)
'rr r 7r
VN(S ) : E=,, r,(X, , dt(H t )) + rN+x(XN+ 1 (4.2)
where E~, s denotes expectation with respect to the joint probability distribution
of the stochastic process determined by 7r conditional on the state of the system
prior to the first decision being s. If the policy is randomized, this distribution
also takes into account the realization of the action selection process at each
decision epoch.
U n d e r the assumption that r,(s, a) is b o u n d e d for (s, t ) ~ S, x As,,, VN(S )
exists and is bounded for each ~ r E / / and each N < ~ . If rewards are
discounted, that is a reward received in a subsequent period is worth less than a
reward received in the current period, a discount factor At-l, 0 < A < 1, is
included inside the summation in (4.2). This will not alter any results in this
section but will be important in the infinite horizon case.
The decision maker's objective is to specify (at decision epoch 1) a policy
7r @ H with the largest expected total reward. When both S t and A~,, are finite
there are only finitely many policies so such a policy is guaranteed to exist and
can be found by enumeration. In this case, the decision maker's problem is that
of finding a 7r* with the property that
7r* 7r
v N (s) = max oN(s ) =- v*N(s), s ~ S1. (4.3)
r~H
The policy ~-* is called an optimal policy and V~v(S) is the optimal value function
or value of the finite horizon Markov decision problem. Theory in the finite
horizon case is concerned with characterizing 7r* and computing v~(s).
When the problem is such that the maximum in (4.3) is not attained, the
maximum is replaced by a supremum and the value of the problem is given by
In such cases, the decision maker's objective is to find an e-optimal policy, that
is, for any e > 0, a policy ~r* with the property that
~r* .
v~ (s) + ~ > v?~(s), s ~ s~.
u =(hi)
t = E~,h, r~ X2, H, . (4.5)
--t
The idea leading to equation (4.6) for general t is as follows. The expected
value of policy 7r over periods t, t + 1 . . . . , N + 1 if the history at epoch t is h,
is equal to the immediate reward received if action dr(h,) is selected plus the
expected reward over the remaining periods. The second term contains the
product of the probability of being in state ]" at epoch t + 1 if action d,(h,) is
used, and the expected reward obtained using policy ~r over periods t + 1,
. . , N + 1 if the history at epoch t + 1 is h,+ 1 = (h,, d,(h~), j). Summing over
all possible j gives the desired expectation expressed in terms of u ~ t instead of
in terms of the reward functions and conditional probabilities required to
explicitly write out (4.5).
344 M.L. Puterman
The quantity u* is the supremal return over the remainder of the decision
horizon when the history up to time t is h,. When minimizing costs instead of
maximizing rewards this is sometimes called a cost-to-go function (Bertsekas,
1987).
The optimality equations of dynamic programming are the fundamental
entities in the theory of Markov decision problems. They are often referred to
as functional equations or Bellman equations and are the basis for the backward
induction algorithm. They are given by
u,(h,) = sup
aEAst,t
{r,(s,,a)+JESt+t
Z p,(j[s,,a)u,+,(h,,a,j)} (4.8)
EN+ 1. Then
(a) ut(ht) = u * ( h t ) f o r all h, E H t , t = 1, . . . , N + 1, a n d
(b) u l ( s l ) = v~(sl) f o r all s I E S 1.
Result (a) means that solutions of the optimality equation are the optimal
value functions from period t onward for each t and result (b) means that the
solution to the first equation is the value function for the MDP. Note that no
assumptions have been imposed on the state space and the result is valid
whenever the summation in (4.8) is defined. In particular, the results hold for
finite and countable state problems.
Result (b) is the statement that the optimal value from epoch 1 onward is the
optimal value function for the N period problem. It is an immediate con-
sequence of (a). The proof of (a) is based on the backward induction argument;
it appears in the references above.
The next theorem shows how the optimality equation can be used to find
optimal policies when the maximum is attained on the right hand side of the
optimality equation. Theorem 4.3 considers the case of a supremum.
Then:
(a) 7r* is an optimal policy and
17"*
v u (s) = v?~(s), s ~ s,. (4.10)
The operation 'arg max' corresponds to choosing an action which attains the
maximum on the right hand side of (4.12). It is not necessarily unique.
The theorem means that an optimal policy is found by first solving the
optimality equations and then for each history choosing a decision rule which
selects any action which attains the maximum on the right hand side of (4.9).
When using these equations in computation, the right hand side is evaluated
for all a @ As,,t and the maximizing actions are recorded. An optimal policy is
one which for each history selects any of these maximizing actions.
Part (b) of this theorem is known as 'The Principle of Optimality', and is
considered to be the basic paradigm of dynamic programming. It first appeared
formally in Bellman (1957, p. 83) as:
"An optimal policy has the property that whatever the initial state and initial
decision are, the remaining decisions must constitute an optimal policy with
regard to the state resulting from the first decision."
An equivalent statement that yields further insight appears in Denardo
(1982, p. 15). It can be paraphrased in the language of this chapter as:
There exists at least one policy that is optimal for the remainder of the decision
making horizon for each state at every stage.
The policy rr* has these properties.
In case the supremum in (4.8) is not attained, the decision maker can use an
e-optimal policy which is found as follows.
( )
a@Ast,t J~St+l
Ch. 8. Markov Decision Processes 347
Then:
(a) 7r ~ is an e-optimal policy with
~re
O N ( S ) -1- e ~ O ~ r ( S ) , S e S 1 . (4.14)
The result follows Since the policy obtained in these two theorems is
deterministic. When the maximum is attained, the policy defined by (4.12) is
optimal, otherwise the policy defined by (4.13) is e-optimah
The following important theorem states that the optimal value functions and
policies depend only on the state of the system at decision epochs and not on
the past, that is, when rewards and transition probabilities are Markov, the
optimal policy depends on the past only throught the current state.
Theorem 4.5. Let ut, t-- 1, . . , N, with UN+ t rN+l ' be solutions of (4.8).
=z
A formal proof is based on induction. The key idea in establishing this result
is that if for some n, u,+ 1 depends on the history only through the current
state, then the maximizing action and u n depend on the history only through
the current state.
4.5. Computationalresults
The backward induction algorithm is used to solve the numerical version of
the stochastic inventory example of Section 3.2. (Since the data are stationary,
the time index may be deleted.) Define u,(s, a) by
u3(s)=maxtr(s,a)+
aEA s L
~ P(j[s,a)u4(j)}=max
jES aEAs
{r(s,a)l
In each state the maximizing action is 0. Thus for all s, Ast3 = {0} and
u3(0 ) = 0, u3(1 ) = 6, u3(2 ) = 6 and U3(3 ) = 5.
3. Since t 1, continue. Set t = 2 and
The quantities U2(S, a), U2(S) and A,* 2 are summarized in Table 4.1 where x~s
denote non-existent actions.
Ch. 8. Markov Decision Processes 349
Table 4.1
u2(s, a) u2(s) A*s,2
a=0 a=l a=2 a=3
s = 0 0 1/4 2 1/2 2 2
s = 1 6 4 2 x 61 0
s,= 2 10 4~ x x 10 0
s = 3 10 x x x 10 0
The quantities Ul(S, a), ul(s ) and As*,1 are summarized in Table 4.2~
Table 4 . 2
ul(s, a) u,(s) A*s,1
a=0 a=l a=2 a=3
5. Since t = 1. Stop.
This procedure has produced the optimal expected total reward function
v~(s) and optimal policy 7r*= (d~(s), d~(s), d~(s)) which are reproduced in
Table 4.3.
Table 4 . 3
s d~(s) d~(s) d~(s) v](s)
0 3 2 0 67/16
1 0 0 0 129/16
2 0 0 0 194/16
3 0 0 0 227/16
The quantity v ~ (s) gives the expected total reward obtained using this policy
when the inventory at the start of month 1 is s units.
This policy has a particularly simple form; if at the start of month 1 the
inventory is 0 units, order 3 units, otherwise do not order, if at the start of
month 2 the inventory is 0 units, order 2 units, otherwise do not order; and do
not order in month 3. This is an example of an (s, S) policy.
350 M.L. Puterman
This section introduces several optimality criteria for infinite horizon M D P ' s
and discusses the relationship between them. It provides an introduction to the
material in Sections 6-8. We assume the data of the problem are stationary and
S is either finite or countable.
'rr
v~(s) = lim VN(S)
where v N is defined by (4.2) with rN~ , = 0. Several special cases are dis-
tinguished for which this criterion is appropriate. These are introduced in the
next subsection and are the subject of Section 7.
(b) The expected discounted reward of policy ,r is defined by
for 0 ~< A < 1. Note that v r = lima,a v~ when the limit exists. Condition (5.1)
ensures that ]v~(s)] < (1 - A)-IM for all s E S and 7r ~ H.
In (5.3), the present value of the reward received in the first period has value
r. This is equivalent to assuming that rewards are received at the beginning of
the period immediately after the decision rule is applied.
(c) the average reward or gain of policy 7r is given by
where VN(S ) is defined by (4.2). In this case, it is not necessary that rN+ 1 equals
zero since averaging removes the effect of the terminal reward. When the limit
in (5.4) does not exist, the limit is replaced by the limit inferior to account for
the worst possible limiting behavior u n d e r policy ~.
The quantity g~ is the average expected reward per period obtained using
policy ~r. Justification for calling it the gain is provided in Section 8.
policy the chain ends up in a recurrent class in which the rewards are zero, then
v=(s) is bounded.
Negative problems arise in the context of minimization of expected total
costs when immediate costs are non-negative. Changing signs converts all costs
to negative rewards and minimization to maximization. The condition that at
least one policy has v~(s) > - ~ is equivalent to the existence of a policy with
finite total expected cost. Such problems also arise in the context of minimizing
the probability of reaching an undesirable state, minimizing the expected time
to reach a desirable state (Demko and Hill, 1981) and optimal stopping with
minimum expected total cost criterion.
Restricting rewards to be negative ensures that v~(s) is well defined,
however it may be infinite for many policies. The restriction that at least one
policy has v~(s) finite ensures that the expected total reward criteria is useful.
Theoretically this problem is more challenging than the positive case because it
permits policies with infinite rewards.
The discounted case is the most important in economic applications and the
best understood theoretically and computationally. It will be studied in detail
in Section 6. Discounting arises naturally in an economic context when the time
values of the rewards are taken into account. The discount factor A is the
present value of one unit of currency received in the subsequent period so that
v~ is the expected total present value of the income stream obtained using
policy 7r. Allowing A to be non-constant leads to non-stationary problems.
Derman (1970, pp. 31-32) shows that discounting is equivalent to a problem
with expected total reward criteria and a random termination time, 7, that is
independent of the actions of the decision maker and geometrically distributed
with parameter A.
Generalizations of the discounted case include the transient case (VeinotL
1969, Hordijk, 1974, Pliska, 1978 and Whittle, 1983) and problems in which
there is a single absorbing state and the expected time until absorption is
bounded for all policies (Blackwell, 1962, Mine and Osaki, 1968 and van
Dawen, 1986a).
~r
v;(s) = sup v~(s). (5.7)
v. ( s ) - v ~ ( s ) for a l l s E S .
When the limits defining g~(s) do not exist, two notions of gain optimality have
been considered (Flynn, 1976, Federgruen, Hordijk and Tijms, 1979, and
Federgruen, Schweitzer and Tijms, 1983). A policy ~-* is said to be average
optimal in the strong sense if its smallest limit point is as least as great as any
limit point of any other policy. T h a t is for each s ~ S,
A policy 7r* is said to be average optimal in the weak sense if its largest limit
point is as least as great as any limit point of any other policy. That is, for each
sES,
Example 5.1. Let S = {1,2,3} and suppose the action sets, rewards and
transition probabilities are as follows. For s = 1, A , = {a, b}, r(s, a) = 1,
r ( s , b ) = O and p ( 2 ] s , a ) = l , p ( 3 [ s , b ) = l . For s = 2 , A s = { a } , r(s,a)=O,
p(1 ts, a) = 1 and for s = 3, A, = {b}, r(s, b) = 1 and p ( l l s , b) = 1. Clearly the
stationary policies which always use action a or b yield average rewards of . It
is easy to see that these policies are average optimal.
This example shows that the average reward criterion does not distinguish
between policies which might have different appeal to the decision maker.
Starting in state 1, policy a with reward stream (1,0, 1 , 0 , . . . ) is clearly
superior to b with reward stream (0, 1, 0, 1 , . . . ) because it provides 1 unit in
the first period which can be put to alternative use. Denardo and Miller (1968)
called a criteria such as the average reward unselective because it depends only
on the tail behavior of the sequence of rewards and does not distinguish
policies with returns which differ only in a finite number of perods. Several
more selective criteria have been proposed. They are based on either
(a) the comparative finite horizon expected total reward as the number of
periods becomes large, or
(b) the comparative expected discounted reward as the discount factor ,~
increases to 1.
Those based on v 7rN are discussed first. Denardo and Miller (1968) called a
policy 7r* overtaking optimality if for each s @ S,
qr * 7r
lim mf 0 N (S) -- ON(S) ~ O for all 7r ~ H , (5.9)
N~
and showed with an example that this criterion is over, elective, that is, there
need not exist an optimal policy with respect to this criterion. The following
criterion (Veinott, 1966) is less selective A policy 7r* is said to be average
overtaking optimal if for each s E S,
ON, n = Oj,n_ l o
j=l
Ch. 8. Markov Decision Processes 355
v~*(s) - v~(s) >~0 for all 7r C H for A*(s) ~< A < 1 . (5.11)
Such policies are now referred to as Blackwell optimal. Blackwell proposed this
criterion in the context of S finite in which case A* = sup,es A*(s) is attained. In
countable state problems this supremum might equal 1. Dekker (1985) dis-
tinguishes cases when A* < 1 as strongly Blackwell optimal.
Veinott (1969) generalized Blackwell optimality by proposing the following
family of sensitive optimality criteria. A policy 7r* is said to be n-discount
optimal if for each s E S,
This criterion unified several optimality criteria based on the expected dis-
counted reward including average or gain optimality, bias and Blackwell
optimality. It has been shown that (-1)-discount optimality is equivalent to
average optimality, 0-discount optimality is equivalent to bias optimality and
~-discount optimality is equivalent to Blackwell optimality. These equivalences
are discussed in more detail in Section 8.9 where the Laurent series expansion
on which these are based is presented.
Blackwell optimality is the most selective of the n-discount optimality
criteria as it implies n-discount optimality for all finite n. It implies gain and
bias optimality. In general, n-discount optimality implies m-discount optimality
for all m < n so that bias optimality (n = 0) is more selective than gain
optimality (n = - 1).
Optimality criteria have also been based on the asymptotic behavior of
policies for finite horizon problems as the horizon gets large. Morton (1978)
calls a policy forecast horizon optimal if it is the pointwise limit of optimal
policies for finite horizon problems. Since the limit need not exist, Hopp, Bean
and Smith (1988) have introduced a weaker criteria, periodic forecast horizon
optimality. A policy is said to be periodic forecast horizon optimal if it is the
limit of a subsequence of optimal policies for finite problems in an appropriate
metric. These two criteria are of particular importance in nonstationary
problems.
356 M.L. Puterman
When the assumption that Ejc s p ( j l s , a)~< 1 is not satisfied the above
criteria are inappropriate. Rothblum (1984) showed that problems for which
Z j c s P ( j l s , a ) > l include Markov decision processes with multiplicative
utilities (Howard and Matheson, 1972) and controlled branching processes
(Mandl, 1967, Pliska, 1976). Optimality criteria in this case are based on
choosing policies which maximize the spectral radius (Bellman, 1957, p. 329).
More general optimality criteria have been proposed by Rothblum and Veinott
(1975).
This section analyzes infinite horizon MDP's under the expected total
discounted cost optimality criterion. The optimality equation is introduced and
its fundamental role in Markov decision process theory and computation is
demonstrated. Several algorithms for solving the optimality equation are
presented and discussed; the section concludes with a numerical example and a
discussion of discounted Markov decision problems with unbounded rewards.
Throughout this section it is assumed that the problem is stationary and as
before, S is assumed to be discrete.
v(s) = sup
a~-A s
{r(s,a)+Z Ap(j[s,a)v(j)}, sES,
jES
(6.1)
plays a key role in the theory of Markov decision problems. In vector notation
it can be written as
Tar =- r d + h P d v . (6.4)
Comparing (6.2) and (6.3) shows that the optimality equation can be expressed
as v = Tv. Thus, a solution v of the optimality equation is a f i x e d p o i n t of T.
This observation will be fundamental to Section 6.2.
The main properties of the optimality equation to be presented below
include:
(1) If a solution of the optimality equation exists, it equals the value of the
discounted M D P (Theorem 6.3).
(2) The value of the discounted MDP satisfies the optimality equation
(Corollary 6.7).
(3) The solution of the optimality equation is unique (Corollary 6.3).
(4) The optimality equation characterizes optimal policies (Theorem 6.4).
A recursive equation to compute the expected total discounted reward of a
fixed policy r is now developed. Let r = (d~, d e . . . . ) be an arbitrary policy.
Its expected total discounted reward v~(s) was given by (5.3). The expectation
in (5.3) can be expressed in terms of transition probabilities as follows
v 7r
A := ~ ~t " - ~ P "Ir- ' Fd n (6.5)
n=l
2
= r d l -1- A P d l r d 2 -[- A P d l P d 2 r d 3 -}- " "
where 7# = (d2, d3, . . .) and the limit implicit in (6.5) is componentwise. The
relationship in (6.6) holds for arbitrary policies, however if 7r is stationary,
r' = 7r. Denote the stationary policy 7r = (d, d . . . . ) by d. Rewriting the above
relationship yields
d d d
V a = r d + APdO A ~ TdO A . (6.7)
d
0;~ . (!. . A. P d. ) l r d L A n--1P dn-1 r d (6.9)
n=l
358 M.L. Puterman
The following is the fundamental result about the optimality equation and its
solutions. A proof appears in Blackwell (1962).
Since any solution of the optimality equation must satisfy both of the above
inequalities, it is equal to the optimal value function and consequently is
unique. Alternatively, uniqueness can be established using the contraction
mapping methods of Section 6.3.
Theorem 6.3. If the equation v = Tv has a solution, it is unique and equals v~.
or equivalently,
= m a x { r ( s , a ) + ~, Z p ( j l s , a ) v ( j ) } .
a@As jCS
Ch. 8. Markov Decision Processes 359
d* 7r
vA = max uA
~rE//
d* = a r g m a x {r e + APdV* } (6.14)
dED
or
Corollary 6.5, Suppose a conserving decision rule d* exists. Then the de-
terministic Markov stationary policy which uses d* every period is optimal in the
class of all policies.
Theorem 6.6. The operator T has a unique fixed point v* E V and for every
v ~ V, the sequence {v n} defined by v n+l = Tv n converges in norm to v*.
Tv = max {r d + APdV } .
dED
When the supremum is not attained, only e-optimal policies are possible
This result is summarized as follows.
d~(s)=argmaxtr(s,a)+Z ,~p(j]s,a)v"+'(j)}.
a~As ~ jES
(6.19)
and stop. If the arg max in (6.19) is not unique, any action achieving this
maximum can be selected.
The main step in the algorithm is 2 which gives the recursion v "+~ = Tv ~ in
component notation. Theorem 6.6 guarantees the convergence of the algorithm
to the optimal value ftmction and that the stopping criterion is satisfied in
362 M.L. Puterman
finitely many iterations. When the stopping criterion in step 3 is met, the
stationary policy corresponding to a vn+l-improving decision rule is e-optimal.
Improved stopping rules are discussed in Section 6.7.
Theorem 6.6 ensures convergence of value iteration for arbitrary state spaces
provided that the appropriate norm is selected so that the value functions and
norm are a Banach space. This means that value iteration will converge in
norm if S is finite, countable, compact or Borel. Unfortunately direct im-
plementation of the maximization in (6.20) is only practical when S is finite.
For more general state spaces, the maximization can only be carried out by
using special structure of the rewards, transition probabilities and value
functions to determine the structure of maximizing decision rules. If it can be
established that a property of v n is preserved by induction, for example
unimodality, and that this property ensures that the optimizing decision rule is
of a certain form, i.e., control limit, then if the property of v n holds in the
limit, as a consequence of Corollary 6.7 there exists an optimal stationary
policy with the special structure. This idea has been used extensively in
inventory theory (Chapter 12), replacement theory and queueing control to
determine the structure of optimal policies for infinite horizon problems.
The value iteration algorithm as defined above terminates in a finite number
of iterations when the stopping criteria in step 3 is satisfied. Consequently,
there is no guarantee that the resulting policy is optimal. In special cases,
action elimination procedures discussed in Section 6.7.3 can be used to ensure
termination with an optimal policy. Note that, the policy determined in step 4
is e-optimal in the norm sense, that is
I 1 ~ ~ -- ~211 < ~
When A is close to one, the above bounds suggest that the convergence of this
algorithm will be quite slow. The subsequent subsections discuss other more
efficient methods for solving discounted MDP's.
Using standard arguments, the following error bound for the iterates of
value iteration can be obtained;
2A"
llv~ ~ - o211 ~ V ~ - X I t ~ 1 - 11 (6.23)
By specifying e a priori and performing one value iteration step, (6.23) can be
used to estimate the number of additional iterations required to obtain the
desired precision.
where w~ is the normalized optimal value function and a n is the modulus of the
subdominant eigenvalue (the second largest eigenvalue in modulus) of the
transition matrix of the on-improving decision rule. The advantage of this
approach is that if the subdominant eigenvalues for most policies (especially
the optimal one) are considerably smaller than 1, the rate of convergence of
this normalized or relative v a l u e iteration will be considerably faster than that
for value iteration.
When the transition matrices of all policies are irreducible, relative value
iteration can be implemented by selecting an arbitrary state s 0, defining w for
each s E S by
When the policies have more general chain structure, a different normalization
which requires identification of recurrent classes can achieve this improved rate
of convergence.
Another modification that will accelerate computations is to use the Gauss-
Seidel variant of value iteration (Hastings, 1969). In it, updated values of
vn+l(s) are substituted into the recursive equation as soon as they are
evaluated. Suppose that the states are labelled sa, s 2 , . . . , s u and are evaluated
in order of their subscripts. Then the Gauss-Seidel iterative recursion is
Theorem 6.10. There e~cists an N* such that for any n >I N*, the optimal
decision in a finite horizon problem when there are n periods' remaining is in
D*, the set of optimal stationary policies for the infinite horizon problem.
rd,+
,
1 Jr- A P d n + l Od
n = md~D
ax {r d + hPdOd. } (6.25)
The above algorithm yields a sequence of policies {dn} and value functions
{re,). It terminates when the maximixing policy in step 3 repeats. This occurs
with certainty in a finite number of iterations in finite state and action problems
but not in compact action problems for which the number of stationary policies
is infinite.
Step 2 is called the policy evaluation step because in it, (6.25) is solved to
obtain the expected discounted reward of stationary policy dn. This equation is
usually solved by Gauss elimination. In step 3, a v e -improving decision rule is
selected. Since the decision rule is not necessarily unique, the condition that
d~+ 1 = dn is included to avoid cycling and ensure termination.
To carry out step 3, the set of all v~ -improving decision rules is required
before selecting a particular decision rule. An alternative specification of the
366 M.L. Puterman
algorithm would retain the entire set of vd-improving decision rules and
terminate when it repeats. This modification is unnecessary since at termina-
tion, v n = v~, so that all conserving decision rules are available.
Alternatively, one might implement step 3 by just finding a decision rule
dn+ ~ with the property that
with strict inequality for at least one component. If this specification is used,
the algorithm will still converge in the finite action case, but at a much slower
rate than using the implementation in step 3. If the set of actions is compact,
convergence to v A* is not guaranteed.
1)d~+l ~ Odn .
Since there are only finitely many deterministic stationary policies, the
algorithm must terminate in a finite n u m b e r of iterations. At termination,
dn+l = dn, so that
Theorem 6.12. Suppose S is finite and for each s C S, A s is finite. Then the
policy iteration algorithm terminates in a finite number o f iterations and the
policy d* is discount optimal.
B y - - - m a x {r d +
dED
(APe- I)v} . (6.26)
By = 0 (6.27)
d o = a r g m a x {r a + ( A P a - I ) v } . (6.28)
dED
Note that the I is (6.28) does not effect the maximization so that D o is the set
of v-improving decision rules.
This result follows easily from the definitions of the quantities in (6.29). It is
called the 'support inequality' and is a vector space generalization of the
gradient inequality which defines convex functions in R n. Thus in a generalized
sense, the operator B is 'convex' and hPdo - I is the 'support' of B at v.
Figure 6.1 illustrates the convexity and construction of B y . In the situation
depicted, there are four policies. For each, the function r i + (Pi - 1 ) v is given.
At each v E V, B y is the maximum of these functions. With the exception of
r4 + (P4 - l ) v , all are supports for some v in the illustrated portion of V. Note
By is convex.
The following proposition provides a closed form representation for the
sequence of values generated by policy iteration and is fundamental to this
analysis.
Un
v n+l = - (hPao,,- 1)-~Bv" (6.30)
368 M . L . Puterman
\
' ~ .............. By
V
Fig. 6.1. Construction of By.
Theorem 6.15. Suppose By ~ 0 and there exists" a unique v* such that By* :: O.
Then the sequence o f iterates {v'} defined by (6.30) converges monotonically
and in norm to the zero o f B, v*.
Then
KA
Hvn+l - v ll IIv" 2112 (6.32)
This theorem says that if (6.31) holds, policy iteration converges quadratical-
ly to the optimal value function. This accounts for the fast convergence of
policy iteration is practice. In contrast, value iteration and its variants converge
linearly.
In terms of the data of the problem, sufficient conditions for (6.31) to hold
are that for each s E S:
(a) A s is compact and convex~
(b) p(j]s, a) is affine in a, and
(c) r(s, a) is strictly concave and twice continuously differentiable in a.
370 M . L . Puterman
When A s is finite, (6.31) need not hold because Pa~ is not unique at several
V E V; however, if a rule such as that in step 3 of the policy iteration algorithm
is used to break ties, the algorithm provides a unique support at each v. Thus
convergence will be quadratic although K might be large. Other conditions
which imply (6.31) can be derived from selection theorems in Fleming and
Rishel (1975).
Then
lira Ilv"+l - v lt
~__,o~ i~v~-~v-~ = 0 . (6.34)
\
Bv
V
v=0 vO U~ uI V 1 uo U~ V 2 Vx ~ - - ~
V
Fig. 6.2. Illustration of modified policy iteration of order 2.
uo(s)=maxlr(s,a)
aEAs ~"
+ j@S
Z Ap(jls, a)v"(j)}. (6.36)
This algorithm combines features of both policy iteration and value iteration
Like value iteration, it is an iterative algorithm. The stopping criterion used in
step 3b is identical to that of value iteration; when it is satisfied, the resulting
policy is e-optimal. The computation of u 0 in step 3a requires no additional
372 M . L . Puterman
v "+1 = ( r dn+l
)"+%"
Equation (6.39) shows that modified policy iteration includes value iteration
and policy iteration as extreme cases; modified policy iteration of order 0 is
value iteration and of infinite order is policy iteration. The modified policy
iteration algorithm corresponds to performing one value iteration step in which
the maximum in (6.36) is computed and then m successive approximation steps
with the fixed decision rule dn+ 1. Figure 6.2 illustrates this for modified policy
iteration of order 2.
The quantity v "+1 is the expected total discounted reward obtained by using
the stationary policy dn+ ~ in a problem with finite horizon m and terminal
reward v'. Alternatively, v n+~ is the expected total discounted reward of the
policy which used d,+ 1 for the first m periods, d, for the next m periods and so
forth, in an (n + 1)m period problem with terminal reward v .
The convergence of the algorithm has been demonstrated by Puterman and
Shin (1978) and Rothblum (1979) and can be summarized as follows.
One might conjecture that the iterates of modified policy iteration of order
m + k (k 1>0) always dominate those for MPI order m when started at the
same initial value. An example of van der Wal and van Nunen (1977) which
appears in Puterman and Shin (1978) indicates that this conjecture is false.
Puterman and Shin (1978) provide the following result regarding the rate of
convergence.
Theorem 6.19. I f
(6.40)
then
This result demonstrates the appeal of this algorithm. When the policy is
close to optimal, the convergence rate of the algorithm is close to that of m + 1
steps of value iteration. Computationally this represents a major savings over
value iteration because MPI avoids the maximization at each pass through the
algorithm. Conditions which imply (6.40) were given in the previous section. It
always holds for finite state and action problems in which a rule is used to
uniquely choose the vn-improving policy is step 2.
The MP! algorithm will converge in fewer iterations than value iteration and
at least as many iterations as policy iteration; however the computational effort
per iteration exceeds that for value iteration and is less than that for policy
iteration. Computational results in Puterman and Shin (1978) suggest that it is
a more computationally efficient method for solution of practical Markov
decision problems then either value iteration or policy iteration. Determining
an efficient procedure for selecting m is still an open problem although results
of Dembo and Haviv (1984) provide insight.
V~rd+A~V
374 M.L. Puterman
for all d ~ D, then v is an upper bound for the value of the MDP, v h. * Since va*
also satisfies this inequality, it must be the smallest such solution. This is the
basis for the following linear program.
Minimize ~ aiv(j )
jcs
subject to
The constants aj are arbitrary positive quantities which are assumed without
loss of generality to satisfy ZjEs aj = 1. Its dual is:
subject to
Theorem 6.20. The dual problem is always feasible and bounded. For a
randomized stationary policy d, the quantities
satisfies (6.41).
The quantity x(s, a) defined in (6.41) is the discounted joint probability that
the system is in state s and action a is selected, averaged over initial
distribution { o~/}.
Corollary 6.21. Any basic feasible solution has the property that for each s E S,
x(s, a) > 0 for only one a E A s. If x* is an optimal basic feasible solution, an
optimal deterministic stationary policy is obtained by setting d*(s) = a whenever
x*(s, a) > O.
The matrix defining tile constraints in the dual problem is a Leontief matrix
(Veinott, 1968), that is, each column has exactly one positive entry and for any
non-negative right hand side the linear system has a non-negative solution. A
consequence of this observation is that for any non-negative vector a the dual
linear program has the same optimal basic feasible solution.
The relationship between the simplex algorithm and the dynamic program
ming algorithms is as follows. When the dual problem is solved by the simplex
algorithm with block pivotting, it is equivalent to policy iteration. When policy
iteration is implemented by changing only the action which gives the maximum
improvement over all states, it is equivalent to solving the dual problem by the
usual simplex method. Modified policy iteration is equivalent to a variant of
linear programming in which the basic feasible solution is evaluated by
relaxation instead of direct solution of the linear system.
(1 - A)
sp(Bv) = U(Bv) - L ( B v ) < - - e. (6.44)
Then
When e is small and (6.44) is satisfied, Bo is nearly constant so that the value
of a v-improving decision rule differs from v by nearly a constant amount. Thus
at the subsequent value, By will again be nearly constant. It can be shown that
this constant must be close to 0. When this constant has been added to v, the
resulting value is closer to v~ as shown by (6.45).
As a consequence of Proposition 6.23, (6.44) provides an alternative stop-
ping rule for value iteration and modified policy iteration. It replaces the
stopping rule in step 3 of the value iteration algorithm or step 3b of the
modified policy iteration algorithm. Note that in both of these algorithms,
By n= Tv n - v" is available prior to testing whether the stopping criteria are
met. Determining U(Bv ~) and L ( B v ' ) requires little additional effort. A l
though policy iteration is a finite algorithm, the above stopping criteria can be
included after step 3, if all that is required is an e-optimal policy.
The quantity v + By + (1 - A) ~AL(Bv)I in (6.48) is called a lower bound
extrapolation by Porteus and Totten (1978) and gives an improved approxima
tion to v~ upon termination of the algorithm. One might conjecture that
convergence of the algorithms would be faster if such extrapolations could be
incorporated at each iteration. Unfortunately this is not the case because the
set of maximizing actions is unaffected by addition of a scalar. Such extrapola-
tions are more useful in semi-Markov decision problems in which transition
matrices do not have equal row sums. Other extrapolations are also available
(Porteus, 1980a).
actions are eliminated from the action set at subsequent iterations and
must be evaluated for fewer actions. Also, the identification and elimination of
suboptimal actions is the only way of determining an optimal (as opposed to an
e-optimal) policy when using a non-finite iterative algorithm such as value
iteration or modified policy iteration. When all but one action is eliminated in
each state, the stationary policy which uses the remaining decision rule is
necessarily optimal.
Action elimination procedures are based on the following observation of
MacQueen (1967).
Proposition 6.24. I f
then any stationary policy which uses action a in state s cannot be optimal.
Referring to Figure 6.1 shows why this result holds. In its policy 3 is optimal
and the one-step improvement functions of all other policies are bounded
above by 0 at v h.
Since o F is unknown, the result in Proposition 6.24 cannot be used directly to
identify suboptimal actions. Instead, Proposition 6.24 permits upper and lower
bounds for v~ to be substituted into (6.47) to obtain an implementable
elimination rule.
p o l i c y i t e r a t i o n a n d t h e r e s u l t s a r e discussed. T h e d a t a a r e t h e s a m e as t h o s e
u s e d in the finite h o r i z o n c a s e in S e c t i o n 4.5, the d i s c o u n t r a t e A is 0.9. T h e
o b j e c t i v e is to d e t e r m i n e t h e s t a t i o n a r y p o l i c y t h a t m a x i m i z e s t h e e x p e c t e d
t o t a l infinite h o r i z o n d i s c o u n t e d r e w a r d . A l l c a l c u l a t i o n s w e r e c a r r i e d o u t using
M D P L A B ( L a m o n d , 1984).
is satisfied. T h e v a l u e f u n c t i o n s v" a n d t h e m a x i m i z i n g a c t i o n s o b t a i n e d in S t e p
2 at s e v e r a l i t e r a t i o n s a r e p r o v i d e d in T a b l e 6.1.
Table 6.1
Value iteration results
n v"(s) d'(s) A"
s=0 s=l s=2 s=3 s=0 s=l s=2 s=3
0 0 0 0 0 0 0 0 0
1 0 5.0 6.0 5.0 2 0 0 0 6.0000
2 1.6 6.125 9.6 9.95 2 0 0 0 2.8250
3 3.2762 7.4581 11.2762 12.9368 3 0 0 0 1.6537
4 4.6632 8.8895 12.6305 14.6632 3 0 0 0 0.3721
5 5.9831 10.1478 13.8914 15.9831 3 0 0 0 0.0616
6 7.1306 11.3218 15.0383 17.1306 3 0 0 0 0.0271
7 8.1690 12.3605 16.0828 18.1690 3 0 0 0 0.0061
10 10.7071 14.8966 18.6194 20.7071 3 0 0 0
15 13.5019 17.6913 21.4142 23.0542 3 0 0 0
30 16.6099 20.7994 24.5222 26.6099 3 0 0 0
50 17.4197 21.6092 25.3321 27.4197 3 0 0 0
56 17.4722 21.6617 25.3845 27.4722 3 0 0 0
57 17.4782 21.6676 25.3905 27.4782 3 0 0 0
58 17.4836 21.6730 25.3959 27.4836 3 0 0 0
E s t i m a t e s b a s e d o n t h e e r r o r b o u n d in (6.23) i n d i c a t e t h a t 68 i t e r a t i o n s a r e
r e q u i r e d to o b t a i n a 0 . 1 - o p t i m a l policy. I n fact, using t h e s t o p p i n g c r i t e r i o n
a b o v e l e a d s to t e r m i n a t i o n a f t e r 58 i t e r a t i o n s w h e n II v 58 - v 57[[ = 0.0054. T h e
0 . 1 - o p t i m a l s t a t i o n a r y p o l i c y is d E = (3, 0, 0, 0). This is t h e p o l i c y w h i c h o r d e r s
o n l y w h e n t h e s t o c k level is 0, a n d in t h a t case o r d e r s 3 units. O b s e r v e t h a t t h e
o p t i m a l p o l i c y was first i d e n t i f i e d at i t e r a t i o n 3 b u t t h e a l g o r i t h m d i d n o t
t e r m i n a t e until i t e r a t i o n 58.
380 M . L . Puterman
1-a 0.1
A n =- U(Bv n) - L(Bv n) < ~ e = ~..9 x 0.1 = 0.0111. (6.49)
To apply this stopping rule, note that By n = v n+a - v". Observe from the last
column of Table 6.1 that when using this stopping rule, the algorithm termi-
nates with a 0.1-optimal policy after only 7 iterations.
1. Set d = ( 0 , 0 , 0 , 0 ) a n d n = 0 .
2. Solve the evaluation equations obtained by substituting the transition
probabilities and rewards corresponding to policy d o into (6.24) to obtain
v = (0, 6.4516, 11.4880, 14.9951).
3. For each s the quantities
are computed for a = 0 , . . . , 3 - s and the actions which achieve the maximum
are placed into Ao, s. In this example there is a unique maximizing action in
each state so that
The detailed step by step calculations for the remainder of the algorithm are
omitted. The value functions and corresponding maximizing actions are pre-
sented below. Since there is a unique maximizing action in the improvement
step at each iteration, A*,s is equivalent to dn(s) and only the latter is
displayed, in 'Fable 6.2.
The algorithm terminates in three iterations with the optimal policy d * =
(3, 0, 0, 0). Observe that an evaluation was unnecessary at iteration 3 since
d 2= d 3 terminated the algorithm prior to the evaluation step. Unlike value
iteration, the algorithm has produced an optimal policy as well as its expected
total discounted reward v 3 =v~.* Note in this example that the e-optimal policy
found using value iteration is optimal but it could not be recognized as such
without using action elimination.
Ch. 8. Markov Decision Processes 381
Table 6.2
Policy iteration results
n v(s) d"(s)
s=0 s=l s=2 s=3 s=0 s=l s=2 s=3
1. S e t v = ( 0 , 0 , 0 , 0 ) , n = 0 a n d e = 0 . 1 .
2. O b s e r v e that
3
r(s, a) + ~, Ap(jls, a)v(j) = r(s, a)
j=o
Table 6.3
Iterates of modified policy iteration
n v"(s) d(s) A"
s=0 s=l s=2 s=3 s=0 s=l s=2 s=3
0 0 0 0 0 0 0 0 0
1 0 6.4507 11.4765 14.9200 3 2 0 0 6.0000
2 7.1215 9.1215 14.6323 17.1215 3 0 0 0 4.9642
3 11.5709 15.7593 19.4844 21.5709 3 0 0 0 2.6011
4 14.3639 18.5534 22.2763 24.3639 3 0 0 0 0.0022
5 15.8483 20.0377 23.7606 25.8483 3 0 0 0
10 17.4604 21.6499 25.3727 27.4604 3 0 0 0
11 17.4738 21.6833 25.4062 27.4938 3 0 0 0
had to be computed for each action at each iteration. Thus in problems with
large action sets, this step of the algorithm would be time consuming. Modified
policy iteration performs this maximization far less frequently so that one
would expect a considerable improvement in efficiency, especially when the
maximizing actions do not change often. Tables 6.1 and 6.3 show that after the
third iteration there was no change in the on-improving decision rule so that
the value iteration algorithm performed many unnecessary maximizations.
Based on results using the span based stopping rule, value iteration required 8
maximizations while modified policy iteration of order 5 required 5. While not
a dramatic savings in this small problem, it illustrates the potential for
considerable improvement in larger problems.
Note also that the sequence of on-improving decision rules obtained using
value iteration and MPI was different. Comparing Tables 6.1 and 6.2 shows
that those of modified policy iteration and policy iteration were identical. This
is to be expected, since modified policy iteration does an approximate policy
evaluation at each pass through the algorithm which is usually adequate to
identify an improved decision rule. In the example above, when the span based
stopping rule was used with modified policy iteration, MPI required 5 maximi-
zations while policy iteration required 4. Thus because Gaussian elimination
was avoided, MPI probabily required fewer multiplications and divisions. In
Ch. 8. Markov Decision Processes 383
this small problem, such improvements are unimportant but in problems with
large state spaces, MPI can be considerably more efficient than policy iteration.
This was demonstrated numerically in Puterman and Shin (1978). An open
question in implementing modified policy iteration is how best to choose the
order which can be varied from iteration to iteration.
Calculations using action elimination are not given here. The reader is
referred to Puterman and Shin (1982) and Ohno and Ichiki (1987) for results
using such methods. In Puterman and Shin, one-step ahead action-elimination
algorithms were shown to be most efficient and are recommended for use
together with a Gauss-Seidel version of modified policy iteration whenever
solving a large discounted MDP. It is expected that increased efficiency can be
attained by incorporating other methods of Section 6.3.3.
Ilrallw M. (6.51)
(2) There exists a finite non-negative constant L such that for all d E D,
Paw ~ w + L . (6.52)
Theorem 6.26. Suppose that (6.51) and (6.52) hold. Then the optimality
equation
The condition that Ilrallw~M is equivalent to Jr(s, a)[ <~Mw(s) for all
a ~ A s and s ~ S which implies that r grows at most rate w in s. Based on this,
a suitable choice for w is
for all a E A s and s E S where X 1 is the random variable with values in S and
probability distribution given by p ( - I s , a). This means that when w is chosen so
that (6.52) holds, under any decision rule, the expected reward in the next
period cannot exceed w by more than L units. This condition places restrictions
on allowable transitions but does allow transitions to distant states with small
weighted probability.
A countable state version of the inventory model of Section 3.3 can be
shown to satisfy (6.51) and (6.52) under the reasonable assumption that the
Ch. 8o Markov Decision Processes' 385
The next two sections are concerned with MDP's without discounting. In this
section the focus is problems with expected total reward criterion. Assumptions
are imposed so that the expected total reward is well defined for all policies
and finite for some policies. Implicit in these formulations are restrictions on
the reward functions. When the expected total rewards is unbounded or not
well defined, for all policies, the average and sensitive optimality criteria of
Section 8 are of greater practical significance.
In Section 6, a complete theory for discounted problems was presented.
Crucial to this theory was the existence of a discount factor A < 1 which
ensured that
(a) the operator T defined in (6.3) was a contraction,
(b) ( I - A P a ) -1 existed and was non-negative for all d, and
(c) for bounded v,
lirn A nP =n v =
lira n
n ~ n----~ c~ ,
v~(s) = E, s r ~ , d, ,
which is the expected total discounted reward with A .... 1. Without additionai
assumptions, there is no guarantee that the above limit exists. Also, (a)-(c)
above are not valid. This situation necessitates restricting attention to cases
when v ~ is well defined in addition to using different methods of analysis.
Strauch (1966) and Blackwell (1967) distinguished the positive and negative
cases in which the expected total discounted reward is well defined. In the
positive case all rewards are non-negative and in the negative case all rewards
are non-positive. These two cases are distinct because maximization is done in
each so that one cannot be obtained from the other by sign reversal.
The key mathematical properties that will be used in this section are that v"
is well defined, the optimal return operator is m o n o t o n e and monotone
386 M.L, Puterman
sequences are convergent. The behavior of P~v for large n is crucial and
results below about its properties will indicate why the positive and negative
cases are distinguished. To paraphrase Strauch (1966), analysis in the total
reward cases is based on improving the tails while that in the discounted case is
based on ignoring the tails.
Several authors including Hordijk (1974), Schal (1975), van Hee (1978), van
der Wal (1984) and van Dawen (1986a, 1986b) have analyzed MDP's with total
reward without distinguishing the positive and negative cases; however, in this
section they will be discussed separately. The state space will be assumed to be
general except where noted.
for all 7r E / L For some results, the uniform boundedness assumption on v ~'
can be relaxed (van H e e , 1978).
When S is finite and Zje s p(jls, a) = 1 for each a E A s and s E S, (7.1)
holds for each ~r if and only if r = 0 on the recurrent classes of 7r. W h e n S is
infinite, r = 0 at each positive recurrent state implies (7.1).
The objectives in analyzing the positive bounded case are to characterize the
value of the MDP,
aEA s L jES
Theorem 7.1. S u p p o s e there exists a v E V + f o r which v >i Tv. Then v >I v*.
This example also illustrates another important feature of the positive case,
that the optimality equation does not possess a unique solution, since v* + c
satisifies Tv* = v* for any constant vector c.
The following result adapted from Schal (1975) and van Dawen (1986a, b)
provides sufficient conditions for the above implication to be valid.
Theorem 7.3 S u p p o s e there exists a v @ V ~ such that v ~< To. Then v <~ v* if
l i m i n f E , ~, { v ( X'tr
N)} = l i m i n f P ~Nv = 0 f o r all r r ~ H . (7.3)
N--+~ N--+~
Theorem 7.4. Suppose there exists a v E V +, such that v = Tv and (7.3) holds.
Then v = v *.
Theorem 7.5. The optimal return v* is the smallest solution o f the optimality
equation To = v in V + and v* = limn_.= Tn0.
Corollary 7.6. The stationary policy based on d' is optimal if and only if v d, is a
fixed point o f T.
Van Dawen (1986b) refers to the decision rule described in the above
corollary as unimprovable, that is, d' is unimprovable if Tv a'= v d'. Corollary
7.6 is equivalent to: d' is optimal if and only if d' is unimprovable.
In the discounted case it was shown that an optimal stationary policy existed
under very weak assumptions. In particular, any condition which guaranteed
the existence of a conserving decision rule was sufficient to ensure the existence
of a stationary optimal policy. Unfortunately, this is not the situation in the
positive case as demonstrated by examples of Strauch (1966) and Blackwell
(1967).
The following theorem is based on a subtle argument allowing generalization
of results in the discounted case.
Theorem 7.7. Suppose that S is finite and for each s ~ S, A , is finite. Then there
exists an optimal stationary policy.
Value iteration: Theorem 7.5 showed that a solution of the optimality equation
can in principle be obtained by the value iteration scheme v " = Tv n-1 where
v = 0. Unfortunately, stopping rules, bounds and action elimination criteria
Ch. 8. Markov Decision Processes 389
are not available for this procedure so its significance is primarily theoretical.
This convergence result can also be used to characterize the structure of an
optimal policy (Kreps and Porteus, 1977) so that search for optima can be
carried out in a smaller class of policies with special form.
Linear programming: Linear programming for finite state and action positive
MDP's has been studied by Kallenberg (1983). The formulation is similar to
that in the discounted case; the constraints of the primal problem are identical
to those in the discounted case with A set equal to 1. However, the additional
condition that v t> 0 is added and consequently the equality constraints in the
dual problem are replaced by inequalities. That v* is a solution of the primal
problem follows from T h e o r e m 7.5. The dual problem is feasible because
x(s, a) = 0 satisfies its inequality constraints. If the dual program has a finite
optimum, then the problem can be solved by the simplex method and an
optimal stationary policy is given by
a if x(s, a) > 0 and s E S* ,
d(s) arbitrary if s E S -- S * ,
where S* = {s E S: x(s, as) > 0 for some a ~ As). W h e n the dual is unbounded
a more complicated procedure is provided by KaUenberg.
assumption that v '~ is finite for at least one policy 7r ~ H, v = = --oo for all 7r ~ / /
so that all policies are equivalent under the total expected reward criterion. If
this is the case, the average reward criterion can be used to discriminate
between policies.
The most natural setting for the negative case is cost minimization with
non-negative costs. In such problems the reward function r is interpreted as
negative cost; maximization of expected total reward corresponds to minimiza-
tion of expected total cost. Contributors to theory in the negative case include
Blackwell (1961), Strauch (1966), Kreps and Porteus (1977), Schal (1978),
Whittle (1979), (1980a and b), Hartley (1980), Demko and Hill (1981) and van
Dawen (1985).
Analysis in this section will parallel that in Section 7.1 as closely as possible.
S will be arbitrary except where noted.
Theorem 7.8. S u p p o s e there exists a v E V - satisfying v <~ Tv. Then v <~ v*.
For the implication with the inequalities reversed, additional conditions are
required as illustrated by Example 7.11 below. One such sufficient condition is
given in the following result.
E 7r
lira sup ,~{v(XN) } = lira sup P =N v = O. (7.4)
N---~ ~ N - - - * oo
Sufficient conditions for (7.4) to hold are identical to those which imply (7.3)
in the positive case. Combining Theorems 7.8 and 7.9 gives the following
important result.
Example 7.11, Let S = {1, 2}~ A 1 = {a, b}, A z = {a}, r(1, a) = 0 , r(1, b) = - 1 ,
r(2, a) = 0, p ( l [ 1 , a) = 1, p(211, b) = 1,p(212, a) = 1 a n d p ( j l s , a') = 0 other~
Ch. 8. Markov Decision Processes 391
The following result is the analog of Theorem 7.5 in the positive case. It
provides an alternative characterization of the optimal return to that of
Theorem 7.10.
Theorem 7.12. The optimal expected total reward v* is the largest solution of
Tv = v in V - and v* = limn_~= Tn0.
Theorem 7.13. Suppose d' is conserving. Then the stationary policy d' is
optimal.
Theorem 7.14. Suppose for each s C S, A s is finite. Then there exists" an optimal
stationary policy.
7.2.3. Computation
Value iteration: As a consequence of Theorem 7.12, value iteration is con-
vergent provided v = 0. As in the positive case, the absence of bounds on
v " - v* makes solution by value iteration impractical. Van Dawen (1985)
shows that if S is finite and v* > - % the convergence rate is geometric.
Policy iteration: In the negative case the situation regarding policy iteration is
the reverse of the positive case. That is, the improvement step is valid while
the termination criterion is not. If d is the current stationary policy and d' is
chosen in the improvement step, then Vd'>~ V~ SO successive iterates are
monotone. But d' can satisfy the stopping criterion
and
re(2) = Tve(2),
so that (7.5) is satisfied. Thus, the algorithm will terminate in the improvement
step with the suboptimal policy e.
Linear programming: The optimal expected return and optimal policies cannot
be determined by a direct application of linear programming in the negative
case. The primal linear programming problem derived from Theorem 7.12 is
given by
Maximize ~, a i r ( j )
j~s
subject to
and
o(s)~<O, s ~ S .
Ch. 8. Markov Decision Processes 393
This section presents the theory of Markov decision problems with average
and sensitive optimality criteria. Whittle (1983, p. 118) summarizes the difficul-
ty in analyzing such problems as follows:
"The field of average cost optimization is a strange one. Counterexamples
exist to almost all the natural conjectures and yet these conjectures are the
basis of a proper intuition and are valid if reformulated right or if natural
conditions are imposed."
Because results for these criteria are highly dependent on the structure of the
Markov chains induced by stationary policies; the reader is referred to Chapter
2 on stochastic processes or to a basic text such as Kemeny and Snell (1960) for
basic definitions. Emphasis throughout Section 8 will be problems with finite
state spaces, but extensions to countable state problems will also be con.
sidered.
P * = lira p N
jER,
where lim inf or lim sup replaces the limit in (8.3) when necessary (see Section
5). Evaluating the expectation in (8.3) and expressing the result in matrix
terms yields
Combining (8.4) and the above results on the structure of P* yields the
following important result about the form of g.
Proposition 8.1. I f s and j are in the same recurrent class, g(s) = g ( j ) . Further,
if the chain is irreducible or unichain, g(s) is constant.
Consequently, in the irreducible and unichain cases, the average reward can
be expressed as gl where g is a scalar and 1 is a vector of ones. In the
multichain case, the gain is constant on each recurrent class. This distinction
has a major influence on the resulting theory.
In the discounted case, Proposition 6.1 shows that the expected total
discounted reward is the unique solution of the linear system v = r + h P v and
consequently v = ( I - h P ) - ' r . Analogous results are now developed for the
Ch. 8. Markov Decision Processes 395
average reward case. They are based on the relationship between the dis-
counted and average reward of Blackwell (1962) and Veinott (1969).
For this analysis, it is convenient to parameterize the problem in terms of the
interest rate p instead of in terms of the discount rate A. They are related by
A = (1 + p ) - i or p = (1 ~)/~-1. The quantity 1 + p is the value of 1 unit one
-
Theorem 8.2. Let v be the eigenvalue of P less than one with largest modulus. If
0 ~< p ~< 1 - I 1, then
and
where
~,rn+l
y_l=P*r and y,=rle r, n=0,1,....
=Z Z (e - e*)r
r*=O n=O
so that
DN = N g q- h p "Jr"O ( 1 ) , (8.9)
Corollary 8.3.
v, = (1 .... a ) - l g + h + / ( a ) (8.10)
The following corollary is useful for extending existence results from the
discounted case to the average reward case (Derman, 1970, pp. 25-28) and
establishing structural results for average reward problems from those for
discounted problems. It follows immediately by multiplying both sides of (8.10)
by 1 - A and passing to the limit.
Corollary 8.4.
g=fimll(1-X)v,. (8.11)
r + [ ( P - I ) - pI]{(1 + p)-lva} = 0
and equating terms with like powers of p yields the following result (Veinott,
1969). The converse follows by multiplying the (n + 1)th equation by P* and
adding it to the nth equation.
Theorem 8.5. The coefficients of" the Laurent series expansion o f v a satisfy the
system o f equations
(P - I ) y _ 1 = 0, (8.12)
r - Y-1 + (P - l)Yo = 0, (8.13)
--yn+(P--I)yn+l=O, n>-O. (8.14)
The following two corollaries give the reduced form of Theorem 8.5 that wilt
be directly applicable to average reward problems. The first gives the equations
to be solved to determine the average reward for transition probability
matrices with general chain structure.
(P--I)g=O (8.15)
and
r - g + (P - Oh = 0. (8.16)
398 M.L. Puterman
Conversely if g' and h' satisfy (8.15) and (8.16) then g ' = g and h ' = h + w
where ( P - 1 ) w = O .
Corollary 8.7. Suppose P ~ unichain. Then the average reward and the bias
satisfy
r - gl + ( P - I)h = 0 . (8.17)
Conversely if the scalar g' and the vector h satisfy (8.17), then g' = g and
h' = h + w where ( P - I)w=O.
When the limits above do not exist, either weak or strong average optimal
policies (Section 5) are sought.
The optimality equation in a unichain average reward MDP is given by
O~
> m a x {r a - gl + (Pd - I)h} . (8.20)
dED
Then
g/>sup
[ l i m s u p -1v :] .
wE/1 t_ n---~ n
then
rElI n
g*=sup
rEH
rl i m -1:]
Ln--*
v
n
and
(b) g is unique and equals g*.
or equivalently
for all s ~ S and 0 < A < 1. Then there exist an h* E V and a scalar g* which
satisfy (8.19) and
g* = lim
x]'l
(1 - A)v~(s0). (8.24)
Theorem 8.12 is valid in the countable state case under weaker conditions
than (8.23) (Federgruen, Hordijk and Tijms, 1978, 1979). Related work
includes Hordijk (1974), Wijngaard (1977), Federgruen, Schweitzer and Tijms
(1983) and Deppe (1984).
The assumption that solutions to the optimality equation are bounded is
crucial for the existence of average optimal stationary policies. Counterexam-
pies have been provided by Fisher and Ross (1968) and Ross (1983), Bather
(1973), Sheu and Farn (1980) and Schweitzer (1985).
constant has no effect on the maximizing decision rule in (8.26) since for any h
satisfying (8.25) and any constant c,
ha,(So) = 0 (8.27)
r = (Q~o)w (8.28)
Pdhd, = 0 (8.29)
hdn = H p a rdn
h rtv = h B
Theorem 8.14. If all states are recurrent under every stationary policy and the
sets of states and actions are finite, then policy iteration converges in a finite
number of iterations.
When there are transient states associated with some (or all) stationary
policies, additional analysis is based on:
(a) h d.+
B 1= hB , B
d. -- P d . + l h a . + Hd.+lB(ga ., ha.) (8.31)
(c) If B(gan, han)(s)=O for all s that are recurrent under dn+ l and
B(ga , ha,)(s ) = 0 for all s which are transient under d,+l, then
hB = hB
dn+l dn
Theorem 8.16. Suppose all stationary policies are unichain and the sets o f states
and actions are finite. Then policy iteration converges in a finite n u m b e r o f
iterations.
The above results provide additional insight to the behavior of the iterates of
policy iteration. If s is recurrent and j is transient under action a, P ( ] [s, a) = 0.
Consequently, once the optimality equation is satisfied on all states that are
recurrent under a decision rule 6, which attains the maximum in the improve-
ment step, there will be no future changes in the gain. Consequently any
stationary policy which agrees with P~ on its recurrent states is average
optimal. Since, in subsequent iterations the bias is increased in transient states
of the maximizing policy until no further improvement is possible, one might
suspect that policy iteration terminates with a policy that is bias-optimal, that is
it has the largest bias among all policies with the same gain as 6. This
supposition is false (Example 2 in Denardo 1973 in which two policies have the
same gain but different recurrent classes). What is attained in this case is the
policy with the largest bias among policies which have the same recurrent class
as 6. To find bias optimal policies requires solution of an additional optimality
equation (Section 8.9).
8.4.1. Convergence
Value iteration in the undiscounted case is based on the operator
T v = m a x [r a + P a v ] (8.32)
dED
n
O = Z eTr. +
m=0
v" = H a r d + n g d + P ] v (8.34)
L = lim {v n -
n--~ oo
ng*} (8.35)
always exists. The following simple example shows this conjecture is false.
Example 8.17. Let S = {1, 2} and suppose there is a single decision rule d, with
rd = and Pa = [011
1 0 "
v =
0[a] b '
then
n even,
[Iab
On .= p n v O = b] '
]
|, n odd.
a J
Thus unless a = b = 0, lim,_,~ {o" - rig*} does not exist, but for a n y choice of
a and b, both lim,__,= v z" and lim,__,= v z"+l exist.
In this example, states 1 and 2 are both recurrent but each is periodic with
period 2. This suggests that periodicity causes problems for the convergence of
value iteration in average reward problems.
When the limit in (8.35) exists, value iteration can be used to solve the MDP
because:
(a) for N sufficiently large,
N N 1 g, o
v -v ~-L+Ng*-(L+(N--1)g*)= ,
U N -Ng *~'~" h*
Theorem 8.18. Let S be finite and let v n+l = Tv n. Then the limit in (8.35) exists
for any v E V if any o f the following conditions hold:
(a) For all s ~ S, P( j I s, a) > 0 for all a E A s and j ~ S (Bellman, 1957).
(b) There exists a state s o and an integer u >>-1 such that
,,:['o
then condition (c) is satisfied and v n is (trivially) convergent for any v . In
practice, conditions (a) and (c) are easiest to check.
Schweitzer and Federgruen (1977) provide necessary and sufficient condi-
tions for the limit (8.35) to exist for all v , generalizing Brown (1965) and
Lanery (1967).
r d + Pdv" = Tv"
L ( v ~ --1) <<L ( v " ) < gal. ~ g* << U(v~) <~ U( v -1) (8.39)
(a) w = l i m w"(s) , s~ S ,
exists;
(b) g = W(So) and h ( s ) = w(s) satisfy B ( g , h ) = 0; and
(c) if (8.36) holds,
Minimize g
subject to
subject to
2 x(j,a)-2 2
a~A s sES aEA s
p(jls, a)x(s,a)=O, jCS,
if x(s, a ) > O , s ~ S * ,
d*(s) = {aarbitrary ifs~S- S* ,
where
For any decision rule d* obtained in the above manner, x*(s, d*) is an
optimal solution to the dual problem and satisfies the equalities
and
Theorem 8.23. Suppose that the transition probability matrix of every stationary
policy is irreducible. Let x be any feasible solution to the dual problem, and
define the randomized stationary policy d by
Then
4. Obtain 0 n+l by
n+l z~ xm+l n
O = ~ldn+l ) O (8.49)
where
Theorem 8.24. Suppose that S and A s for each s E S are finite, that for some
a > O, Pd >~t~lfor all d E D and that all stationary policies are unichain. Then if
{o n} is generated by modified policy iteration:
(a) L(v n) converges monotonically and exponentially fast to g* and
(b) U(o n) converges exponentially fast to g*.
Results are not as complete as in the unichain case and few are available
when the set of stationary policies is infinite. The assumption that the sets of
states and actions are finite is required for results in this section. Discussion will
focus on the optimality equation and policy iteration. The reader is referred to
Denardo and Fox (1968), Denardo (1970), Derman (1970), Dirickx and Rao
(1979) and Katlenberg (1983) for linear programming for multichain MDP's.
Little is known about value iteration and modified policy iteration for multi-
chain problems.
and
for all d ~_ D for which P d g = g with equality holding in (8.53) for at least one
such d.
In the unichain case the first optimality equation above is redundant and
E = D. This is because when all decision rules are unichain, P u g = g implies
that g is a constant so that equation (8.51) is satisfied for all d E D. The above
reduces to (8.18) in the unichain case. If D replaces E in (8.52) then possibly a
different decision rule attains the maximum in each equation.
Establishing that solutions to this pair of equations characterize average
optimal policies is not as straightforward as in the unichain case. A proof is
based on the existence of a Blackwell optimal stationary policy as defined in
(5.11). When S is finite, a policy It* is Blackwell optimal if there exists a
A * , 0 < ~ A * < I , such that or* is discount optimal for all A E [ A * , I ) . The
following important theorem is due to Blackwell (1962). An elegant non~
constructive proof using function theory was provided by Blackwell, a c o n
structive proof is based on the policy iteration algorithm in Section 8.9.
Theorem 8.27. Suppose (g*, h*) satisfies (8.51) and (8.52) and d* attains the
maximum in (8.52) at (g*, h*). Then for all ~ E II,
and
ga. ) g~.
The above algorithm yields a sequence of decision rules {dn) and corre-
sponding gains { gd,}" The pair of matrix equations (8.55) uniquely determines
the gain, but the relative v a l u e s {hdn } are only unique up to a u satisfying
(Pa, - I)u = 0. If Pd, has k recurrent classes, then ha, will be unique up to k
arbitrary constants which can be determined by setting ha,(S ) = 0 for an
arbitrary s in each recurrent class of Pdn (Howard 1960). Blackwell's specifica-
tion (8.29) can also be used, but is computationally prohibitive. Veinott (1969)
provides a method for finding an h which satisfies (8.58). In practice, any h will
do.
The improvement step of the algorithm consists of two phases. First im-
provement is attempted through the first optimality equation (8.56), that is a
ga,-improving decision rule is sought. (Call a decision rule d' g-improving if
d' = arg maxd~ D {Pug}.) If no strict improvement is possible, an ha-improving
decision rule is found among all gan-improving rules and if no strict improve-
ment is possible, the iterations are stopped. Otherwise, the improved policy is
evaluated at the subsequent iteration.
In the unichain case, the first equation in (8.55) and part (a) of the
improvement step are redundant so that the algorithm reduces to the unichain
policy iteration algorithm.
Proofs that policy iteration is convergent in the multichain case for finite
state and action problems have been provided by Blackwetl (1962) using the
partial Laurent expansion and Denardo and Fox (1968) using detailed analysis
of the chain structure. The next result shows the monotone nature of the
multichain policy iteration algorithm and is the basis for a proof of con-
vergence.
until the algorithm terminates with d,+ 1 = d n at which point the optimality
equations are satisfied. Finite convergence is ensured since there are only
finitely many policies and the algorithm is monotone in the sense of (8.59).
This yields the following result.
Theorem 8.29. Suppose A s for each s E S and S are finite. Then the policy
iteration algorithm terminates in a finite number of iterations with a gain optimal
stationary policy and a pair (g*, h*) which satisfy the optimality equations
(8.51) and (8.52).
Based on the above arguments, one might speculate that this implies that the
gains of successive policies are strictly increasing. This is not the case because
(8.59) does not exclude the possibility that the gains of two successive policies
are identical and improvement occurs in the bias term.
Howard (1960, p. 69-74) and Denardo and Fox (1968, p. 477-479) have
analyzed the behavior of the iterative process in detail and have shown that:
(a) the gains of successive policies are monotone non-decreasing,
(b) if improvement occurs in step 3a of the algorithm, then it can only be in
transient states of dn+l, in which case ga,+l(s)> gd(S) where s is transient
under dn+ 1,
(c) if no improvement occurs in step 3a of the algorithm and it occurs in a
recurrent state of dn+l, in step 3b of the algorithm then gd.+l(s) > gd.(s) where
s is recurrent under dn+l, and
(d) if no improvement occurs in step 3a of the algorithm and it occurs in a
transient state of d,+~ in part 3b of the algorithm then ha,+l(s ) > hd,(s ) where s
is transient under d,+ 1.
In the special case that all policies are communicating, Haviv and Puterman
(1990) provide a modification of the unichain policy iteration algorithm which
avoids the pair of optimality equations.
At present, non-trivial conditions which imply the convergence of policy
iteration in the non-finite case are not available. Dekker (1985, pp. 109-110)
provides an example with finite states and compact actions in which an infinite
number of improvements occur in step 3a and converge to a suboptimal policy.
In it, the limiting policy has different ergodic classes than the optimal policy
and since improvements through 3a cannot create new ergodic classes, the
algorithm will not converge to an optimal policy.
A policy with the largest bias among all average optimal policies is said to be
bias-optimal. Veinott (1966), Denardo (1970) and Kallenberg (1983) provide
methods for obtaining such policies in the finite state and action setting. Sheu
and Farn (1980) and Mann (1983) have also studied this criterion.
Since bias-optimal policies need not be unique, a decision maker might wish
to have some way of selecting a 'best' bias-optimal policy. Veinott (1969)
introduced the concept of sensitive discount optimality and using the Laurent
series expansion (8.7), showed that it provided a link between average
optimality, bias-optimality and Blackwell optimality. Contributors to this
theory include Miller and Veinott (1969), Veinott (1974), Hordijk and Sladky
(1977), Wijngaard (1977), van der Wal (1981), Federgruen and Schweitzer
(1984a) and Dekker (1985).
This section presents the theory of sensitive discount optimality in the finite
state and action case.
for all ~r E H. This criterion can be reexpressed in terms of the interest rate
p = 1/(1 - A) (Section 8.1.2) as follows:
n
liminf p [v~~ * - v~]~>0. (8.60)
p$0
if and only if ga >~ge" When ga = ge, the limit in (8.62) will be zero, in which
case
d e
hm mf [v, - oh] ~> 0 (8.63)
p$0
if and only if h~ >/heB . if gd(s) > ge(S) for some s in S, then (8.63) will hold with
strictly inequality in component s, regardless of the values of h~(s) and heB (s)
(these quantities are defined in Section 8.3.1).
Similarily, if gd = ge and h~ = heB, then the lim inf's in both (8.62) and (8.63)
will be zero and
e
lim inf p[ vaa -- v xl >~0 (8.64)
050
if and only if y~ ~ y~. Conversely the sth component of (8.64) will be strictly
positive if either
(a) gd(s) > ge(s), or
(b) gd(S) = ge(S) and h~(s) > h~(s), or
(c) gd(S) = ge(s) and h~(s) = hBe(S) and y~(s) > ye~(s)o
These arguments can be repeated indefinitely to demonstrate that:
(1) The larger the value of n, the more selective the discount optimal
criteria. That is, if D* denotes the set of n-discount optimal stationary policies,
thenD* DD~* f o r n = 0 , 1 ,
n--1 . . . .
Theorem 8.30. f l A , f o r each s ~ S and S are finite, then there exists" a stationary
n-discount optimal policy for each n.
r din. - - Y m, - ~ + ( p d n -- I ) y m = 0 , (8.68)
rin+l
d,, -- Yin + (Pd. -- l ) y m +1 = O, (8.69)
subject to Ym+l(S)= 0 for one s in each recurrent class of jog..
4. (Policy improvement)
(a) Choose a d~+l @ Din to satisfy
dn m+l dn
rd~+~m-~lAv p d . + l y m + 1 = deD.~max l r d + PdYm+l} (8.71)
m
limm fp
[v~d n + 1 - v ~ ~]~-0.
p$o
Ch. 8. Markov Decision Processes 421
Since there are only finitely many stationary policies, this step terminates with
Dm+~, the set of m-discount optimal stationary policies. When m = N, the
algorithm terminates with the set of N-discount optimal policies.
Since Blackwell optimality corresponds to o0-discount optimality, the above
suggests that an infinite number of passes throught the above policy iteration
algorithm is necessary to obtain a Blackwell optimal stationary policy. Miller
and Veinott (1968) showed that this is not the case.
Veinott (1974) and Lamond (1986) showed that the N in Theorem 8.33 can
be replaced by N - k, where k is number of recurrent classes in an ( N - k)-
discount optimal policy. The following immediate corollary to the above
theorem ties together many of the results in Section 8.
Corollary 8.34. Suppose A s for each s ~ S and S are finite. Then there exists a
stationary Blackwell optimal policy which can be determined by the policy
iteration algorithm in a finite number of iterations.
Veinott (1966), Denardo and Miller (1968), Lippman (1968), Sladky (1974),
Hordijk and Sladky (1977), Denardo and Rothblum (1979) and van der Wal
(1981) have investigated the relationship between discount optimality, overtak-
ing optimality and average optimality as defined in Section 5.
under some policy. Consequently value iteration and modified policy iteration
are convergent. Since not all policies are unichain the multichain version of
policy iteration is required.
Table 8.1
Policy iteration results
n gd,, (h~.(s)) d.(s)
s=0 s=l s=2 s=3 s=0 s=l s=2 s=3
0 0 0 0
0 0 2 l 0
(0) (-3.0) (-1.0) (5.0)
0 0 0 0
1 0 0 0 0
(0) (6.6667) (12.4444) (17.1852)
1.6 1.6 1.6 1.6
2 3 2 0 0
(-5.08) (-3.08) (2.12) (4.92)
2.2045 2.2045 2.2045 2.2045
3 3 0 0 0
(-4.2665) (-0.5393) (3.2789 (5.7335)
4 3 0 0 0
Since the average optimal policy is unique, it is n-discount optimal for all n.
Thus it is Blackwell optimal and discount optimal for all discount factors
sufficiently close to 1. It agrees with that found in the discounted case with
h=0.9.
Table 8.2
Value iteration results
n v"(s) d'(s) A"
s=0 s=l s=2 s=3 s=0 s=l s=2 s=3
0 0 0 0 0 0 0 0 0
1 0 5.0 6.0 5.0 2 0 0 0 6.0000
2 2,0 6.25 10.0 10.50 3 0 0 0 4.2500
3 4.1875 8.0625 12.125 14.1875 3 0 0 0 1.8750
4 6.625 10.1563 14.1094 16.625 3 0 0 0 0.4531
5 8.75 12.5078 16.2617 18.75 3 0 0 0 0.1465
6 10.9453 14.6895 18.5068 20.9453 3 0 0 0 0.0635
7 13.1621 16.8813 20.7078 23.1621 3 0 0 0 0.0164
8 15.3647 19.0919 22.9081 25.3647 3 0 0 0 0.0103
9 17.5682 21.2965 25.1142 27.5685 3 0 0 0 0.0025
Observe that after 9 iterations, the difference between the bounds is 0.0025
so that the decision rule ( 3 , 0 , 0 , 0) (which is the unique optimal policy
identified by policy iteration) is guaranteed to have a gain that is within 0.0025
of optimum. That is, O<~g*--gdg<~O.O025. Estimates of g* and h* are
obtained using the method described in Section 8.4.1. That is
F2.2035] -2.2633-
g* ~ v 9 - v 8 = / 2"2046] h* ~ O9 1.4551
|2.2060l and - 9g* = 5.2602
/2.20353 7.7370
For a specified policy the system evolves by remaining in a state for a random
amount of time and then jumping to a different state. These models are called
semi-Markov because for fixed Markov policies the system states evolve
according to a semi-Markov process.
Analysis of SMDP's depends on the set of admissible policies or controls.
When actions are allowed at any tirfie, continuous time control theory methods
are appropriate. If actions can be chosen only immediately following transi-
tions and the time horizon is infinite, or a fixed number of transitions, discrete
time M D P methods can be adapted. When actions are allowed only after
transitions, the models are often referred to as Markov renewal programs
(MRP's), however not all authors distinguish MRP's and SMDP's in this way.
When the time horizon is finite and fixed, MRP's also require control theory
methods for analysis. This section presents results for infinite horizon Markov
renewal programs.
MRP's are mostly widely used to model equipment replacement, queueing
control and inventory control problems. Notable special cases are continuous
time MDP's (CTMDP's) in which the transition times are exponentially
distributed and MDP's in which all transition times are constant and equal.
Semi-Markov decision processes were introduced by Jewell (1963), other
contributors to the theory include H o w a r d (1963), de Cani (1964), Schweitzer
(1965, 1971), Fox (1966) and D e n a r d o (1971). They are also treated in the
books of Kallenberg (1983) and H e y m a n and Sobel (1984). Numerous papers,
most notably in the infinite horizon average reward case, have been based on
applying a transformation (Schweitzer, 1971) to convert SMDP's to MDP's. A
good reference on semi-Markov processes is ~inlar (1975).
q(tls, d(s), j), ka(s ) = k(s, d(s)) and ca(s ) = c(s, d(s)). T o avoid technicalities,
it will be assumed that the decision making period begins at the time of the first
transition and no rewards are received until that time. Let 7r = (da, d 2 , . . . ) be
an arbitrary policy which corresponds to using d n at the time of the nth
transition. Corresponding to this policy is a semi-Markov process { Y~; t/> 0}
which gives the state of the system at time t and a process {U~; t/> 0} which
gives the action chosen at time t. This process can be further described in terms
of a sequence of jointly distributed random variables {(X~, r~); n = 1, 2 , . . . }
where X2 is the state of the system immediately following the nth transition
and r~ is length of time the system is in state X~. It is convenient to define o-~
as the total time until the nth transition starting at the time of the first
transition, that is
n -- 1
qr E
O- n ~--- 3" n
j=l
and o-~ = 0.
The first term in (9.2) corresponds to the continuous portion of the reward and
the second term to the fixed reward received only at decision epochs.
The objective in this problem is to characterize
for all s ~ S and to find a policy ,~r* with the property that
(s) = v2(s)
for all s ~ S.
This problem is transformed to a discrete time problem by allowing the
discount factor to be state and action dependent and analyzing the problem in
terms of its embedded chain. Define ra(s ) to be the expected total discounted
reward until the next transition if the system just entered state s and decision
rule d is selected. It is given by
426 M . L . Puterman
re(s ) = k a ( s ) + cd(s)Ea~LJo[f e T M dt ]
where ~.e is the time until the first transition given that the system just entered
state s and decision rule d is used. Define the expected discounted holding time
in state s if action a is selected by
For d C D, define Ad(S) = A(s, d(s)). Note that 2te(S) is the Laplace transform of
r e. Thus
v~ s = E s
d1
ETrr --~'r 7r'
= tel(s) + le v~ (Xz) ]
= re,(s) +
j~s
where 7r' = (d2, d3, . . .). For a stationary policy d, the infinite horizon expec-
ted total discounted reward can be obtained by solving the equation
v~(s) = re(s ) + ~ A a ( s ) p a ( j l s ) v ~ ( j ) o
jeS
v = r e + Mev (9.5)
where M e is the matrix with entries Z d ( s ) p a ( j l s ) . This differs from the discrete
time evaluation equation (6.8) by the state and action dependent discount rate
(9.4). From a computational point of view this causes the matrix M a to have
unequal row sums bounded by A* = SUpa,,A(S, a). The efficiency of numerical
methods for solution of (9.5) has been investigated by Porteus (1980a, 1983)~
The optimality equation for Markov renewal programs is given by
If (9.1) holds, ~* < 1 and if in addition IIr~ll ~ M < m for all d E D, T defined
in (9.6) is a contraction operator on the space of bounded real valued functions
on S, so consequently all results of Section 6 apply. This means that for MRP's:
(1) The optimality equation has a unique solution v*.
(2) There exists a stationary policy which is optimal.
(3) The problem can be solved by value iteration, policy iteration, modified
policy iteration, linear programming and their variants.
(4) Bounds and action elimination methods are valid.
(5) Extensions to unbounded rewards are possible.
I-t"T ~'% 1
:LJ0 z
rt=l n
(97
where v r is the random variable representing the number of decisions made up
to time T using policy 7r. For each zr E H, define the average expected reward
or gain by
~-.~inf ~v;(s),
g~(s) = lim 1 s~s.
The objective in the average reward case is to characterize the optimal average
expected reward
Let rd(s ) be the expected total reward until the next transitior~ when the
system is in state s and decision rule d is used. It is given by
H(s, a) ~
F t d F ( t i s , a)~ (9.9)
For d E D, define He(S ) ~- H(s, d(s)). Under (9.1), "0 ~: inf~,,H(s, a) > 0o
428 M.L. Puterman
Bibliography
Bather, J. (1975). Optimal decision procedures for finite Markov chains. Adv. Appl. Probab. 5,
328-339, 521-540, 541-553.
Bellman, R.E. (1957). Dynamic Programming. Princeton University Press, Princeton, NJ.
Bertsekas, D.P. (1987). Dynamic Programming, Deterministic and Stochastic Models. Prentice-
Hall, Englewood Cliffs, NJ.
Blackwell, D. (1961). On the functional equation of dynamic programming. J. Math. Anal. Appl.
2, 273-276.
Blackwell, D. (1962). Discrete dynamic programming. Ann. Math. Statist. 35, 719-726.
Blackwell, D. (1965). Discounted dynamic programming. Ann. Math. Statist. 36, 226-235.
Blackwell, D. (1967). Positive dynamic programming. Proc. 5th Berkeley Symp. Mathematical
Statistics and Probability 1, 415-418.
Brown, B.W. (1965). On the iterative method of dynamic programming on a finite space discrete
Markov process. Ann. Math. Statist. 36, 1279-1286.
~inclar, E. (1975). Introduction to Stochastic Processes. Prentice-Hall, Englewood Cliffs, NJ.
De Cani, J.S. (1964). A dynamic programming algorithm for embedded Markov chains when the
planning horizon is at infinity. Management Sci. 10, 716-733.
Dekker, R. (1985). Denumerable Markov decision chains: Optimal policies for small interest
rates. Unpublished Ph.D. Dissertation, University of Leiden.
Dembo, R. and M. Haviv (1984). Truncated policy iteration methods. Oper. Res. Lett. 3,
243 -246.
Demko, S. and T.P. Hill (1981). Decision processes with total cost criteria. Ann. Probab. 9,
293-301.
Denardo, E.V. (1967). Contraction mappings in the theory underlying dynamic programming.
S I A M Rev. 9, 169-177.
Denardo, E.V. and B. Fox (1968). Multichain Markov renewal programming. S l A M J. Appl.
Math. 16, 468-487.
Denardo, E.V. and B.L. Miller (1968). An optimality condition for discrete dynamic programming
with no discounting. Ann. Math. Statist. 39, 1220-1227.
Denardo, E.V. (1970). Computing a bias-optimal policy in a discrete-time Markov decision
problem. Oper. Res. 18, 279-289.
Denardo, E.V. (1971). Markov renewal programs with small interest rates. Ann. Math. Statist. 42,
477 -496.
Denardo, E.V. (1973). A Markov decision problem. In: T.C. Hu and S.M. Robinson (Eds.),
Mathematical Programming. Academic Press, New York.
Denardo, E.V. and U.G. Rothblum (1979). Overtaking optimality for Markov decision chains
Math. Oper. Res. 4, 144-152.
Denardo, E.V. (1982). Dynamic Programming, Models and Applications. Prentice-Hall, Engle-
wood Cliffs, NJ.
D'Epenoux, F. (1963). Sur un probl~me de production et de stockage dans l'aleatoire. Rev.
Francaise Automat. lnformat. Rech. Oper. 14 (English Transl.: Management Sci. 10, 98-108).
Deppe, H. (1984). On the existence of average optimal policies in semi-regenerative decision
models. Math. Oper. Res. 9, 558-575.
Derman, C. (1966). Denumerable state Markovian decision processes--Average cost criterion.
Ann. Math. Statist. 37, 1545-1554.
Derman, C. and R. Straueh (1966). A note on memoryless rules for controlling sequential decision
processes. Ann. Math. Statist. 37, 276-278.
Derman, C. (1970). Finite state Markovian decision processes. Academic Press, New York.
Dirickx, Y.M.J. and M.R. Rao (1979). Linear programming methods for computing gain-optimal
policies in Markov decision models. Cah. Centre d'Etudes Rech. Oper. 21, 133-142.
Dubins, L.E. and L.J. Savage (1965). How to Gamble if You Must: Inequalities for Stochastic
Probesses. McGraw-Hill, New York.
Eagle, J.E. (1975). A Utility Criterion for the Markov Decision Process. Unpublished Ph.D
Dissertation, Dept. of Engineering-Economic Systems, Stanford University.
430 M.L. Puterman
Federgruen, A., A. Hordijk and H.C. Tijms (1978). Recurrence conditions in denumerable state
Markov decision processes. In: M.L. Puterman (Ed.), Dynamic Programming and Its Applica-
tions. Academic Press, New York, 3-22.
Federgruen, A. and P.J. Schweitzer (1978). Discounted and undiscounted value-iteration in
Markov decision problems: A survey. In: M.L. Puterman (Ed.), Dynamic Programming and Its
Applications. Academic Press, New York, 23-52.
Federgruen, A. and H.C. Tijms (1978). The optimality equation in average cost denumerable state
semi-Markov decision problems, recurrency conditions and algorithms. J. Appl. Probab. 15,
356-373.
Federgruen, A., A. Hordijk and H.C. Tijms (1979). Denumerable state semi-Markov decision
processes with unbounded costs, average cost criteria. Stochastic Process. Appl. 9, 223-235.
Federgruen, A. and P.J. Schweitzer (1980). A survey of asymptotic value-iteration for undis-
counted Markovian decision processes. In: R. Hartley, L.C. Thomas and D.J. White (Eds.),
Recent Developments in Markov Decision Processes. Academic Press, New York, 73-109.
Federgruen, A., P.J. Schweitzer and H.C. Tijms (1983). Denumerable undiscounted semi-Markov
decision processes with unbounded rewards. Math. Oper. Res. 8, 298-313.
Federgruen, A. and J.P. Schweitzer (1984a). Successive approximation methods for solving nested
functional equations in Markov decision problems. Math. Oper. Res. 9, 319-344.
Federgruen, A. and J.P. Schweitzer (1984b). A fixed point approach to undiscounted Markov
renewal programs. SIAM J. Algebraic Discrete Methods 5, 539-550.
Fisher, L. and S.M. Ross (1968). An example in denumerable decision processes. Ann. Math.
Statist. 39, 674-676.
Fleming, W.H. and R. Rishel (1975). Deterministic and Stochastic Optimal Control. Springer-
Verlag, New York.
Flynn, J. (1976). Conditions for the equivalence of optimality criteria in dynamic programming.
Ann. Statist. 4, 936-953.
Fox, B.L. (1966). Markov renewal programming by linear fractional programming. SIAM J. Appl.
Math. 14, 1418-1432.
Grinold, R. (1973). Elimination of suboptimal actions in Markov decision problems. Oper. Res.
21, 848-851.
Harrison, J.M. (1972). Discrete dynamic programming with unbounded rewards. Ann. Math.
Statist. 43, 636-644.
Hartley, R., L.C. Thomas and D.J. White (Eds.) (1980). Recent Developments in Markov Decision
Processes. Academic Press, New York.
Hartley, R. (1980). A simple proof of Whittle's bridging condition in dynamic programming. J.
Appl. Probab. 17, 1114-1116.
Hastings, N.A.J. (1968). Some note son dynamic programming and replacement, Oper. Res.
Quart. 19, 453-464.
Hastings, N.A.J. (1969). Optimization of discounted Markov decision problems. Oper. Res.
Quart. 20, 499-500.
Hastings, N.A.J. (1976)~ A test for suboptimal actions in undiscounted Markov decision chains.
Management Sci. 23, 87-91.
Hastings, N.A.J. and J.A.E.E. van Nunen (1977). The action elimination algorithm for Markov
decision processes. In: H.C. Tijms and J. Wessels (eds.), Markov Decision Theory, Mathemati-
cal Centre Tract No. 93. Mathematical Centre, Amsterdam, 161-170.
Haviv, M. and M.L. Puterman (1990). Improved policy iteration methods for communicating
Markov decision processes. Annals of Operations Research, Special Issue on Markov Decision
Processes, to appear.
Hernfindez-Lerma, O. (1989). Adaptive Markov Control Processes. Springer-Verlag, New York.
Heyman, D.P. and M.J. Sobel (1984). Stochastic Models in Operations Research, Vol. II.
McGraw-Hill, New York.
Hille, E. and R.S. Phillips (1957). Functional Analysis and Semi-Groups, American Mathematical
Society Colloquim Publications, Vol. 31. AMS, Providence, RI.
Hinderer, K. (1970). Foundations of Non-Stationary Dynamic Programming with Discrete Time
Parameter. Springer-Verlag, New York.
Ch. 8. Markov Decision Processes 431
Hopp, W.J., J.C. Bean and R.L. Smith (1988). A new optimality criterion for non-homogeneous
Markov decision processes. Oper. Res. 35, 875,-883.
Hordijk, A. (1974). Dynamic Programming and Markov Potential Theory. Mathematical Centre
Tract No. 51. Mathematical Centre, Amsterdam.
Hordijk, A. and K. Sladky (1977). Sensitive optimality criteria in countable state dynamic
programming. Math. Oper. Res. 2, 1-14.
Hordijk, A. and L.C.M. Kallenberg (1979). Linear programming and Markov decision chains.
Management Sci. 25, 352-362.
Hordijk, A. and L.C.M. Kallenberg (1980). On solving Markov decision problems by linear
programming. In: R. Hartley, L.C. Thomas and D.J. White (Eds.), Recent Developments in
Markov Decision Processes. Academic Press, New York, 127-143.
Hordijk, A. and M.L. Puterman (1987). On the convergence of policy iteration in undiscounted
finite state Markov decision processes; the unichain case. Math. Oper. Res. 12, 163-176.
Howard, R. (1960). Dynamic Programming and Markov Processes. MIT Press, Cambridge, MA.
Howard, R.A. (1963). Semi-Markovian decision processes. Proc. Internat. Statist. Inst., Ottawa,
Canada.
Howard, R.A. and J.E. Matheson (1972). Risk sensitive Markov decision processes. Management
Sci. 8, 356-369.
Hubner, G. (1977). Improved procedures for eliminating suboptimal actions in Markov program
ruing by the use of contraction properties. Transactions of the Seventh Prague Conference on
Information Theory, Statistical Decision Functions, Random Processes, 257-263.
Hubner, G. (1988). A unified approach to adaptive control of average reward decision processes.
OR Spektrum 10, 161-166.
Jacquette, S.C. (1973). Markov decision processes with a new optimality condition: Discrete time.
Ann. Statist. 3, 496-505.
JeweU, W.S. (1963). Markov-renewal programming I: Formulation, finite return models; Ii:
Infinite return models, example. Oper. Res. 11, 938-971.
Kallenberg, L.C.M. (1983). Linear Programming and Finite Markov Control Problems, Mathe-
matical Centre Tract No. 148. Mathematical Centre, Amsterdam.
Kantorovich, L.V. (1952). Functional Analysis and Applied Mathematics, Translated by C.D.
Benster, NBS Report 1509, National Bureau of Standards, Los Angeles, CA.
Kemeny, J.G. and J.L. Snell (t960). Finite Markov Chains. Van Nostrand-Reinhold, New York.
Kreps, D.M. and E. Porteus (1977). On the optimality of structured policies in countable stage
decision processes, II: Positive and negative problems. SIAM J. Appl. Math. 32, 457-466.
Lamond, B.L. (1984). MDPLAB, an interactive computer program for Markov dynamic program-
ming. Working Paper 1068, Faculty of Commerce, University of British Columbia.
Lamond, B.L. (1986). Matrix methods in queueing and dynamic programming. Unpublished
Ph.D. Dissertation, Faculty of Commerce, University of British Columbia.
Lamond, B.L. and M.L. Puterman (1989). Generalized inverses in discrete time Markov decision
processes. SIAM J. Mat. Anal. Appl. 10, 118-134.
Lanery, E. (1967). Etude asymptotique des systrmes Markovien h commande. Rev. Fran~aise
Inform. Rech. Oper. 1, 3-56.
Lippman, S.A. (t968). Criterion equivalence in discrete dynamic programming. Oper. Res. 17,
920 -923.
Lippman, S.A. (1975). On Dynamic Programming with Unbounded Rewards. Management. Sci.
21, 1225-1233.
Liusternik, L. and V. Sobolev (1961). Elements of Functional Analysis. Ungar, New York.
MacQueen, J. (1966). A modified dynamic programming method for Markov decision problems. J.
Math. Anal. Appl. 14, 38-43.
Mandl, P. (1967). An iterative method for maximizing the characteristic root of positive matrices.
Rev. Roumaine Math. Pures Appl. 12, 1312-1317.
Mandl, P. (1974). Estimation and control in Markov chains. Adv. in AppL Probab. 6, 40-60~
Mann, E. (1983). Optimality equations and bias optimality in bounded Markov decision processes.
Preprint No. 574, University of Bonn.
Manne, A. (1960)~ Linear programming and sequential decisions. Management Sci. 6, 259-267
432 M.L. Puterman
Miller, B.L. and A.F. Veinott, Jr. (1969). Discrete dynamic programming with a small interest
rate. Ann. Math. Statist. 40, 366-370.
Mine, H. and S. Osaki (1968). Some remarks on a Markovian decision process with an absorbing
state. J. Math. Anal. Appl. 23, 327-333.
Monahan, G.E. (1982). A survey of partially observabel Markov decision processes: Theory,
models, and algorithms. Management Sci. 28, 1-16.
Morton, T.E. (1971). On the asymptotic convergence rate Of cost differences for Markovian
decision processes. Oper. Res. 19, 244-248.
Morton, T.E. and W.E. Wecker (1977). Discounting ergodicity and convergence for Markov
decision processes. Management Sci. 23, 890-900.
Morton, T. (1978). The non-stationary infinite horizon inventory problem. Management Sci. 24,
1474-1482.
Odoni, A.R. (1969). On finding the maximal gain for Markov decision processes. Oper. Res~ 17,
857-860.
Ohno, K. (1985). Modified policy iteration algorithm with nonoptimality tests for undiscounted
Markov decision processes. Working Paper, Dept. of Information System and Management
Science, Konan University, Japan.
Ohno, K. and K. Ichiki (1987). Computing optimnal policies for tandem queueing systems. Oper.
Res. 35, 121-126.
Ornstein, D. (1969). On the existence of stationary optimal strategies. Proc. Amer. Math. Soc. 20,
563 -569.
Ortega, J.M. and W.C. Rheinboldt (1970). Iterative Solutions of Nonlinear equations in Several
Variables. Academic Press, New York.
Platzman, L. (1977). Improved conditions for convergence in undiscounted Markov renewal
programming. Oper. Res. 25, 529-533.
Pliska, S.R. (1976). Optimization of multitype branching processes. Management Sci. 23,
117-125.
Pliska, S.R. (1978). On the transient case for Markov decision processes with general state spaces.
In: M.L. Puterrnan (Ed.), Dynamic Programming and Its Application. Academic Press, New
York, 335-350.
Porteus, E. (1971). Some bounds for discounted sequential decision processes. Management Sci.
18, 7-11.
Porteus, E. and J. Totten (1978). Accelerated computation of the expected discounted return in a
Markov chain. Oper. Res. 26, 350-358.
Porteus, E. (1980a). Improved iterative computation of the expected discounted return in Markov
and semi-Markov chains. Z. Oper. Res. 24, 155-170.
Porteus, E. (1980b). Overview of iterative methods for discounted finite Markov and semi-Markov
decision chains. In: R. Hartley, L.C. Thomas and D.J. White (Eds.), Recent Developments in
Markov Decision Processes. Academic Press, New York, 1-20.
Porteus, E. (1981). Computing the discounted return in Markov and semi-Markov chains. Naval
Res. Logist. Quart. 28, 567-578.
Porteus, E. (1983). Survey of numerical methods for discounted finite Markov and semi-Markov
chains. Presented at Twelfth Conference on Stochastic Processes and Their Applications, Ithaca~
NY.
Puterman, M.L. (Ed.) (1978). Dynamic Programming and Its Applications. Academic Press, New
York.
Puterman, M.L. and S.L. Brumelle (19'78). The analytic theory of policy iteration. In: M.L.
Puterman (ed.), Dynamic Programming and Its Application. Academic Press, New York.
Puterman, M.L. and M.C. Shin (1978). Modified policy iteration algorithms for discounted
Markov decision problems. Management Sci. 24, 1127-1137.
Puterman, M.L. and S.L. Brumelle (1979). On the convergence and policy iteration in stationary
dynamic programming. Math. Oper. Res. 4, 60-69.
Puterman M.L. and M.C. Shin (1982). Action elimination procedures for modified policy
iteration algorithms. Oper. Res. 30, 301-318.
Puterman, M.L. (1991). Markov Decision Processes. Wiley, New York.
Ch. 8. Markov Decision Processes 433
Ross, S. (1968a). Non-Discounted denumerable Markovian decision models. Ann. Math. Statist.
39, 412-423.
Ross, S.M. (1968b). Arbitrary state Markovian decision processes. Ann. Math. Statist. 39,
2118-2122.
Ross, S.M. (1983). Introduction to Stochastic Dynamic Programming. Academic Press, New York.
Rothblum, U.G. and A.F. Veinott, Jr. (1975). Cumulative average optimality for normalized
Markov decision chains. Working Paper, Dept. of Operations Research, Stanford University.
Rothblum, U.G. (1979). Iterated successive approximation for sequential decision processes. In:
J.W.B. van Overhagen and H.C. Tijms, (Eds.), Stochastic Control and Optimization. Vrije
Universiteit, Amsterdam, 30-32.
Rothblum, U.G. (1984). Multiplicative Markov decision chains. Math. Oper. Res. 9, 6-24.
Schal, M. (1975). Conditions for optimality in dynamic programming and for the limit of n-stage
optimal policies to be optimal. Z. Wahrsch. Verw. Gebiete 32, 179-196.
Schweitzer, P.J. (1965). Perturbation theory and Markov decision chains. Unpublished Ph.D.
Dissertation, Massachusetts Institute of Technology.
Schweitzer, P.J. (1971). Iterative solution of the functional equations of undiscounted Markov
renewal programming. J. Math. Anal. Appl. 34, 495-501.
Schweitzer, P.J. and A. Federgruen (1977). The asymptotic behavior o1 undiscounted value
iteration in Markov decision problems. Math. Oper. Res. 2, 360-381.
Schweitzer, P.J. and A~ Federgruen (1978). The functional equations of undiscounted Markov
renewal programming. Math. Oper. Res. 3, 308-321.
Schweitzer, P.J. and A. Federgruen (1979). Geometric convergence of value iteration in multich-
ain Markov decision problems. Adv. in Appl. Probab. 11, 188-217.
Schweitzer, P.J. (1985). On undiscounted Markovian decision processes with compact action
spaces. Rev. R A I R O Rech. Oper. 19, 71-86.
Seneta, E. (1981). Non-negative Matrices and Markov Chains. Springer-Verlag, New York.
Shapiro, J. (1968). Turnpike planning horizons for a Markovian decision model. Management Sci.
14, 292-300.
Shapley, L.S. (1953). Stochastic games. Proc. Nat. Acad. Sci. U.S.A. 39, 1095-1100.
Sheu, S.S. and K.-J. Farn (1980). A sufficient condition for the existence of a stationary 1-optimal
plan in compact action Markovian decision processes. In: R. Hartley, L.C. Thomas and D.J.
White (Eds.), Recent Developments in Markov Decision Processes. Academic Press, New York,
111-126.
Sladky, K. (1974). On the set of optimal controls for Markov chains with rewards. Kybernetika 10,
350-367.
Smallwood, R. and E. Sondik (1973). The optimal control of partially observable Markov
processes over a finite horizon. Oper. Res. 21, 1071-1088.
Sobel, M.J. (1982). The variance of discounted Markov decision processes. J. Appl. Probab. 19,
794 -802.
Sondik, E.J. (1971). The optimal control of partially observable Markov processes. Ph.D.
Dissertation, Department of Engineering-Economic Systems, Stanford University.
Sondik, E. (1978). The optimal control of Partially observable Markov processes over the infinite
horizon: Discounted costs. Oper. Res. 26, 282-304.
Strauch, R. (1966). Negative dynamic programming. Ann. Math. Statist. 37, 871-890.
Taylor, H.M. (1965). Markovian sequential replacement processes. Ann. Math. Statist. 36,
1677-1694.
Tijms, H.C. and J. Wessels (eds.) (1977). Markov Decision Theory. Tract 93, Mathematical
Centre, Amsterdam.
van Dawen, R. (1986a). Finite state dynamic programming with the total reward criterion. Z~
Oper. Res. 30, A1-A14.
van Dawen, R. (1986b). Pointwise and uniformly good stationary strategies in dynamic program
ruing models. Math. Oper. Res. 11, 521-535.
van der Wal, J. and J.A.E.E. van Nunen (1977). A note on the convergence of the value oriented
successive approximations method. COSO Note R 77-05, Department of Mathematics, Eind~
hoven University of Technology.
434 M.L. Puterman
van der Wal, J. (1984). Stochastic Dynamic Programming. Tract 139, Mathematical Centre,
Amsterdam.
van der Wal, J. (1984). On stationary strategies in countable state total reward Markov decision
processes. Math. Oper. Res. 9, 290-300.
van Hee, K. (1978). Markov strategies in dynamic programming. Math. Oper. Res. 3, 37-41.
van Nunen, J.A.E.E. (1976a). A set of successive approximation methods for discounted Marko-
vian decision problems. Z. Oper. Res. 20, 203-208.
van Nunen, J.A.E.E. (1976b). Contracting Markov Decision Processes. Tract 71, Mathematical
Centre, Amsterdam.
van Nurlen, J.A.E.E. and J. Wessels (1978). A note on dynamic programming with unbounded
rewards. Management Sci. 24, 576-560.
Veinott, Jr., A.F. (1966). On finding optimal in discrete dynamic programming with no discount
ing. Ann. Math. Statist. 37, 1284-1294.
Veinott, Jr., A.F. (1968). Extreme points of Leontief substitution systems. Linear Algebra AppI. 1,
181-194.
Veinott, Jr., A.F. (1969). On discrete dynamic programming with sensitive discount optimality
criteria. Ann. Math. Statist. 40, 1635-1660.
Veinott, Jr., A.F. (1974). Markov decision chains. In: G.B. Dantzig and B.C. Eaves (Eds.),
Studies in Optimization. American Mathematical Association, Providence, RI.
White, D.J. (1963). Dynamic programming, Markov chains, and the method of successive
approximations. J. Math. Anal. Appl. 6, 373-376.
White, D.J. (1978). Elimination of non-optimal actions in Markov decision processes. In: M.L.
Puterman (Ed.), Dynamic Programming and Its Applications. Academic Press, New York,
131-160.
White, D.J. (1985a). Monotone value iteration for discounted finite Markov decision processes. J.
Math. Anal. Appl. 109, 311-324.
White, D.J. (1985b). Real applications of Markov decision processes. Interfaces 15, 73-83.
White, D.J. (1988). Mean, variance, and probabilistic criteria in finite Markov decisionprocesses:
A review. J. Optim. Theory Appl. 56, 1-29.
Whittle, P. (1979). A simple condition for regularity in negative programming. J. Appl. Probab.
16, 305-318.
Whittle, P. (1980a). Stability and characterisation condition in negative programming. J. Appl.
Probab. 17, 635-645.
Whittle, P. (1980b). Negative programming with unbounded costs: A simple condition for
regularity. In: R. Hartley, L.C. Thomas, D.J. White (Eds.), Recent Developments in Markov
Decision Processes. Academic Press, New York, 23-34.
Whittle, P. (1983). Optimization Over Time, Dynamic Programming and Stochastic Control, Vol.
II. J. Wiley and Sons, New York.
Wijngaard, J. (1977). Sensitive optimality in stationary Markov decision chains on a general state
space. In: H.C. Tijms and J. Wessels (Eds.), Markov Decision Theory, Mathematical Centre
Tracts No. 93. Mathematical Centre, Amsterdam, 85--94.
Yosida, K. (1968). Functional Analysis. Springer-Verlag, New York.