Simulation Based Optimization-1
Simulation Based Optimization-1
Simulation Based Optimization-1
CONTROL OPTIMIZATION
WITH STOCHASTIC DYNAMIC
PROGRAMMING
1. Chapter Overview
This chapter focuses on a problem of control optimization, in
particular the Markov decision problem (or process). Our discussions
will be at a very elementary level, and we will not attempt to prove
any theorems. The central aim of this chapter is to introduce the
reader to classical dynamic programming in the context of solving
Markov decision problems. In the next chapter, the same ideas will be
presented in the context of simulation-based dynamic programming.
The main concepts presented in this chapter are (1) Markov chains,
(2) Markov decision problems, (3) semi-Markov decision problems,
and (4) classical dynamic programming methods.
2. Stochastic Processes
We begin with a discussion on stochastic processes. A stochastic
(or random) process, roughly speaking, is an entity that has a prop-
erty which changes randomly with time. We refer to this changing
property as the state of the stochastic process. A stochastic process
is usually associated with a stochastic system. Read Chap. 2 for a
definition of a stochastic system. The concept of a stochastic process
is best understood with an example.
Consider a queue of persons that forms in a bank. Let us assume
that there is a single server (teller) serving the queue. See Fig. 6.1.
The queuing system is an example of a stochastic system. We need
to investigate further the nature of this queuing system to identify
properties, associated with the queue, that change randomly with time.
Let us denote
The number of customers in the queue at time t by X(t) and
The number of busy servers at time t by Y (t).
Then, clearly, X(t) will change its value from time to time and so
will Y (t). By its definition, Y (t) will equal 1 when the teller is busy
serving customers, and will equal 0 when it is idle.
Customer being
served
Server
Customers
in the queue
Now if the state of the system is recorded after unit time, X(t)
could take on values such as: 3, 3, 4, 5, 4, 4, 3 . . . The set {X(t)|t =
1, 2, · · · , ∞}, then, defines a stochastic process. Mathematically, the
sequence of values that X(t) assumes in this example is a stochastic
process.
Similarly, {Y (t)|t = 1, 2, · · · , ∞} denotes another stochastic process
underlying the same queuing system. For example, Y (t) could take on
values such as 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, . . .
It should be clear now that more than one stochastic process may be
associated with any given stochastic system. The stochastic processes
X and Y differ in their definition of the system state. For X, the
state is the number of customers in the queue and for Y , the state is
the number of busy servers.
An analyst selects the stochastic process that is of interest to
him/her. E.g., an analyst interested in studying the utilization of
the server (i.e., proportion of time the server is busy) will choose Y ,
while the analyst interested in studying the length of the queue will
choose X. See Fig. 6.2 for a pictorial explanation of the word “state.”
In general, choosing the appropriate definition of the state of a
system is a part of “modeling.” The state must be defined in a manner
suitable for the optimization problem under consideration. To under-
stand this better, consider the following definition of state. Let Z(t)
denote the total number of persons in the queue with black hair. Now,
Dynamic Programming 125
State = 3
Customers Server
State = 2
Customers Server
Customers Server
Figure 6.2. A queue in two different states: The “state” is defined by the number
in the queue
Figure 6.3. Schematic of a two-state Markov chain, where circles denote states
later.) Hence, after unit time, the system either switches (moves) to
a new state or else the system returns to the current state. We will
refer to this phenomenon as a state transition.
To understand this phenomenon better, consider Fig. 6.3. The figure
shows two states, which are denoted by circles, numbered 1 and 2.
The arrows show the possible ways of transiting. This system has
two states: 1 and 2. Assuming that we first observe the system when
it is in state 1, it may for instance follow the trajectory given by:
1, 1, 2, 1, 1, 1, 2, 2, 1, 2, . . .
A state transition in a Markov process is usually a probabilistic, i.e.,
random, affair. Consider the Markov process in Fig. 6.3. Let us further
assume that in its first visit to state 1, from state 1 the system jumped
to state 2. In its next visit to state 1, the system may not jump to
state 2 again; it may jump back to state 1. This should clarify that
the transitions in a Markov chain are “random” affairs.
We now need to discuss our convention regarding the time needed
for one jump (transition). In a Markov process, how much time is spent
in one transition is really irrelevant to its analysis. As such, even if the
time is not always unity, or even if it is not a constant, we assume it
to be unity for our analysis. If the time spent in the transition becomes
an integral part of how the Markov chain is analyzed, then the Markov
process is not an appropriate model. In that case, the semi-Markov
process becomes more appropriate, as we will see below.
When we study real-life systems using Markov processes, it usu-
ally becomes necessary to define a performance metric for the real-life
system. It is in this context that one has to be careful with how the
unit time convention is interpreted. A common example of a perfor-
mance metric is: average reward per unit time. In the case of a Markov
process, the phrase “per unit time” in the definition of average reward
actually means “per jump” or “per transition.” (In the so-called semi-
Markov process that we will study later, the two phrases have different
meanings.)
Another important property of the Markov process needs to be
studied here. In a Markov process, the probability that the process
jumps from a state i to a state j does not depend on the states vis-
ited by the system before coming to i. This is called the memoryless
property. This property distinguishes a Markov process from other
stochastic processes, and as such it needs to be understood clearly.
Because of the memoryless property, one can associate a probability
with a transition from a state i to a state j, that is,
i −→ j.
Dynamic Programming 127
1, 3, 2, 1, 1, 1, 2, 1, 3, 1, 1, 2, . . .
Assume that: P (3, 1) = 0.2 and P (3, 2) = 0.8. When the system visits
3 for the first time in the above, it jumps to 2. Now, the probability of
jumping to 2 is 0.8, and that of jumping to 1 is 0.2. When the system
revisits 3, the probability of jumping to 2 will remain at 0.8, and that
of jumping to 1 at 0.2. Whenever the system comes to 3, its probability
of jumping to 2 will always be 0.8 and that of jumping to 1 be 0.2. In
other words, when the system comes to a state i, the state to which
it jumps depends only on the transition probabilities: P (i, 1), P (i, 2)
and P (i, 3). These probabilities are not affected by the sequence of
states visited before coming to i. Thus, when it comes to jumping to a
new state, the process does not “remember” what states it has had to
go through in the past. The state to which it jumps depends only on
the current state (say i) and on the probabilities of jumping from that
state to other states, i.e., P (i, 1), P (i, 2) and P (i, 3). In general, when
the system is ready to leave state i, the next state j depends only on
P (i, j). Furthermore, P (i, j) is completely independent of where the
system has been before coming to i.
We now give an example of a non-Markovian process. Assume that
a process has three states, numbered 1, 2, and 3. X(t), as before,
denotes the system state at time t. Assume that the law governing
this process is given by:
where f (i, j) is the probability that the next state is j given that the
current state is i. Also f (i, j) is a constant for given values of i and j.
Carefully note the difference between Eqs. (6.2) and (6.1). Where
the process resides one step before its current state has no influence on
a Markov process. It should be obvious that in the Markov process,
the transition probability (probability of going to one state to another
in the stochastic process in one step) depends on two quantities:
the present state (i) and the next state (j). In a non-Markovian pro-
cess, such as the one defined by Eq. (6.1), the transition probability
depended on the current state (i), the next state (j), and the previous
state (l). An implication is that even if both the processes have the
same number of states, we will have to deal with additional probabil-
ities in the two-step stochastic process.
The quantity f (i, j) is an element of a two-dimensional matrix. Note
that f (i, j) is actually P (i, j), the one-step transition probability of
jumping from i to j, which we have defined earlier.
All the transition probabilities of a Markov process can be conve-
niently stored in a matrix. This matrix is called the one-step tran-
sition probability matrix or simply the transition probability ma-
trix, usually abbreviated as TPM. An example of a TPM with three
states is:
0.7 0.2 0.1
P = 0.4 0.2 0.4 . (6.3)
0.6 0.1 0.3
P (i, j) here denotes the (i, j)th element of the matrix, P, i.e., the
element in the ith row and the jth column of P. In other words, P (i, j)
denotes the one-step transition probability of jumping from state i to
state j. Thus, for example, P (3, 1), which is 0.6 above, denotes the
one-step transition probability of going from state 3 to state 1.
We will also assume that a finite amount of time is taken in any
transition and that no time is actually spent in a state. This is one
convention (there are others), and we will stick to it in this book. Also,
note that by our convention, the time spent in a transition is unity (1).
In summary, a Markov process possesses three important properties:
(1) the jumpy property, (2) the memoryless property, and (3) the unit
time property (by our convention).
Dynamic Programming 129
0.7 0.3
P= .
0.4 0.6
0.6
0.7
0.3
2
1 0.4
Figure 6.4. Schematic of a two-state Markov chain, where circles denote states,
arrows depict possible transitions, and the numbers on the arrows denote the prob-
abilities of those transitions
Figures 6.5 and 6.6 show some more examples of Markov chains
with three and four states respectively. In this book, we will consider
Markov chains with a finite number of states.
Estimating the values of the elements of the TPM is often quite
difficult. This is because, in many real-life systems, the TPM is
very large, and evaluating any given element in the TPM requires
the setting up of complicated expressions, which may involve multiple
integrals. In subsequent chapters, this issue will be discussed in depth.
130 SIMULATION-BASED OPTIMIZATION
3
1
For the Markov process, the time taken in any transition is equal,
and hence the limiting probability of a state also denotes the propor-
tion of time spent in transitions to that particular state.
Dynamic Programming 133
We will now show how we can obtain the limiting probabilities from
the TPM without raising the TPM to large powers. The following
important result provides a very convenient way for obtaining the
limiting probabilities.
Theorem 6.2 Let Π(i) denote the limiting probability of state i, and
let S denote the set of states in the Markov chain. Then the limiting
probabilities for all the states in the Markov chain can be obtained
from the transition probabilities by solving the following set of linear
equations:
|S|
Π(i)P (i, j) = Π(j), for every j ∈ S (6.4)
i=1
|S|
and Π(j) = 1, (6.5)
j=1
Equations (6.4) and (6.5) are often collectively called the invariance
equation, since they help us determine the invariant (limiting) proba-
bilities. Equation (6.4) is often expressed in the matrix form as:
and 6.8. Another type of state is the absorbing state. Once the system
enters any absorbing state, it can never get out of that state and it
remains there.
An ergodic Markov chain is one in which all states are recurrent
and no absorbing states are present. Ergodic chains are also called
irreducible chains. All regular Markov chains are ergodic, but the
converse is not true. (Regular chains were defined in Sect. 3.1.1.) For
instance, a chain that is not regular may be ergodic. Consider the
Markov chain with the following TPM:
0 1
.
1 0
This chain is not regular, but ergodic. It is ergodic because both states
are visited infinitely many times in an infinite viewing.
1
2
Transient State
Recurrent State
1b
2
1a
Transient State
Recurrent State
called the embedded Markov chain. The main difference between the
semi-Markov process and the Markov process lies in the time taken in
transitions.
In general, when the distributions for the transition times are
arbitrary, the process goes by the name semi-Markov. If the time in
every transition is an exponentially distributed random variable, the
stochastic process is referred to as a continuous time Markov process.
Some authors refer to what we have called the continuous time
Markov process as the “Markov process,” and by a “Markov chain,”
they mean what we have referred to as the Markov process.
There is, however, a critical difference between the Markov chain
underlying a Markov process and that underlying a semi-Markov pro-
cess. In a semi-Markov process, the system jumps, but not necessarily
after unit time, and when it jumps, it jumps to a state that is different
than the current state. In other words, in a semi-Markov process, the
system cannot jump back to the current state. (However, in a semi-
Markov decision process, which we will discuss later, jumping back to
the current state is permitted.) In a Markov process, on the other
hand, the system can return to the current state after one jump.
If the time spent in the transitions is a deterministic quantity, the
semi-Markov process has a transition time matrix analogous to the
TPM, e.g.,
Dynamic Programming 137
− 17.2
.
1 −
For an example of the most general model in which some or all of the
transition times are random variables from any given distributions,
consider the following transition time matrix:
− unif (5, 6)
,
expo(5) −
where unif (min, max) denotes a random number from the uniform
distribution with parameters, min and max, and expo(µ) denotes the
same from the exponential distribution with parameter µ.
When we analyze a semi-Markov process, we begin by analyzing
the Markov chain embedded in it. The next step usually is to analyze
the time spent in each transition. As we will see later, the semi-
Markov process is more powerful than the Markov process in modeling
real-life systems, although very often its analysis can prove to be more
complicated.
0 1 2 3 .............
Now consider a policy µ̂ = (2, 1). The TPM associated with this
policy will contain the transition probabilities of action 2 in state 1
and the transition probabilities of action 1 in state 2. The TPM of
policy µ̂ is thus
0.1 0.9
Pµ̂ = .
0.4 0.6
P1 P2
Pu
0.1 0.9
Matrix for
0.4 0.6
policy (2,1)
Figure 6.10. Schematic showing how the TPM of policy (2, 1) is constructed from
the TPMs of action 1 and 2
11 −4 45 80
R1 = and R2 = .
−14 6 1 −23
Now consider a policy µ̂ = (2, 1). Like in the TPM case, the TRM
associated with this policy will contain the immediate reward of action
2 in state 1 and the immediate rewards of action 1 in state 2. Thus
the TRM of policy µ̂ can be written as
45 80
Rµ̂ = .
−14 6
The TPM and the TRM of a policy together contain all the informa-
tion one needs to evaluate the policy in an MDP. In terms of notation,
we will denote the immediate reward, earned in going from state i to
state j, under the influence of action a, by:
r(i, a, j).
144 SIMULATION-BASED OPTIMIZATION
r(i, µ(i), j)
because µ(i) is the action that will be selected in state i when policy
µ̂ is used.
Performance metric. To compare policies, one must define a perfor-
mance metric (objective function). Naturally, the performance metric
should involve reward and cost elements. To give a simple analogy, in
a linear programming problem, one judges each solution on the basis
of the value of the associated objective function. Any optimization
problem has a performance metric, which is also called the objective
function. In this book, for the most part, the MDP will be studied
with respect to two performance metrics. They are:
1. Expected reward per unit time calculated over an infinitely long
trajectory of system states: We will refer to this metric as the
average reward.
2. Expected total discounted reward calculated over an infinitely long
trajectory of system states: We will refer to this metric as the
discounted reward.
It is the case that of the two performance metrics, average reward
is easier to understand, although the average reward MDP is more
difficult to analyze for its convergence properties. Hence, we will begin
our discussion with the average reward performance criterion. Dis-
counted reward will be defined later.
We first need to define the expected immediate reward of a state
under the influence of a given action. Consider the following scenario.
An action a is selected in state i. Under the influence of this action,
the system can jump to three states: 1, 2, and 3 with probabilities of
(0.2,10)
1
(0.3,12)
i
2
(0.5,-14)
Legend: 3
(x,y)
x= transition probability
y=transition reward
earned in each visit to state i under policy µ̂ is r̄(i, µ(i)). Then the
total long-run expected reward earned in k transitions for this MDP
can be written as:
Πµ̂ (i) denotes the limiting probability of state i when the system
(and hence the underlying Markov chain) is run with the policy µ̂
Assumption 6.1 The state space S and the action space A(i) for
every i ∈ S is finite (although possibly quite large).
6 −5 10 17
R1 = and R2 = .
7 12 −14 13
µ̂1 = (1, 1), µ̂2 = (1, 2), µ̂3 = (2, 1), and µ̂4 = (2, 2).
The TPMs and TRMs of these policies are constructed from the
individual TPMs and TRMs of each action. The TPMs are:
10 17 10 17
Rµ̂3 = ; Rµ̂4 = .
7 12 − 14 13
(1,0.7,6) (2,0.8,13)
(1,0.3,-5)
(2,0.1,17)
1 2
(1,0.4,7)
(2,0.2,-14)
(2,0.9,10)
(1,0.6,12)
Legend:
(a,p,r): a = action
p = transition
probability
r = immediate
reward
From the TPMs, using Eqs. (6.4) and (6.5), one can find the limiting
probabilities of the states associated with each policy. They are:
r̄(1, µ1 (1)) = p(1, µ1 (1), 1)r(1, µ1 (1), 1) + p(1, µ1 (1), 2)r(1, µ1 (1), 2)
r̄(2, µ1 (2)) = p(2, µ1 (2), 1)r(2, µ1 (2), 1) + p(2, µ1 (2), 2)r(2, µ1 (2), 2)
r̄(1, µ2 (1)) = p(1, µ2 (1), 1)r(1, µ2 (1), 1) + p(1, µ2 (1), 2)r(1, µ2 (1), 2)
r̄(2, µ2 (2)) = p(2, µ2 (2), 1)r(2, µ2 (2), 1) + p(2, µ2 (2), 2)r(2, µ2 (2), 2)
r̄(1, µ3 (1)) = p(1, µ3 (1), 1)r(1, µ3 (1), 1) + p(1, µ3 (1), 2)r(1, µ3 (1), 2)
r̄(2, µ3 (2)) = p(2, µ3 (2), 1)r(2, µ3 (2), 1) + p(2, µ3 (2), 2)r(2, µ3 (2), 2)
r̄(1, µ4 (1)) = p(1, µ4 (1), 1)r(1, µ4 (1), 1) + p(1, µ4 (1), 2)r(1, µ4 (1), 2)
r̄(2, µ4 (2)) = p(2, µ4 (2), 1)r(2, µ4 (2), 1) + p(2, µ4 (2), 2)r(2, µ4 (2), 2)
Thus:
ρµ̂1 = Πµ̂1 (1)r̄(1, µ1 (1)) + Πµ̂1 (2)r̄(2, µ1 (2))
If possible, one should set µk+1 (i) = µk (i) for each i. The signifi-
cance of ∈ in the above needs to be understood clearly. There may
be more than one action that satisfies the argmax operator. Thus
there may be multiple candidates for µk+1 (i). However, the latter
is selected in a way such that µk+1 (i) = µk (i) if possible.
Step 4. If the new policy is identical to the old one, that is, if
µk+1 (i) = µk (i) for each i, then stop and set µ∗ (i) = µk (i) for
every i. Otherwise, increment k by 1, and go back to the second
step.
Table 6.1. Calculations in policy iteration for average reward MDPs on Example A
mechanism here employs the span seminorm, also called the span.
We will denote the span seminorm of a vector in this book by sp(.)
and define it as:
Step 3: If
sp(J k+1 − J k ) < ,
go to Step 4. Otherwise increase k by 1, and go back to Step 2.
Step 4: For each i ∈ S, choose
|S|
d(i) ∈ arg max r̄(i, a) + p(i, a, j)J k (j) ,
a∈A(i) j=1
ˆ
and stop. The -optimal policy is d.
The implication of -optimality (in Step 4 above) needs to be
understood. The smaller the value of , the closer we get to the
optimal policy. Usually, for small values of , one obtains policies
very close to optimal. The span of the difference vector (J k+1 − J k )
keeps getting smaller and smaller in every iteration, and hence for a
given positive value of , the algorithm terminates in a finite number
156 SIMULATION-BASED OPTIMIZATION
After calculations in this step for all states are complete, set ρ =
J k+1 (i∗ ).
Table 6.2. Calculations in value iteration for average reward MDPs: Note that the
values get unbounded but the span of the difference vector gets smaller with every
iteration. We start with J 1 (1) = J 1 (2) = 0
Step 4: If
sp(J k+1 − J k ) < ,
go to Step 5. Otherwise increase k by 1 and go back to Step 2.
Step 5: For each i ∈ S, choose
|S|
d(i) ∈ arg max r̄(i, a) + p(i, a, j)J k (j) ,
a∈A(i)
j=1
158 SIMULATION-BASED OPTIMIZATION
k denotes the number of transitions (or time assuming that each tran-
sition takes unit time) over which the system is observed, xs denotes
the state from where the sth jump or state transition occurs under the
policy µ̂, and E denotes the expectation operator over all trajectories
that start under the condition specified within the square brackets.
If you have trouble understanding why we use lim inf here, you may
replace it by lim at this stage.
It can be shown that for policies with regular Markov chains, the
average reward is independent of the starting state i, and hence, ρ(i)
can be replaced by ρ. Intuitively, the above expression says that the
average reward for a given policy is
the expected sum of rewards earned in a very long trajectory
.
the number of transitions in the same trajectory
In the above, we assume that the associated policy is pursued within
the trajectory. We now discuss the other important performance
metric typically studied with an MDP: the discounted reward.
Table 6.3. Calculations in Relative value iteration for average reward MDPs: =
0.001; -optimal policy found at k = 12; J 1 (1) = J 1 (2) = 0
The idea of discounting is related to the fact that the value of money
reduces with time. To give a simple example: a dollar tomorrow is
worth less than a dollar today. The discounting factor is the fraction
by which money gets devalued in unit time. So for instance, if I earn $3
today, $5 tomorrow, $6 the day after tomorrow, and if the discounting
factor is 0.9 per day, then the present worth of my earnings will be:
3 + (0.9)5 + (0.9)2 6.
The reason for raising 0.9 to the power of 2 is that tomorrow, the
present worth of day-after-tomorrow’s earning will be 0.9(6). Hence
today, the present worth of this amount will be 0.9[0.9(6)] = (0.9)2 6.
In general, if the discounting factor is λ, and if e(t) denotes the
earning in the tth period of time, then the present worth of earnings
over n periods of time can be denoted by
In Eq. (6.19),
k
E λs−1 r(xs , µ(xs ), xs+1 )|x1 = i =
s=1
The above means that the optimal policy will have a value function
vector that satisfies the following property: each element of the vector
is greater than or equal to the corresponding element of the value
function vector of any other policy. This concept is best explained
with an example.
Consider a 2-state Markov chain with 4 allowable policies denoted
by µ̂1 , µ̂2 , µ̂3 , and µ̂4 . Let the value function vector be defined by
vµ̂1 (1) = 3; vµ̂2 (1) = 8; vµ̂3 (1) = −4; vµ̂4 (1) = 12;
vµ̂1 (2) = 7; vµ̂2 (2) = 15; vµ̂3 (2) = 1; vµ̂4 (2) = 42;
162 SIMULATION-BASED OPTIMIZATION
Now, from our definition of an optimal policy, policy µ̂4 should be the
optimal policy since the value function vector assumes the maximum
value for this policy for each state. Now, the following question should
rise in your mind at this stage. What if there is no policy for which
the value function is maximized for each state? For instance consider
the following scenario:
vµ̂1 (1) = 3; vµ̂2 (1) = 8; vµ̂3 (1) = −4; vµ̂4 (1) = 12;
vµ̂1 (2) = 7; vµ̂2 (2) = 15; vµ̂3 (2) = 1; vµ̂4 (2) = −5;
In the above setting, there is no one policy for which the value function
is maximized for all the states. Fortunately, it has been proved that
under the assumptions we have made above, there exists an optimal
policy; in other words, there exists a policy for which the value function
is maximized in each state. The interested reader is referred to [30,
270], among other sources, for the proof of this.
The important point that we need to address next is: how does
one find the value function of any given policy? Equation (6.19) does
not provide us with any direct mechanism for this purpose. Like in
the average reward case, we will need to turn to the Bellman policy
equation.
By solving the Bellman equation, one can obtain the value func-
tion vector associated with a given policy. Clearly, the value function
vectors associated with each policy can be evaluated by solving the
respective Bellman equations. Then, from the value function vectors
obtained, it is possible to determine the optimal policy. This method
is called the method of exhaustive enumeration.
Like in the average reward case, the method of exhaustive enu-
meration is not a very efficient method to solve the MDP, since its
computational burden is enormous. For a problem of 10 states with
two allowable actions in each, one would need to evaluate 210 policies.
The method of policy iteration is considerably more efficient.
Step 1. Set k = 1. Here k will denote the iteration number. Let the
number of states be |S|. Select any policy in an arbitrary manner.
Let us denote the policy selected in the kth iteration by µ̂k . Let µ̂∗
denote the optimal policy.
|S|
k
h (i) = r̄(i, µk (i)) + λ p(i, µk (i), j)hk (j). (6.21)
j=1
164 SIMULATION-BASED OPTIMIZATION
Like in the average reward case, we will next discuss the value
iteration method. The value iteration method is also called the method
of successive approximations (in the discounted reward case). This is
because the successive application of the Bellman operator in the dis-
counted case does lead one to the optimal value function. Recall that
in the average reward case, the value iteration operator may not keep
the iterates bounded. Fortunately this is not the case in the discounted
problem.
|S|
J ∗ (i) = max r̄(i, a) + λ p(i, a, j)J ∗ (j) for each i ∈ S. (6.22)
a∈A(i)
j=1
The notation is similar to that defined for the average reward case.
Equation (6.22), i.e., the Bellman optimality equation for discounted
reward contains the max operator; hence, it cannot be solved using
linear algebra techniques, e.g., Gaussian elimination. However, the
value iteration method forms a convenient solution method. In value
iteration, one starts with some arbitrary values for the value function
vector. Then a transformation, derived from the Bellman optimality
equation, is applied on the vector successively until the vector starts
approaching a fixed value. The fixed value is also called a fixed point.
We will discuss issues such as convergence to fixed points in Chap. 11
in a more mathematically rigorous framework. However, at this stage,
it is important to get an intuitive feel for a fixed point.
If a transformation has a unique fixed point, then no matter what
vector you start with, if you keep applying the transformation repeat-
edly, you will eventually reach the fixed point. Several operations
research algorithms are based on such transformations.
We will now present step-by-step details of the value iteration algo-
rithm. In Step 3, we will need to calculate the max norm of a vector.
See Chap. 1 for a definition of max norm. We will use the notation ||.||
to denote the max norm.
Steps in value iteration for MDPs.
Step 1: Set k = 1. Select arbitrary values for the elements of a vector
of size |S|, and call the vector J 1 . Specify > 0.
Step 2: For each i ∈ S, compute:
|S|
J k+1 (i) ← max r̄(i, a) + λ p(i, a, j)J k (j) .
a∈A(i)
j=1
Step 3: If
||(J k+1 − J k )|| < (1 − λ)/2λ,
go to Step 4. Otherwise increase k by 1 and go back to Step 2.
Step 4: For each i ∈ S choose
|S|
d(i) ∈ arg max r̄(i, a) + λ p(i, a, j)J k (j) and stop.
a∈A(i) j=1
166 SIMULATION-BASED OPTIMIZATION
Table 6.5. Calculations in value iteration for discounted reward MDPs: The value
of is 0.001. The norm is checked with 0.5 (1 − λ)/λ = 0.000125. When k = 53,
the -optimal policy is found; we start with J(1) = J (2) = 0
Table 6.6. Gauss-Siedel value iteration for discounted reward MDPs: Here =
0.001; the norm is checked with 0.5 (1 − λ)/λ = 0.000125; the -optimal is found
at k = 33; we start with J 1 (1) = J 1 (2) = 0
r(i,d(i),j)
i j
v(j)
v(i)
Let us next discuss what happens when the state transitions are
probabilistic and there is a discounting factor λ. When the system
is in a state i, it may jump to any one of the states in the system.
Consider Fig. 6.14.
r(i,d(i),j)
i j
|S|
vdˆ(i) = p(i, d(i), j) r(i, d(i), j) + λvdˆ(j) ,
j=1
which turns out to be the Bellman equation for a policy (d)ˆ for state i.
We hope that this discussion has served as an intuitive basis for the
Bellman policy equation.
The Bellman optimality equation has a similar intuitive explanation:
In each transition, to obtain the optimal value function at the current
state, i, one seeks to add the maximum over the sum of immediate
reward to the next state j and the “best” (optimal) value function from
state j. Of course, like in the policy equation, we must compute an
expectation over all values of j. We now discuss semi-Markov decision
problems.
Assume that the SMDP has two states numbered 1 and 2. Also,
assume that the time spent in a transition from state 1 is uniformly
distributed with a minimum value of 1 and a maximum of 2 (Unif(1,2)),
while the same from state 2 is exponentially distributed with a mean
of 3 (EXPO(3)); these times are the same for every action. Then, for
generating the TTMs, we need to use the following values. For all
values of a,
t̄(2, a, 1) = 3; t̄(2, a, 2) = 3.
Obviously, the time could follow any distribution. If the distributions
are not available, we must have access to the expected values of the
transition time, so that we have values for each t̄(i, a, j) term in the
model. These values are needed for solving the problem via dynamic
programming.
It could also be that the time spent depends on the action. Thus
for instance, we could also represent the distributions within the TTM
matrix. We will use Ta to denote the TTM for action a. For a 2-state,
2-action problem, consider the following data:
does not return to itself after one transition. This implies that the
natural process remains in a state for a certain amount of time and
then jumps to a different state.
The decision process has a different nature. It records only those
states in which an action needs to be selected by the decision-maker.
Thus, the decision process may come back to itself after one transition.
A decision-making state is one in which the decision-maker makes a
decision. All states in a Markov chain may not be decision-making
states; there may be several states in which no decision is made. Thus
typically a subset of the states in the Markov chain tends to be the set
of decision-making states. Clearly, as the name suggests, the decision-
making process records only the decision-making states.
For example, consider a Markov chain with three states numbered
1, 2, 3, and 4. States 1 and 2 are decision-making states while 3 and
4 are not. Now consider the following trajectory:
1, 3, 4, 3, 2, 3, 2, 4, 2, 3, 4, 3, 4, 3, 4, 3, 4, 1.
In this trajectory, the NP will look identical to what we see above.
The DMP however will be:
1, 2, 2, 2, 1.
This example also explains why the NP may change several times
between one change of the DMP. It should also be clear that the DMP
and NP coincide on the decision-making states (1 and 2).
We need to calculate the value functions of only the decision-making
states. In our discussions on MDPs, when we said “state,” we meant
a decision-making state. Technically, for the MDP, the non-decision-
making states enter the analysis only when we calculate the imme-
diate rewards earned in a transition from a decision-making state to
another decision-making state. In the SMDP, calculation of the tran-
sition rewards and the transition times needs taking into account the
non-decision-making states visited. This is because in the transition
from one decision-making state to another, the system may have vis-
ited non-decision-making states multiple times, which can dictate (1)
the value of the immediate reward earned in the transition and (2) the
transition time.
In simulation-based DP (reinforcement learning), the issue of iden-
tifying non-decision-making states becomes less critical because the
simulator calculates the transition reward and transition time in tran-
sitions between decision-making states; as such we need not worry
about the existence of non-decision-making states. However, if one
wished to set up the model, i.e., the TRM and the TPM, careful
attention must be paid to this issue.
Dynamic Programming 173
k
E[ s=1 r(xs , µ(xs ), xs+1 )|x1 = i]
ρµ̂ (i) ≡ lim inf k
k→∞ E[ s=1 t̄(xs , µ(xs ), xs+1 )|x1 = i]
where xs is the state from where the sth jump (or state transition)
occurs. The expectation is over the different trajectories that may be
followed under the conditions within the square brackets.
The notation inf denotes the infimum (and sup denotes the supre-
mum). An intuitive meaning of inf is minimum and that of sup is
maximum. Technically, the infimum (supremum) is not equivalent to
the minimum (maximum); however at this stage you can use the two
interchangeably. The use of the infimum here implies that the average
reward of a policy is the ratio of the minimum value of the total reward
divided by the total time in the trajectory. Thus, it provides us with
lowest possible value for the average reward.
It can be shown that the average reward is not affected by the state
from which the trajectory of the system starts. Therefore, one can get
rid of i in the definition of average reward. The average reward on
the other hand depends on the policy used. Solving the SMDP means
finding the policy that returns the highest average reward.
174 SIMULATION-BASED OPTIMIZATION
where t̄(i, a, j) is the expected time spent in one transition from state
i to state j under the influence of action a. Now, the average reward
of an SMDP can also be defined as:
|S|
i=1 Πµ̂ (i)r̄(i, µ(i))
ρµ̂ = |S|
(6.23)
i=1 Πµ̂ (i)t̄(i, µ(i))
where
r̄(i, µ(i)) and t̄(i, µ(i)) denote the expected immediate reward
earned and the expected time spent, respectively, in a transition
from state i under policy µ̂ and
Πµ̂ (i) denotes the limiting probability of the underlying Markov
chain for state i when policy µ̂ is followed.
The numerator in the above denotes the expected immediate reward
in any given transition, while the denominator denotes the expected
time spent in any transition. The above formulation (see e.g., [30])
is based on the renewal reward theorem (see Johns and Miller [155]),
which essentially states that
expected reward earned in a cycle
ρ = average reward per unit time = . (6.24)
expected time spent in a cycle
1 5 50 75
T1 = ; T2 = .
120 60 7 2
0.7 0.3 0.9 0.1
P1 = ; P2 = .
0.4 0.6 0.2 0.8
6 −5 10 17
R1 = ; R2 = .
7 12 −14 13
There are 4 possible policies that can be used to control the system:
µ̂1 = (1, 1), µ̂2 = (1, 2), µ̂3 = (2, 1), and µ̂4 = (2, 2).
1 5 1 5 50 75 50 75
Tµ̂1 = ; Tµ̂2 = ; Tµ̂3 = ; Tµ̂4 = .
120 60 7 2 120 60 7 2
The TPMs and TRMs were calculated in Sect. 3.3.2. The value of
each t̄(i, µ(i)) term can be calculated from the TTMs in a manner
similar to that used for calculation of r̄(i, µ(i)). The values are:
t̄(1, µ1 (1)) = p(1, µ1 (1), 1)t̄(1, µ1 (1), 1) + p(1, µ1 (1), 2)t̄(1, µ1 (1), 2)
Step 1. Set k = 1. Here k will denote the iteration number. Let the
number of states be |S|. Select any policy in an arbitrary manner.
Let us denote the policy selected by µ̂k . Let µ̂∗ denote the optimal
policy.
Step 2. (Policy Evaluation) Solve the following linear system of
equations.
|S|
k k
h (i) = r̄(i, µk (i)) − ρ t̄(i, µk (i)) + p(i, µ(i), j)hk (j). (6.26)
j=1
|S|
J ∗ (i) = max r̄(i, a) − ρ∗ t̄(i, a) + p(i, a, j)J ∗ (j) for each i ∈ S.
a∈A(i)
j=1
(6.27)
The following remarks will explain the notation.
The J ∗ terms are the unknowns. They are the components of the
optimal value function vector J ∗ . The number of elements in the
vector J ∗ equals the number of states in the SMDP.
The term t̄(i, a) denotes the expected time of transition from state
i when action a is selected in state i.
The term ρ∗ denotes the average reward associated with the optimal
policy.
Now in an MDP, although ρ∗ is unknown, it is acceptable to replace
ρ∗ by 0 (which is the practice in regular value iteration for MDPs),
or to replace it by the value function associated with some state of
the Markov chain (which is the practice in relative value iteration
178 SIMULATION-BASED OPTIMIZATION
for all (i, a) pairs, where J k (i) denotes the estimate of the value func-
tion element for the ith state in the kth iteration of the value iteration
algorithm.Let us define W (i, a), which deletes the ρ∗ and also the time
Table 6.7. Calculations in policy iteration for average reward SMDPs (Example B)
Now, consider an SMDP with two actions in each state, where t̄(i, 1) =
t̄(i, 2). For this case, the above value iteration update can be writ-
ten as:
J k+1 (i) ← max{W (i, 1), W (i, 2)}. (6.29)
If regular value iteration, as defined for the MDP, is used here, one
must not only ignore the ρ∗ term but also the time term. Then, an
update based on a regular value iteration for the SMDP will be (we
will show below that the following equation is meaningless):
|S|
J k+1
(i) ← max r̄(i, a) + p(i, a, j)J k (j) , (6.30)
a∈A(i)
j=1
Now, since t̄(i, 1) = t̄(i, 2), it is entirely possible that using (6.29), one
obtains J k+1 (i) = W (i, 1), while using (6.31), one obtains J k+1 (i) =
W (i, 2).
In other words, the update in (6.31) will not yield the same maxi-
mizing action as (6.29), where (6.29) is based on the Bellman equation,
i.e., Eq. (6.28). (Note that both will yield the same maximizing action
if all the time terms are equal to 1, i.e., in the MDP). Thus regu-
lar value iteration, i.e., (6.31), which is based on Eq. (6.30), is not
an acceptable update for the SMDP in the manner shown above. It
should be clear thus that Eq. (6.28) cannot be modified to eliminate
ρ∗ without eliminating t̄(., .) at the same time; i.e., Eq. (6.30) has no
validity for SMDPs! And herein lies the difficulty with value iteration
for SMDPs. There is, however, a way around this difficulty, which we
now discuss.
where the replacements for r̄(i, a) and p(i, a, j) are denoted by r̄ϑ (i, a)
and pϑ (i, a, j), respectively, and are defined as:
ϑp(i, a, j)/t̄(i, a) if i = j
pϑ (i, a, j) =
1 + ϑ[p(i, a, j) − 1]/t̄(i, a) if i = j
In the above, ϑ is chosen such that:
|S|
hµ̂ (i) = r̄(i, µ(i)) + e−γ t̄(i,µ(i),j) p(i, µ(i), j)hµ̂ (j)
j=1
|S|
k
h (i) = r̄(i, µk (i)) + e−γ t̄(i,µk (i),j) p(i, µ(i), j)hk (j). (6.33)
j=1
Dynamic Programming 183
Step 3: If
sp(J k+1 − J k ) < ,
go to Step 4. Otherwise increase k by 1, and go back to Step 2.
Step 4: For each i ∈ S, choose
|S|
d(i) ∈ arg max r̄(i, a) + e−γ t̄(i,a,j) p(i, a, j)J k (j) ,
a∈A(i) j
and stop.
∞
p(i, µ(i), j)rL (i, µ(i), j) + R(i, µ(i)) + e−γτ fi,µ(i),j (τ )Jµ (j)dτ ,
j∈S j∈S 0
∞
1 − e−γτ
where R(i, a) = rC (i, a, j) fi,a,j (τ )dτ
0 γ
j∈S
∞
max p(i, a, j)rL (i, a, j) + R(i, a) + e−γτ fi,a,j (τ )J (j)dτ .
a∈A(i) 0
j∈S j∈S
Step 3b. If
||(W q − J k )|| < (1 − λ)/2λ,
go to Step 4. Otherwise go to Step 3c.
Step 3c. If q = mk , go to Step 3e. Otherwise, for each i ∈ S, com-
pute:
W q+1 (i) ← r̄(i, µk+1 (i)) + λ p(i, µk+1 (i), j)W q (j) .
j∈S
Step 3b. If
sp(W q − J k ) ≤ ,
go to Step 4. Otherwise go to Step 3c.
Step 3c. If q = mk , go to Step 3e. Otherwise, for each i ∈ S, com-
pute:
W q+1 (i) = r̄(i, µk+1 (i)) + p(i, µk+1 (i), j)W q (j) .
j∈S
Minimize ρ subject to
|S|
ρ + v(i) − p(i, µ(i), j)v(j) ≥ r̄(i, µ(i)) for i = 1, 2, . . . , |S| and all µ(i) ∈ A(i).
j=1
x(i, a) = 1, (6.35)
i∈S a∈A(i)
where x∗ (i, a) denotes the optimal value of x(i, a) obtained from solv-
ing the LP above, and d(i, a) will contain the optimal policy. Here
Dynamic Programming 189
x(i, a)t̄(i, a) = 1.
i∈S a∈A(i)
for all policies µ̂, then v(i) is an upper bound for the optimal value
v ∗ (i). This paves the way for an LP. The formulation, using the x and
v terms as decision variables, is:
|S|
Minimize j=1 x(j)v(j) subject to
|S| |S|
j=1 x(j) = 1 and v(i) − λ j=1 p(i, µ(i), j)v(j) ≥ r̄(i, µ(i)) for i =
1, 2, . . . , |S| and all µ(i) ∈ A(i);
x(j) > 0 for j = 1, 2, . . . , |S|, and v(j) is URS for j = 1, 2, . . . , |S|.
Note that in the infinite horizon setting, the total expected reward is
usually infinite, but that is the not the case here. As such the total
expected reward is a useful metric in the finite horizon MDP.
In this setting, every time the Markov chain jumps, we will
assume that the number of stages (or time) elapsed since the start
190 SIMULATION-BASED OPTIMIZATION
.......
.......
.......
.......
1 2 T T+1
Terminal
Stage
(Non-
Decision Making Stages decision
-making)
the optimal solution. We will now discuss the main idea underlying
the backward recursion technique for solving the problem.
The backward recursion technique starts with finding the values of
the states in the T th stage (the final decision-making stage). For this,
it uses (6.37). In the latter, one needs the values of the states in
the next stage. We assume the values in the (T + 1)th stage to be
known (they will all be zero by our convention). Having determined
the values in the T th stage, we will move one stage backwards, and
then determine the values in the (T − 1)th stage.
The values in the (T − 1)th stage will be determined by using the
values in the T th stage. In this way, we will proceed backward one
stage at a time and find the values of all the stages. During the evalua-
tion of the values, the optimal actions in each of the states will also be
identified using the Bellman equation. We now present a step-by-step
description of the backward recursion algorithm in the context of dis-
counted reward. The expected total reward algorithm will use λ = 1
in the discounted reward algorithm.
A Backward Recursion. Review notation provided for Eq. (6.37).
11. Conclusions
This chapter discussed the fundamental ideas underlying MDPs and
SMDPs. The focus was on a finite state and action space within
discrete-event systems. The important methods of value and policy
iteration (DP) were discussed, and the two forms of the Bellman equa-
tion, the optimality equation and the policy equation, were presented
for both average and discounted reward. The modified policy iteration
algorithm was also discussed. Some linear programming for solving
MDPs along with finite horizon control was covered towards the end
briefly. Our goal in this chapter was to provide some of the theory
underlying dynamic programming for solving MDPs and SMDPs that
can also be used in the simulation-based context of the next chapter.
as good as new. (5) Let i denote the number of days elapsed since
the last preventive maintenance or repair (subsequent to a failure);
then the probability of failure during the ith day can be modeled as
1 − ξψ i+2 , where ξ and ψ are scalars in the interval (0, 1), whose values
can be estimated from the data for time between successive failures of
the system.
We will use i to denote the state of the system, since this leads to a
Markov chain. In order to construct a finite Markov chain, we define
for any given positive value of ∈ , ī to be the minimum integer
value of i such that the probability of failure on the īth day is less than
or equal to (1 − ). Since we will set to some pre-fixed value, we can
drop from our notation. In theory, the line will have some probability
of not failing after any given day, making the state space infinite, but
our definition of ī permits truncation of the infinite state space to a
finite one. The resulting state space will be: S = {0, 1, 2, . . . , ī}. This
means that the probability of failure on the īth day (which is very
close to 1) will be assumed to equal 1.
Clearly, when a maintenance or repair is performed, i will be set
to 0. If a successful day of production occurs, i.e., the line does not
fail during the day, the state of the system is incremented by 1. The
action space is: {produce, maintain}. Cm and Cr denote the cost
of one maintenance and one repair respectively. Then, we have the
following transition probabilities for the system.
For action produce: For i = 0, 1, 2, . . . , ī − 1
For i = ī, p(i, produce, 0) = 1. For all other cases not specified above,
p(., produce, .) = 0. Further, for all values of i,