Unit 05 Dynamic Programming
Unit 05 Dynamic Programming
Unit 05 Dynamic Programming
The term dynamic programming (DP) refers to a collection of algorithms that can be used to compute optimal
policies given a perfect model of the environment as a Markov decision process (MDP).
where π(a|s) is the probability of taking action a in state s under policy π, and the expectations are subscripted by
π to indicate that they are conditional on π being followed. The existence and uniqueness of vπ are guaranteed as
long as either γ < 1 or eventual termination is guaranteed from all states under the policy π.
The initial approximation, v0, is chosen arbitrarily (except that the terminal state, if any, must be given value 0),
and each successive approximation is obtained by using the Bellman equation for vπ (4.4) as an update rule:
Clearly, vk = vπ is a fixed point for this update rule because the Bellman equation for vπ assures us of equality in this
case. Indeed, the sequence {vk} can be shown in general to converge to vπ as, k→∞ under the same conditions that
guarantee the existence of vπ. This algorithm is called iterative policy evaluation.
To produce each successive approximation, vk+1 from vk, iterative policy evaluation applies the same operation to
each state s: it replaces the old value of s with a new value obtained from the old values of the successor states of s,
and the expected immediate rewards, along all the one-step transitions possible under the policy being evaluated.
We call this kind of operation an expected update.
REINFORCEMENT LEARNING
Each iteration of iterative policy evaluation updates the value of every state once to produce the new approximate
value function vk+1.
There are several different kinds of expected updates, depending on whether a state (as here) or a state–action pair
is being updated, and depending on the precise way the estimated values of the successor states are combined. All
the updates done in DP algorithms are called expected updates because they are based on an expectation over all
possible next states rather than on a sample next state.
A complete in-place version of iterative policy evaluation is shown in pseudo code in the box below. Note how it
handles termination. Formally, iterative policy evaluation converges only in the limit, but in practice it must be halted
short of this.
The key criterion is whether this is greater than or less than vπ(s). If it is greater—that is, if it is better to
select a once in s and thereafter follow π than it would be to follow π all the time—then one would expect
it to be better still to select a every time s is encountered, and that the new policy would in fact be a better
one overall.
That this is true is a special case of a general result called the policy improvement theorem. Let π and π’
be any pair of deterministic policies such that, for all s ϵ S,
(4.7)
REINFORCEMENT LEARNING
Then the policy π’ must be as good as, or better than, π. That is, it must obtain greater or equal expected
return from all states s ϵ S:
(4.8)
The policy improvement theorem applies to the two policies that we considered at the beginning of this
section: an original deterministic policy, π, and a changed policy, π’ , that is identical to π except that π’ (s)
= a ≠ π(s).
The idea behind the proof of the policy improvement theorem is easy to understand. Starting from (4.7),
we keep expanding the qπ side with (4.6) and reapplying (4.7) until we get v π’ (s):
So far we have seen how, given a policy and its value function, we can easily evaluate a change in the policy
at a single state. It is a natural extension to consider changes at all states, selecting at each state the action
that appears best according to qπ(s, a). In other words, to consider the new greedy policy, π’, given by
Where, argmaxa denotes the value of a at which the expression that follows is maximized (with ties
broken arbitrarily).
The process of making a new policy that improves on an original policy, by making it greedy with respect
to the value function of the original policy, is called policy improvement.
Suppose the new greedy policy, π’, is as good as, but not better than, the old policy π. Then vπ = vπ’ , and
from (4.9) it follows that for all s ϵ S:
REINFORCEMENT LEARNING
But this is the same as the Bellman optimality equation (4.1), and therefore, vπ’ must be v*, and both π
and π’ must be optimal policies. Policy improvement thus must give us a strictly better policy except
when the original policy is already optimal.
4.3 Policy Iteration
Once a policy, π, has been improved using vπ to yield a better policy, π’, we can then compute vπ’ and
improve it again to yield an even better π’’. We can thus obtain a sequence of monotonically improving
policies and value functions:
Where E-> denotes a policy evaluation and I−> denotes a policy improvement. Each policy is guaranteed to
be a strict improvement over the previous one (unless it is already optimal). Because a finite MDP has only
a finite number of deterministic policies, this process must converge to an optimal policy and the optimal
value function in a finite number of iterations.
This way of finding an optimal policy is called policy iteration. Note that each policy evaluation, itself an
iterative computation, is started with the value function for the previous policy. This typically results in a
great increase in the speed of convergence of policy evaluation (presumably because the value function
changes little from one policy to the next).
REINFORCEMENT LEARNING
4.4 Value Iteration
One drawback to policy iteration is that each of its iterations involves policy evaluation, which may itself
be a protracted iterative computation requiring multiple sweeps through the state set. If policy evaluation
is done iteratively, then convergence exactly to vπ occurs only in the limit.
In fact, the policy evaluation step of policy iteration can be truncated in several ways without losing the
convergence guarantees of policy iteration. One important special case is when policy evaluation is stopped
after just one sweep (one update of each state). This algorithm is called value iteration. It can be written as
a particularly simple update operation that combines the policy improvement and truncated policy
evaluation steps:
Note that value iteration is obtained simply by turning the Bellman optimality equation into an update rule.
Also note how the value iteration update is identical to the policy evaluation update except that it requires
the maximum to be taken over all actions.
Value iteration effectively combines, in each of its sweeps, one sweep of policy evaluation and one sweep
of policy improvement. Faster convergence is often achieved by interposing multiple policy evaluation
sweeps between each policy improvement sweep. In general, the entire class of truncated policy iteration
algorithms can be thought of as sequences of sweeps, some of which use policy evaluation updates and
some of which use value iteration updates. All of these algorithms converge to an optimal policy for
discounted finite MDPs
REINFORCEMENT LEARNING
4.5 Asynchronous Dynamic Programming
A major drawback to the DP methods so far is that they involve operations over the entire state set of the
MDP, that is, they require sweeps of the state set. If the state set is very large, then even a single sweep can
be prohibitively expensive.
Asynchronous DP algorithms are in-place iterative DP algorithms that are not organized in terms of
systematic sweeps of the state set. These algorithms update the values of states in any order whatsoever,
using whatever values of other states happen to be available. The values of some states may be updated
several times before the values of others are updated once. To converge correctly, however, an
asynchronous algorithm must continue to update the values of all the states: it can’t ignore any state after
some point in the computation. Asynchronous DP algorithms allow great flexibility in selecting states to
update.
Of course, avoiding sweeps does not necessarily mean that we can get away with less computation. It just
means that an algorithm does not need to get locked into any hopelessly long sweep before it can make
progress improving a policy. We can try to take advantage of this flexibility by selecting the states to which
we apply updates so as to improve the algorithm’s rate of progress. We can try to order the updates to let
value information propagate from state to state in an efficient way. Some states may not need their values
updated as often as others. We might even try to skip updating some states entirely if they are not relevant
to optimal behaviour.
Asynchronous algorithms also make it easier to intermix computation with real-time interaction. To solve
a given MDP, we can run an iterative DP algorithm at the same time that an agent is actually experiencing
the MDP. The agent’s experience can be used to determine the states to which the DP algorithm applies its
updates. At the same time, the latest value and policy information from the DP algorithm can guide the
agent’s decision making.
4.6 Generalized Policy Iteration
Policy iteration consists of two simultaneous, interacting processes, one making the value function
consistent with the current policy (policy evaluation), and the other making the policy greedy with respect
to the current value function (policy improvement). In policy iteration, these two processes alternate, each
completing before the other begins, but this is not really necessary. In value iteration, for example, only a
single iteration of policy evaluation is performed in between each policy improvement. In asynchronous
DP methods, the evaluation and improvement processes are interleaved at an even finer grain. In some
cases a single state is updated in one process before returning to the other. As long as both processes
continue to update all states, the ultimate result is typically the same—convergence to the optimal value
function and an optimal policy.
REINFORCEMENT LEARNING
We use the term generalized policy iteration (GPI) to refer to the general idea of letting policy-
evaluation and policy improvement processes interact, independent of the granularity and other details of
the two processes. Almost all reinforcement learning methods are well described as GPI. That is, all have
identifiable policies and value functions, with the policy always being improved with respect to the value
function and the value function always being driven toward the value function for the policy.
If both the evaluation process and the improvement process stabilize, that is, no longer produce
changes, then the value function and policy must be optimal. The value function stabilizes only when it is
consistent with the current policy, and the policy stabilizes only when it is greedy with respect to the
current value function. Thus, both processes stabilize only when a policy has been found that is greedy with
respect to its own evaluation function. This implies that the Bellman optimality equation holds, and thus
that the policy and the value function are optimal.
The evaluation and improvement processes in GPI can be viewed as both competing and
cooperating. They compete in the sense that they pull in opposing directions. Making the policy greedy
with respect to the value function typically makes the value function incorrect for the changed policy, and
making the value function consistent with the policy typically causes that policy no longer to be greedy.
Each process drives the value function or policy toward one of the lines representing a solution to
one of the two goals. The goals interact because the two lines are not orthogonal. Driving directly toward
one goal causes some movement away from the other goal.
Inevitably, however, the joint process is brought closer to the overall goal of optimality. The arrows
in this diagram correspond to the behaviour of policy iteration in that each takes the system all the way to
achieving one of the two goals completely.
In either case, the two processes together achieve the overall goal of optimality even though neither
is attempting to achieve it directly.
4.7 Efficiency of Dynamic Programming
DP may not be practical for very large problems, but compared with other methods for solving MDPs, DP
methods are actually quite efficient. If we ignore a few technical details, then, in the worst case, the time
that DP methods take to find an optimal policy is polynomial in the number of states and actions.
If n and k denote the number of states and actions, this means that a DP method takes a number of
computational operations that is less than some polynomial function of n and k. A DP method is guaranteed
to find an optimal policy in polynomial time even though the total number of (deterministic) policies is kn.
In this sense, DP is exponentially faster than any direct search in policy space could be.
REINFORCEMENT LEARNING
DP is sometimes thought to be of limited applicability because of the curse of dimensionality, the fact that
the number of states often grows exponentially with the number of state variables. Large state sets do
create difficulties, but these are inherent difficulties of the problem, not of DP as a solution method. In fact,
DP is comparatively better suited to handling large state spaces than competing methods such as direct
search and linear programming.
In practice, DP methods can be used with today’s computers to solve MDPs with millions of states. Both
policy iteration and value iteration are widely used, and it is not clear which, if either, is better in general.
In practice, these methods usually converge much faster than their theoretical worst-case run times,
particularly if they are started with good initial value functions or policies.
On problems with large state spaces, asynchronous DP methods are often preferred. Asynchronous
methods and other variations of GPI can be applied in such cases and may find good or optimal policies
much faster than synchronous methods can.
4.8 Introduction to Monte Carlo Methods.
Monte Carlo methods require only experience—sample sequences of states, actions, and rewards from
actual or simulated interaction with an environment. Learning from actual experience is striking because
it requires no prior knowledge of the environment’s dynamics, yet can still attain optimal behaviour.
Learning from simulated experience is also powerful. Although a model is required, the model need only
generate sample transitions, not the complete probability distributions of all possible transitions that is
required for dynamic programming (DP).
Monte Carlo methods sample and average returns for each state–action pair much like the bandit methods
sample and average rewards for each action. The main difference is that now there are multiple states, each
acting like a different bandit problem (like an associative-search or contextual bandit) and the different
bandit problems are interrelated.
That is, the return after taking an action in one state depends on the actions taken in later states in the
same episode. Because all the action selections are undergoing learning, the problem becomes
nonstationary from the point of view of the earlier state.
Monte Carlo Prediction
Consider Monte Carlo methods for learning the state-value function for a given policy. Recall that the value
of a state is the expected return—expected cumulative future discounted reward—starting from that state.
An obvious way to estimate it from experience, then, is simply to average the returns observed after visits
to that state. As more returns are observed, the average should converge to the expected value. This idea
underlies all Monte Carlo methods.
Suppose we wish to estimate vπ(s), the value of a state s under policy π, given a set of episodes obtained by
following π and passing through s. Each occurrence of state s in an episode is called a visit to s.
Of course, s may be visited multiple times in the same episode; let us call the first time it is visited in an
episode the first visit to s. The first-visit MC method estimates vπ(s) as the average of the returns following
first visits to s, whereas the every-visit MC method averages the returns following all visits to s.
Every-visit MC would be the same except without the check for St having occurred earlier in the episode.
Both first-visit MC and every-visit MC converge to vπ(s) as the number of visits (or first visits) to s goes to
infinity. This is easy to see for the case of first-visit MC. In this case each return is an independent,
identically distributed estimate of vπ(s) with finite variance.
REINFORCEMENT LEARNING
Every-visit MC is less straightforward, but its estimates also converge quadratically to vπ(s).