Module 04
Module 04
1
Dynamic Programming
• The term dynamic programming (DP) refers to a collection of
algorithms that can be used to compute optimal policies given a
perfect model of the environment as a Markov decision process
(MDP).
• Classical DP algorithms are of limited utility in reinforcement learning
both because of their assumption of a perfect model and because of
their great computational expense, but they are still important
theoretically.
• The methods - to be discussed - can be viewed as attempts to achieve
much the same effect as DP, only with less computation and without
assuming a perfect model of the environment.
2
Dynamic Programming
• The key idea of DP, and of reinforcement learning generally, is the use of value
functions to organize and structure the search for good policies.
• We can easily obtain optimal policies once we have found the optimal value
functions, v* or q*, which satisfy the Bellman optimality equations:
(4.1)
(4.2)
3
Dynamic Programming
• (As we shall see)
• DP algorithms are obtained by turning Bellman equations such as 4.1 and 4.2 into
assignments, that is, into update rules for improving approximations of the
desired value functions.
• Update rules : Imagine these update rules as instructions for progressively
improving the agent's estimates of the value functions (v or q).
• Approximations and Improvement: In real-world applications, the
environment might be too complex to calculate the exact value functions
directly. So, DP algorithms rely on approximations. These approximations are
initially rough estimates of the true value functions.
4
The Update Process
• The update rules based on the Bellman equations iteratively improve
these approximations.
• In each iteration, the agent considers the rewards received, the value
of the next state (according to the current approximation), and
updates the estimated value of the current state.
• This process continues until the approximations converge to a good
estimate of the actual value functions.
• Analogy : Imagine passing through a maze blindfolded.
5
Policy Evaluation (Prediction)
6
Revision : What is Policy?
• Policy defines the strategy an agent uses to navigate its environment and
make decisions.
• Function of a Policy:
• The policy acts as a mapping function. It takes the current state of the environment
(a set of features representing the situation) as input and outputs an action for the
agent to take.
• Ideally, the policy guides the agent towards actions that maximize its long-term
reward.
• Types of Policies:
• Deterministic Policies: These policies always recommend the same action for a given
state. Imagine a robot following a pre-programmed path in a factory.
• Stochastic Policies: These policies assign a probability distribution over possible
actions for each state. The agent randomly chooses an action based on these
probabilities. This can be useful for exploration in unknown environments or dealing
with uncertainty.
7
Policy Evaluation (Prediction)
• Let us consider how to compute the state-value function vπ for an
arbitrary policy π.
• This is called policy evaluation in the DP literature.
• Also referred as the prediction problem.
8
Policy Evaluation (Prediction)
• For all s ϵ S,
(4.3)
(4.4)
where
• Gt is expected return, sum of the sequence of rewards received after time step t.
• π (a|s) is the probability of taking action a in state s under policy π, and the expectations are
subscripted by π to indicate that they are conditional on π being followed
9
Iterative Policy Evaluation
• Consider a sequence of approximate value functions v0, v1, v2, . . .,
each mapping S+ to R (the real numbers).
• The initial approximation, v0, is chosen arbitrarily (except that the
terminal state, if any, may be given value 0), and each successive
approximation is obtained by using the Bellman equation for vπ (4.4)
as an update rule:
(4.5)
10
Iterative Policy Evaluation
• To produce each successive approximation, vk+1 from vk, iterative
policy evaluation applies the same operation to each state s: it
replaces the old value of s with a new value obtained from the old
values of the successor states of s, and the expected immediate
rewards, along all the one-step transitions possible under the policy
being evaluated.
• We call this kind of operation an expected update.
• Each iteration of iterative policy evaluation updates the value of every
state once to produce the new approximate value function vk+1
11
Iterative Policy Evaluation
• There are several different kinds of expected updates, depending on
whether a state (as here) or a state–action pair is being updated, and
depending on the precise way the estimated values of the successor
states are combined.
• All the updates done in DP algorithms are called expected updates
because they are based on an expectation over all possible next states
rather than on a sample next state.
12
Developing a computer program
• To write a sequential computer program to implement iterative policy
evaluation as given by (4.5) you would have to use two arrays, one for
the old values, vk(s), and one for the new values, vk+1(s).
• With two arrays, the new values can be computed one by one from
the old values without the old values being changed.
• Of course it is easier to use one array and update the values “in
place,” that is, with each new value immediately overwriting the old
one.
• Then, depending on the order in which the states are updated,
sometimes new values are used instead of old ones on the right-hand
side of (4.5).
13
Developing a computer program
• This in-place algorithm also converges to vπ; in fact, it usually
converges faster than the two-array version, because it uses new data
as soon as they are available.
• We think of the updates as being done in a sweep through the state
space.
• For the in-place algorithm, the order in which states have their values
updated during the sweep has a significant influence on the rate of
convergence.
• We usually have the in-place version in mind when we think of DP
algorithms.
14
Pseudocode
15
Pseudocode
• A complete in-place version of iterative policy evaluation is shown in
pseudocode in the box.
• Note how it handles termination.
• Formally, iterative policy evaluation converges only in the limit, but in
practice it must be halted short of this.
• The pseudocode tests the quantity maxs∈S |vk+1(s)−vk(s)| after each
sweep and stops when it is sufficiently small.
16
Example 4.1
• Consider the 4 X 4 gridworld shown below.
17
Example 4.1
18
Example 4.1
21
Example 4.1
• The final estimate is in fact vπ, which in this case gives for each state
the negation of the expected number of steps from that state until
termination.
22
Exercise
• In Example 4.1, if π is the equiprobable random policy,
• What is qπ(11, down)?
• What is qπ(7, down)?
23
Exercise
24
Answer
25
Exercise
• Assume a game to be played as follows.
1. There are 2 options for the player: play or quit
2. If quit is chosen then 10 points are granted to the player and the game
ends.
3. If play is chosen then 4 points are granted to the player and the game
continues with the rolling of a 6 faced die.
i. If 1 or 2 occurs on the die then the game ends, with no additional points awarded to
player.
ii. If 3 to 6 occurs on the die then then follow step 1.
• Question : Design this game as MDP.
27
Exercise
28
Policy Improvement
The reason for computing the value function for a policy is to help find better
policies.
29
What is Policy Improvement
• The process of creating a new policy that outperforms the original
policy.
• Achieved by evaluating the current policy and identifying actions that
lead to higher rewards in specific states.
• The new policy prioritizes these actions, leading to potentially better
outcomes.
30
Policy Improvement : Introduction
• Suppose we have determined the value function vπ for an arbitrary deterministic policy π.
• For some state s we would like to know whether or not we should change the policy to
deterministically choose an action a ≠ π (s).
• We know how good it is to follow the current policy from s—that is vπ (s)—but would it be better
or worse to change to the new policy ?
• One way to answer this question is to consider selecting a in s and thereafter following the
existing policy, π.
• The value of this way of behaving is
(4.6)
31
Policy Improvement : Introduction
(4.6)
• The key criterion is whether this is greater than or less than vπ (s).
• If it is greater
• that is, if it is better to select a once in s and thereafter follow π than it would
be to follow π all the time
• then one would expect it to be better still to select a every time s is
encountered, and that the new policy would in fact be a better one overall.
32
Policy Improvement Theorem
33
Policy Improvement Theorem : Proof
34
Policy Improvement
• We saw how - given a policy and its value function, we can easily
evaluate a change in the policy at a single state to a particular action.
• It is a natural extension to consider changes at all states and to all
possible actions, selecting at each state the action that appears best
according to qπ(s, a). In other words, to consider the new greedy
policy, π’, given by
(4.9)
35
Policy Improvement
• The greedy policy takes the action that looks best in the short term—
after one step of lookahead—according to vπ.
• By construction, the greedy policy meets the conditions of the policy
improvement theorem (4.7), so we know that it is as good as, or
better than, the original policy.
• The process of making a new policy that improves on an original
policy, by making it greedy with respect to the value function of the
original policy, is called policy improvement.
36
Policy Improvement
37
Policy Iteration
38
Policy Iteration
39
Policy Iteration Algorithm
40
Example 4.2 : Jack’s Car Rental
• Students are recommended to study Jack’s Car Rental problem
discussed on Page 81 of the book.
41
Exercise 4.4
• The policy iteration algorithm on page 80 has a subtle bug in that it
may never terminate if the policy continually switches between two
or more policies that are equally good.
• This is ok for pedagogy, but not for actual use.
• Modify the pseudocode so that convergence is guaranteed.
42
Solution
43
Exercise 4.5
• How would policy iteration be defined for action values?
• Give a complete algorithm for computing q*, analogous to that on
page 80 for computing v*.
• Please pay special attention to this exercise, because the ideas
involved will be used throughout the rest of the book.
44
Solution
45
Value Iteration
46
Value Iteration
• One drawback to policy iteration is that each of its iterations involves
policy evaluation, which may itself be a protracted iterative
computation requiring multiple sweeps through the state set.
• If policy evaluation is done iteratively, then convergence exactly to vπ
occurs only in the limit.
• Must we wait for exact convergence, or can we stop short of that?
• The example in Figure 4.1 certainly suggests that it may be possible to
truncate policy evaluation.
• In that example, policy evaluation iterations beyond the first three have no
effect on the corresponding greedy policy.
47
Value Iteration
• In fact, the policy evaluation step of policy iteration can be truncated
in several ways without losing the convergence guarantees of policy
iteration.
• One important special case is when policy evaluation is stopped after
just one sweep (one update of each state).
• This algorithm is called value iteration.
48
Value Iteration
• It can be written as a particularly simple update operation that combines
the policy improvement and truncated policy evaluation steps:
(4.10)
50
Value Iteration
51
Example 4.3 : Gambler’s Problem
• Students are recommended to study Gambler’s problem discussed on
Page 84 of the book.
52
Asynchronous Dynamic
Programming
53
Asynchronous Dynamic Programming
• A major drawback to the DP methods that we have discussed so far is
that they involve operations over the entire state set of the MDP, that
is, they require sweeps of the state set.
• If the state set is very large, then even a single sweep can be
prohibitively expensive.
• For example, the game of backgammon has over 1020 states. Even if
we could perform the value iteration update on a million states per
second, it would take over a thousand years to complete a single
sweep.
54
Asynchronous Dynamic Programming
• Asynchronous Dynamic Programming (ADP) refers to a class of
algorithms used to solve Markov Decision Processes (MDPs) by
asynchronously updating the value function or policy of states.
• Unlike traditional dynamic programming algorithms, such as value
iteration and policy iteration, which update all states in a
synchronized manner, ADP updates states individually and in any
order.
• This asynchronous updating process offers several advantages in
reinforcement learning settings:
55
Advantages of ADP
• Efficiency:
• ADP can be more computationally efficient than synchronous dynamic
programming algorithms, especially in large-scale problems where updating
all states simultaneously may be computationally prohibitive.
• By focusing computational resources on states that are most in need of
updating, ADP can converge to an optimal solution more quickly.
• Scalability:
• Asynchronous updates allow for more scalable solutions to reinforcement
learning problems, as they enable the algorithm to handle large state spaces
more efficiently.
• By updating states independently, ADP can be applied to problems with
millions or even billions of states.
56
Advantages of ADP
• Exploration-Exploitation Trade-off:
• ADP can help balance exploration and exploitation in reinforcement learning.
By updating states asynchronously, the algorithm can focus on exploring
regions of the state space that are less explored or have higher uncertainty,
while exploiting known information in other regions.
• Incremental Updates:
• ADP typically performs incremental updates to the value function or policy,
updating only the affected states based on changes in neighboring states.
• This incremental update process can lead to faster convergence and reduced
computational overhead compared to recomputing the entire value function
or policy in each iteration.
57
Asynchronous Dynamic Programming
• Popular ADP algorithms in reinforcement learning include
asynchronous value iteration, asynchronous policy iteration, and
various variants of asynchronous Q-learning.
• These algorithms have been successfully applied to a wide range of
reinforcement learning tasks, including robotic control, game playing,
and autonomous decision-making
58
Generalized Policy Iteration
59
Generalized Policy Iteration
• The term generalized policy iteration (GPI) refers to the
general idea of letting policy-evaluation and policy
improvement processes interact, independent of the
granularity and other details of the two processes.
• Almost all reinforcement learning methods are well
described as GPI.
• That is, all have identifiable policies and value functions,
with the policy always being improved with respect to
the value function and the value function always being
driven toward the value function for the policy, as
suggested by the diagram to the right.
60
Generalized Policy Iteration
• If both the evaluation process and the improvement process stabilize,
that is, no longer produce changes, then the value function and policy
must be optimal.
• The value function stabilizes only when it is consistent with the
current policy, and the policy stabilizes only when it is greedy with
respect to the current value function.
• Thus, both processes stabilize only when a policy has been found that
is greedy with respect to its own evaluation function.
• This implies that the Bellman optimality equation (4.1) holds, and
thus that the policy and the value function are optimal.
61
Generalized Policy Iteration
• The evaluation and improvement processes in GPI can be viewed as
both competing and cooperating.
• They compete in the sense that they pull in opposing directions.
• Making the policy greedy with respect to the value function typically
makes the value function incorrect for the changed policy, and
making the value function consistent with the policy typically causes
that policy no longer to be greedy.
• In the long run, however, these two processes interact to find a single
joint solution: the optimal value function and an optimal policy.
62
Generalized Policy Iteration
• Generalized Policy Iteration (GPI) is crucial in reinforcement learning due to
its ability to iteratively refine both the policy and the value function.
• By continuously evaluating and improving the agent's policy, GPI enables
adaptive learning and optimal decision-making in complex environments.
• Its flexibility allows for asynchronous updates, interleaving policy
evaluation and improvement, and accommodating various algorithms,
making it applicable to a wide range of reinforcement learning problems.
• GPI provides a unified framework that balances exploration and
exploitation, leading to faster convergence and more efficient learning.
63
End
64