Value Functions & Bellman Equations
Value Functions & Bellman Equations
Value Functions & Bellman Equations
Optimal Policies - Optimal Value Functions - Using Optimal Value Functions to Get
Optimal Policies.
Optimal policies:
An optimal policy is a policy that maximizes the expected reward for an agent in
a given environment.
An optimal policy can be found using a variety of methods, such as policy iteration,
value iteration, and policy gradient methods.
Policy iteration is an iterative algorithm that starts with an initial policy and then
iteratively improves the policy by evaluating the policy and then improving it
based on the evaluation.
Value iteration is an iterative algorithm that starts with an initial value function
and then iteratively improves the value function by evaluating the value function
and then improving it based on the evaluation.
Policy gradient methods are a class of algorithms that directly update the policy in
order to improve it.
In a gridworld, an optimal policy might be to always move towards the goal state.
In a cartpole system, an optimal policy might be to always move the cart to the
right.
In a frozen lake, an optimal policy might be to always move towards the goal state
without slipping and falling.
In a mountain car system, an optimal policy might be to always move the car up
the hill without falling back down.
Optimal policies are important in reinforcement learning because they allow agents to
learn how to act in an environment in order to maximize their rewards.
Optimal Value Functions
Solving a reinforcement learning task means, roughly, finding a policy that achieves a lot
of reward over the long run. For finite MDPs, we can precisely define an optimal policy in
the following way. Value functions define a partial ordering over policies. A policy is
defined to be better than or equal to a policy if its expected return is greater than or
equal to that of for all states. In other words, if and only if for
all . There is always at least one policy that is better than or equal to all other
policies. This is an optimal policy. Although there may be more than one, we denote all
the optimal policies by . They share the same state-value function, called the optimal
state-value function, denoted , and defined as
(1)
for all .
Optimal policies also share the same optimal action-value function, denoted , and
defined as
(2)
for all and . For the state-action pair , this function gives the expected
return for taking action in state and thereafter following an optimal policy. Thus, we
can write in terms of as follows:
(3)
Example 3.10: Optimal Value Functions for Golf The lower part of Figure 1 shows the
contours of a possible optimal action-value function . These are the values
of each state if we first play a stroke with the driver and afterward select either the driver
or the putter, whichever is better. The driver enables us to hit the ball farther, but with
less accuracy. We can reach the hole in one shot using the driver only if we are already
very close; thus the contour for covers only a small portion of the green.
If we have two strokes, however, then we can reach the hole from much farther away, as
shown by the contour. In this case we don't have to drive all the way to within the
small contour, but only to anywhere on the green; from there we can use the putter.
The optimal action-value function gives the values after committing to a
particular first action, in this case, to the driver, but afterward using whichever actions
are best. The contour is still farther out and includes the starting tee. From the tee, the
best sequence of actions is two drives and one putt, sinking the ball in three strokes.
Because is the value function for a policy, it must satisfy the self-consistency condition
given by the Bellman equation for state values (1). Because it is the optimal value
function, however, 's consistency condition can be written in a special form without
reference to any specific policy. This is the Bellman equation for , or the Bellman
optimality equation. Intuitively, the Bellman optimality equation expresses the fact that
the value of a state under an optimal policy must equal the expected return for the best
action from that state:
(4)
(5)
The last two equations are two forms of the Bellman optimality equation for . The
Bellman optimality equation for is
The backup diagrams in Figure 1 show graphically the spans of future states and actions
considered in the Bellman optimality equations for and . These are the same as the
backup diagrams for and except that arcs have been added at the agent's choice
points to represent that the maximum over that choice is taken rather than the expected
value given some policy. Figure 1a graphically represents the Bellman optimality
equation (5).
For finite MDPs, the Bellman optimality equation (5) has a unique solution independent
of the policy. The Bellman optimality equation is actually a system of equations, one for
each state, so if there are states, then there are equations in unknowns. If the
dynamics of the environment are known ( and ), then in principle one can solve
this system of equations for using any one of a variety of methods for solving systems
of nonlinear equations. One can solve a related set of equations for .
Once one has , it is relatively easy to determine an optimal policy. For each state , there
will be one or more actions at which the maximum is obtained in the Bellman optimality
equation. Any policy that assigns nonzero probability only to these actions is an optimal
policy. You can think of this as a one-step search. If you have the optimal value
function, , then the actions that appear best after a one-step search will be optimal
actions. Another way of saying this is that any policy that is greedy with respect to the
optimal evaluation function is an optimal policy. The term greedy is used in computer
science to describe any search or decision procedure that selects alternatives based only
on local or immediate considerations, without considering the possibility that such a
selection may prevent future access to even better alternatives. Consequently, it
describes policies that select actions based only on their short-term consequences. The
beauty of is that if one uses it to evaluate the short-term consequences of actions--
specifically, the one-step consequences--then a greedy policy is actually optimal in the
long-term sense in which we are interested because already takes into account the
reward consequences of all possible future behavior. By means of , the optimal
expected long-term return is turned into a quantity that is locally and immediately
available for each state. Hence, a one-step-ahead search yields the long-term optimal
actions.
Having makes choosing optimal actions still easier. With , the agent does not even
have to do a one-step-ahead search: for any state , it can simply find any action that
maximizes . The action-value function effectively caches the results of all one-
step-ahead searches. It provides the optimal expected long-term return as a value that is
locally and immediately available for each state-action pair. Hence, at the cost of
representing a function of state-action pairs, instead of just of states, the optimal action-
value function allows optimal actions to be selected without having to know anything
about possible successor states and their values, that is, without having to know anything
about the environment's dynamics.
Example 3.11: Bellman Optimality Equations for the Recycling Robot Using (5), we
can explicitly give the the Bellman optimality equation for the recycling robot example.
To make things more compact, we abbreviate the states high and low, and the
actions search, wait, and recharge respectively by h, l, s, w, and re. Since there are only
two states, the Bellman optimality equation consists of two equations. The equation
for can be written as follows:
Explicitly solving the Bellman optimality equation provides one route to finding an
optimal policy, and thus to solving the reinforcement learning problem. However, this
solution is rarely directly useful. It is akin to an exhaustive search, looking ahead at all
possibilities, computing their probabilities of occurrence and their desirabilities in terms
of expected rewards. This solution relies on at least three assumptions that are rarely
true in practice: (1) we accurately know the dynamics of the environment; (2) we have
enough computational resources to complete the computation of the solution; and (3) the
Markov property. For the kinds of tasks in which we are interested, one is generally not
able to implement this solution exactly because various combinations of these
assumptions are violated. For example, although the first and third assumptions present
no problems for the game of backgammon, the second is a major impediment. Since the
game has about states, it would take thousands of years on today's fastest computers
to solve the Bellman equation for , and the same is true for finding . In reinforcement
learning one typically has to settle for approximate solutions.
An optimal value function is a function that maps from states to the expected
value of taking the optimal policy from that state.
An optimal policy can be found by taking the greedy action in each state, where
the greedy action is the action that has the highest value according to the optimal
value function.
This can be done using a variety of methods, such as policy iteration, value
iteration, and policy gradient methods.
Policy iteration is an iterative algorithm that starts with an initial policy and then
iteratively improves the policy by evaluating the policy and then improving it
based on the evaluation.
Value iteration is an iterative algorithm that starts with an initial value function
and then iteratively improves the value function by evaluating the value function
and then improving it based on the evaluation.
Policy gradient methods are a class of algorithms that directly update the policy in
order to improve it.
Here are some examples of how optimal value functions can be used to get optimal
policies:
In a gridworld, the optimal value function might map each state to the expected
reward of taking the greedy action from that state.
gridworld
In a cartpole system, the optimal value function might map each state to the
expected reward of taking the greedy action from that state.
cartpole
In a frozen lake, the optimal value function might map each state to the expected
reward of taking the greedy action from that state.
frozen lake
In a mountain car system, the optimal value function might map each state to the
expected reward of taking the greedy action from that state.
mountain car system
Optimal value functions are important in reinforcement learning because they can be
used to find optimal policies.