Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
25 views

Module 2

The document discusses reinforcement learning and Markov decision processes (MDPs). It defines the key elements of an MDP, including states, actions, transition probabilities, rewards, and a discount factor. It provides examples of Markov chains, including a random walk and the M/M/1 queue model. It explains that MDPs use these variables to determine the optimal next action based on the current state and environment.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Module 2

The document discusses reinforcement learning and Markov decision processes (MDPs). It defines the key elements of an MDP, including states, actions, transition probabilities, rewards, and a discount factor. It provides examples of Markov chains, including a random walk and the M/M/1 queue model. It explains that MDPs use these variables to determine the optimal next action based on the current state and environment.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Reinforcement Learning

Module 2
Dr. D. Sathian
SCOPE
Markov Decision Process

• The agent and environment interact at each of a sequence of discrete time steps,
t = 0, 1, 2, 3, . . .
• At each time step t, the agent receives some representation of the environment’s state,
St ∈ S, and on that basis selects an action, At ∈ A(s).
• One time step later, in part as a consequence of its action, the agent receives a
numerical reward, Rt+1 ∈ R ⊂ R, and finds itself in a new state, St+1.
• The MDP and agent together thereby give rise to a sequence or trajectory that begins
like this: S0,A0,R1,S1,A1,R2,S2,A2,R3,...
Your Logo or Name Here 2
Markov Decision Process – 5 elements
• A set of state(S) the agent can be in.
• A set of actions (A) that can be performed by an agent, for moving from
one state to another.
• A transition probability (Pᵃₛ₁ₛ₂), which is the probability of moving from
one state to another state by performing some action.
• A reward probability ( Rᵃₛ₁ₛ₂), which is the probability of a reward acquired
by the agent for moving from one state to another state by performing
some action.
• A discount factor (γ), which controls the importance of immediate and
future rewards.
Your Logo or Name Here 3
Markov Chain
• The Markov chain is a probabilistic model that solely depends on the current
state and not the previous states, that is, the future is conditionally independent
of past.
• A Markov chain, studied at the discrete time points 0, 1, 2, . . ., is characterized
by a set of states S and the transition probabilities pij between the states.
• Moving from one state to another is called transition and its probability is called
a transition probability.
• Here, pij is the probability that the Markov chain is at the next time point in state
j, given that it is at the present time point at state i.
• The matrix P with elements pij is called the transition probability matrix of the
Markov chain.
Your Logo or Name Here 4
Markov Chain

Your Logo or Name Here 5


Markov Chain
• Note that the definition of the pij implies that the row sums of P are equal to 1.
Under the conditions that
• all states of the Markov chain communicate with each other (i.e., it is possible to go
from each state, possibly in more than one step, to every other state),
• the Markov chain is not periodic (a periodic Markov chain is a chain in which, e.g.,
you can only return to a state in an even number of steps),
• the Markov chain does not drift away to infinity.

Your Logo or Name Here 6


Markov Chain
• Imagine a person standing at a point on an infinite straight line. At each time
step, the person can take a step either to the left or to the right with equal
probabilities.
• Let's define the states of the system as the positions along the line where the
person can be found. The state space, in this case, is the set of all integers,
representing different positions along the line.

Your Logo or Name Here 7


Markov Chain
• Here are some transition probabilities for the person's movement:
• If the person is at position 'x', the probability of moving left to position 'x-1' is 0.5.
• If the person is at position 'x', the probability of moving right to position 'x+1' is 0.5.

• Since the person's future position depends solely on their current position and
not on any previous steps, this system follows the Markov property.
• Let's say the person starts at position 0. The probabilities for moving left or right
from position 0 are both 0.5. So, after the first time step, the person has equal
chances of being at position -1 or 1.

Your Logo or Name Here 8


Markov Chain
• Let's illustrate the first few steps of the Markov chain:
Time Step 0: Starting position = 0
Time Step 1: Possible positions = -1, 1 (with probabilities 0.5 each)
Time Step 2: Possible positions = -2, 0, 2 (with probabilities 0.25 each)
Time Step 3: Possible positions = -3, -1, 1, 3 (with probabilities 0.125 each) ... and so on.

Your Logo or Name Here 9


Markov Chain
• The probabilities for each position keep spreading out as time progresses, and
the person's position can become increasingly positive or negative,
demonstrating the random walk behaviour of the Markov chain.
• The Random Walk is a fundamental example of a Markov chain, and it finds
applications in various fields, including physics, finance (Brownian motion), and
algorithms for simulating random processes.

Your Logo or Name Here 10


Markov Chain
• Discrete time
• Finite state space
• Memoryless

Your Logo or Name Here 11


Markov Decision Process
• A Markov decision process (MDP) is defined as a stochastic decision-
making process that uses a mathematical framework to model the
decision-making of a dynamic system in scenarios where the results are
either random or controlled by a decision maker, which makes sequential
decisions over time.
• Markov Process includes both discrete-time Markov chains and
continuous-time Markov processes.

Your Logo or Name Here 12


Markov Decision Process
• MDPs rely on variables such as the environment, agent’s actions, and
rewards to decide the system’s next optimal action.
• It is used in scenarios where the results are either random or controlled by
a decision maker, which makes sequential decisions over time.
• MDPs evaluate which actions the decision maker should take considering
the current state and environment of the system.
• They are classified into four types — finite, infinite, continuous, or discrete
— depending on various factors such as sets of actions, available states,
and the decision-making frequency.

Your Logo or Name Here 13


Markov Decision Process
• The M/M/1 Queue is a widely studied stochastic process used to model a
single-server queue in a queuing system, such as customers waiting in line
at a service center.
• Assumptions:
• There is one server (i.e., one resource) to serve customers.
• The arrival of customers follows a Poisson process (exponential inter-arrival
times).
• The service times for each customer also follow a Poisson process (exponential
service times).
• The system can hold an infinite number of customers in the queue.
Your Logo or Name Here 14
Markov Decision Process
• Let's define the states of the system as the number of customers in the system
at any given time. The state space, in this case, is the set of non-negative
integers (0, 1, 2, ...).

• Transition Rates:
• Arrival Rate (λ): The rate at which customers arrive at the queue, following a Poisson
process.
• Service Rate (μ): The rate at which the server serves customers, following a Poisson
process.

Your Logo or Name Here 15


Markov Decision Process
• Transition Probabilities:
• If the system is in state 'i' (i.e., 'i' customers in the system), the rate at which a new
customer arrives is λ, and the rate at which a customer completes service and leaves
the system is μ. These rates determine the probabilities of transitioning between
different states.
• The probability of moving from state 'i' to state 'i+1' (i.e., from 'i' customers to 'i+1'
customers) is given by λ.
• The probability of moving from state 'i' to state 'i-1' (i.e., from 'i' customers to 'i-1'
customers) is given by μ.

Your Logo or Name Here 16


Markov Decision Process
• For example, if the system is currently in state 2 (two customers in the
system), the rate at which a new customer arrives is λ, and the rate at
which the server serves a customer is μ.
• The probability of moving from state 2 to state 3 (adding one more
customer to the queue) is λ, and the probability of moving from state 2 to
state 1 (reducing the queue by serving a customer) is μ.

• The M/M/1 Queue model is essential for analyzing and understanding queuing systems, resource
utilization, and performance measures such as average queue length, waiting time, and system
throughput. It has applications in various fields, including telecommunications, computer networks, and
transportation systems.

Your Logo or Name Here 17


Markov Decision Process
• Continuous-Time Markov Process Example: Radioactive Decay
• Consider a sample of radioactive material that undergoes a decay process
over time. The decay of each radioactive atom occurs randomly, and the
time it takes for each atom to decay follows an exponential distribution.
• Assumptions:
• Each radioactive atom has a constant decay rate, denoted by λ (lambda).
• The decay of each atom is independent of others, and there is no influence
from past decay events.

Your Logo or Name Here 18


Markov Decision Process
• State Space: In this example, the state space is continuous and represents
the amount of radioactive material remaining at any given time. It can be
represented by non-negative real numbers.
• Transition Probabilities: The probability density function for the time until
an individual radioactive atom decays follows an exponential distribution
with a decay rate λ.
• This means that the probability of an atom decaying in a small time
interval 'dt' is approximately λ * dt.

Your Logo or Name Here 19


Markov Decision Process
• In summary, the key difference between a Markov chain and a Markov
process is that:
• a Markov chain is a specific type of discrete-time Markov process with a
finite or countable state space, whereas a Markov process encompasses a
broader class of stochastic processes, including both discrete and
continuous-time models.
• The term "Markov process" is often used to refer to any stochastic process
that possesses the Markov property.

Your Logo or Name Here 20


Rewards
• In reinforcement learning, the purpose or goal of the agent is formalized in
terms of a special signal, called the reward, passing from the environment
to the agent.
• At each time step, the reward is a simple number, Rt ∈ R.
• Informally, the agent’s goal is to maximize the total amount of reward it
receives. This means maximizing not immediate reward, but cumulative
reward in the long run.
• Role:
• Reward Signal as Feedback
• Immediate vs. Delayed Rewards
• Motivation and Learning Signal Your Logo or Name Here 21
Reward signal as feedback
• In reinforcement learning, rewards serve as feedback to the agent's
actions.
• Rewards indicate how well the agent is performing a given task or
interacting with its environment.
• Positive rewards encourage the agent to repeat certain actions, while
negative rewards discourage undesirable actions.

Your Logo or Name Here 22


Immediate vs Delayed Rewards
• Immediate rewards provide instant feedback after each action, guiding the
agent's short-term decisions.
• Delayed rewards are rewards that are obtained after a sequence of actions
and decisions. They require the agent to consider the long-term
consequences of its actions.

Your Logo or Name Here 23


Motivation & Learning Signal
• Rewards act as the motivation for the agent to learn and improve its
behaviour over time.
• The reward signal provides a learning signal that helps the agent update its
policy or value function to make better decisions.
• By optimizing for higher cumulative rewards, the agent learns to make
choices that lead to achieving its goals more effectively.

Your Logo or Name Here 24


Components of Rewards
• Reward Function:
• A reward function maps states and actions to numerical values,
quantifying the desirability of outcomes.
• It defines the goal and objectives of the reinforcement learning problem.
• The reward function guides the agent's decision-making process by
assigning values to different states and actions.

Your Logo or Name Here 25


Components of Rewards
• Cumulative Reward (Return):
• Cumulative reward, also known as return, is the sum of rewards obtained
over a sequence of actions.
• It reflects the overall performance of the agent in achieving its long-term
objectives.
• Agents aim to maximize the cumulative reward by learning optimal
policies or strategies.

Your Logo or Name Here 26


Components of Rewards
• Discount Factor (Gamma):
• The discount factor, denoted as γ (gamma), determines the importance of
future rewards relative to immediate rewards.
• The discount factor essentially determines how much the reinforcement
learning agents cares about rewards in the distant future relative to those
in the immediate future.
• It controls the agent's preference for short-term versus long-term rewards.
• γ ∈ [0, 1], where higher values give more weight to future rewards.
• Discounted rewards is
𝐺𝑡=𝑅𝑡+𝛾𝑅𝑡+1+𝛾2𝑅𝑡+2+⋯ Your Logo or Name Here 27
Components of Rewards
• Exploration-Exploitation Trade-off:
• Rewards influence the exploration-exploitation trade-off—the balance
between trying new actions (exploration) and exploiting known actions
(exploitation).
• High immediate rewards may lead to exploitation, while exploration helps
discover better actions.
• Well-designed reward functions encourage balanced exploration and
exploitation for effective learning.

Your Logo or Name Here 28


Types of Rewards

• Dense vs. Sparse Rewards:


• Dense rewards provide feedback for nearly every action, facilitating faster
learning.
• Sparse rewards offer feedback infrequently, leading to slower convergence
and exploration challenges.
• Careful design of dense or shaped rewards can mitigate the impact of
sparse rewards.

Your Logo or Name Here 29


Types of Rewards

• Shaping Rewards:
• Reward shaping involves modifying the reward signal to guide the agent's
behaviour.
• It can accelerate learning by providing intermediate objectives or
encouraging desired subtasks.
• Shaped rewards should align with the ultimate goal to avoid unintended
behaviours.

Your Logo or Name Here 30


Types of Rewards

• Intrinsic vs. Extrinsic Rewards:


• Extrinsic rewards come from the environment and are directly related to
the task at hand.
• Intrinsic rewards are internally generated by the agent, promoting
curiosity and skill development.
• Balancing both types of rewards enhances the agent's learning and
exploration.

Your Logo or Name Here 31


Types of Rewards
• Example: Consider a reinforcement learning agent training a robot to walk:
• Dense rewards could provide feedback based on posture and motion,
aiding quick learning.
• A shaped reward might encourage the robot to move forward
incrementally, promoting stability.
• Intrinsic rewards could be generated for novel actions, encouraging
exploration.

Your Logo or Name Here 32


Returns
• If the sequence of rewards received after time step t is denoted by
Rt+1, Rt+2, Rt+3, . . .
• In general, we seek to maximize the expected return, where the return,
denoted Gt, is defined as some specific function of the reward sequence.
• Episodic Task: interaction breaks naturally into episodes, e.g., plays of a
game, trips through a maze
• In episodic tasks, we almost always use simple total reward:
Gt = Rt+1 + Rt+2 + Rt+3 +···+ RT
• where T is a final time step at which a terminal state is reached, ending an
episode. Your Logo or Name Here 33
Returns
• Each episode ends in a special state called the terminal state, followed by a
reset to a standard starting state or to a sample from a standard
distribution of starting states.

Your Logo or Name Here 34


Returns
• Even if you think of episodes as ending in different ways, such as winning
and losing a game, the next episode begins independently of how the
previous one ended.
• Thus the episodes can all be considered to end in the same terminal state,
with different rewards for the different outcomes.
• Tasks with episodes of this kind are called episodic tasks.
• In episodic tasks we sometimes need to distinguish the set of all
nonterminal states, denoted S, from the set of all states plus the terminal
state, denoted S+.
• T (time of termination), is a random variable that normally varies from
episode to episode. Your Logo or Name Here 35
Returns
• On the other hand, in many cases the agent–environment interaction does not
break naturally into identifiable episodes, but goes on continually without limit.
• For example, this would be the natural way to formulate an on-going process-
control task, or an application to a robot with a long life span.
• These are known as continuing tasks.
• T=∞
Gt = Rt+1 + 𝛾 Rt+2 + 𝛾2Rt+3 +···+ RT =
• Where γ, 0≤γ ≤1, is the discount rate.
• shortsighted 0 ← γ → 1 farsighted; Typically, γ = 0.9

• The discount rate determines the present value of future rewards: a reward
received k time steps in the future is worth only k1 times what it would be36
Your Logo or Name Here
worth if it were received immediately.
Returns
Gt = Rt+1 + 𝛾 Rt+2 + 𝛾2Rt+3 +···+ RT
= Rt+1 + 𝛾( Rt+2 + 𝛾 Rt+3 + 𝛾2Rt+4) +···+ RT
= Rt+1 + 𝛾 Gt+1
• This works for all time steps t < T , even if termination occurs at t + 1, if we
define GT = 0. This often makes it easy to compute returns from reward
sequences.
• Note that although the return (in previous slide) is a sum of an infinite number
of terms, it is still finite if the reward is nonzero and constant—if 𝛾 < 1. If the
reward is a constant +1, then the return is

Your Logo or Name Here 37


Example: Pole Balancing
• The objective in this task is to apply
forces to a cart moving along a track
so as to keep a pole hinged to the cart
from falling over:
• A failure is said to occur if the pole falls
past a given angle from vertical or if the
cart runs of the track.
• The pole is reset to vertical after each
failure.

Your Logo or Name Here 38


Example: Pole Balancing
• This task could be treated as episodic,
where the natural episodes are the
repeated attempts to balance the pole.
• The reward in this case could be +1 for
every time step on which failure did not
occur, so that the return at each time
would be the number of steps until
failure.
• Successful balancing forever would
mean a return of infinity.

Your Logo or Name Here 39


Example: Pole Balancing
• Alternatively, we could treat this task as
a continuing task, using discounting.
• In this case the reward would be -1 on
each failure and zero at all other times.
• The return at each time would then be
related to 𝛾k, where k is the number of
time steps before failure.

• In either case, the return is maximized


by keeping the pole balanced for as long
as possible. Your Logo or Name Here 40
Example: Pole Balancing
• Suppose if we treated pole-balancing as
an episodic task but also used
discounting, with all rewards zero except
for -1 upon failure.
• What then would the return be at each
time?
• How does this return differ from that in
the discounted, continuing formulation
of this task?

Your Logo or Name Here 41


Example: Pole Balancing
• Answer:
• In the episodic task setting with
discounting, where all rewards are zero
except for -1 upon failure, the return at
each time step would be the sum of
discounted rewards from that time step
until the episode terminates.
• The return at each time step can be
expressed as:
Return = -𝛾k
Your Logo or Name Here 42
Example: Pole Balancing
• Answer:
• In the continuing formulation, it is helpful to
have discounting factor: it prevents the return
from blowing up to infinity.
• However, in episodic formulation, we do not
need to worry about infinite return, using
discounting factor in this particular episodic
setting seems meaningless:

• as our goal is to make the balance lasts as long as possible, we should make the reward
of each additional time-step equally valuable, or even make each additional time-step
that the balance lasts more valuable than its prior one, but certainly we do not want to
discount the value of an additional time step, it discourages the balance-time to last
longer. Your Logo or Name Here 43
Example: Maze Escape
• Imagine that you are designing a robot to run a maze. You decide to give
it a reward of +1 for escaping from the maze and a reward of zero at all
other times. The task seems to break down naturally into episodes–the
successive runs through the maze–so you decide to treat it as an episodic
task, where the goal is to maximize expected total reward. After running
the learning agent for a while, you find that it is showing no
improvement in escaping from the maze. What is going wrong? Have you
effectively communicated to the agent what you want it to achieve?

Your Logo or Name Here 44


Example: Maze Escape
• Answer:
• The problem lies in the fact that the reward of +1 for escaping the maze and a reward
of zero at all other times doesn't provide sufficient guidance or feedback to the
learning agent.
• While you've defined the goal of the task (escaping the maze), the agent lacks the
necessary information to learn how to achieve that goal effectively.
• Without any intermediary rewards or feedback during the maze navigation process, the
agent has no way to discern which actions or strategies lead it closer to the goal of
escaping the maze.
• It's essentially operating in a "sparse reward" environment, where the reward signal
only appears at the end of an episode and provides no guidance on how to get there.

Your Logo or Name Here 45


Example: Maze Escape
• Answer:
• We want to train the agent to escape from the maze as quick as possible, but the agent
understand it in a way that, as long as it can escape from the maze, it does not matter
how long it takes to escape.
• To effectively communicate what we want, we need to add a small negative reward for
each step the agent moves.

Your Logo or Name Here 46


Example:
• Suppose 𝛾 = 0.5 and the following sequence of rewards is received R1 = -1, R2 =2, R3
=6,R4 =3, and R5 =2, with T =5. What are G0, G1,..., G5?

Gt= Rt+1 + 𝛾 Gt+1


G5= R6+ R7 + …… = 0
G4= R5+ 𝛾 G5 = 2 + (0.5 * 0) = 2
G3= R4+ 𝛾 G4 = 3 + (0.5 * 2) = 4
G2= R3+ 𝛾 G3 = 6 + (0.5 * 4) = 8
G1= R2+ 𝛾 G2 = 2 + (0.5 * 8) = 6
G0= R1+ 𝛾 G1 = -1 + (0.5 *6) = 2
Your Logo or Name Here 47
Example:
• Suppose γ = 0.5 and the reward sequence is an infinite sequence of 1s. What is G?

We know that if the reward is an infinite series of 1s,

Your Logo or Name Here 48


Example:
• Suppose γ = 0.5 and the reward sequence is R1 = 1 followed by an infinite sequence
of 13s. What are G2 and G0?

We know that R2 = R3 = R4….. = 13

!"
G2 = !#$.& = 26

!"
G1 = !#$.& = 26

G0 = R1+ 𝛾 G1
= 1+ (0.5 * 26) = 14

Your Logo or Name Here 49


Unified Notation for Episodic and Continuing Tasks
• In episodic tasks, we number the time steps of each episode starting from zero.
• We usually do not have to distinguish between episodes, so instead of writing St,i for
states in episode i, we write just St
• We have defined the return as a sum over a finite number of terms in one case (Gt =
Rt+1 + Rt+2 + Rt+3 +···+ RT) and as a sum over an infinite number of terms in the other
( ).
• These two can be unified by considering episode termination to be the entering of a
special absorbing state that transitions only to itself and that generates only rewards of
zero.

Your Logo or Name Here 50


Unified Notation for Episodic and Continuing Tasks

• Starting from S0, we get the reward sequence +1, +1, +1, 0, 0, 0, . . ..
• Summing these, we get the same return whether we sum over the first T rewards (T =
3) or over the full infinite sequence.
• This remains true even if discounting is applied.
• So we can cover all cases by:

• This is applicable for T = 1 or = 1 (but not both)


Your Logo or Name Here 51
Policies & Value Function
• A policy is a mapping from states to probabilities of selecting each possible action.
• It defines the agent’s behaviour It can be either deterministic or stochastic:
• Deterministic: π(s) = a
Stochastic: π(a|s) = Pπ[A = a|S = s]
• If the agent is following policy π at time t, then π(a|s) is the probability that At=a if St=s
• “|” in the middle of π(a|s) merely reminds that it defines a probability distribution over
a ∈ A(s) for each s ∈ S.

• A policy fully defines the behaviour of an agent


• MDP policies depend on the current state (not the history)

Your Logo or Name Here 52


Policies
• One step dynamics:

• This can also be written as:

• P is the transition probability. If we start at state s and take action a and we end up in
state sʹ with probability p(s’|s,a)

Your Logo or Name Here 53


Recycling bot example: Search & Collect cans
• High-level decisions are made by a RL agent based on the current charge level of the
battery.
• S = {high,low} à 2 charge levels
• In each state, the agent can decide whether to
• actively search for a can for a certain period of time,
• remain stationary and wait for someone to bring it a can, or
• head back to its home base to recharge its battery.
• Action Setà A(high) = {search,wait} and A(low) = {search,wait,recharge}.
• The rewards are zero most of the time, but become positive when the robot secures an
empty can, or large and negative if the battery runs all the way down.

Your Logo or Name Here 54


Recycling bot example: Search & Collect cans
• Searching runs down the robot’s battery, whereas waiting does not.
• Whenever the robot is searching, there is a possibility that its battery will become
depleted. In this case the robot must shut down and wait to be rescued (producing a
low reward).
• If the energy level is high, then the robot can search for a long period of time and also
without the risk of depleting the battery.
• A period of searching that begins with a high energy level leaves the energy level high
with probability 𝞪 and reduces it to low with probability 1 - 𝞪.
• A period of searching undertaken when the energy level is low leaves it low with
probability β and depletes the battery with probability 1- β . (Robot must be rescued)

Your Logo or Name Here 55


Recycling bot example: Search & Collect cans
• Let rsearch and rwait, respectively denote the expected number of cans the robot will
collect (and hence the expected reward) while searching and while waiting. (rsearch >
rwait)
• No cans can be collected during a run home for recharging, and on a step in which the
battery is depleted
• This system is a finite MDP

Your Logo or Name Here 56


Recycling bot example: Search & Collect cans

Your Logo or Name Here 57


Value Function
• There are two types of value functions:
• state value function vπ(s)
• action value function q π(s,a)

• The value function of a state s under a policy π, denoted vπ(s), is the expected return
when starting in s and following π thereafter.
• For MDPs, we can define vπ formally by

• where Eπ[·] denotes the expected value of a random variable given that the agent
follows policy π, and t is any time step.
• vπ the state-value function for policy π Your Logo or Name Here 58
Value Function
• We define the value of taking action a in state s under a policy π, denoted qπ(s,a), as
the expected return starting from s, taking the action a, and thereafter following policy
π

• where qπ the action-value function for policy π

Your Logo or Name Here 59


Bellman’s equation
• Richard Bellman was an American applied mathematician who derived the following
equations which allow us to start solving these MDPs. The Bellman equations are
ubiquitous in RL and are necessary to understand how RL algorithms work.

Your Logo or Name Here 60


Bellman’s equation
• Backup Diagram for v π

• Starting state s, the root node at the top, the agent could take any of the set of actions
(three are shown in the diagram)—based on its policy π.
• From each of these, the environment could respond with one of several next states, s’
(two are shown in the figure), along with a reward, r, depending on its dynamics given
by the function p.
• The Bellman equation averages over all the possibilities, weighting each by its
probability of occurring. It states that the value of the start state must equal the
(discounted) value of the expected next state, plus the reward expected
Your along theHereway.61
Logo or Name
Optimal Policies and Optimal Value Functions
• Solving a reinforcement learning task means, roughly, finding a policy that achieves a
lot of reward over the long run.
• For finite MDPs, optimal policy can be defined precisely.
• A policy π is defined to be better than or equal to a policy π’ if its expected return is
greater than or equal to that of π’ for all states. In other words, π >= π’ if and only if
• vπ(s) vπ’ 0 (s) for all s ∈ S.
• There is always at least one policy that is better than or equal to all other policies. This
is an optimal policy.

Your Logo or Name Here 62


Optimal Policies and Optimal Value Functions
• There may be more than one, and all the optimal policies are denoted by π*.
• They share the same state-value function, called the optimal state-value function,
denoted v*, and defined as

• Optimal policies also share the same optimal action-value function, denoted q*, and
defined as

• For the state–action pair (s,a), this function gives the expected return for taking action
a in state s and thereafter following an optimal policy. Thus, we can write q* in terms of
v* as follows:

Your Logo or Name Here 63


Example: Golf

Your Logo or Name Here 64


Example: Golf
• Conditions:
• -1 reward for each stroke until the goal.
• State is the location of the ball.
• Value of a state is the negative of the number of
strokes to the hole from that location.
• Choice of club: putter or driver
• Putter: slow & precise shot
• Driver: powerful and long shot
• terminal state in-the-hole has a value of 0
• From anywhere on the green region, we assume we
can make a putt; these states have value -1
Your Logo or Name Here 65
Example: Golf
• Off the green, we cannot reach the hole by putting,
and the value is greater.
• we can reach the green from a state by putting,
then that state must have value one less than the
green’s value, that is, -2.
• all locations between that line (-2) and the green
require exactly two strokes to complete the hole
• Putting doesn’t get us out of sand traps, so they
have a value of -∞
• it takes us six strokes to get from the tee to the hole
by putting.

Your Logo or Name Here 66


Example: Golf
• With driver shot, the ball can be taken out of sand.

Your Logo or Name Here 67


Example: Golf
• Bellman’s optimality equation:
• The backup diagram shows graphically the spans of
future states and actions considered in the Bellman
optimality equations for v* and q*

Backup diagrams for v* and q*

Your Logo or Name Here 68


Example: GridWorld

+1

-1

Your Logo or Name Here 69


Example: GridWorld

V=1 V=1 V=1 +1

V=1 -1

V=1

Your Logo or Name Here 70


Example: GridWorld

+1

-1

Your Logo or Name Here 71


Example: GridWorld

V=0.81 V=0.9 V=1 +1

V=0.73 V=0.9 -1
𝛄 = 0.9

V=0.66 V=0.73 V=0.81 V=0.73

Your Logo or Name Here 72


Example: GridWorld
• Q: The Bellman equation must hold for each state 𝛄 = 0.9
for the value function vπ shown in Fig. Show P(s’,r|s,a) = 0.25
numerically that this equation holds for the center
state, valued at +0.7, with respect to its four
neighboring states, valued at +2.3, +0.4, 0.4, and =
0.25 * (0 + 0.9 * 2.3) +
+0.7. (These numbers are accurate only to one 0.25 * (0 + 0.9 * 0,4) +
decimal place.) 0.25 * (0 - 0.9 * 0.4) +
0.25 * (0 + 0.9 * 0.7) =
0.5175 + 0.09 - 0.09 + 0.1575 = 0.675 ≅ 0.7

Your Logo or Name Here 73

You might also like