Module 2
Module 2
Module 2
Dr. D. Sathian
SCOPE
Markov Decision Process
• The agent and environment interact at each of a sequence of discrete time steps,
t = 0, 1, 2, 3, . . .
• At each time step t, the agent receives some representation of the environment’s state,
St ∈ S, and on that basis selects an action, At ∈ A(s).
• One time step later, in part as a consequence of its action, the agent receives a
numerical reward, Rt+1 ∈ R ⊂ R, and finds itself in a new state, St+1.
• The MDP and agent together thereby give rise to a sequence or trajectory that begins
like this: S0,A0,R1,S1,A1,R2,S2,A2,R3,...
Your Logo or Name Here 2
Markov Decision Process – 5 elements
• A set of state(S) the agent can be in.
• A set of actions (A) that can be performed by an agent, for moving from
one state to another.
• A transition probability (Pᵃₛ₁ₛ₂), which is the probability of moving from
one state to another state by performing some action.
• A reward probability ( Rᵃₛ₁ₛ₂), which is the probability of a reward acquired
by the agent for moving from one state to another state by performing
some action.
• A discount factor (γ), which controls the importance of immediate and
future rewards.
Your Logo or Name Here 3
Markov Chain
• The Markov chain is a probabilistic model that solely depends on the current
state and not the previous states, that is, the future is conditionally independent
of past.
• A Markov chain, studied at the discrete time points 0, 1, 2, . . ., is characterized
by a set of states S and the transition probabilities pij between the states.
• Moving from one state to another is called transition and its probability is called
a transition probability.
• Here, pij is the probability that the Markov chain is at the next time point in state
j, given that it is at the present time point at state i.
• The matrix P with elements pij is called the transition probability matrix of the
Markov chain.
Your Logo or Name Here 4
Markov Chain
• Since the person's future position depends solely on their current position and
not on any previous steps, this system follows the Markov property.
• Let's say the person starts at position 0. The probabilities for moving left or right
from position 0 are both 0.5. So, after the first time step, the person has equal
chances of being at position -1 or 1.
• Transition Rates:
• Arrival Rate (λ): The rate at which customers arrive at the queue, following a Poisson
process.
• Service Rate (μ): The rate at which the server serves customers, following a Poisson
process.
• The M/M/1 Queue model is essential for analyzing and understanding queuing systems, resource
utilization, and performance measures such as average queue length, waiting time, and system
throughput. It has applications in various fields, including telecommunications, computer networks, and
transportation systems.
• Shaping Rewards:
• Reward shaping involves modifying the reward signal to guide the agent's
behaviour.
• It can accelerate learning by providing intermediate objectives or
encouraging desired subtasks.
• Shaped rewards should align with the ultimate goal to avoid unintended
behaviours.
• The discount rate determines the present value of future rewards: a reward
received k time steps in the future is worth only k1 times what it would be36
Your Logo or Name Here
worth if it were received immediately.
Returns
Gt = Rt+1 + 𝛾 Rt+2 + 𝛾2Rt+3 +···+ RT
= Rt+1 + 𝛾( Rt+2 + 𝛾 Rt+3 + 𝛾2Rt+4) +···+ RT
= Rt+1 + 𝛾 Gt+1
• This works for all time steps t < T , even if termination occurs at t + 1, if we
define GT = 0. This often makes it easy to compute returns from reward
sequences.
• Note that although the return (in previous slide) is a sum of an infinite number
of terms, it is still finite if the reward is nonzero and constant—if 𝛾 < 1. If the
reward is a constant +1, then the return is
• as our goal is to make the balance lasts as long as possible, we should make the reward
of each additional time-step equally valuable, or even make each additional time-step
that the balance lasts more valuable than its prior one, but certainly we do not want to
discount the value of an additional time step, it discourages the balance-time to last
longer. Your Logo or Name Here 43
Example: Maze Escape
• Imagine that you are designing a robot to run a maze. You decide to give
it a reward of +1 for escaping from the maze and a reward of zero at all
other times. The task seems to break down naturally into episodes–the
successive runs through the maze–so you decide to treat it as an episodic
task, where the goal is to maximize expected total reward. After running
the learning agent for a while, you find that it is showing no
improvement in escaping from the maze. What is going wrong? Have you
effectively communicated to the agent what you want it to achieve?
!"
G2 = !#$.& = 26
!"
G1 = !#$.& = 26
G0 = R1+ 𝛾 G1
= 1+ (0.5 * 26) = 14
• Starting from S0, we get the reward sequence +1, +1, +1, 0, 0, 0, . . ..
• Summing these, we get the same return whether we sum over the first T rewards (T =
3) or over the full infinite sequence.
• This remains true even if discounting is applied.
• So we can cover all cases by:
• P is the transition probability. If we start at state s and take action a and we end up in
state sʹ with probability p(s’|s,a)
• The value function of a state s under a policy π, denoted vπ(s), is the expected return
when starting in s and following π thereafter.
• For MDPs, we can define vπ formally by
• where Eπ[·] denotes the expected value of a random variable given that the agent
follows policy π, and t is any time step.
• vπ the state-value function for policy π Your Logo or Name Here 58
Value Function
• We define the value of taking action a in state s under a policy π, denoted qπ(s,a), as
the expected return starting from s, taking the action a, and thereafter following policy
π
• Starting state s, the root node at the top, the agent could take any of the set of actions
(three are shown in the diagram)—based on its policy π.
• From each of these, the environment could respond with one of several next states, s’
(two are shown in the figure), along with a reward, r, depending on its dynamics given
by the function p.
• The Bellman equation averages over all the possibilities, weighting each by its
probability of occurring. It states that the value of the start state must equal the
(discounted) value of the expected next state, plus the reward expected
Your along theHereway.61
Logo or Name
Optimal Policies and Optimal Value Functions
• Solving a reinforcement learning task means, roughly, finding a policy that achieves a
lot of reward over the long run.
• For finite MDPs, optimal policy can be defined precisely.
• A policy π is defined to be better than or equal to a policy π’ if its expected return is
greater than or equal to that of π’ for all states. In other words, π >= π’ if and only if
• vπ(s) vπ’ 0 (s) for all s ∈ S.
• There is always at least one policy that is better than or equal to all other policies. This
is an optimal policy.
• Optimal policies also share the same optimal action-value function, denoted q*, and
defined as
• For the state–action pair (s,a), this function gives the expected return for taking action
a in state s and thereafter following an optimal policy. Thus, we can write q* in terms of
v* as follows:
+1
-1
V=1 -1
V=1
+1
-1
V=0.73 V=0.9 -1
𝛄 = 0.9