An Introduction To Markov Decision Processes: Bob Givan Ron Parr Purdue University Duke University

An Introduction to
Markov Decision Processes

Bob Givan Ron Parr
Purdue University Duke University
MDP Tutorial - 1
Outline
Markov Decision Processes defined (Bob)
• Objective functions
• Policies
Finding Optimal Solutions (Ron)

• Dynamic programming
• Linear programming
Refinements to the basic model (Bob)

• Partial observability
• Factored representations
MDP Tutorial - 2
Stochastic Automata with Utilities
A Markov Decision Process (MDP) model
contains:
• A set of possible world states S
• A set of possible actions A
• A real valued reward function R(s,a)
• A description T of each action’s effects in each state.
We assume the Markov Property: the effects of an action

taken in a state depend only on that state and not on the
prior history.
MDP Tutorial - 3
Stochastic Automata with Utilities
A Markov Decision Process (MDP) model
contains:
• A set of possible world states S
• A set of possible actions A
• A real valued reward function R(s,a)
• A description T of each action’s effects in each state.
We assume the Markov Property: the effects of an action

taken in a state depend only on that state and not on the
prior history.
MDP Tutorial - 4
Representing Actions
Deterministic Actions:
• T : S × A → S For each state and action
we specify a new state.
0.6
0.4
Stochastic Actions:
• T : S × A → Prob(S) For each state and action we
specify a probability distribution over next states.
Represents the distribution P (s’ | s, a).
MDP Tutorial - 5
Representing Actions
Deterministic Actions:
• T : S × A → S For each state and action
we specify a new state.
1.0
Stochastic Actions:
• T : S × A → Prob(S) For each state and action we
specify a probability distribution over next states.
Represents the distribution P (s’ | s, a).
MDP Tutorial - 6
Representing Solutions
A policy π is a mapping from S to A
MDP Tutorial - 7
Following a Policy
Following a policy π:
1. Determine the current state s
2. Execute action π(s)
3. Goto step 1.
Assumes full observability: the new state resulting from

executing an action will be known to the system
MDP Tutorial - 8
Evaluating a Policy
How good is a policy π in a state s ?
For deterministic actions just total the

rewards obtained... but result may be infinite.
For stochastic actions, instead expected total reward

obtained–again typically yields infinite value.
How do we compare policies of infinite value?
MDP Tutorial - 9
Objective Functions
An objective function maps infinite sequences of rewards
to single real numbers (representing utility)
Options:
1. Set a finite horizon and just total the reward
2. Discounting to prefer earlier rewards
3. Average reward rate in the limit
Discounting is perhaps the most analytically tractable and

most widely studied approach
MDP Tutorial - 10
Discounting
n
A reward n steps away is discounted by γ for discount
rate 0 < γ < 1 .
• models mortality: you may die at any moment
• models preference for shorter solutions
• a smoothed out version of limited horizon lookahead
We use cumulative discounted reward as our objective
2 1
(Max value <= M + γ ⋅ M + γ ⋅ M + .... = ------------ ⋅ M )
1–γ
MDP Tutorial - 11
Value Functions
A value function V π : S → ℜ represents the
expected objective value obtained
following policy π from each state in S .
Value functions partially order the policies,

• but at least one optimal policy exists, and
• all optimal policies have the same value function, V *
MDP Tutorial - 12
Bellman Equations
Bellman equations relate the value function to itself via
the problem dynamics.
For the discounted objective function,
V π ( s ) = R (s, π ( s )) + ∑ T ( s, π ( s ), s′ ) ⋅ γ ⋅ V π ( s′ )
s′ ∈ S
V *(s) =
 R (s, a) + T ( s, a, s′ ) ⋅ γ ⋅ V * ( s′ ) 
MAX
a∈A  ∑
s′ ∈ S
In each case, there is one equation per state in S
MDP Tutorial - 13
Finite-horizon Bellman Equations
Finite-horizon values at adjacent horizons are related by
the action dynamics
V π, 0 ( s ) = R (s, π ( s ))
V π, n ( s ) = R (s, a) + ∑ T ( s, a, s′ ) ⋅ γ ⋅ V π, n – 1 ( s′ )
s′ ∈ S
MDP Tutorial - 14
Relation to Model Checking
Some thoughts on the relationship
• MDP solution focuses critically on expected value
• Contrast safety properties which focus on worst case
• This contrast allows MDP methods to exploit
sampling and approximation more aggressively
MDP Tutorial - 15
• At this point, Ron Parr spoke on solution methods for
about 1/2 an hour, and then I continued.
MDP Tutorial - 16
Large State Spaces
In AI problems, the “state space” is typically
• astronomically large
• described implicitly, not enumerated
• decomposed into factors, or aspects of state
Issues raised:
• How can we represent reward and action behaviors
in such MDPs?
• How can we find solutions in such MDPs?
MDP Tutorial - 17
A Factored MDP Representation
• State Space S — assignments to state variables:
On-Mars?, Need-Power?, Daytime?,..etc...
• Partitions — each block a DNF formula (or BDD, etc)

Block 1: not On-Mars?
Block 2: On-Mars? and Need-Power?
Block 3: On-Mars? and not Need-Power?
• Reward function R — labelled state-space partition:

Block 1: not On-Mars?. . . . . . . . . . . . . . . . . . . . Reward=0
Block 2: On-Mars? and Need-Power?. . . . . . . Reward=4
Block 3: On-Mars? and not Need-Power? . . Reward=5
MDP Tutorial - 18
Factored Representations of Actions
• Assume: actions affect state variables independently.1
e.g.....Pr(Nd-Power? ^ On-Mars? | x, a)
= Pr (Nd-Power? | x, a) * Pr (On-Mars? | x, a)
• Represent effect on each state variable as labelled
partition:
Effects of Action Charge-Battery on variable Need-Power?

Pr(Need-Power? | Block n)
Block 1: not On-Mars? . . . . . . . . . . . . . . . . . . . . . 0.9
Block 2: On-Mars? and Need-Power? . . . . . . . . 0.3
Block 3: On-Mars? and not Need-Power? . . . . 0.1
1. This assumption can be relaxed.
MDP Tutorial - 19
Representing Blocks
• Identifying “irrelevant” state variables
• Decision trees
• DNF formulas
• Binary/Algebraic Decision Diagrams
MDP Tutorial - 20
Partial Observability
System state can not always be determined
⇒ a Partially Observable MDP (POMDP)
• Action outcomes are not fully observable

• Add a set of observations O to the model
• Add an observation distribution U(s,o) for each state
• Add an initial state distribution I
Key notion: belief state, a distribution over system states

representing “where I think I am”
MDP Tutorial - 21
POMDP to MDP Conversion
Belief state Pr(x) can be updated to Pr(x’|o) using Bayes’
rule:
Pr(s’ |s,o) = Pr(o|s,s’ ) Pr(s’ |s) / Pr(o|s)
= U(s’,o) T(s’,a,s) normalized
Pr(s’ |o) = Pr(s’ |s,o) Pr(s)
A POMDP is Markovian and fully observable relative to

the belief state.
⇒ a POMDP can be treated as a continuous state MDP
MDP Tutorial - 22
Belief State Approximation
Problem: When MDP state space is astronomical, belief
states cannot be explicitly represented.
Consequence: MDP conversion of POMDP impractical
Solution: Represent belief state approximately

• Typically exploiting factored state representation
• Typically exploiting (near) conditional independence
properties of the belief state factors
MDP Tutorial - 23

An Introduction To Markov Decision Processes: Bob Givan Ron Parr Purdue University Duke University

Uploaded by

Copyright:

Available Formats

An Introduction To Markov Decision Processes: Bob Givan Ron Parr Purdue University Duke University

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Introduction To Markov Decision Processes: Bob Givan Ron Parr Purdue University Duke University

Uploaded by

Copyright:

Available Formats

An Introduction to

Markov Decision Processes

Finding Optimal Solutions (Ron)

Refinements to the basic model (Bob)

We assume the Markov Property: the effects of an action

We assume the Markov Property: the effects of an action

Assumes full observability: the new state resulting from

For deterministic actions just total the

For stochastic actions, instead expected total reward

How do we compare policies of infinite value?

Discounting is perhaps the most analytically tractable and

We use cumulative discounted reward as our objective

Value functions partially order the policies,

In each case, there is one equation per state in S

• Partitions — each block a DNF formula (or BDD, etc)

• Reward function R — labelled state-space partition:

Effects of Action Charge-Battery on variable Need-Power?

1. This assumption can be relaxed.

• Action outcomes are not fully observable

Key notion: belief state, a distribution over system states

A POMDP is Markovian and fully observable relative to

Consequence: MDP conversion of POMDP impractical

Solution: Represent belief state approximately

You might also like