An Introduction To Markov Decision Processes: Bob Givan Ron Parr Purdue University Duke University
An Introduction To Markov Decision Processes: Bob Givan Ron Parr Purdue University Duke University
An Introduction To Markov Decision Processes: Bob Givan Ron Parr Purdue University Duke University
MDP Tutorial - 1
Outline
Markov Decision Processes defined (Bob)
• Objective functions
• Policies
MDP Tutorial - 2
Stochastic Automata with Utilities
A Markov Decision Process (MDP) model
contains:
• A set of possible world states S
• A set of possible actions A
• A real valued reward function R(s,a)
• A description T of each action’s effects in each state.
MDP Tutorial - 3
Stochastic Automata with Utilities
A Markov Decision Process (MDP) model
contains:
• A set of possible world states S
• A set of possible actions A
• A real valued reward function R(s,a)
• A description T of each action’s effects in each state.
MDP Tutorial - 4
Representing Actions
Deterministic Actions:
• T : S × A → S For each state and action
we specify a new state.
0.6
0.4
Stochastic Actions:
• T : S × A → Prob(S) For each state and action we
specify a probability distribution over next states.
Represents the distribution P (s’ | s, a).
MDP Tutorial - 5
Representing Actions
Deterministic Actions:
• T : S × A → S For each state and action
we specify a new state.
1.0
Stochastic Actions:
• T : S × A → Prob(S) For each state and action we
specify a probability distribution over next states.
Represents the distribution P (s’ | s, a).
MDP Tutorial - 6
Representing Solutions
A policy π is a mapping from S to A
MDP Tutorial - 7
Following a Policy
Following a policy π:
1. Determine the current state s
2. Execute action π(s)
3. Goto step 1.
MDP Tutorial - 8
Evaluating a Policy
How good is a policy π in a state s ?
MDP Tutorial - 9
Objective Functions
An objective function maps infinite sequences of rewards
to single real numbers (representing utility)
Options:
1. Set a finite horizon and just total the reward
2. Discounting to prefer earlier rewards
3. Average reward rate in the limit
MDP Tutorial - 10
Discounting
n
A reward n steps away is discounted by γ for discount
rate 0 < γ < 1 .
• models mortality: you may die at any moment
• models preference for shorter solutions
• a smoothed out version of limited horizon lookahead
2 1
(Max value <= M + γ ⋅ M + γ ⋅ M + .... = ------------ ⋅ M )
1–γ
MDP Tutorial - 11
Value Functions
A value function V π : S → ℜ represents the
expected objective value obtained
following policy π from each state in S .
MDP Tutorial - 12
Bellman Equations
Bellman equations relate the value function to itself via
the problem dynamics.
For the discounted objective function,
V π ( s ) = R (s, π ( s )) + ∑ T ( s, π ( s ), s′ ) ⋅ γ ⋅ V π ( s′ )
s′ ∈ S
V *(s) =
R (s, a) + T ( s, a, s′ ) ⋅ γ ⋅ V * ( s′ )
MAX
a∈A ∑
s′ ∈ S
MDP Tutorial - 13
Finite-horizon Bellman Equations
Finite-horizon values at adjacent horizons are related by
the action dynamics
V π, 0 ( s ) = R (s, π ( s ))
V π, n ( s ) = R (s, a) + ∑ T ( s, a, s′ ) ⋅ γ ⋅ V π, n – 1 ( s′ )
s′ ∈ S
MDP Tutorial - 14
Relation to Model Checking
Some thoughts on the relationship
• MDP solution focuses critically on expected value
• Contrast safety properties which focus on worst case
• This contrast allows MDP methods to exploit
sampling and approximation more aggressively
MDP Tutorial - 15
• At this point, Ron Parr spoke on solution methods for
about 1/2 an hour, and then I continued.
MDP Tutorial - 16
Large State Spaces
In AI problems, the “state space” is typically
• astronomically large
• described implicitly, not enumerated
• decomposed into factors, or aspects of state
Issues raised:
• How can we represent reward and action behaviors
in such MDPs?
• How can we find solutions in such MDPs?
MDP Tutorial - 17
A Factored MDP Representation
• State Space S — assignments to state variables:
On-Mars?, Need-Power?, Daytime?,..etc...
MDP Tutorial - 18
Factored Representations of Actions
• Assume: actions affect state variables independently.1
e.g.....Pr(Nd-Power? ^ On-Mars? | x, a)
= Pr (Nd-Power? | x, a) * Pr (On-Mars? | x, a)
• Represent effect on each state variable as labelled
partition:
MDP Tutorial - 19
Representing Blocks
• Identifying “irrelevant” state variables
• Decision trees
• DNF formulas
• Binary/Algebraic Decision Diagrams
MDP Tutorial - 20
Partial Observability
System state can not always be determined
⇒ a Partially Observable MDP (POMDP)
MDP Tutorial - 21
POMDP to MDP Conversion
Belief state Pr(x) can be updated to Pr(x’|o) using Bayes’
rule:
Pr(s’ |s,o) = Pr(o|s,s’ ) Pr(s’ |s) / Pr(o|s)
= U(s’,o) T(s’,a,s) normalized
Pr(s’ |o) = Pr(s’ |s,o) Pr(s)
MDP Tutorial - 22
Belief State Approximation
Problem: When MDP state space is astronomical, belief
states cannot be explicitly represented.
MDP Tutorial - 23