Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
http://en.wikipedia.org/wiki/Image:Random_Walk_example.png
Markov Property
• Also thought of as the “memoryless” property
• A stochastic process is said to have the Markov
property if the probability of state Xn+1 having any
given value depends only upon state Xn
• Very much depends on description of states
Markov Property Example
• Checkers:
• Current State: The current configuration of the board
• Contains all information needed for transition to next
state
• Thus, each configuration can be said to have the Markov
property
Markov Chain
• Discrete-time
stochastic process
with the Markov
property
• Industry Example:
Google’s PageRank
algorithm
• Probability
distribution
representing
likelihood of random
linking ending up on a
page
http://en.wikipedia.org/wiki/PageRank
Markov Decision Process (MDP)
• Discrete time stochastic control process
• Extension of Markov chains
• Differences:
• Addition of actions (choice)
• Addition of rewards (motivation)
• If the actions are fixed, an MDP reduces to a
Markov chain
Description of MDPs
• Tuple (S, A, P(.,.), R(.)))
• S -> state space
• A -> action space
• Pa(s, s’) = Pr(st+1 = s’ | st = s, at = a)
• R(s) = immediate reward at state s
• Goal is to maximize some cumulative function of
the rewards
• Finite MDPs have finite state and action spaces
Simple MDP Example
• Recycling MDP Robot
• Can search for trashcan, wait for someone to bring a
trashcan, or go home and recharge battery
news.bbc.co.uk
• Has two energy levels – high and low
• Searching runs down battery, waiting does not, and a
depleted battery has a very low reward
Transition Probabilities
s = st s’ = st+1 a = at Pass’ Rass’
high high search α Rsearch
high low search 1-α Rsearch
low high search 1-β -3
low low search β Rsearch
high high wait 1 Rwait
high low wait 0 Rwait
low high wait 0 Rwait
low low wait 1 Rwait
low high recharge 1 0
low low recharge 0 0
Transition Graph
state node
action node
Solution to an MDP = Policy π
• Gives the action to take from a given state
regardless of history
• Two arrays indexed by state
• V is the value function, namely the discounted sum of
rewards on average from following a policy
• π is an array of actions to be taken in each state (Policy)
2 basic
steps
V(s): = R(s) + γ∑Pπ(s)(s,s')V(s')
Variants
2 basic 1
steps
V(s): = R(s) + γ∑Pπ(s)(s,s')V(s') 2
First-visit MC method
Estimation of Action Values
• State values are not enough without a model – we
need action values as well
• Qπ(s, a) expected return when starting in state s,
taking action a, and thereafter following policy π
• Exploration vs. Exploitation
• Exploring starts
Example Monte Carlo Algorithm
• Simplest TD method
• Uses sample backup from single successor state
or state-action pair instead of full backup of DP
methods
SARSA – On-policy Control