Examples and Videos of Markov Decision Processes (MDPS) and Reinforcement Learning
Examples and Videos of Markov Decision Processes (MDPS) and Reinforcement Learning
Examples and Videos of Markov Decision Processes (MDPS) and Reinforcement Learning
Environment
state action
reward
Agent
• complete agent
• temporally situated
• continual learning & planning
• object is to affect environment
• environment stochastic & uncertain
States, Actions, and Rewards
Hajime Kimura’s RL Robots
Before After
–Plato, Protagoras
Backgammon
a “big” game
Tesauro, 1992-1995
TD-Gammon
Action selection
...
Value
... ...
by 2-3 ply search
...
TD Error
Vt+1 − Vt
Six weeks later it’s the best player of backgammon in the world
The Mountain Car Problem
Goal
Minimum-Time-to-Goal Problem
Moore, 1990
Value Functions Learned
while solving the Mountain Car problem
Goal
region
Minimize Time-to-Goal
Value = estimated time to goal
Random
Learned
Hand-coded
Hold
Temporal-difference (TD)
error
World
or world model
“Autonomous helicopter flight
via Reinforcement Learning”
Ng (Stanford), Kim, Jordan, & Sastry (UC Berkeley) 2004
Reason as RL over Imagined Experience
1. Learn a predictive model of the world’s dynamics
transition probabilities, expected immediate rewards
2. Use model to generate imaginary experiences
internal thought trials, mental simulation (Craik, 1943)
3. Apply RL as if experience had really happened
vicarious trial and error (Tolman, 1932)
GridWorld Example
Summary:
RL’s Computational Theory of Mind
Reward
Policy
Value
Function
Predictive
Model
Reward
Policy
Value
Function
Predictive
Model
It’s all created from
the scalar reward signal
Summary:
RL’s Computational Theory of Mind
Reward
Policy
Value
Function
Predictive
Model
It’s all created from
the scalar reward signal