Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
Reinforcement Learning
(Ch. 17.1-17.3, Ch. 20)
Learner
passive
active
Sequential decision problems
Approaches:
1. Learn values of states (or state histories) & try to maximize
utility of their outcomes.
Need a model of the environment: what ops & what
states they lead to
2. Learn values of state-action pairs
Does not require a model of the environment (except
legal moves)
Cannot look ahead
Reinforcement Learning
Deterministic transitions
Stochastic transitions
+1
-1
1 start
1
2
3
4
(Temporal) credit assignment problem sparse reinforcement problem
Offline alg: action sequences determined ex ante
Online alg: action sequences is conditional on observations along the
way; Important in stochastic environment (e.g. jet flying)
Reinforcement Learning
M = 0.8 in direction you want to go
0.1 left
0.2 in perpendicular
0.1 right
Policy: mapping from states to actions
An optimal
policy for the
stochastic
environment:
utilities of states:
+1
0.812
-1
0.762
0.705
1
1
0.868
0.912
+1
0.660
-1
0.655
0.611
0.388
vs.
(2,3)
Observable MDPs
Assume additivity (almost always true in practice):
Uh ([S0,S1Sn]) = R0 + Uh([S1,Sn])
Utility function on histories
Policy*(i) = arg max
a
a
M
ij U ( j )
j
M
j
a
ij
U ( j)
DP O(n|A||S|)
# possible states
The utility values for selected states at each iteration step in the application
of VALUE-ITERATION to the 4x3 world in our example
Thrm: As t, value iteration converges to exact U even if updates
are done asynchronously & i is picked randomly at every step.
U t 1 (i ) R (i ) M ijPolicy (i )U t ( j )
j
and using the current utility estimates from policy iteration as the initial
values. (Here Policy(i) is the action suggested by the policy in state i)
While this can work well in some environments, it will often take a very
long time to converge in the early stages of policy iteration. This is because
the policy will be more or less random, so that many steps can be required to
reach terminal states
U (i ) R (i ) M ijP (i )U t ( j )
j
For example, suppose P is the policy shown in Figure 17.2(a). Then using the
transition model M, we can construct the following set of equations:
U(1,1) = 0.8u(1,2) + 0.1u(1,1) + 0.1u(2,1)
U(1,2) = 0.8u(1,3) + 0.2u(1,2)
and so on. This gives a set of 11 linear equations in 11 unknowns, which can
be solved by linear algebra methods such as Gaussian elimination. For small
state spaces, value determination using exact solution methods is often the
most efficient approach.
Policy iteration converges to optimal policy, and policy improves
monotonically for all states.
Asynchronous version converges to optimal policy if all states are visited
infinitely often.
Discounting
Infinite horizon Infinite U Policy & value iteration fail to converge.
Also, what is rational: vs.
Solution: discounting
U (H ) vi R
i
Finite if
0 v 1
Passive learning
Passive learning
(b)
3
+1
(a) 2
-1
1 start
1
3
(c) 2
1
.5
.33
+1
.5
.5
.5
.33
.33
.33
.33
1.0
1.0
-1
3
-0.0380 0.0886
0.2152
+1
-0.1646
-0.4430
-1
.5
.5
.5.5
.33
.5
.5
.33
(a) A simple stochastic environment.
.33
.33
.5
.5
LMS updating
[Widrow & Hoff 1960]
function LMS-UPDATE(U,e,percepts,M,N) returns an update U
if TERMINAL?[e] then reward-to-go 0
simple average
for each ei in percepts (starting at end) do
batch mode
reward-to-go reward-to-go + REWARD[ei]
U[STATE[ei]] RUNNING-AVERAGE (U[STATE[ei]], reward-to-go, N[STATE[ei]])
end
NEW
U=?
P=0.9
-1
P=0.1
+1
OLD
U = -0.8
An example where LMS does poorly. A new state is reached for the
first time, and then follows the path marked by the dashed lines,
reaching a terminal state with reward +1.
Adaptive DP (ADP)
Idea: use the constraints (state transition probabilities) between
states to speed learning.
Solve
U (i ) R (i ) M ijU ( j )
j
= value determination.
No maximization over actions because agent is
passive unlike in value iteration.
using DP
Large state space
e.g. Backgammon: 1050 equations in 1050 variables
U (i ) U (i ) [ R(i ) U ( j ) U (i )]
Thrm: Average value of U(i) converges to the correct value.
Thrm: If is appropriately decreased as a function of times a
state is visited (=[N[i]]), then U(i) itself converges to the
correct value
Algorithm TD()
(not in Russell & Norvig book)
Idea: update from the whole epoch, not just on state transition.
Special cases:
=1: LMS
=0: TD
Intermediate choice of (between 0 and 1) is best.
Interplay with
Convergence of TD()
Thrm: Converges w.p. 1 under certain boundaries conditions.
Decrease i(t) s.t.
i (t )
t
i2 (t )
t
a
ij unknown
Unchanged!
Tradeoff
Model-based (learn M)
Model-free (e.g. Q-learning)
Q-learning
Q (a,i)
U (i ) max Q(a, i )
a
a'
Exploration
Tradeoff between exploitation (control) and exploration (identification)
Extremes: greedy vs. random acting
(n-armed bandit models)
Q-learning converges to optimal Q-values if
* Every state is visited infinitely often (due to exploration),
* The action selection becomes greedy as time approaches infinity, and
* The learning rate is decreased fast enough but not too fast
(as we discussed in TD learning)
2.
3.
f (u , n)
Pa*
M ija U ( j )
e
a
M ijaU ( j )
j
R+ if n<N
u o.w.
Generalization
With table lookup representation (of U,M,R,Q) up to 10,000
states or more
Chess ~ 10120
Industrial problems
Backgammon ~ 1050
Generalization
Could use any supervised learning algorithm for the
generalization part:
input
sensation
generalization
estimate
(Q or U)
update from RL
function approximation
state aggregation
converges to Q*
converges to Q*
general
averagers
linear
converges to Q*
error in Q
max
i, j in same class
Q d (i ) Q d ( j )
off-policy
on-policy
1 v
prediction
converges to Q
diverges
control
diverges
chatters, bound
unknown
Applications of RL
TD-Gammon
Performance
against
Gammontool
Neurogammon (15,000
supervised learning
examples)
# hidden units
Multiagent RL
Each agent as a Q-table entry e.g. in a communication network
Each agent as an intentional entity
Opponents behavior varies for a given sensation of the agent
Opponent uses different sensation than agent, e.g. longer window or
different features (Stochasticity in steady state)
Opponent learned: sensation Q-values (Nonstationarity)
Opponents exploration policy (Q-values action probabilities)
changed.
Opponents action selector chose different action. (Stochasticity)
Sensation at step n: < anme1 , anopponent
>,
1
reward from step n-1
Qcoop
Q-storage
deterministic
Qdef
p(coop)
Explorer
p(def)
Random
Process
an
Future research in RL
Macros
Advantages
Reduce complexity of learning by learning subgoals (macros) first
Can be learned by TD()
Problems
Selection of macro action
Learn models of macro actions (predict their outcome)
How do you come up with subgoals