01ijcai Mike
01ijcai Mike
01ijcai Mike
converge. There are single-agent learning algorithms capable and constrain it to a legal probability
of playing stochastic policies [Jaakkola et al., 1994; Baird
distribution,
9
:
;'=< IG > H @?A02BDC.!0E1 3F4 6
and Moore, 1999]. In general though just the ability to play
J K(L J NG M
if
otherwise
stochastic policies is not sufficient for convergence, as will be
shown in Section 4.
Joint Action Learners. JALs [Claus and Boutilier, 1998] Table 1: Policy hill-climbing algorithm (PHC) for player .
observe the actions of the other agents. They assume the
other players are selecting actions based on a stationary pol- mixed policy. The policy is improved by increasing the prob-
PO the highest valued actionRaccording
Q ' the al-
icy, which they estimate. They then play optimally with re- ability that it selects to
& ' (
spect to this learned estimate. Like single-agent learners they a learning rate . Notice that when
are rational but not convergent, since they also cannot con- gorithm is equivalent to Q-learning, since with each step the
verge to mixed equilibria in self-play. policy moves to the greedy ' policy executing the highest val-
Minimax-Q. Minimax-Q [Littman, 1994] and Hu & Well- ued action with probability (modulo exploration).
man’s extension of it to general-sum SGs [1998] take a dif- This technique, like Q-learning, is rational and will con-
ferent approach. These algorithms observe both the actions verge to an optimal policy if the other players are playing
and rewards of the other players and try to learn a Nash equi- stationary strategies. The proof follows from the proof of Q-
librium explicitly. The algorithms learn and play the equi- learning, which guarantees the S values will converge to SUT
librium independent of the behavior of other players. These with a suitable exploration policy.1 Similarly, will converge
algorithms are convergent, since they always converge to a to a policy that is greedy according to S , which is converging
stationary policy. However, these algorithms are not ratio- to SVT , the optimal response S -values. Despite the fact that it
nal. This is most obvious when considering a game of Rock- is rational and can play mixed policies, it still doesn’t show
Paper-Scissors against an opponent that almost always plays any promise of being convergent. We show examples of its
“Rock”. Minimax-Q will still converge to the equilibrium so- convergence failures in Section 5.
lution, which is not optimal given the opponent’s policy.
4.2 WoLF Policy Hill-Climbing
In this work we are looking for a learning technique that is We now introduce the main contribution of this paper. The
rational, and therefore plays a best-response in the obvious contribution is two-fold: using a variable learning rate, and
case where one exists. Yet, its policy should still converge. the WoLF principle. We demonstrate these ideas as a modifi-
We want the rational behavior of single-agent learners and cation to the naive policy hill-climbing algorithm.
JALs, and the convergent behavior of minimax-Q.
The basic idea is to vary the learning rate used by the al-
gorithm in such a way as to encourage convergence, with-
4 A New Algorithm out sacrificing rationality. We propose the WoLF principle as
In this section we contribute an algorithm towards the goal of an appropriate method. The principle has a simple intuition,
a rational and convergent learner. We first introduce an algo- learn quickly while losing and slowly while winning. The
rithm that is rational and capable of playing mixed policies, specific method for determining when the agent is winning is
but does not converge in experiments. We then introduce a by comparing the current policy’s expected payoff with that
modification to this algorithm that results in a rational learner of the average policy over time. This principle aids in con-
that does in experiments converge to mixed policies. vergence by giving more time for the other players to adapt
to changes in the player’s strategy that at first appear benefi-
4.1 Policy Hill Climbing cial, while allowing the player to adapt more quickly to other
players’ strategy changes when they are harmful.
A simple extension of Q-learning to play mixed strategies The required changes for WoLF policy hill-climbing are
is policy hill-climbing (PHC) as shown in Table 1. The al- shown in Table 2. Practically, the algorithm requires two
gorithm, in essence, performs hill-climbing in the space of
mixed policies. Q-values are maintained just as in normal 1
The issue of exploration is not critical to this work. See [Singh
Q-learning. In addition the algorithm maintains the current et al., 2000a] for suitable exploration policies for online learning.
1. Let ,
be learning rates. Initialize, as essentially involving a variable learning rate, nor has such
9
@2
: an approach been applied to learning in stochastic games.
5 Results
2. Repeat,
In this section we show results of applying policy hill-
(a,b) Same as PHC in Table 1 climbing and WoLF policy hill-climbing to a number of dif-
(c) Update estimate of average policy, ,
@ 2:
@E(' ferent games, from the multiagent reinforcement learning lit-
6
6 :
6 ' erature. The domains include two matrix games that help to
M
9 6 #
6
show how the algorithms work and the effect of the WoLF
principle on convergence. The algorithms were also applied
(d) Update
7 ;: to two multi-state SGs. One is a general-sum grid world do-
and constrain it to a legal probability main used by Hu & Wellman [1998]. The other is a zero-sum
9
:
'< IG > H if @?A02BDC.!0E1 3F4
6
distribution, soccer game introduced by Littman [1994].
J K DL J G M otherwise The experiments involve training the players using the
same learning algorithm. Since PHC and WoLF-PHC are ra-
tional, we know that if they converge against themselves, then
3
3
! I
where,
Pr(Paper)
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 200000 400000 600000 800000 1e+06 0 0.2 0.4 0.6 0.8 1
Iterations Pr(Rock)
(a) Matching Pennies Game (b) Rock-Paper-Scissors Game
Figure 2: (a) Results for matching pennies: the policy for one of the players as a probability distribution while learning with
PHC and WoLF-PHC. The other player’s policy looks similar. (b) Results for rock-paper-scissors: trajectories of one player’s
policy. The bottom-left shows PHC in self-play, and the upper-right shows WoLF-PHC in self-play.
and the other trying to move North. Hence the game requires evaluated after every 50,000 steps. The worst performing pol-
that the players coordinate their behaviors. icy was then used for the value of that learning run.
WoLF policy hill-climbing successfully converges to one Figure 3(b) shows the percentage of games won by the
of these equilibria. Figure 3(a) shows an example trajectory different players when playing their challengers. “Minimax-
of the players’ strategies for the initial state while learning Q” represents Minimax-Q when learning against itself (the
over 100,000 steps. In this example the players converged results were taken from Littman’s original paper.) “WoLF”
to the equilibrium where player one moves East and player represents WoLF policy hill-climbing learning against itself.
two moves North from the initial state. This is evidence that
WoLF policy hill-climbing can learn an equilibrium even in a with
Q and
“PHC(L)”
and
“PHC(W)”
Q , respectively.
represents policy hill-climbing
“WoLF(2x)” represents
general-sum game with multiple equilibria. WoLF policy hill-climbing learning with twice the training
(i.e. two million steps). The performance of the policies were
5.3 Soccer averaged over fifty training runs and the standard deviations
The final domain is a comparatively large zero-sum soccer are shown by the lines beside the bars. The relative ordering
game introduced by Littman [1994] to demonstrate Minimax- by performance is statistically significant.
Q. An example of an initial state in this game is shown in Fig- WoLF-PHC does extremely well, performing equivalently
ure 3(b), where player ’B’ has possession of the ball. The goal to Minimax-Q with the same amount of training2 and contin-
is for the players to carry the ball into the goal on the opposite ues to improve with more training. The exact effect of the
side of the field. The actions available are the four compass WoLF principle can be seen by its out-performance of PHC,
directions and the option to not move. The players select ac- using either the larger or smaller learning rate. This shows
tions simultaneously but they are executed in a random order, that the success of WoLF-PHC is not simply due to changing
which adds non-determinism to their actions. If a player at- learning rates, but rather to changing the learning rate at the
tempts to move to the square occupied by its opponent, the appropriate time to encourage convergence.
stationary player gets possession of the ball, and the move
fails. Unlike the grid world domain, the Nash equilibrium for 6 Conclusion
this game requires a mixed policy. In fact any deterministic
In this paper we present two properties, rationality and con-
policy (therefore anything learned by an single-agent learner
vergence, that are desirable for a multiagent learning algo-
or JAL) can always be defeated [Littman, 1994].
rithm. We present a new algorithm that uses a variable learn-
Our experimental setup resembles that used by Littman
ing rate based on the WoLF (“Win or Learn Fast”) princi-
in order to compare with his results for Minimax-Q. Each
ple. We then showed how this algorithm takes large steps
player was trained for one million steps. After training,
towards achieving these properties on a number and variety
its policy was fixed and a challenger using Q-learning was
of stochastic games. The algorithm is rational and is shown
trained against the player. This determines the learned pol-
empirically to converge in self-play to an equilibrium even in
icy’s worst-case performance, and gives an idea of how close
games with multiple or mixed policy equilibria, which previ-
the player was to the equilibrium policy, which would per-
ous multiagent reinforcement learners have not achieved.
form no worse than losing half its games to its challenger.
Unlike Minimax-Q, WoLF-PHC and PHC generally oscillate 2
The results are not directly comparable due to the use of a dif-
around the target solution. In order to account for this in the ferent decay of the learning rate. Minimax-Q uses an exponential
results, training was continued for another 250,000 steps and decay that decreases too quickly for use with WoLF-PHC.
G
A
S2 S1
Pr(West)
50
1 0.8 0.6 0.4 0.2 0
1 0
40
0.8 0.2
% Games Won
30
Pr(North)
Pr(North)
0.6 0.4
0.4 0.6 20
0.2 0.8 10
Player 1
Player 2
0 1
0 0.2 0.4 0.6 0.8 1 0
Pr(East) Minimax-Q WoLF WoLF(x2) PHC(L) PHC(W)
Figure 3: (a) Gridworld game. The dashed walls represent the actions that are uncertain. The results show trajectories of two
players’ policies while learning with WoLF-PHC. (b) Soccer game. The results show the percentage of games won against a
specifically trained worst-case opponent after one million steps of training.
Acknowledgements. Thanks to Will Uther for ideas and [Jaakkola et al., 1994] T. Jaakkola, S. P. Singh, and M. I. Jor-
discussions. This research was sponsored by the United dan. Reinforcement learning algorithm for partially ob-
States Air Force under Grants Nos F30602-00-2-0549 and servable markov decision problems. In Advances in Neu-
F30602-98-2-0135. The content of this publication does not ral Information Processing Systems 6, 1994.
necessarily reflect the position or the policy of the sponsors [Littman, 1994] M. L. Littman. Markov games as a frame-
and no official endorsement should be inferred. work for multi-agent reinforcement learning. In Proceed-
ings of the Eleventh International Conference on Machine
References Learning, pages 157–163, 1994.
[Baird and Moore, 1999] L. C. Baird and A. W. Moore. Gra- [Nash, Jr., 1950] J. F. Nash, Jr. Equilibrium points in -
dient descent for general reinforcement learning. In Ad- person games. PNAS, 36:48–49, 1950.
vances in Neural Information Processing Systems 11. The [Shapley, 1953] L. S. Shapley. Stochastic games. PNAS,
MIT Press, 1999.
39:1095–1100, 1953.
[Blum and Burch, 1997] A. Blum and C. Burch. On-line
[Singh et al., 2000a] S. Singh, T. Jaakkola, M. L. Littman,
learning and the metrical task system problem. In Tenth
and C. Szepesvári. Convergence results for single-step
Annual Conference on Computational Learning Theory,
on-policy reinforcement-learning algorithms. Machine
1997.
Learning, 2000.
[Bowling and Veloso, 2001] M. Bowling and M. Veloso.
[Singh et al., 2000b] S. Singh, M. Kearns, and Y. Mansour.
Convergence of gradient dynamics with a variable learn-
Nash convergence of gradient dynamics in general-sum
ing rate. In Proceedings of the Eighteenth International
games. In Proceedings of the Sixteenth Conference on Un-
Conference on Machine Learning, 2001. To Appear.
certainty in Artificial Intelligence, pages 541–548, 2000.
[Claus and Boutilier, 1998] C. Claus and C. Boutilier. The [Stone and Veloso, 2000] P. Stone and M. Veloso. Multia-
dyanmics of reinforcement learning in cooperative multi- gent systems: A survey from a machine learning perspec-
agent systems. In Proceedings of the Fifteenth National
tive. Autonomous Robots, 8(3), 2000.
Conference on Artificial Intelligence. AAAI Press, 1998.
[Sutton and Barto, 1998] R. S. Sutton and A. G. Barto. Re-
[Fink, 1964] A. M. Fink. Equilibrium in a stochastic n-
inforcement Learning. The MIT Press, 1998.
person game. Journal of Science in Hiroshima University,
Series A-I, 28:89–93, 1964. [Weibull, 1995] J. W. Weibull. Evolutionary Game Theory.
The MIT Press, 1995.
[Hu and Wellman, 1998] J. Hu and M. P. Wellman. Multi-
agent reinforcement learning: Theoretical framework and [Weiß and Sen, 1996] G. Weiß and S. Sen, editors. Adapta-
an algorithm. In Proceedings of the Fifteenth International tion and Learning in Multiagent Systems. Springer, 1996.
Conference on Machine Learning, pages 242–250, 1998.