Rational and Convergent Learning in Stochastic Games
Rational and Convergent Learning in Stochastic Games
Rational and Convergent Learning in Stochastic Games
https://zh.m.wikipedia.org/zh-tw/⾺可夫決策過程
receiving their payoffs the players are transitioned to another techniques showing that they fail to simultaneously achieve
state (or matrix game) determined by their joint action. ] xWe these properties.
can see that SGs then contain both MDPs and matrix games
as subsets of the framework.] 3.1 Properties
We contribute two desirable properties of multiagent learning
Mixed Policies. Unlike in single-agent settings,(deterministic algorithms: rationality and convergence.
.
剝奪
policies in multiagent settings can often be exploited by the
other agents.了 [Consider the matching pennies matrix game as OProperty固定的
1 (Rationality) If the other players’ policies con-
shown in Figure 1. If the column player were to play either verge to stationary policies then the learning algorithm will
action deterministically, the row player could win a payoff of converge to a policy that is a best-response to their policies.
基本
one every time.xThis requires us to consider mixed strategies [This is a fairly basic property requiring the player to be-
or policies. A mixed policy, , is a function have optimally when the other players play stationary strate-
that maps states to mixed strategies, which are probability gies. lThis requires the player to learn a best-response pol-
distributions over the player’s actions.] icy in this case where one indeed exists.xAlgorithms that are
選擇
Nash Equilibria. [Even with the concept of mixed strategies not rational often opt to learn some policy independent of the
there are still no optimal strategies that are independent of other players’ policies, such as their part of some equilibrium
the other players’ strategies.xWe can, though, define a notion solution.lThis completely fails in games with multiple equi-
of best-response. A strategy is a best-response to the other libria where the agents cannot independently select and play
players’ strategies if it is optimal given their strategies.xThe an equilibrium.]
major advancement that has driven much of the development
of matrix games, game theory, and even stochastic games is
個Property 2 (Convergence) The learner will necessarily con-
verge to a stationary policy. This property will usually be
平衡
the notion of a best-response equilibrium, or Nash equilib- conditioned on the other agents using an algorithm from some
納什均衡是每個玩家的策略集合,這樣每個玩
rium Nash, Jr., 1950 ]. 家的策略都是對其他玩家策略的最佳反應。
[ ] class of learning algorithms. 此屬性通常以使⽤來⾃某類學習算法
( A Nash equilibrium is a collection of strategies for each of 的算法的其他代理為條件
the players such that each player’s strategy is a best-response
[The second property requires that, against some class of
other players’ learning algorithms (ideally a class encompass-
to the other players’ strategies.了孔So, no player can get a higher ing most “useful” algorithms), the learner’s policy will con-
payoff by changing strategies given that the other players also verge.xFor example, one might refer to convergence with re-
don’t change
引⼈注⽬的
strategies.] xWhat makes the notion of equilib- spect to players with stationary policies, or convergence with
rium compelling is that all matrix games have such an equi- respect to rational players.]
librium, possibly having multiple equilibria.[In the zero-sum [In this paper, we focus on convergence in the case of self-
examples in Figure 1, both games have an equilibrium con- play. That is, if all the players use the same learning algorithm
sisting of each player playing the mixed strategy where all the 關鍵的
do the players’ policies converge?了 lThis is a crucial and dif-
actions have equal probability.] ficult step towards convergence against more general classes
(The concept
不凡的平
of equilibria also extends to stochastic games. of players.] xIn addition, ignoring the possibility of self-play
This is a non-trivial result, proven by Shapley [1953] for zero- makes the naive assumption that other players are inferior
sum stochastic games and by Fink [1964] for general-sum since they cannot be using an identical algorithm.]
stochastic games.) [In combination, these two properties guarantee that the
learner will converge to a stationary strategy that is optimal
當我們無法控制的其他參與者改變他們
3 Motivation 的政策時,最佳反應政策也會改變
given the play of the other players.] [There is also a connec-
tion between these properties and Nash equilibria.xWhen all
[The multiagent learning problem is one of a “moving target.” players are rational, if they converge, then they must have
The best-response policy changes as the other players, which
converged to a Nash equilibrium.] xSince all players converge
are outside of our control, change their policies.] xEquilibrium
to a stationary policy, each player, being rational, must con-
solutions do not solve this problem since the agent does not
verge to a best response to their policies.了 [Since this is true of
know which equilibrium the other players will play, or even
each player, then their policies by definition must be an equi-
if they will tend to an equilibrium at all.]
librium.] 孔In addition, if all players are rational and conver-
[Devising a learning algorithm for our agent is also chal- gent with respect to the other players’ algorithms, then con-
lenging because we don’t know which learning algorithms
vergence to a Nash equilibrium is guaranteed.]
the other learning agents are using.了 [Assuming a general case
where other players
意⽅式任
may be changing their policies in a com- 3.2 Other Reinforcement Learners
實際的
pletely arbitrary manner is neither useful nor practical.了 [On
There are few RL techniques that directly address learn-
the other hand, making restrictive assumptions on the other
ing in a multiagent system. We examine three RL tech-
players’ specific methods of adaptation is not acceptable, as
niques: single-agent learners, joint-action learners (JALs),
the other learners are outside of our control and therefore we
and minimax-Q.
don’t know which restrictions to assume.叮
[We address
特性
this multiagent learning problem by defining ⑦ Single-Agent Learners. [Although not truly a multiagent
two properties of a learner that make requirements on its be- learning algorithm, one of the most common approaches is
具体的
havior in concrete situations.] xAfter presenting these proper- to apply a single-agent learning algorithm (e.g. Q-learning,
ties we examine previous multiagent reinforcement learning TD( ), prioritized sweeping, etc.) to a multi-agent domain.]
但最常⾒的⽅法之⼀是將單智能體學習算法應⽤於多智能體域
[ They, of course, ignore the existence of other agents, assum- O1. Let and be learning rates. Initialize,
ing their rewards and the transitions are Markovian. They
essentially treat other agents as part of the environment.]
[This naive approach does satisfy one of the two properties.了
⑥ [If the other agents play, or converge to, stationary strategies 2. Repeat,
then their Markovian assumption holds and they converge to (a) From state select action with probability
an optimal response.xSo, single agent learning is rational.] [On③ with some exploration.
the other hand, it is not generally convergent in self-play. This (b) Observing reward and next state ,
is obvious to see for algorithms that learn only deterministic
policies.] xSince they are rational, if they converge it must be
to a Nash equilibrium.了 CIn games where the only equilibria
are mixed equilibria (e.g. Matching Pennies), they could not
(c) Update and constrain it to a legal probability
converge. ] xThere are single-agent learning algorithms capable distribution,
of playing stochastic policies [Jaakkola et al., 1994; Baird
if
凸
and Moore, 1999]. In general though just the ability to play
otherwise
stochastic policies is not sufficient for convergence, as will be
shown in Section 4.
③ Joint Action Learners. [JALs [Claus and Boutilier, 1998] Table 1: Policy hill-climbing algorithm (PHC) for player .
observe the actions of the other agents.] [They assume the
other players are selecting actions based on a stationary pol- mixed policy.] lThe policy is improved by increasing the prob-
icy, which they estimate.lThey then play optimally with re- ability that it selects the highest valued action according to
spect to this learned estimate.(Like single-agent learners they a learning rate . Notice that when the al-
are rational but not convergent, since they also cannot con- gorithm is equivalent to Q-learning, since with each step the
verge to mixed equilibria in self-play.] policy moves to the greedy policy executing the highest val-
ued action with probability (modulo exploration).]
③ Minimax-Q. [Minimax-Q [Littman, 1994] and Hu & Well-
man’s extension of it to general-sum SGs [1998] take a dif- [ This technique, like Q-learning, is rational and will con-
verge to an optimal policy if the other players are playing
炎
ferent approach. These algorithms observe both the actions
and rewards of the other players and try to learn a Nash equi- stationary strategies.xThe proof follows from the proof of Q-
librium explicitly. The algorithms learn and play the equi- learning, which guarantees the values will converge to
librium independent of the behavior of other players.了 [These with a suitable exploration policy.了1 (Similarly, will converge
algorithms are convergent, since they always converge to a to a policy that is greedy according to , which is converging
stationary policy.] [However, these algorithms are not ratio- to , the optimal response -values.jxDespite the fact that it
nal.了 [This is most obvious when considering a game of Rock- is rational and can play mixed policies, it still doesn’t show
对⼿
Paper-Scissors against an opponent that almost always plays any promise of being convergent. We show examples of its
“Rock”.JxMinimax-Q will still converge to the equilibrium so- convergence failures in Section 5.]
lution, which is not optimal given the opponent’s policy.]
4.2 WoLF Policy Hill-Climbing
[ In this work we are looking for a learning technique that is [We now introduce the main contribution of this paper. The
rational, and therefore plays a best-response in the obvious contribution is two-fold: using a variable learning rate, and
case where one exists.xYet, its policy should still converge. the WoLF principle. We demonstrate these ideas as a modifi-
We want the rational behavior of single-agent learners and cation to the naive policy hill-climbing algorithm.叮
JALs, and the convergent behavior of minimax-Q.]
[ The basic idea is to vary the learning rate used by the al-
gorithm in such a way as to encourage convergence, with-
4 A New Algorithm out sacrificing rationality.xWe propose the WoLF principle as
an appropriate method. The principle has a simple intuition, 直
[In this section we contribute an algorithm towards the goal of learn quickly while losing and slowly while winning.了 CThe
a rational and convergent learner.] [We first introduce an algo-
rithm that is rational and capable of playing mixed policies, specific method for determining when the agent is winning is
but does not converge in experiments.了 [We then introduce a by comparing the current policy’s expected payoff with that
調整
modification to this algorithm that results in a rational learner of the average policy over time.xThis principle aids in con-
that does in experiments converge to mixed policies.] vergence by giving more time for the other players to adapt
to changes in the player’s strategy that at first appear benefi-
4.1 Policy Hill Climbing cial, while allowing the player to adapt more quickly to other
players’ strategy changes when they are harmful.]
(A simple extension of Q-learning to play mixed strategies (The required changes for WoLF policy hill-climbing are
is policy hill-climbing (PHC) as shown in Table 1.] x The al- shown in Table 2.] x Practically, the algorithm requires two
gorithm, in essence, performs hill-climbing in the space of
mixed policies.孔 Q-values are maintained just as in normal 1
The issue of exploration is not critical to this work. See [Singh
Q-learning.“In addition the algorithm maintains the current et al., 2000a] for suitable exploration policies for online learning.
由於本質上涉及可變學習率,這種⽅法也沒有應⽤於隨機遊戲中的學習
as essentially involving a variable learning rate, nor has such
1. Let , be learning rates. Initialize,
an approach been applied to learning in stochastic games.叮
5 Results
2. Repeat,
(a,b) Same as PHC in Table 1
[ In this section we show results of applying policy hill-
climbing and WoLF policy hill-climbing to a number of dif-
(c) Update estimate of average policy, , ferent games, from the multiagent reinforcement learning lit- ⽂
erature.⼉The domains include two matrix games that help to
show how the algorithms work and the effect of the WoLF
principle on convergence.xThe algorithms were also applied
to two multi-state SGs. One is a general-sum grid world do-
(d) Update and constrain it to a legal probability main used by Hu & Wellman [1998]. The other is a zero-sum
distribution, soccer game introduced by Littman [1994].]
if [The experiments involve training the players using the
otherwise same learning algorithm.]Since PHC and WoLF-PHC are ra-
tional, we know that if they converge against themselves, then
x
where,
if they must have converged to a Nash equilibrium. ] [For the ma-
otherwise trix game experiments , but for the other results a
more aggressive was used.] xIn all cases both the
and were decreased proportionately to , although
Table 2: WoLF policy hill-climbing algorithm for player . the exact proportion varied between domains.]
self-play [Bowling and Veloso, 2001].] 5.2 Gridworld WOLF PHC 会收斂
-
[Something similar to the WoLF principle has also been (We also examined a gridworld domain introduced by Hu and
studied in some form in other areas, notably when consid- Wellman [1998] to demonstrate their extension of Minimax-
敵的
ering an adversary.] xIn evolutionary game theory the adjusted Q to general-sum games. The game consists of a small grid
replicator dynamics [Weibull, 1995] scales the individuals’ shown in Figure 3(a).] xThe agents start in two corners and
growth rate by the inverse of the overall success of the popula- are trying to reach the goal square on the opposite wall. The
tion.lThis will cause the population’s composition to change players have the four compass actions (i.e. N, S, E, and W),
more quickly when the population as a whole is performing which are in most cases deterministic.J [If the two players at-
poorly.xA form of this also appears as a modification to the tempt to move to the same square, both moves fail.了 CTo make
randomized weighted majority algorithm [Blum and Burch, the game interesting and force the players to interact, while
1997].xIn this algorithm, when an expert makes a mistake, in the initial starting position the North action is uncertain,
a portion of its weight loss is redistributed among the other and is only executed with probability
橫向
0.5.[The optimal path
experts.xIf the algorithm is placing large weights on mistaken for each agent is to move laterally on the first move and then
experts (i.e. the algorithm is “losing”), then a larger portion of move North to the goal, but if both players move laterally
the weights are redistributed (i.e. the algorithm adapts more then the actions will fail.] xThere are two Nash equilibria for
quickly.)xNeither research lines recognized their modification this game.了 CThey involve one player taking the lateral move
測側迅
1 1
WoLF-PHC: Pr(Heads) PHC
PHC: Pr(Heads) WoLF
Pr(Heads) 0.8 0.8
Pr(Paper)
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 200000 400000 600000 800000 1e+06 0 0.2 0.4 0.6 0.8 1
Iterations Pr(Rock)
(a) Matching Pennies Game (b) Rock-Paper-Scissors Game
Figure 2: (a) Results for matching pennies: the policy for one of the players as a probability distribution while learning with
PHC and WoLF-PHC. The other player’s policy looks similar. (b) Results for rock-paper-scissors: trajectories of one player’s
policy. The bottom-left shows PHC in self-play, and the upper-right shows WoLF-PHC in self-play.
and the other trying to move North. Hence the game requires evaluated after every 50,000 steps.了 [The worst performing pol-
that the players coordinate their behaviors.] icy was then used for the value of that learning run.]
[WoLF policy hill-climbing successfully converges to one [Figure 3(b) shows the percentage of games won by the
of these equilibria.J [Figure 3(a) shows an example trajectory different players when playing their challengers. “Minimax-
of the players’ strategies for the initial state while learning Q” represents Minimax-Q when learning against itself (the
over 100,000 steps.了 [ In this example the players converged results were taken from Littman’s original paper.)x“WoLF”
to the equilibrium where player one moves East and player represents WoLF policy hill-climbing learning against itself.]
two moves North from the initial state.了 [This is evidence that [“PHC(L)” and “PHC(W)” represents policy hill-climbing
WoLF policy hill-climbing can learn an equilibrium even in a with and , respectively.x“WoLF(2x)” represents
general-sum game with multiple equilibria.] WoLF policy hill-climbing learning with twice the training
(i.e. two million steps).xThe performance of the policies were
5.3 Soccer averaged over fifty training runs and the standard deviations “
[The final domain is a comparatively large zero-sum soccer are shown by the lines beside the bars. The relative ordering
game introduced by Littman [1994] to demonstrate Minimax- by performance is statistically significant.]
Q.xAn example of an initial state in this game is shown in Fig- [WoLF-PHC does extremely well, performing equivalently
領地擁有
ure 3(b), where player ’B’ has possession of the ball. The goal to Minimax-Q with the same amount of training2 and contin-
is for the players to carry the ball into the goal on the opposite ues to improve with more training.lThe exact effect of the
指南針羅盤 WoLF principle can be seen by its out-performance of PHC,
side of the field.孔The actions available are the four compass
directions and the option to not move.xThe players select ac- using either the larger or smaller learning rate.xThis shows
tions simultaneously but they are executed in a random order, that the success of WoLF-PHC is not simply due to changing
which adds non-determinism to their actions.了 [If a player at- learning rates, but rather to changing the learning rate at the
对⼿
tempts to move to the square occupied by its opponent, the appropriate time to encourage convergence.
固定的靜⽌的 ⽽是在適當的時候改變學習率以⿎勵收斂。
stationary player gets possession of the ball, and the move
fails.xUnlike the grid world domain, the Nash equilibrium for 6 Conclusion
this game requires a mixed policy.了 CIn fact any deterministic
policy (therefore anything learned by an single-agent learner
[In this paper we present two properties, rationality and con-
打敗 vergence, that are desirable for a multiagent learning algo-
or JAL) can always be defeated [Littman, 1994].]
rithm.孔We present a new algorithm that uses a variable learn-
[ Our experimental setup resembles that used by Littman ing rate based on the WoLF (“Win or Learn Fast”) princi-
in order to compare with his results for Minimax-Q. Each
ple.了 [We then showed how this algorithm takes large steps
player was trained for one million steps.“ After training,
towards achieving these properties on a number and variety
its policy was fixed and a challenger using Q-learning was
of stochastic games.] [The algorithm is rational and is shown
trained against the player.了 [This determines the learned pol-
empirically to converge in self-play to an equilibrium even in
icy’s worst-case performance, and gives an idea of how close
games with multiple or mixed policy equilibria, which previ-
the player was to the equilibrium policy, which would per-
ous multiagent reinforcement learners have not achieved.]
form no worse than losing half its games to its challenger.叮 即使在具有多重或混合策略均衡的遊戲中,⾃我博弈也會收斂到均衡
[ Unlike Minimax-Q, WoLF-PHC and PHC generally oscillate 2
The results are not directly comparable due to the use of a dif-
around the target solution.[In order to account for this in the ferent decay of the learning rate. Minimax-Q uses an exponential
results, training was continued for another 250,000 steps and decay that decreases too quickly for use with WoLF-PHC.
G
A
S2 S1
Pr(West)
50
1 0.8 0.6 0.4 0.2 0
1 0
40
0.8 0.2
% Games Won
30
Pr(North)
Pr(North)
0.6 0.4
0.4 0.6 20
0.2 0.8 10
Player 1
Player 2
0 1
0 0.2 0.4 0.6 0.8 1 0
Pr(East) Minimax-Q WoLF WoLF(x2) PHC(L) PHC(W)
Figure 3: (a) Gridworld game. The dashed walls represent the actions that are uncertain. The results show trajectories of two
players’ policies while learning with WoLF-PHC. (b) Soccer game. The results show the percentage of games won against a
specifically trained worst-case opponent after one million steps of training.
Acknowledgements. Thanks to Will Uther for ideas and [Jaakkola et al., 1994] T. Jaakkola, S. P. Singh, and M. I. Jor-
discussions. This research was sponsored by the United dan. Reinforcement learning algorithm for partially ob-
States Air Force under Grants Nos F30602-00-2-0549 and servable markov decision problems. In Advances in Neu-
F30602-98-2-0135. The content of this publication does not ral Information Processing Systems 6, 1994.
necessarily reflect the position or the policy of the sponsors [Littman, 1994] M. L. Littman. Markov games as a frame-
and no official endorsement should be inferred. work for multi-agent reinforcement learning. In Proceed-
ings of the Eleventh International Conference on Machine
References Learning, pages 157–163, 1994.
[Baird and Moore, 1999] L. C. Baird and A. W. Moore. Gra- [Nash, Jr., 1950] J. F. Nash, Jr. Equilibrium points in -
dient descent for general reinforcement learning. In Ad- person games. PNAS, 36:48–49, 1950.
vances in Neural Information Processing Systems 11. The [Shapley, 1953] L. S. Shapley. Stochastic games. PNAS,
MIT Press, 1999.
39:1095–1100, 1953.
[Blum and Burch, 1997] A. Blum and C. Burch. On-line
[Singh et al., 2000a] S. Singh, T. Jaakkola, M. L. Littman,
learning and the metrical task system problem. In Tenth
and C. Szepesvári. Convergence results for single-step
Annual Conference on Computational Learning Theory,
on-policy reinforcement-learning algorithms. Machine
1997.
Learning, 2000.
[Bowling and Veloso, 2001] M. Bowling and M. Veloso.
[Singh et al., 2000b] S. Singh, M. Kearns, and Y. Mansour.
Convergence of gradient dynamics with a variable learn-
Nash convergence of gradient dynamics in general-sum
ing rate. In Proceedings of the Eighteenth International
games. In Proceedings of the Sixteenth Conference on Un-
Conference on Machine Learning, 2001. To Appear.
certainty in Artificial Intelligence, pages 541–548, 2000.
[Claus and Boutilier, 1998] C. Claus and C. Boutilier. The [Stone and Veloso, 2000] P. Stone and M. Veloso. Multia-
dyanmics of reinforcement learning in cooperative multi- gent systems: A survey from a machine learning perspec-
agent systems. In Proceedings of the Fifteenth National
tive. Autonomous Robots, 8(3), 2000.
Conference on Artificial Intelligence. AAAI Press, 1998.
[Sutton and Barto, 1998] R. S. Sutton and A. G. Barto. Re-
[Fink, 1964] A. M. Fink. Equilibrium in a stochastic n-
inforcement Learning. The MIT Press, 1998.
person game. Journal of Science in Hiroshima University,
Series A-I, 28:89–93, 1964. [Weibull, 1995] J. W. Weibull. Evolutionary Game Theory.
The MIT Press, 1995.
[Hu and Wellman, 1998] J. Hu and M. P. Wellman. Multi-
agent reinforcement learning: Theoretical framework and [Weiß and Sen, 1996] G. Weiß and S. Sen, editors. Adapta-
an algorithm. In Proceedings of the Fifteenth International tion and Learning in Multiagent Systems. Springer, 1996.
Conference on Machine Learning, pages 242–250, 1998.