Learning To Fight: Keywords
Learning To Fight: Keywords
Learning To Fight: Keywords
KEYWORDS
Reinforcement Learning, Fighting Games,
SARSA, Q-Learning, Markov Decision Process
ABSTRACT
We apply reinforcement learning to the problem
of finding good policies for a fighting agent in a
commercial computer game. The learning agent is
trained using the SARSA algorithm for on-policy
learning of an action-value function represented by
linear and neural network function approximators.
We discuss the selection and construction of features, actions, and rewards as well as other design
choices necessary to integrate the learning process into the game. The learning agent is trained
against the built-in AI of the game with different
rewards encouraging aggressive or defensive behaviour. We show that the learning agent finds interesting (and partly near optimal) policies in accordance with the reward functions provided. We
also discuss the particular challenges arising in the
application of reinforcement learning to the domain of computer games.
INTRODUCTION
Computer games constitute a very interesting domain for the application of machine learning techniques. Games can often be considered simula-
tions of (aspects of) reality (Salen and Zimmerman, 2004). As a consequence, modelling the behaviour of agents in computer games may capture
aspects of behaviour in the real world. Also, the
competitive and interactive nature of games allows
the exploration of policies in a rich dynamic environment. In contrast to modelling behaviour in
the real world, there are (at least theoretically) two
great advantages enjoyed by a simulation/game approach: i.) full control of the game universe including full observability of the state ii.) reproducibility of experimental settings and results.
Computer games provide one of the rather few
domains in which artificial intelligence (game AI)
is currently applied in practice. That said, it is a
common complaint of gamers that the game AI behaves either in boring ways or is too strong or too
weak to provide interesting and entertaining game
play. Hence, adaptive game AI has the potential
of making games more interesting and ultimately
more fun to play. This is particularly true since the
sophistication of other areas of computer games
such as sound, graphics, and physics have leapt
ahead of AI in recent years and it is anticipated
that advances in game AI will be a considerable
driving force for the games market in the future.
In fact, games such as Creatures, Black and White
and Virtua Fighter 4 were designed around the notion of adaptation and learning.
Machine learning can be applied in different
computer game related scenarios (Rabin, 2002).
Supervised learning can be used to perform be-
REINFORCEMENT
LEARNING AND THE ACTION-VALUE
FUNCTION
Markov Decision Processes
We model the agents decision and learning process in the framework of reinforcement learning
(Sutton and Barto, 1998) which aims at finding
an (near) optimal policy for an agent acting in a
Markov decision process (MDP). An MDP is characterised by a tuple (S, A, T , R) with
1. A state space S with states s S. In the
most straight-forward case S is a finite set.
In Tao Feng the state can be represented by
nominal features such as physical situation of
a player (on the ground, in the air, knocked) or
spatial features (wall behind, wall to the right
etc.) However, depending on the representation chosen, real-valued state features such as
distance between players or state of the health
bar are conceivable as well.
2. An action space A with actions a A. We
will only consider the case of A being a finite
set. More precisely, we are dealing with action spaces A (s) that depend on the current
state s. Typical actions in Tao Feng include
punches, kicks, throws, blocks and combo
moves.
3. An unknown stochastic transition dynamics
a
Ts,s
0 : S A S [0, 1] which gives the
probability of a transition from state s to state
s 0 if action a A (s) is taken,
a
0
Ts,s
.
0 := Pst+1 |st =s,at =a s
(1)
(2)
X
Rt := rt+1 + rt+2 + =
k rt+k+1 , (3)
where the temperature parameter 0 deterThe goal of the agent is to devise a sequence mines how peaked the probability distribution is
of actions (at )
t=0 so as to maximise his aver- around a .
state value function would require us to learn a separate model of the Tao Feng dynamics, thus introducing a layer of complication.
Although we originally intended to apply QLearning (Watkins and Dayan, 1992) it turned
out to be very difficult to reliably determine the
set A (s) of available actions
for the evaluation
of maxa 0 A(s 0 ) Q s 0 , a 0 . The reason for this
complication lies in the graphical animation system of Tao Feng, which rejects certain actions
depending on the animation state. As a consequence, it is only possible to submit a given action a (using submitAction(a)) and to check (using
getExecutedAction(a)) which action has actually
been performed. We chose the SARSA algorithm
(Rummery and Niranjan, 1994) because it does not
require knowledge of A (s).
Implementation Issues
The integration of the learning algorithms into the
code base was hindered by the complex multithreaded and animation centred architecture of the
system. Systematic monitoring of the learning process was only possible because we devised an online monitoring tool that continuously sent data
such as rewards, actions and parameters from the
Xbox via the network to a PC, where the data was
analysed and visualised in Matlab. Although the
implementation as such is not planned to be productised, it served as a test-bed for a library of
reusable modules including function approximators (look-up tables, linear function approximation,
and neural network) and learning algorithms (QLearning and SARSA), which are suitable for use
in future Xbox games.
EXPERIMENTS
We performed experiments in order to see if we
could learn a good policy for fighting against the
built-in AI. We employed the SARSA learning algorithm with function approximation as detailed in
Algorithm 1. Throughout we used the parameter
settings = 0.01 and = 0.8. We used two
types of reward functions depending on the change
1
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
Average Reward
Agressive Actions
Neutral Actions
Defensive Actions
0.8
1
0
0.5
1.5
Epochs
2.5
3
4
x 10
0.5
0.5
Average Reward
Agressive Actions
Neutral Actions
Defensive Actions
1
0
0.5
1.5
Epochs
2.5
3
4
x 10
Figure 1: Average rewards and frequency of selected action classes for a Gibbs policy with = 2
in the aggressive setting using linear function approximators. The gray band indicates performance
of a uniformly random policy (mean 2 standard
deviations)
Figure 2: Average rewards and frequency of selected action classes for a Gibbs policy with = 1
in the aggressive setting using linear function approximators. The gray band indicates performance
of a uniformly random policy (mean 2 standard
deviations)
Aggressive:
throw,
kicktrail,
kicklead, punchlead, punchtrail.
Defensive:
block10,
block25,
block50,
stepleft,
stepright,
lungeback.
Neutral: getup, run10, run25, run50,
crouch10, crouch25, crouch50.
In a first set of experiments we consider the reward function raggressive and use linear function approximators for the Q-function. The results for
= 2 and = 1 are shown in Figure 1 and
2. In both cases, the reward rate increases with
considerable fluctuations by 0.8 and 1.1, respectively. Starting from the random policy, which
loses approximately one reward unit per second,
the learning agent achieves a reduction of the loss
to 0.2 sec1 for = 2 and even a net gain
of reward of 0.1 sec1 for = 1. In the case
= 2 the average reward stagnates at a suboptimal level presumably being stuck in a local
optimum. From the action frequencies it can be
seen that despite the aggressive reward function,
the agent prefers defensive actions and lacks the
0.5
0.5
Average Reward
Agressive Actions
Neutral Actions
Defensive Actions
1
0
0.5
1.5
Epochs
2.5
1
0.5
0
0.5
1
Average Reward
Agressive Actions
Neutral Actions
Defensive Actions
1.5
2
0
0.5
1.5
Epochs
x 10
2.5
3
4
x 10
Figure 3: Average rewards and frequency of selected action classes for a Gibbs policy with = 2
in the aggressive setting using a neural network
function approximator with 3 hidden units
Figure 4: Average rewards and frequency of selected action classes for a Gibbs policy with = 2
in the Aikido setting using a linear function approximator.
CONCLUSIONS
This work demonstrates that reinforcement learning can be applied successfully to the task of learning behaviour of agents in fighting games with the
caveat that the implementation requires considerable insight into the mechanics of the game engine.
As mentioned earlier, our current approach neglects hidden state information and adversarial aspects of the game. One idea to tackle this problem is to separately model the game engine and
the opponent. Based on these two models, standard planning approaches (e.g., min-max search,
beam search) can be employed that take into ac- Watkins, C. J. C. H. and P. Dayan (1992). Q-learning.
Machine Learning 8, 279292.
count that there is more than one decision maker
in the game. Also an important aspect of human
fighting game play involves timing which is hardly
captured by the MDP model and requires explicit
representation of time.
ACKNOWLEDGEMENTS
We would like to thank JCAB, Glen Doren and
Shannon Loftis for providing us with the code base
and Mark Hatton for his initial involvement in the
project.
References
Filar, J. and K. Vrieze (1996). Competitive Markov decision processes. Berlin: Springer.
Kaelbing, L. P., M. L. Littman, and A. R. Cassandra
(1998). Planning and acting in partially observable
stochastic domains. Artificial Intelligence 101, 99
134.
Rabin, S. (2002). AI Game Programming Wisdom.
Hingham, Massachusetts: Charles River Media, Inc.
Rummery, G. and M. Niranjan (1994). On-line Qlearning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University, Engineering Department.
Salen, K. and E. Zimmerman (2004). Rules of Play:
Game Design Fundamentals. Cambridge, Massachusetts: MIT Press.
Stone, P. and R. S. Sutton (2001). Scaling reinforcement learning toward RoboCup soccer. In Proceedings of the Eighteenth International Conference on
Machine Learning, pp. 537544. Morgan Kaufmann,
San Francisco, CA.
Sutton, R. S. and A. G. Barto (1998). Reinforcement
Learning: An Introduction. MIT Press.
Tesauro, G. J. (1995). Temporal difference learning and
td-gammon. Communications of the ACM 38(3), 58
68.